DevOps Fundamental for DevOps Fundamentals

Posted on Jun 24

Kafka Fundamentals: kafka cleanup.policy

#kafka #messagequeue #streaming #kafkacleanuppolicy

Kafka `cleanup.policy`: A Deep Dive into Log Management for Production Systems

1. Introduction

Imagine a globally distributed financial trading platform. Every trade, order modification, and market data update is streamed through Kafka. We’re dealing with hundreds of millions of events per second, strict regulatory compliance requiring long-term audit trails, and the need for real-time risk analysis. A naive retention policy can quickly lead to disk exhaustion, performance degradation, and even data loss. The cleanup.policy configuration is the critical control point for managing Kafka’s log segments, directly impacting storage costs, query latency, and overall system stability. This post dives deep into cleanup.policy, covering its architecture, configuration, failure modes, and best practices for production deployments. We’ll focus on scenarios involving stream processing with Kafka Streams, CDC replication using Debezium, and microservices communicating via event-driven architectures.

2. What is `cleanup.policy` in Kafka Systems?

cleanup.policy dictates how Kafka manages the lifecycle of log segments within a topic partition. It’s a broker-level configuration, but applied at the topic level. Introduced in Kafka 0.10.0.0 (KIP-97), it replaced the older log.retention.hours and log.retention.bytes configurations, offering more granular control.

The core values are:

delete: Segments are deleted based on time or size retention policies (configured via retention.ms and retention.bytes). This is the default.
compact: Segments are compacted, retaining only the latest value for each key. This is crucial for stateful stream processing and change data capture.
delete,compact: Combines both policies. Segments are compacted and deleted based on retention policies.

cleanup.policy operates within the Kafka broker’s log management system. Producers append messages to the log. Consumers read from it. The broker, guided by cleanup.policy, manages the underlying log segments on disk. The controller quorum is responsible for coordinating partition leadership and ensuring consistent application of these policies across replicas.

3. Real-World Use Cases

Change Data Capture (CDC): Debezium streams database changes to Kafka. Using compact, we retain only the latest state for each database record, creating a materialized view. Without compaction, the topic would grow indefinitely with every historical change.
Event Sourcing: Microservices emit events representing state changes. compact ensures we have the latest event for each entity, enabling reconstruction of state.
Log Aggregation: Collecting logs from distributed applications. delete with appropriate retention settings provides a time-bound audit trail.
Real-time Analytics: Aggregating metrics. compact can be used to maintain the latest aggregate value, reducing storage and improving query performance.
Multi-Datacenter Replication (MirrorMaker 2): Maintaining consistent data across regions. delete policies must be carefully aligned to prevent data loss or excessive replication lag.

4. Architecture & Internal Mechanics

graph LR
    A[Producer] --> B(Kafka Broker 1);
    A --> C(Kafka Broker 2);
    B --> D{Topic Partition};
    C --> D;
    D --> E[Log Segments];
    E --> F{Cleanup Policy (delete/compact)};
    F --> G[Disk Storage];
    H[Consumer] --> D;
    I[Kafka Controller] -- Coordinates --> B;
    I -- Coordinates --> C;
    style D fill:#f9f,stroke:#333,stroke-width:2px

Kafka partitions are divided into log segments. The cleanup.policy operates on these segments. When a segment reaches its size or time limit (for delete), it’s marked for deletion. For compact, the broker scans the segment, identifying the latest offset for each key. Older offsets for the same key are marked for deletion. This process is resource-intensive. The controller quorum ensures that compaction is applied consistently across all replicas. Kafka Raft (KRaft) mode replaces ZooKeeper for metadata management, impacting controller election and coordination. Schema Registry integration ensures data consistency during compaction, preventing issues with schema evolution.

5. Configuration & Deployment Details

server.properties (Broker Configuration):

log.cleanup.policy=compact,delete
log.retention.ms=604800000 # 7 days

log.retention.bytes=-1 # Unlimited

log.segment.bytes=1073741824 # 1GB

kafka-topics.sh (Topic Configuration):

kafka-topics.sh --bootstrap-server localhost:9092 --alter --topic my-topic --config cleanup.policy=compact
kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic my-topic

Producer Configuration (producer.properties):

key.serializer=org.apache.kafka.common.serialization.StringSerializer
value.serializer=org.apache.kafka.common.serialization.ByteArraySerializer
acks=all
idempotence.enabled=true

Consumer Configuration (consumer.properties):

key.deserializer=org.apache.kafka.common.serialization.StringDeserializer
value.deserializer=org.apache.kafka.common.serialization.ByteArrayDeserializer
enable.auto.commit=false

6. Failure Modes & Recovery

Broker Failure During Compaction: Compaction is idempotent. The controller will re-assign the partition to another broker, and compaction will resume from where it left off.
ISR Shrinkage: If the number of in-sync replicas falls below min.insync.replicas, writes are blocked, preventing data loss. Compaction will also be paused until sufficient replicas are available.
Message Loss: Idempotent producers and transactional guarantees are crucial to prevent message loss, especially during compaction.
Consumer Lag: Aggressive compaction can lead to consumers falling behind if they cannot keep up with the rate of updates. Monitor consumer lag closely.
Recovery: Utilize Dead Letter Queues (DLQs) for messages that fail processing. Implement robust offset tracking to ensure consumers can resume from the correct position after failures.

7. Performance Tuning

linger.ms & batch.size (Producer): Increase these to improve throughput, but at the cost of increased latency.
compression.type (Producer/Broker): Use compression (e.g., gzip, snappy, lz4) to reduce storage costs and network bandwidth.
fetch.min.bytes & replica.fetch.max.bytes (Consumer/Broker): Adjust these to optimize fetch requests.
Compaction Overhead: Compaction is CPU and I/O intensive. Monitor broker resource utilization. Consider increasing num.partitions to distribute the load.

Benchmark: A single Kafka broker with SSD storage can typically handle compaction rates of up to 50 MB/s, depending on key cardinality and segment size.

8. Observability & Monitoring

Prometheus & JMX: Monitor Kafka JMX metrics using Prometheus.
Critical Metrics:
- kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec
- kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec
- kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions
- kafka.consumer:type=consumer-coordinator-metrics,client-id=*,group-id=*,name=Lag
Alerting: Alert on high consumer lag, low ISR count, and high broker CPU/disk utilization.
Grafana Dashboards: Create dashboards to visualize these metrics.

9. Security and Access Control

SASL/SSL: Encrypt communication between clients and brokers.
SCRAM: Use SCRAM authentication for secure access.
ACLs: Implement Access Control Lists (ACLs) to restrict access to topics and operations.
Kerberos: Integrate with Kerberos for strong authentication.
Audit Logging: Enable audit logging to track access and modifications to Kafka configuration.

10. Testing & CI/CD Integration

Testcontainers: Use Testcontainers to spin up ephemeral Kafka clusters for integration testing.
Embedded Kafka: Utilize embedded Kafka for unit testing.
Consumer Mock Frameworks: Mock consumers to verify producer behavior.
CI Pipeline:
- Schema compatibility checks.
- Throughput tests.
- Contract testing to ensure data consistency.

11. Common Pitfalls & Misconceptions

Incorrect Key Selection: Compaction relies on keys. Poor key selection leads to uneven distribution and hotspots.
Overly Aggressive Compaction: Can starve consumers.
Ignoring Schema Evolution: Schema changes can break compaction if not handled carefully.
Insufficient Disk Space: Compaction creates temporary files.
Misunderstanding delete,compact: It doesn't guarantee all data is retained; it's still subject to time/size limits.

Example Logging (Consumer Lag):

kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group my-group --describe

Look for high CURRENT-OFFSET and LOG-END-OFFSET differences.

12. Enterprise Patterns & Best Practices

Shared vs. Dedicated Topics: Use dedicated topics for different applications to isolate compaction and retention policies.
Multi-Tenant Cluster Design: Implement resource quotas and ACLs to prevent interference between tenants.
Retention vs. Compaction: Understand the trade-offs between retaining all data and retaining only the latest state.
Schema Evolution: Use a Schema Registry and backward/forward compatibility strategies.
Streaming Microservice Boundaries: Design microservices to own their data and manage compaction policies accordingly.

13. Conclusion

cleanup.policy is a foundational configuration for building reliable, scalable, and efficient Kafka-based platforms. Properly configuring and monitoring this policy is essential for managing storage costs, optimizing query performance, and ensuring data consistency. Next steps include implementing comprehensive observability, building internal tooling for managing compaction, and continuously refining topic structures based on evolving business requirements. A proactive approach to cleanup.policy will significantly improve the long-term health and performance of your Kafka ecosystem.

DEV Community

Kafka Fundamentals: kafka cleanup.policy

Kafka `cleanup.policy`: A Deep Dive into Log Management for Production Systems

1. Introduction

2. What is `cleanup.policy` in Kafka Systems?

3. Real-World Use Cases

4. Architecture & Internal Mechanics

5. Configuration & Deployment Details

6. Failure Modes & Recovery

7. Performance Tuning

8. Observability & Monitoring

9. Security and Access Control

10. Testing & CI/CD Integration

11. Common Pitfalls & Misconceptions

12. Enterprise Patterns & Best Practices

13. Conclusion

Top comments (0)

Kafka cleanup.policy: A Deep Dive into Log Management for Production Systems

1. Introduction

2. What is cleanup.policy in Kafka Systems?

3. Real-World Use Cases

4. Architecture & Internal Mechanics

5. Configuration & Deployment Details

6. Failure Modes & Recovery

7. Performance Tuning

8. Observability & Monitoring

9. Security and Access Control

10. Testing & CI/CD Integration

11. Common Pitfalls & Misconceptions

12. Enterprise Patterns & Best Practices

13. Conclusion

Kafka `cleanup.policy`: A Deep Dive into Log Management for Production Systems

2. What is `cleanup.policy` in Kafka Systems?