Kafka cleanup.policy
: A Deep Dive into Log Management for Production Systems
1. Introduction
Imagine a globally distributed financial trading platform. Every trade, order modification, and market data update is streamed through Kafka. We’re dealing with hundreds of millions of events per second, strict regulatory compliance requiring long-term audit trails, and the need for real-time risk analysis. A naive retention policy can quickly lead to disk exhaustion, performance degradation, and even data loss. The cleanup.policy
configuration is the critical control point for managing Kafka’s log segments, directly impacting storage costs, query latency, and overall system stability. This post dives deep into cleanup.policy
, covering its architecture, configuration, failure modes, and best practices for production deployments. We’ll focus on scenarios involving stream processing with Kafka Streams, CDC replication using Debezium, and microservices communicating via event-driven architectures.
2. What is cleanup.policy
in Kafka Systems?
cleanup.policy
dictates how Kafka manages the lifecycle of log segments within a topic partition. It’s a broker-level configuration, but applied at the topic level. Introduced in Kafka 0.10.0.0 (KIP-97), it replaced the older log.retention.hours
and log.retention.bytes
configurations, offering more granular control.
The core values are:
-
delete
: Segments are deleted based on time or size retention policies (configured viaretention.ms
andretention.bytes
). This is the default. -
compact
: Segments are compacted, retaining only the latest value for each key. This is crucial for stateful stream processing and change data capture. -
delete,compact
: Combines both policies. Segments are compacted and deleted based on retention policies.
cleanup.policy
operates within the Kafka broker’s log management system. Producers append messages to the log. Consumers read from it. The broker, guided by cleanup.policy
, manages the underlying log segments on disk. The controller quorum is responsible for coordinating partition leadership and ensuring consistent application of these policies across replicas.
3. Real-World Use Cases
- Change Data Capture (CDC): Debezium streams database changes to Kafka. Using
compact
, we retain only the latest state for each database record, creating a materialized view. Without compaction, the topic would grow indefinitely with every historical change. - Event Sourcing: Microservices emit events representing state changes.
compact
ensures we have the latest event for each entity, enabling reconstruction of state. - Log Aggregation: Collecting logs from distributed applications.
delete
with appropriate retention settings provides a time-bound audit trail. - Real-time Analytics: Aggregating metrics.
compact
can be used to maintain the latest aggregate value, reducing storage and improving query performance. - Multi-Datacenter Replication (MirrorMaker 2): Maintaining consistent data across regions.
delete
policies must be carefully aligned to prevent data loss or excessive replication lag.
4. Architecture & Internal Mechanics
graph LR
A[Producer] --> B(Kafka Broker 1);
A --> C(Kafka Broker 2);
B --> D{Topic Partition};
C --> D;
D --> E[Log Segments];
E --> F{Cleanup Policy (delete/compact)};
F --> G[Disk Storage];
H[Consumer] --> D;
I[Kafka Controller] -- Coordinates --> B;
I -- Coordinates --> C;
style D fill:#f9f,stroke:#333,stroke-width:2px
Kafka partitions are divided into log segments. The cleanup.policy
operates on these segments. When a segment reaches its size or time limit (for delete
), it’s marked for deletion. For compact
, the broker scans the segment, identifying the latest offset for each key. Older offsets for the same key are marked for deletion. This process is resource-intensive. The controller quorum ensures that compaction is applied consistently across all replicas. Kafka Raft (KRaft) mode replaces ZooKeeper for metadata management, impacting controller election and coordination. Schema Registry integration ensures data consistency during compaction, preventing issues with schema evolution.
5. Configuration & Deployment Details
server.properties
(Broker Configuration):
log.cleanup.policy=compact,delete
log.retention.ms=604800000 # 7 days
log.retention.bytes=-1 # Unlimited
log.segment.bytes=1073741824 # 1GB
kafka-topics.sh
(Topic Configuration):
kafka-topics.sh --bootstrap-server localhost:9092 --alter --topic my-topic --config cleanup.policy=compact
kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic my-topic
Producer Configuration (producer.properties
):
key.serializer=org.apache.kafka.common.serialization.StringSerializer
value.serializer=org.apache.kafka.common.serialization.ByteArraySerializer
acks=all
idempotence.enabled=true
Consumer Configuration (consumer.properties
):
key.deserializer=org.apache.kafka.common.serialization.StringDeserializer
value.deserializer=org.apache.kafka.common.serialization.ByteArrayDeserializer
enable.auto.commit=false
6. Failure Modes & Recovery
- Broker Failure During Compaction: Compaction is idempotent. The controller will re-assign the partition to another broker, and compaction will resume from where it left off.
- ISR Shrinkage: If the number of in-sync replicas falls below
min.insync.replicas
, writes are blocked, preventing data loss. Compaction will also be paused until sufficient replicas are available. - Message Loss: Idempotent producers and transactional guarantees are crucial to prevent message loss, especially during compaction.
- Consumer Lag: Aggressive compaction can lead to consumers falling behind if they cannot keep up with the rate of updates. Monitor consumer lag closely.
- Recovery: Utilize Dead Letter Queues (DLQs) for messages that fail processing. Implement robust offset tracking to ensure consumers can resume from the correct position after failures.
7. Performance Tuning
-
linger.ms
&batch.size
(Producer): Increase these to improve throughput, but at the cost of increased latency. -
compression.type
(Producer/Broker): Use compression (e.g.,gzip
,snappy
,lz4
) to reduce storage costs and network bandwidth. -
fetch.min.bytes
&replica.fetch.max.bytes
(Consumer/Broker): Adjust these to optimize fetch requests. - Compaction Overhead: Compaction is CPU and I/O intensive. Monitor broker resource utilization. Consider increasing
num.partitions
to distribute the load.
Benchmark: A single Kafka broker with SSD storage can typically handle compaction rates of up to 50 MB/s, depending on key cardinality and segment size.
8. Observability & Monitoring
- Prometheus & JMX: Monitor Kafka JMX metrics using Prometheus.
- Critical Metrics:
-
kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec
-
kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec
-
kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions
-
kafka.consumer:type=consumer-coordinator-metrics,client-id=*,group-id=*,name=Lag
-
- Alerting: Alert on high consumer lag, low ISR count, and high broker CPU/disk utilization.
- Grafana Dashboards: Create dashboards to visualize these metrics.
9. Security and Access Control
- SASL/SSL: Encrypt communication between clients and brokers.
- SCRAM: Use SCRAM authentication for secure access.
- ACLs: Implement Access Control Lists (ACLs) to restrict access to topics and operations.
- Kerberos: Integrate with Kerberos for strong authentication.
- Audit Logging: Enable audit logging to track access and modifications to Kafka configuration.
10. Testing & CI/CD Integration
- Testcontainers: Use Testcontainers to spin up ephemeral Kafka clusters for integration testing.
- Embedded Kafka: Utilize embedded Kafka for unit testing.
- Consumer Mock Frameworks: Mock consumers to verify producer behavior.
- CI Pipeline:
- Schema compatibility checks.
- Throughput tests.
- Contract testing to ensure data consistency.
11. Common Pitfalls & Misconceptions
- Incorrect Key Selection: Compaction relies on keys. Poor key selection leads to uneven distribution and hotspots.
- Overly Aggressive Compaction: Can starve consumers.
- Ignoring Schema Evolution: Schema changes can break compaction if not handled carefully.
- Insufficient Disk Space: Compaction creates temporary files.
- Misunderstanding
delete,compact
: It doesn't guarantee all data is retained; it's still subject to time/size limits.
Example Logging (Consumer Lag):
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group my-group --describe
Look for high CURRENT-OFFSET
and LOG-END-OFFSET
differences.
12. Enterprise Patterns & Best Practices
- Shared vs. Dedicated Topics: Use dedicated topics for different applications to isolate compaction and retention policies.
- Multi-Tenant Cluster Design: Implement resource quotas and ACLs to prevent interference between tenants.
- Retention vs. Compaction: Understand the trade-offs between retaining all data and retaining only the latest state.
- Schema Evolution: Use a Schema Registry and backward/forward compatibility strategies.
- Streaming Microservice Boundaries: Design microservices to own their data and manage compaction policies accordingly.
13. Conclusion
cleanup.policy
is a foundational configuration for building reliable, scalable, and efficient Kafka-based platforms. Properly configuring and monitoring this policy is essential for managing storage costs, optimizing query performance, and ensuring data consistency. Next steps include implementing comprehensive observability, building internal tooling for managing compaction, and continuously refining topic structures based on evolving business requirements. A proactive approach to cleanup.policy
will significantly improve the long-term health and performance of your Kafka ecosystem.
Top comments (0)