DevOps Fundamental for DevOps Fundamentals

Posted on Jun 25

Kafka Fundamentals: kafka retention.ms

#kafka #messagequeue #streaming #kafkaretentionms

Kafka Retention.ms: A Deep Dive for Production Systems

1. Introduction

Imagine a financial trading platform where order events must be replayed for audit purposes or to correct erroneous trades. Or consider a global IoT system where device telemetry needs to be available for historical analysis, even during network partitions. These scenarios demand careful consideration of data retention in Kafka. A poorly configured retention.ms can lead to data loss, compliance violations, or performance bottlenecks. This post dives deep into kafka retention.ms, focusing on its architectural implications, operational considerations, and performance tuning for large-scale, production Kafka deployments. We’ll assume a microservices architecture leveraging Kafka for event streaming, with a strong emphasis on data contracts enforced via a Schema Registry and observability via Prometheus.

2. What is "kafka retention.ms" in Kafka Systems?

retention.ms is a Kafka broker configuration parameter that defines the maximum time a log message will be retained in a topic partition, regardless of consumer consumption. It’s specified in milliseconds. It’s a partition-level setting, meaning each partition can have its own retention policy.

Introduced in Kafka 0.8, retention.ms (and its counterpart retention.bytes) replaced older retention mechanisms. KIP-44 introduced dynamic topic configuration, allowing retention.ms to be altered without broker restarts. Key related configurations include:

log.retention.check.interval.ms: How often the broker checks for expired messages.
log.cleaner.enable: Enables log compaction, which interacts with retention.
log.segment.bytes: Size of individual log segments.

The broker’s log manager periodically scans each partition, deleting segments whose messages exceed the retention.ms threshold. This process is independent of consumer offsets; messages are deleted based on their timestamp, not whether they’ve been consumed.

3. Real-World Use Cases

Audit Logging: Financial institutions require long-term retention (years) of transaction events for regulatory compliance. retention.ms is set to a large value, often combined with log compaction to retain only the latest state.
Out-of-Order Processing: Stream processing applications sometimes receive events out of order. Sufficient retention.ms allows time for late-arriving events to be processed.
Multi-Datacenter Replication (MirrorMaker 2): Replication lag between datacenters necessitates a longer retention period on the source cluster to ensure data availability during failover.
Consumer Lag & Backpressure: If consumers fall behind, a longer retention.ms provides a buffer, preventing producers from being blocked due to full partitions. However, this is a symptom fix, not a solution. Addressing the root cause of consumer lag is crucial.
CDC Replication: Change Data Capture (CDC) streams often require a retention window to allow downstream systems to catch up after outages or schema changes.

4. Architecture & Internal Mechanics

graph LR
    A[Producer] --> B(Kafka Broker 1);
    A --> C(Kafka Broker 2);
    B --> D{Partition Leader};
    C --> D;
    D --> E[Log Segment 1];
    D --> F[Log Segment 2];
    E --> G(Disk);
    F --> G;
    H[Consumer] --> B;
    H --> C;
    I[Schema Registry] --> A;
    J[Controller Quorum] --> B;
    J --> C;
    style D fill:#f9f,stroke:#333,stroke-width:2px

When a producer sends a message, the broker appends it to the end of the relevant partition’s log. The log is divided into segments. The retention.ms check, performed by the log manager, iterates through these segments. If a segment’s oldest message exceeds the retention time, the segment is marked for deletion.

The Controller Quorum manages partition leadership and ensures replication. Retention checks are performed on each broker, and the deletion process is coordinated to maintain consistency across replicas. If log compaction is enabled, the cleaner process runs periodically, removing redundant data based on a configured key. KRaft mode replaces ZooKeeper for metadata management, but the retention logic remains fundamentally the same.

5. Configuration & Deployment Details

server.properties (Broker Configuration):

log.retention.check.interval.ms=300000 # Check every 5 minutes

log.retention.hours=168 # Default retention of 7 days

Topic Configuration (using kafka-topics.sh):

kafka-topics.sh --bootstrap-server localhost:9092 --alter --topic my-topic --config retention.ms=604800000 # 7 days

kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic my-topic

Producer Configuration (consumer.properties):

While producers don't directly configure retention.ms, they should be configured for idempotent production (enable.idempotence=true) and transactional delivery (transactional.id) to prevent data loss, especially when dealing with potentially expiring messages.

enable.idempotence: true
transactional.id: my-producer-id

6. Failure Modes & Recovery

Broker Failure: If a broker fails, retention checks are paused on that broker. Upon recovery, the broker resumes retention checks, potentially deleting segments that were nearing expiration during the outage.
Rebalance: During a consumer group rebalance, consumers may temporarily fall behind. A sufficient retention.ms can mitigate data loss, but it’s not a substitute for proper consumer scaling.
Message Loss: Idempotent producers and transactional guarantees are crucial to prevent message loss, especially when retention is configured aggressively.
ISR Shrinkage: If the number of in-sync replicas (ISR) falls below the minimum required replicas, the broker may temporarily stop accepting writes to the partition. This doesn’t directly affect retention.ms, but it can exacerbate data loss if retention is too short.
Recovery: Dead Letter Queues (DLQs) should be used to handle messages that cannot be processed due to errors or expiration.

7. Performance Tuning

Throughput: Aggressive retention policies (short retention.ms) can improve write throughput by reducing the amount of data the broker needs to manage.
Latency: Longer retention periods can increase read latency, especially for tail-reading consumers.
Tail Log Pressure: Slow consumers can create pressure on the tail of the log, potentially leading to producer backpressure.
Tuning Configs:
- linger.ms: Increase to batch more messages, improving throughput.
- batch.size: Increase to send larger batches, improving throughput.
- compression.type: Use compression (e.g., gzip, snappy, lz4) to reduce storage costs and network bandwidth.
- fetch.min.bytes: Increase to reduce the number of fetch requests, improving throughput.
- replica.fetch.max.bytes: Increase to allow replicas to fetch larger batches, improving replication performance.

Benchmark: A typical Kafka cluster can sustain >1MB/s per partition with reasonable retention settings. Performance degrades as retention increases and consumer lag grows.

8. Observability & Monitoring

Prometheus & JMX: Monitor Kafka JMX metrics using Prometheus.
Critical Metrics:
- kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec: Track message ingestion rate.
- kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec: Track data ingestion rate.
- kafka.consumer:type=consumer-coordinator-metrics,client-id=*,group-id=*,name=Lag: Monitor consumer lag.
- kafka.controller:type=KafkaController,name=ActiveControllerCount: Ensure a healthy controller quorum.
Alerting: Alert on:
- Consumer lag exceeding a threshold.
- Low ISR count.
- High broker CPU utilization.
- Disk space approaching capacity.

9. Security and Access Control

retention.ms itself doesn’t directly introduce security vulnerabilities. However, long retention periods increase the risk of data breaches. Implement:

SASL/SSL: Encrypt communication between producers, brokers, and consumers.
SCRAM/Kerberos: Authenticate users and services.
ACLs: Control access to topics and partitions.
Audit Logging: Track access to sensitive data.

10. Testing & CI/CD Integration

Testcontainers: Use Testcontainers to spin up ephemeral Kafka clusters for integration testing.
Embedded Kafka: Use embedded Kafka for unit testing.
Consumer Mock Frameworks: Mock consumers to simulate various consumption scenarios.
CI Pipeline:
- Schema compatibility checks.
- Throughput tests with varying retention settings.
- Consumer lag monitoring.

11. Common Pitfalls & Misconceptions

Setting retention.ms too low: Leads to data loss for late-arriving events or slow consumers.
- Symptom: Missing data in downstream systems.
- Fix: Increase retention.ms.
Ignoring Consumer Lag: Relying on retention.ms to compensate for chronic consumer lag.
- Symptom: High consumer lag, increasing retention requirements.
- Fix: Scale consumers, optimize consumer code, or partition data more effectively.
Not considering Log Compaction: Using retention.ms without understanding how it interacts with log compaction.
- Symptom: Unexpected data retention behavior.
- Fix: Understand the interaction between retention.ms and log.cleaner.enable.
Incorrectly interpreting retention behavior: Assuming retention is based on consumer offsets.
- Symptom: Data disappearing unexpectedly.
- Fix: Understand that retention is timestamp-based.
Forgetting about replication: Retention checks are performed on all replicas.
- Symptom: Inconsistent retention across brokers.
- Fix: Ensure proper replication configuration and monitor ISR count.

12. Enterprise Patterns & Best Practices

Shared vs. Dedicated Topics: Use dedicated topics for different applications or data streams to allow for granular retention policies.
Multi-Tenant Cluster Design: Implement resource quotas and access control to isolate tenants.
Retention vs. Compaction: Use retention for time-based data and compaction for stateful data.
Schema Evolution: Ensure schema compatibility to prevent data corruption and consumer errors.
Streaming Microservice Boundaries: Define clear boundaries between microservices and use Kafka to facilitate asynchronous communication.

13. Conclusion

kafka retention.ms is a critical configuration parameter for building reliable, scalable, and operationally efficient Kafka-based platforms. Careful consideration of its architectural implications, potential failure modes, and performance characteristics is essential. Investing in observability, building internal tooling, and continuously refining topic structure will ensure your Kafka deployment can meet the demands of your real-time data pipelines. Next steps include implementing comprehensive monitoring, automating retention policy management, and exploring advanced features like tiered storage.

DEV Community