DevOps Fundamental for DevOps Fundamentals

Posted on Jun 21

Kafka Fundamentals: kafka retention

#kafka #messagequeue #streaming #kafkaretention

Kafka Retention: A Deep Dive for Production Systems

1. Introduction

Imagine a financial trading platform built on Kafka. We need to reliably capture every trade event for auditing, regulatory compliance, and potential replay for fraud detection. However, storing every event indefinitely is prohibitively expensive and introduces significant operational complexity. This is where Kafka retention becomes critical. It’s not just about disk space; it’s about balancing data durability, cost, and the operational realities of a high-throughput, real-time data platform. Our architecture relies on microservices communicating via Kafka, with stream processing jobs (Kafka Streams, Flink) consuming events for real-time analytics and CDC pipelines replicating data to data lakes. Data contracts are enforced via a Schema Registry, and robust observability is paramount. Incorrect retention settings can lead to data loss, consumer lag, and ultimately, system failure.

2. What is "kafka retention" in Kafka Systems?

Kafka retention defines how long messages are stored on the broker’s disk before being eligible for deletion. It’s a topic-level configuration, meaning each topic can have its own retention policy. Prior to KIP-47 (introduced in Kafka 0.10.0), retention was primarily time-based (e.g., 7 days) or size-based (e.g., 10GB). KIP-47 introduced the ability to retain messages based on the number of partitions.

Retention is managed by the Kafka brokers themselves. When a producer sends a message, it’s appended to the end of the topic’s log segments. The retention process periodically checks if log segments are eligible for deletion based on the configured policy. The controller manages the retention metadata and coordinates the deletion process across the replicas.

Key configuration flags:

retention.ms: Retention time in milliseconds.
retention.bytes: Maximum size of the log in bytes.
max.message.bytes: Maximum size of a single message.
cleanup.policy: compact or delete (default). compact retains only the latest value for each key.
segment.bytes: Size of each log segment file.

3. Real-World Use Cases

Out-of-Order Messages: In a distributed system, messages can arrive out of order. Sufficient retention allows consumers to “rewind” and reprocess messages if they encounter gaps in sequence numbers.
Multi-Datacenter Deployment (MirrorMaker 2): When replicating data across datacenters using MirrorMaker 2, retention on the source cluster must be long enough to allow replication to complete, even during network disruptions.
Consumer Lag & Backpressure: If consumers fall behind, increased retention provides a buffer, preventing producers from being overwhelmed and potentially dropping messages. However, excessive retention exacerbates tail log pressure.
Event Sourcing: Applications using event sourcing rely on a complete, immutable history of events. Retention policies must align with the application’s requirements for replayability and auditability.
CDC Replication: Change Data Capture (CDC) pipelines often require a retention window to allow for initial snapshotting and ongoing replication of database changes.

4. Architecture & Internal Mechanics

Kafka stores messages in an append-only log, divided into segments. Each segment has a maximum size defined by segment.bytes. Retention operates at the segment level. The controller, responsible for partition leadership and metadata management, coordinates the deletion of segments across replicas. Replication ensures data durability; retention policies are applied consistently across all in-sync replicas (ISRs).

graph LR
    A[Producer] --> B(Kafka Broker 1);
    A --> C(Kafka Broker 2);
    A --> D(Kafka Broker 3);
    B --> E{Log Segments};
    C --> E;
    D --> E;
    E --> F[Retention Policy Check];
    F --> G{Eligible for Deletion?};
    G -- Yes --> H[Delete Segment];
    G -- No --> E;
    I(Consumer) --> B;
    I --> C;
    I --> D;
    style E fill:#f9f,stroke:#333,stroke-width:2px

With KRaft mode (Kafka Raft metadata mode), the controller’s responsibilities are handled by a Raft quorum, improving scalability and fault tolerance. Schema Registry integration ensures data consistency and compatibility, while MirrorMaker 2 replicates data and applies retention policies across clusters. ZooKeeper is no longer required for metadata management in KRaft mode.

5. Configuration & Deployment Details

server.properties (Broker Configuration):

log.retention.hours=168 # Default retention: 7 days

log.retention.bytes=-1 # No size limit

log.segment.bytes=1073741824 # 1GB segment size

consumer.properties (Consumer Configuration):

auto.offset.reset=earliest # Start consuming from the beginning if no offset is found

enable.auto.commit=true # Automatically commit offsets

CLI Examples:

Set retention time for a topic:

kafka-configs.sh --bootstrap-server localhost:9092 --entity-type topics --entity-name my-topic --add-config retention.ms=604800000 # 7 days

Set retention size for a topic:

kafka-configs.sh --bootstrap-server localhost:9092 --entity-type topics --entity-name my-topic --add-config retention.bytes=10737418240 # 10GB

Describe topic configuration:

kafka-configs.sh --bootstrap-server localhost:9092 --entity-type topics --entity-name my-topic --describe

6. Failure Modes & Recovery

Broker Failure: Retention policies are applied consistently across replicas. If a broker fails, the ISR shrinks, but retention is still enforced by the remaining replicas.
Rebalances: During rebalances, consumers may temporarily fall behind. Sufficient retention provides a buffer.
Message Loss: Retention doesn’t prevent message loss due to producer errors or network issues. Idempotent producers and transactional guarantees are crucial for ensuring message delivery.
ISR Shrinkage: If the ISR shrinks to zero, the partition becomes unavailable, and retention is paused until the ISR is restored.

Recovery Strategies:

Idempotent Producers: Ensure messages are delivered exactly once.
Transactional Guarantees: Atomically write messages to multiple partitions.
Offset Tracking: Consumers must reliably track their offsets to avoid reprocessing or skipping messages.
Dead Letter Queues (DLQs): Route failed messages to a DLQ for investigation and reprocessing.

7. Performance Tuning

Retention impacts disk I/O and producer performance.

Throughput: Benchmark tests show that excessive retention can reduce write throughput, especially with frequent compaction. A reasonable balance is crucial. (Example: 100MB/s write throughput with 7-day retention, dropping to 70MB/s with 30-day retention).
Latency: Long retention periods can increase read latency, especially for tail log reads.
Tail Log Pressure: High write rates combined with long retention can create tail log pressure, leading to producer retries.

Tuning Configs:

linger.ms: Increase to batch more messages, improving throughput.
batch.size: Increase to send larger batches, improving throughput.
compression.type: Use compression (e.g., gzip, snappy, lz4) to reduce storage costs and I/O.
fetch.min.bytes: Increase to reduce the number of fetch requests, improving consumer performance.
replica.fetch.max.bytes: Increase to allow replicas to fetch larger batches, improving replication performance.

8. Observability & Monitoring

Prometheus & JMX: Monitor Kafka JMX metrics using Prometheus.
Grafana Dashboards: Visualize key metrics in Grafana.

Critical Metrics:

kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec: Incoming message rate.
kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec: Incoming byte rate.
kafka.consumer:type=consumer-coordinator-metrics,client-id=*,group-id=*,name=consumer-lag: Consumer lag.
kafka.controller:type=KafkaController,name=ActiveControllerCount: Active controller count (KRaft).
kafka.network:type=RequestMetrics,name=TotalTimeMs: Request processing time.

Alerting Conditions:

Consumer lag exceeding a threshold.
ISR shrinkage below a threshold.
High request latency.
Disk space utilization exceeding a threshold.

9. Security and Access Control

Retention policies can expose sensitive data if not properly secured.

SASL/SSL: Use SASL/SSL for authentication and encryption in transit.
SCRAM: Use SCRAM for password-based authentication.
ACLs: Configure ACLs to restrict access to topics based on user roles.
Kerberos: Integrate with Kerberos for strong authentication.
Audit Logging: Enable audit logging to track access to Kafka data.

10. Testing & CI/CD Integration

Testcontainers: Use Testcontainers to spin up ephemeral Kafka clusters for integration testing.
Embedded Kafka: Use embedded Kafka for unit testing.
Consumer Mock Frameworks: Mock consumers to verify producer behavior.

CI Strategies:

Schema Compatibility Checks: Ensure schema compatibility between producers and consumers.
Throughput Checks: Verify that the system can handle the expected message volume.
Retention Policy Validation: Confirm that retention policies are correctly applied.

11. Common Pitfalls & Misconceptions

Incorrect Retention Configuration: Setting retention too short leads to data loss; too long leads to excessive storage costs.
Ignoring Compaction: Failing to use compaction when appropriate results in unnecessary storage consumption.
Assuming Retention Guarantees Delivery: Retention only guarantees storage; it doesn’t prevent message loss due to producer errors.
Not Monitoring Consumer Lag: Ignoring consumer lag can lead to data backlogs and system instability.
Misunderstanding cleanup.policy: Using compact without understanding its implications for key-based retention.

Example Logging (Consumer Lag):

[2024-01-26 10:00:00,000] WARN Consumer lag detected for topic my-topic, partition 0: 10000 messages.

12. Enterprise Patterns & Best Practices

Shared vs. Dedicated Topics: Use dedicated topics for different applications or data streams to simplify retention management.
Multi-Tenant Cluster Design: Implement quotas and ACLs to isolate tenants and prevent resource contention.
Retention vs. Compaction: Use retention for time-based data and compaction for key-based data.
Schema Evolution: Implement a robust schema evolution strategy to ensure compatibility between producers and consumers.
Streaming Microservice Boundaries: Define clear boundaries between streaming microservices to simplify data ownership and retention policies.

13. Conclusion

Kafka retention is a fundamental aspect of building reliable, scalable, and cost-effective real-time data platforms. Careful consideration of retention policies, coupled with robust observability and monitoring, is essential for ensuring data durability, preventing data loss, and optimizing system performance. Next steps include implementing comprehensive monitoring dashboards, building internal tooling for managing retention policies, and refactoring topic structures to align with application requirements.

DEV Community