DevOps Fundamental for DevOps Fundamentals

Posted on Jun 22

Kafka Fundamentals: kafka at-least-once

#kafka #messagequeue #streaming #kafkaatleastonce

Kafka At-Least-Once: A Deep Dive for Production Systems

1. Introduction

Consider a financial transaction processing system built on Kafka. Each transaction event must be processed exactly once to avoid discrepancies. However, achieving true exactly-once semantics in a distributed system is complex and often comes with significant performance overhead. In practice, many systems opt for “at-least-once” delivery, coupled with idempotent processing on the consumer side. This approach balances reliability with acceptable performance. This post dives deep into the architectural implications, configuration, and operational considerations of building Kafka systems that leverage at-least-once delivery, focusing on production-grade deployments within a microservices architecture utilizing Kafka for event streaming and CDC replication. We’ll assume a cloud-native environment with robust observability and CI/CD pipelines.

2. What is "kafka at-least-once" in Kafka Systems?

“Kafka at-least-once” guarantees that a message published to a Kafka topic will be delivered to each consumer at least once. It’s a consequence of Kafka’s design, prioritizing throughput and durability over strict exactly-once semantics. The guarantee stems from Kafka’s replication mechanism and offset management.

From an architectural perspective, at-least-once is achieved through:

Producer Acknowledgements: Producers can configure acks to control how many in-sync replicas (ISRs) must acknowledge a write before considering it successful. acks=all provides the strongest guarantee, ensuring data is replicated before acknowledgement.
Consumer Offset Management: Consumers track their progress by committing offsets. If a consumer fails after processing a message but before committing the offset, the message will be re-delivered upon restart.
Replication: Kafka brokers replicate partitions across multiple brokers. If a broker fails, a replica takes over, ensuring data availability.

Key configurations:

producer.acks: Controls acknowledgement requirements. (0, 1, all)
consumer.enable.auto.commit: Enables/disables automatic offset committing.
consumer.auto.commit.interval.ms: Frequency of automatic offset commits.
consumer.isolation.level: (read_uncommitted, read_committed, auto_offset_reset) – impacts how consumers handle offsets.

KIP-98 introduced transactional producers, offering a path towards stronger guarantees, but still relies on at-least-once delivery as a foundation. Kafka Raft (KRaft) mode, replacing ZooKeeper, doesn’t fundamentally change at-least-once behavior but improves cluster stability and scalability.

3. Real-World Use Cases

Financial Transaction Logging: Every transaction must be logged, even if processing fails initially. Idempotent processing on the consumer side prevents duplicate application of the transaction.
Change Data Capture (CDC): Replicating database changes to downstream systems requires at-least-once delivery. Duplicate events are handled by the downstream system’s idempotent update logic.
Log Aggregation: Collecting logs from numerous sources. A few duplicate logs are acceptable, while missing logs are critical.
Event Sourcing: Storing a sequence of events representing state changes. Idempotent event handlers ensure consistent state reconstruction.
Multi-Datacenter Replication (MirrorMaker 2): Replicating data across geographically distributed datacenters. At-least-once ensures no data loss during network partitions or failures.

4. Architecture & Internal Mechanics

graph LR
    A[Producer] --> B(Kafka Broker 1);
    A --> C(Kafka Broker 2);
    A --> D(Kafka Broker 3);
    B --> E{Partition Leader};
    C --> E;
    D --> E;
    E --> F[Log Segment];
    F --> G(Replicas on Brokers 2 & 3);
    H[Consumer] --> E;
    H --> I{Offset Commit};
    I --> J[Kafka Broker (Offset Storage)];
    style E fill:#f9f,stroke:#333,stroke-width:2px

The diagram illustrates the core flow. The producer sends messages to the broker cluster. The partition leader appends the message to its log segment. Replication ensures copies exist on other brokers. Consumers read from the leader and commit offsets. The controller manages partition leadership and ISRs.

Kafka’s internal mechanics are crucial:

Log Segments: Messages are appended to immutable log segments.
Controller Quorum: The controller manages cluster metadata and partition assignments.
Replication: Data is replicated across brokers based on the replication factor.
ISR (In-Sync Replicas): The set of replicas that are currently caught up to the leader. acks=all requires acknowledgement from all ISRs.
Retention: Messages are retained for a configurable period or size.

5. Configuration & Deployment Details

server.properties (Broker):

log.retention.hours: 168
log.retention.bytes: -1
num.partitions: 10
default.replication.factor: 3
offsets.topic.replication.factor: 3
transaction.state.log.replication.factor: 3

consumer.properties (Consumer):

group.id: my-consumer-group
enable.auto.commit: false
auto.offset.reset: earliest
isolation.level: read_committed
max.poll.records: 500

CLI Examples:

Create a topic with replication factor 3:

kafka-topics.sh --create --topic my-topic --bootstrap-server localhost:9092 --partitions 10 --replication-factor 3

Describe topic configuration:

kafka-topics.sh --describe --topic my-topic --bootstrap-server localhost:9092

Check consumer group offsets:

kafka-consumer-groups.sh --group my-consumer-group --describe --bootstrap-server localhost:9092

6. Failure Modes & Recovery

Broker Failure: If the leader fails, a replica takes over. At-least-once is maintained as long as sufficient replicas are in the ISR.
Rebalance: During a rebalance, consumers may re-read messages from the last committed offset.
Message Loss (Rare): If a message is lost before replication to all ISRs, it’s lost. Increasing acks and min.insync.replicas mitigates this.
ISR Shrinkage: If the number of ISRs falls below min.insync.replicas, writes are blocked.

Recovery Strategies:

Idempotent Producers: Enable enable.idempotence=true to prevent duplicate writes.
Transactional Producers: Use transactions to ensure atomic writes across multiple partitions.
Offset Tracking: Manually commit offsets after successful processing.
Dead Letter Queues (DLQs): Send failed messages to a DLQ for investigation and reprocessing.

7. Performance Tuning

Benchmark: A typical Kafka cluster can achieve throughput of 100MB/s - 1GB/s depending on hardware and configuration.

linger.ms: Increase to batch messages, improving throughput but increasing latency.
batch.size: Larger batches improve throughput but increase memory usage.
compression.type: gzip, snappy, or lz4 can reduce network bandwidth.
fetch.min.bytes: Increase to reduce the number of fetch requests.
replica.fetch.max.bytes: Control the maximum size of fetch requests for replicas.

At-least-once can impact latency due to potential re-deliveries. Careful tuning of batching and compression is crucial. Tail log pressure can increase with high throughput, requiring sufficient disk I/O.

8. Observability & Monitoring

Prometheus: Expose Kafka JMX metrics to Prometheus.
Kafka JMX Metrics: Monitor consumer-fetch-manager-metrics, producer-metrics, broker-topic-metrics.
Grafana Dashboards: Visualize key metrics:
- Consumer Lag (critical)
- Replication In-Sync Count
- Request/Response Time
- Queue Length
Alerting:
- Alert if consumer lag exceeds a threshold.
- Alert if the ISR count falls below min.insync.replicas.

9. Security and Access Control

SASL/SSL: Encrypt communication between clients and brokers.
SCRAM: Use SCRAM for authentication.
ACLs: Control access to topics and consumer groups.
Kerberos: Integrate with Kerberos for authentication.
Audit Logging: Enable audit logging to track access and modifications.

At-least-once delivery doesn’t inherently introduce new security vulnerabilities, but proper security configuration is essential to protect data in transit and at rest.

10. Testing & CI/CD Integration

Testcontainers: Spin up embedded Kafka instances for integration tests.
Consumer Mock Frameworks: Simulate consumer behavior for testing.
Integration Tests:
- Publish messages and verify they are consumed at least once.
- Simulate broker failures and verify recovery.
- Test schema compatibility.
CI Strategies:
- Run integration tests on every commit.
- Perform throughput checks to ensure performance regressions are detected.

11. Common Pitfalls & Misconceptions

Assuming Exactly-Once: At-least-once is not exactly-once. Idempotent processing is required.
Ignoring Consumer Lag: High consumer lag indicates processing bottlenecks.
Insufficient ISRs: Low ISRs increase the risk of data loss.
Rebalancing Storms: Frequent rebalances disrupt processing. Tune session.timeout.ms and heartbeat.interval.ms.
Incorrect Offset Management: Failing to commit offsets correctly leads to message re-delivery or loss.

Example kafka-consumer-groups.sh output showing a consumer stuck on an old offset:

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                HOST                                CLIENT-ID
my-consumer-group my-topic       0          100             200             100             consumer-1234567890-abcdefg               localhost:9092                    client-1

12. Enterprise Patterns & Best Practices

Shared vs. Dedicated Topics: Use dedicated topics for critical data streams.
Multi-Tenant Cluster Design: Isolate tenants using quotas and ACLs.
Retention vs. Compaction: Use compaction to retain only the latest value for each key.
Schema Evolution: Use a Schema Registry to manage schema changes.
Streaming Microservice Boundaries: Design microservices to consume and produce events independently.

13. Conclusion

Kafka’s at-least-once delivery provides a robust foundation for building reliable, scalable, and real-time data platforms. By understanding the underlying architecture, carefully configuring brokers and clients, and implementing robust observability and recovery strategies, engineers can leverage at-least-once to build production-grade systems that meet demanding business requirements. Next steps include implementing comprehensive monitoring, building internal tooling for offset management, and refactoring topic structures to optimize performance and scalability.

DEV Community