DevOps Fundamental for DevOps Fundamentals

Posted on Jun 22

Kafka Fundamentals: kafka at-most-once

#kafka #messagequeue #streaming #kafkaatmostonce

Kafka At-Most-Once: A Deep Dive for Production Systems

1. Introduction

Consider a financial transaction processing system built on Kafka. Each transaction event must be processed exactly once to maintain data integrity. However, achieving true exactly-once semantics in a distributed system is complex and often comes with significant performance overhead. In many scenarios, a carefully engineered “at-most-once” approach, coupled with application-level idempotency, provides a pragmatic and scalable solution. This is particularly relevant in microservice architectures where event-driven communication is paramount, and data consistency is critical across services. We need to understand how Kafka’s inherent capabilities and configurations can be leveraged to minimize data loss and ensure reliable event delivery, even in the face of failures. This post will explore the nuances of achieving “at-most-once” delivery in Kafka, focusing on architectural considerations, configuration, and operational best practices.

2. What is "kafka at-most-once" in Kafka Systems?

“Kafka at-most-once” isn’t a single configuration flag, but rather a characteristic of Kafka’s default behavior. It means that a message, once produced to a topic, is guaranteed to be delivered to at least one consumer, but may be delivered more than once in certain failure scenarios. The key is that Kafka itself doesn’t guarantee that a message will be processed exactly once.

This behavior stems from Kafka’s design. Producers can configure acks=0 (fire and forget), acks=1 (leader acknowledgement), or acks=all (leader + replicas acknowledgement). acks=all provides the strongest durability guarantee, but doesn’t prevent duplicate delivery if a consumer fails after acknowledging the message. Consumers track their progress via offsets. If a consumer fails and restarts, it may re-read messages from the last committed offset, leading to potential duplication.

Introduced in KIP-98, the Kafka Raft (KRaft) mode replaces ZooKeeper for metadata management, impacting the controller quorum and broker failover behavior, but doesn’t fundamentally change the at-most-once delivery characteristic. Key configuration flags impacting this behavior include enable.idempotence (producer), max.in.flight.requests.per.connection (producer), and session.timeout.ms (consumer).

3. Real-World Use Cases

Log Aggregation: Losing a log event is less critical than ensuring high throughput. Duplicate logs are tolerable, and application-level deduplication can be implemented if necessary.
Clickstream Data: Similar to log aggregation, occasional duplicate click events are acceptable in many analytics scenarios.
CDC Replication (Change Data Capture): Replicating database changes to a data lake. While data consistency is important, at-most-once with application-level reconciliation is often preferred over the complexity of full transactional guarantees.
Event-Driven Microservices (Non-Critical Events): For events that trigger asynchronous tasks where idempotency can be enforced in the consuming service (e.g., sending a welcome email).
Out-of-Order Messages: When message ordering is not strictly required, at-most-once delivery combined with application-level sorting and deduplication can be a viable strategy.

4. Architecture & Internal Mechanics

Kafka’s architecture centers around topics, partitions, and brokers. Messages are appended to the end of a partition’s log segment. Replication ensures data durability. The controller manages partition leadership and broker failures.

graph LR
    A[Producer] --> B(Kafka Topic);
    B --> C1{Broker 1 (Leader)};
    B --> C2{Broker 2 (Replica)};
    B --> C3{Broker 3 (Replica)};
    C1 --> D[Log Segment];
    C2 --> D;
    C3 --> D;
    E[Consumer] --> B;
    style D fill:#f9f,stroke:#333,stroke-width:2px
    subgraph Kafka Cluster
        C1
        C2
        C3
    end

When a producer sends a message with acks=all, the message is written to the leader, replicated to followers, and only then acknowledged to the producer. If a broker fails, the controller elects a new leader for the affected partition. Consumers track their offset within each partition. If a consumer fails and restarts, it resumes consumption from the last committed offset. The In-Sync Replica (ISR) set determines which replicas are considered up-to-date. Message loss can occur if the leader fails and no replicas are in the ISR. Schema Registry ensures data contract compatibility, preventing issues caused by schema evolution.

5. Configuration & Deployment Details

server.properties (Broker):

log.retention.hours: 168
log.retention.bytes: -1
default.replication.factor: 3
min.insync.replicas: 2

consumer.properties (Consumer):

group.id: my-consumer-group
auto.offset.reset: earliest # or latest

enable.auto.commit: true
auto.commit.interval.ms: 5000
session.timeout.ms: 30000

producer.properties (Producer):

acks: all
retries: 3
linger.ms: 5
batch.size: 16384
compression.type: snappy
enable.idempotence: true

CLI Examples:

Create a topic: kafka-topics.sh --create --topic my-topic --partitions 10 --replication-factor 3 --bootstrap-server localhost:9092
Describe a topic: kafka-topics.sh --describe --topic my-topic --bootstrap-server localhost:9092
Configure a topic: kafka-configs.sh --alter --entity-type topics --entity-name my-topic --add-config retention.ms=604800000 --bootstrap-server localhost:9092

6. Failure Modes & Recovery

Broker Failure: If the leader fails, a new leader is elected from the ISR. If no replicas are in the ISR, data loss can occur.
Consumer Failure: The consumer restarts and resumes from the last committed offset, potentially re-processing messages.
Message Loss: Rare, but possible if the leader fails before replication completes and no replicas are in the ISR.
ISR Shrinkage: If the number of in-sync replicas falls below min.insync.replicas, the leader will stop accepting writes.

Recovery Strategies:

Idempotent Producers: enable.idempotence=true prevents duplicate messages from the producer side.
Transactional Guarantees: For exactly-once semantics, use Kafka Transactions.
Offset Tracking: Consumers should commit offsets regularly to ensure progress is saved.
Dead Letter Queues (DLQs): Route failed messages to a DLQ for investigation and reprocessing.

7. Performance Tuning

Typical throughput for a well-configured Kafka cluster can range from 100 MB/s to several GB/s, depending on hardware and network bandwidth.

linger.ms: Increase to batch more messages, improving throughput but increasing latency.
batch.size: Larger batches improve throughput but increase memory usage.
compression.type: snappy offers a good balance between compression ratio and CPU usage.
fetch.min.bytes: Increase to reduce the number of fetch requests, improving throughput.
replica.fetch.max.bytes: Control the maximum amount of data fetched from a replica.

At-most-once delivery generally has lower latency than exactly-once due to the reduced overhead of transactional guarantees. However, producer retries can increase latency if network conditions are poor.

8. Observability & Monitoring

Prometheus & Grafana: Use Kafka Exporter to expose Kafka JMX metrics to Prometheus.
Critical Metrics:
- consumer:fetch-latency-avg: Consumer fetch latency.
- consumer:records-lag-max: Maximum consumer lag.
- broker:replication:underreplicated-partitions: Number of underreplicated partitions.
- broker:requests-per-second: Broker request rate.
Alerting: Alert on high consumer lag, low ISR count, or increased fetch latency.

9. Security and Access Control

SASL/SSL: Encrypt communication between clients and brokers.
SCRAM: Use SCRAM authentication for secure access.
ACLs: Control access to topics and consumer groups.
Kerberos: Integrate with Kerberos for authentication.
Audit Logging: Enable audit logging to track access and modifications.

10. Testing & CI/CD Integration

Testcontainers: Use Testcontainers to spin up temporary Kafka clusters for integration tests.
Embedded Kafka: Use embedded Kafka for unit tests.
Consumer Mock Frameworks: Mock consumer behavior for testing producer logic.
CI Pipeline:
- Schema compatibility checks.
- Throughput tests.
- Contract testing to ensure data contracts are maintained.

11. Common Pitfalls & Misconceptions

Assuming acks=all guarantees exactly-once: It doesn’t. Consumer failures can still lead to duplication.
Ignoring Consumer Lag: High consumer lag indicates a bottleneck and potential data loss.
Incorrect Offset Management: Failing to commit offsets regularly can lead to reprocessing or data loss.
Insufficient Replication Factor: A low replication factor increases the risk of data loss during broker failures.
Overly Aggressive session.timeout.ms: Can lead to unnecessary rebalances.

Example kafka-consumer-groups.sh output showing consumer lag:

kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group my-consumer-group --describe

12. Enterprise Patterns & Best Practices

Shared vs. Dedicated Topics: Consider dedicated topics for critical data streams.
Multi-Tenant Cluster Design: Use resource quotas and ACLs to isolate tenants.
Retention vs. Compaction: Use compaction to retain only the latest value for each key.
Schema Evolution: Use a Schema Registry to manage schema changes.
Streaming Microservice Boundaries: Clearly define event boundaries between microservices.

13. Conclusion

“Kafka at-most-once” delivery, when combined with application-level idempotency and robust monitoring, provides a pragmatic and scalable solution for many real-world use cases. By understanding Kafka’s internal mechanics, carefully configuring brokers and clients, and implementing appropriate recovery strategies, you can build reliable and performant event-driven systems. Next steps include implementing comprehensive observability, building internal tooling for managing Kafka clusters, and continuously refining topic structures to optimize performance and scalability.

DEV Community