DevOps Fundamental for DevOps Fundamentals

Posted on Jun 22

Kafka Fundamentals: kafka acks

#kafka #messagequeue #streaming #kafkaacks

Kafka Acks: A Deep Dive into Delivery Semantics and Operational Excellence

1. Introduction

Imagine a financial trading platform where order events must be processed exactly once. A lost or duplicated trade can have catastrophic consequences. Or consider a CDC (Change Data Capture) pipeline replicating data across multiple data centers – data consistency is paramount. These scenarios demand a deep understanding of Kafka’s acknowledgement (ack) mechanism. Kafka acks aren’t just a configuration setting; they represent a fundamental trade-off between latency, throughput, and data durability. This post will dissect Kafka acks from an architectural and operational perspective, focusing on production-grade considerations for engineers building and maintaining real-time data platforms. We’ll cover everything from internal mechanics to failure modes, performance tuning, and observability.

2. What is "kafka acks" in Kafka Systems?

“Kafka acks” define the level of acknowledgement a producer requires from the Kafka brokers before considering a message successfully sent. It’s a core component of Kafka’s delivery semantics. Introduced in Kafka 0.9, the acks configuration (available in producer configs) dictates how many replicas must acknowledge a write before the producer receives a success response.

acks=0: The producer doesn’t wait for any acknowledgement. Highest throughput, lowest latency, but no durability guarantees. Message loss is possible if the leader broker goes down immediately after receiving the message.
acks=1: The producer waits for acknowledgement from the leader broker. Good balance between throughput and durability. Message loss is still possible if the leader fails before replication to followers.
acks=all (or -1): The producer waits for acknowledgement from all in-sync replicas (ISRs). Highest durability, lowest throughput, and highest latency. Guarantees that the message is replicated to a quorum of brokers before being considered committed.

The acks setting interacts directly with the Kafka broker’s replication mechanism. When a producer sends a message, the leader broker appends it to its log. With acks=all, the leader replicates the message to its followers. Only when a quorum of ISRs have acknowledged the write does the leader respond to the producer. KIP-98 introduced the concept of ISRs, dynamically tracking which replicas are considered “in-sync” with the leader based on replication lag.

3. Real-World Use Cases

Financial Transactions: acks=all is non-negotiable. Data loss is unacceptable. Idempotent producers (see section 6) are also crucial here.
Log Aggregation: acks=1 is often sufficient. Some log loss is tolerable, and higher throughput is prioritized.
CDC Replication (Multi-Datacenter): acks=all is essential for consistency. MirrorMaker 2.0 or similar tools are used to replicate data across regions, and acks ensure data is reliably written to both locations.
Event Sourcing: acks=all is critical. Events represent the source of truth, and any loss would compromise the entire system.
Stream Processing Backpressure: Incorrect acks settings can exacerbate backpressure issues. If consumers are slow, and acks=all is used, producers can be blocked waiting for acknowledgements, leading to increased latency and potential producer timeouts.

4. Architecture & Internal Mechanics

graph LR
    A[Producer] --> B(Leader Broker);
    B --> C{ISR Replicas};
    C -- Ack --> B;
    B -- Ack --> A;
    B --> D[Log Segment];
    D --> E[Follower Brokers];
    E --> C;
    subgraph Kafka Cluster
        B
        C
        D
        E
    end

The diagram illustrates the core flow. The producer sends a message to the leader broker. The leader appends the message to a log segment. With acks=all, the leader replicates the message to the ISRs. Once a quorum of ISRs acknowledge the write, the leader sends an acknowledgement back to the producer. The controller manages the ISR list, dynamically adjusting it based on replica health and replication lag. Kafka Raft (KRaft) mode replaces ZooKeeper for metadata management, impacting controller election and ISR updates. Schema Registry ensures data contracts are enforced, preventing data corruption.

5. Configuration & Deployment Details

server.properties (Broker):

replica.fetch.max.bytes: 1048576 # Max bytes fetched from a replica during replication

default.replication.factor: 3 # Default replication factor for topics

producer.properties (Producer):

acks: all
retries: 3
max.in.flight.requests.per.connection: 5
linger.ms: 2
batch.size: 16384
compression.type: snappy

consumer.properties (Consumer):

fetch.min.bytes: 1048576
fetch.max.wait.ms: 500
max.poll.records: 500

CLI Examples:

Set acks at topic level:

kafka-configs.sh --bootstrap-server localhost:9092 --entity-type topics --entity-name my-topic --add-config acks=all

Verify topic config:

kafka-configs.sh --bootstrap-server localhost:9092 --entity-type topics --entity-name my-topic --describe

6. Failure Modes & Recovery

Broker Failure: With acks=0 or acks=1, message loss is possible. With acks=all, messages are safe as long as a quorum of ISRs remains.
ISR Shrinkage: If the number of ISRs falls below a quorum, the leader broker will stop accepting writes.
Rebalance: During a consumer group rebalance, consumers may temporarily lose their offset.
Message Loss: Idempotent producers (enabled with enable.idempotence=true) prevent duplicate messages. Transactional guarantees (using Kafka’s transactional API) provide atomic writes across multiple partitions.
Recovery: Utilize Dead Letter Queues (DLQs) to handle messages that cannot be processed. Proper offset tracking is crucial for consumer recovery.

7. Performance Tuning

Configuration	Impact	Recommendation
`linger.ms`	Latency vs. Throughput	Increase for higher throughput, decrease for lower latency
`batch.size`	Throughput	Increase to maximize batching efficiency
`compression.type`	Throughput & Network Bandwidth	`snappy` offers a good balance
`fetch.min.bytes`	Consumer Throughput	Increase to reduce fetch requests
`replica.fetch.max.bytes`	Replication Throughput	Increase to speed up replication

Benchmark: A cluster with acks=all, linger.ms=2, batch.size=16384, and compression.type=snappy can achieve sustained throughput of up to 500 MB/s, depending on hardware and network conditions. Latency typically ranges from 5-15ms.

8. Observability & Monitoring

Metrics:

Consumer Lag: Indicates how far behind consumers are. High lag can signal backpressure.
Replication In-Sync Count: Shows the number of ISRs. A decreasing count indicates potential issues.
Request/Response Time: Monitors the time taken for producer requests to complete.
Queue Length: Tracks the number of messages waiting to be written to disk.

Alerting:

Alert if consumer lag exceeds a threshold.
Alert if the replication in-sync count falls below a quorum.
Alert if producer request latency exceeds a threshold.

Tools: Prometheus, Grafana, Kafka JMX metrics.

9. Security and Access Control

acks itself doesn’t directly impact security, but ensuring secure communication is vital.

SASL/SSL: Encrypt communication between producers, brokers, and consumers.
SCRAM: Use SCRAM authentication for secure access.
ACLs: Implement Access Control Lists to restrict access to topics and resources.
Kerberos: Integrate with Kerberos for strong authentication.
Audit Logging: Enable audit logging to track access and modifications.

10. Testing & CI/CD Integration

Testcontainers: Use Testcontainers to spin up ephemeral Kafka clusters for integration testing.
Embedded Kafka: Utilize embedded Kafka for unit testing.
Consumer Mock Frameworks: Mock consumers to test producer behavior.
CI Pipeline:
- Schema compatibility checks.
- Throughput tests with varying acks settings.
- Failure injection tests (broker failures, network partitions).

11. Common Pitfalls & Misconceptions

Misconception: acks=1 guarantees no data loss. Incorrect. Leader failure before replication can still cause loss.
Problem: Slow consumers causing producer blocking. Symptom: High producer latency. Fix: Increase consumer concurrency, optimize consumer processing logic.
Problem: Rebalancing storms. Symptom: Frequent consumer rebalances. Fix: Optimize group ID strategy, increase session.timeout.ms.
Problem: ISR shrinkage. Symptom: Broker logs showing ISR changes. Fix: Investigate broker health, network connectivity.
Problem: Incorrect acks setting for the use case. Symptom: Data loss or unacceptable latency. Fix: Re-evaluate requirements and adjust acks accordingly.

12. Enterprise Patterns & Best Practices

Shared vs. Dedicated Topics: Consider dedicated topics for critical data streams.
Multi-Tenant Cluster Design: Use resource quotas and ACLs to isolate tenants.
Retention vs. Compaction: Balance data retention with storage costs.
Schema Evolution: Use a Schema Registry to manage schema changes.
Streaming Microservice Boundaries: Design microservices to consume and produce events from well-defined Kafka topics.

13. Conclusion

Kafka acks are a cornerstone of building reliable, scalable, and operationally sound real-time data platforms. Choosing the right acks setting requires a thorough understanding of your application’s requirements and the trade-offs involved. Investing in robust observability, automated testing, and well-defined operational procedures will ensure your Kafka-based systems can handle the demands of production environments. Next steps include implementing comprehensive monitoring, building internal tooling for managing acks configurations, and proactively refactoring topic structures to optimize performance and durability.

DEV Community