Kafka Distributed Log: How to Build Reliable Async Data Pipelines
Kafka distributed log internals: partitioning, replication, consumer groups, and production gotchas.
20+ years shipping large-scale distributed systems. Drawn from code that ran under real load.
Kafka's distributed log is an append-only, partitioned, replicated commit log that stores records in topics. Producers write to partitions, consumers read from them, and the log ensures durability and ordering within a partition.
Imagine a library where every new book gets added to the end of a shelf, and each shelf is labeled by genre. Multiple librarians (producers) can add books simultaneously to different shelves. Readers (consumers) can pick any shelf and read books in order, from oldest to newest. If a shelf gets too long, you split it into multiple shelves (partitions) but keep the genre label. The library keeps copies of each shelf in different rooms (replication) so if one room burns down, the books aren't lost.
Most developers think Kafka is just a message queue. It's not. It's a distributed log — and that distinction is the difference between a system that survives a 3AM partition failure and one that silently loses data. I've seen a payments service go down because the team treated Kafka like RabbitMQ and ran out of consumer threads. Don't be that team.
The problem Kafka solves is brutally simple: how do you reliably move data between services without tight coupling, data loss, or performance bottlenecks? Before Kafka, teams hacked together databases as message queues, polled for changes, or used point-to-point HTTP calls that failed under load. Kafka's distributed log gives you a single source of truth for event streams.
By the end of this article, you'll understand Kafka's internal architecture — the log, partitions, segments, and replication protocol — well enough to design resilient async pipelines, debug production issues, and avoid the common pitfalls that burn teams. You'll also get battle-tested code for a realistic checkout service.
The Log: Append-Only, Immutable, and Ordered
At its core, Kafka's distributed log is an append-only sequence of records. Each record has a unique offset within its partition. This immutability is what makes Kafka fast: no locks, no random writes, just sequential I/O. The log is split into segments — files on disk that are rolled over when they reach a size or time limit. Old segments are deleted or compacted based on retention policies.
Why does this matter? Because sequential disk writes are faster than random writes by orders of magnitude. Kafka can saturate a 10Gbps NIC with a single partition if the log is on fast SSDs. But here's the gotcha: if you have too many partitions (thousands), the sequential I/O advantage disappears because the OS is seeking between many small files. I've seen teams create 10,000 partitions and wonder why throughput tanked. Keep partitions per broker under 4,000 for high throughput.
Partitioning: The Key to Parallelism and Ordering
Partitions are the unit of parallelism in Kafka. Each partition is an ordered, immutable log. Producers write to partitions based on a key (or round-robin if no key). Consumers read from partitions in order. This gives you a trade-off: within a partition, messages are ordered; across partitions, order is not guaranteed.
Here's the production reality: if you need global ordering, you can only have one partition. That kills throughput. Most systems don't need global ordering — they need ordering per entity (e.g., per user, per order). Use the entity ID as the partition key. I've seen teams use a random key for load balancing and then wonder why user events arrive out of order. Don't do that.
When choosing the number of partitions, consider your throughput requirements and consumer parallelism. A good rule of thumb: start with 3x the number of consumers you plan to run. You can increase partitions later, but decreasing is not supported without recreating the topic.
max(3 * expected_consumers, 6). Monitor consumer lag. If lag grows, increase partitions by doubling. Never go below 3 partitions for production topics — you need at least 3 for the default min.insync.replicas=2 to work with replication factor 3.Replication: How Kafka Survives Broker Failures
Kafka replicates each partition across multiple brokers (configurable via replication.factor). One broker is the leader for a partition; the rest are followers. Producers write to the leader; followers replicate the data. If the leader fails, a follower becomes the new leader (controlled by the controller broker).
The key config is min.insync.replicas — the minimum number of replicas that must acknowledge a write for it to be considered committed. If you set acks=all and min.insync.replicas=2, a write must be acknowledged by the leader and at least one follower. This prevents data loss if the leader crashes after acknowledging but before the follower replicates.
Here's the gotcha: if the number of in-sync replicas drops below min.insync.replicas, the broker stops accepting writes (NotEnoughReplicasException). I've seen this happen when a broker goes down for maintenance and the team forgot to adjust min.insync.replicas. Always set min.insync.replicas to replication.factor - 1 for maximum durability, but be prepared for availability impact during broker failures.
acks=0 means the producer doesn't wait for any acknowledgment. You will lose data on broker failure. I've seen a metrics pipeline use this and lose 20% of events during a rolling restart. Use acks=all with min.insync.replicas=2 for any data you care about.Consumer Groups: Scaling Reads Without Losing Order
Consumer groups allow multiple consumers to read from a topic in parallel. Each partition is assigned to exactly one consumer in the group. This ensures ordering within a partition (since only one consumer reads it) while allowing horizontal scaling. If you have more consumers than partitions, some consumers will be idle.
The consumer group protocol uses a group coordinator broker and a rebalance protocol. When a consumer joins or leaves, a rebalance triggers, and partitions are reassigned. During rebalance, all consumers in the group stop processing (stop-the-world). This can cause latency spikes. I've seen teams with hundreds of consumers experience 30-second rebalances. Mitigation: use static group membership (introduced in Kafka 2.3) or cooperative rebalancing (incremental rebalance).
Another gotcha: if your consumer processes messages slowly, lag grows. But if it crashes, the group rebalances, and the new consumer starts from the last committed offset. If you use auto-commit, you might lose messages between the last commit and the crash. Always use manual offset commits after processing, not before.
partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor for large groups.Producers: Idempotence, Batching, and Throughput
Kafka producers batch records to improve throughput. The batch.size and linger.ms configs control batching. batch.size is the maximum bytes to accumulate before sending; linger.ms is the maximum time to wait for more records. Setting linger.ms=5 adds 5ms latency but can double throughput under moderate load.
Idempotent producers (enable.idempotence=true) prevent duplicate records caused by retries. When a producer sends a batch and the broker acknowledges but the producer doesn't receive the ack (e.g., network issue), the producer retries. Without idempotence, this can cause duplicate records. With idempotence, the broker deduplicates based on producer ID and sequence number. Always enable idempotence in production — the performance cost is negligible.
Here's a gotcha: if you set max.in.flight.requests.per.connection > 1 with idempotence, Kafka ensures ordering within a partition. But if you set it > 1 without idempotence, retries can cause out-of-order writes. I've seen a logging pipeline lose ordering because of this. Keep max.in.flight.requests.per.connection=5 with idempotence for high throughput.
compression.type=snappy or lz4 for text-heavy data. Snappy gives good compression with low CPU overhead. Gzip compresses more but uses more CPU. For high-throughput pipelines, Snappy is the sweet spot.When Not to Use Kafka: The Overkill Trap
Kafka is not a message queue. If you need point-to-point messaging with low latency (sub-millisecond) and don't need replay or long-term storage, use RabbitMQ or Redis Streams. If you need a simple work queue for background jobs, use Redis or a database-backed queue. Kafka's strength is in streaming, replay, and multi-subscriber patterns.
I've seen teams use Kafka for a simple email notification service with 10 messages per second. They ended up with a 3-broker cluster, ZooKeeper (or KRaft), and a complex deployment. A single Redis instance would have handled it with less ops overhead. Don't use Kafka unless you need at least two of: replay, multiple consumer groups, or high throughput (>10K msg/s).
The 4GB Container That Kept Dying
log.retention.bytes=1GB but log.segment.bytes=1GB and log.retention.check.interval.ms=300000. With high throughput, segments accumulated faster than retention could clean them, causing the log directory to grow beyond the container's disk limit, which triggered OOM as the OS tried to page.log.retention.bytes=500MB and log.segment.bytes=256MB. Also set log.retention.check.interval.ms=60000 to clean more frequently. Added disk monitoring with Prometheus.- Always align segment size, retention size, and check interval with your throughput.
- Disk pressure kills brokers faster than CPU or memory.
kafka-consumer-groups --bootstrap-server localhost:9092 --group my-group --describe to check lag per partition. 2. Check if number of consumers equals number of partitions. 3. If lag is on specific partitions, check if those partitions have a slow consumer or hot key. 4. Increase partitions (double) and add consumers.min.insync.replicas — if ISR count drops below it, writes are rejected. 3. Increase request.timeout.ms and delivery.timeout.ms. 4. Ensure network is stable.kafka-topics --describe --topic my-topic to see ISR. 2. If a broker is down, wait for it to come back or reduce min.insync.replicas temporarily. 3. Check if unclean.leader.election.enable is set to false (default) — if so, partitions with no ISR become unavailable.kafka-consumer-groups --bootstrap-server localhost:9092 --group my-group --describekafka-consumer-groups --bootstrap-server localhost:9092 --group my-group --reset-offsets --to-latest --executeKey takeaways
enable.idempotence=true) and manual offset commits (enable.auto.commit=false) in production to prevent data loss and duplicates.Interview Questions on This Topic
How does Kafka handle a broker failure? Describe the leader election process.
unclean.leader.election.enable=false and no ISR exists, the partition becomes unavailable. The new leader starts accepting writes and followers replicate from it.Frequently Asked Questions
20+ years shipping large-scale distributed systems. Drawn from code that ran under real load.
That's Async & Data Processing. Mark it forged?
5 min read · try the examples if you haven't