Senior 5 min · March 06, 2026

Apache Kafka — Why Rebalance Storms Kill Consumer Groups

One slow consumer caused 45 min of Black Friday downtime.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Kafka is a distributed, partitioned, append-only commit log — not a message queue. Data persists, consumers read by offset, messages are not deleted after consumption
  • Core components: topics (named logs), partitions (parallelism units), offsets (sequence numbers), consumer groups (shared load), brokers (servers)
  • Performance magic: sequential append writes + zero-copy sendfile — a single broker can saturate a 10Gbps NIC
  • Partition count is your permanent max parallelism: you can never have more active consumers than partitions in a group
  • Production trap: consumer rebalance storms when max.poll.interval.ms is too low — the group stops all consumption and oscillates
  • Biggest mistake: acks=all without min.insync.replicas=2 — you think writes are durable but a single broker loss can still lose data
Plain-English First

Imagine a massive newspaper printing plant that runs 24/7. Instead of delivering one paper to one house at a time, it prints millions of copies, sorts them into labeled bundles (topics), loads each bundle onto separate delivery trucks (partitions), and lets any number of paperboys (consumers) grab from their assigned truck without slowing anyone else down. If a truck breaks down, a spare steps in immediately — nobody misses their morning paper. Kafka is that printing plant, but for data.

Every modern distributed system eventually hits the same wall: services need to talk to each other faster than synchronous HTTP allows, and they need to do it reliably even when parts of the system go down. A payment service can't wait for analytics to finish before confirming a transaction. An inventory system can't drop an event just because the warehouse app is restarting. The moment you have more than two services exchanging real-time data, you need a message broker — and Kafka has become the industry's default answer.

Kafka solves a deceptively hard problem: durable, ordered, high-throughput, multi-consumer event streaming. Traditional message queues like RabbitMQ delete a message once it's consumed. Kafka keeps it, indexed by offset, for as long as you tell it to. This means you can replay events, onboard new consumers without re-producing data, and audit exactly what happened and when. That's not a queue — it's a distributed, append-only log. Understanding the difference separates engineers who use Kafka from engineers who understand Kafka.

By the end you'll be able to reason about partition assignment and why it determines your throughput ceiling, explain exactly how consumer group rebalancing works and when it silently kills your throughput, configure producers for durability versus latency trade-offs, and walk into any production incident knowing which knobs to turn first.

The Append-Only Log: Why Kafka's Core Data Structure Changes Everything

Most engineers learn Kafka by learning its API. That's backwards. To really understand it, start with the log — a data structure so simple it's almost boring, yet so powerful it underpins everything from databases (WAL in PostgreSQL) to distributed consensus (Raft).

A Kafka topic is not a queue. It's a named, partitioned, append-only log. When a producer sends a message, Kafka appends it to the end of the log and assigns it a monotonically increasing integer: the offset. Consumers don't pop messages off — they read forward from an offset they track themselves. This is the first mental model shift you need to make.

Because the log is append-only and offset-indexed, reads are sequential I/O — the fastest kind of disk operation. Kafka exploits this aggressively with zero-copy I/O (sendfile syscall), meaning data moves from disk to the network socket without ever touching user-space memory. A single Kafka broker can saturate a 10Gbps NIC handling millions of messages per second, not because Kafka is magic, but because it's built around hardware's natural strengths.

Each partition is an independent log stored as a set of segment files on disk. The active segment is open for writes. Older segments are immutable and eligible for deletion or compaction based on your retention policy. Understanding this file structure is critical when you're debugging disk space alerts or tuning log compaction.

io/thecodeforge/kafka/kafka_topic_and_log_structure.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# ── Step 1: Create a topic with 3 partitions and replication factor 3 ──
# More partitions = more parallelism, but also more overhead.
# Rule of thumb: partitions = (target throughput MB/s) / (single-partition throughput MB/s)
kafka-topics.sh \
  --bootstrap-server localhost:9092 \
  --create \
  --topic order-events \
  --partitions 3 \
  --replication-factor 3 \
  --config retention.ms=604800000    # 7 days retention
  --config segment.bytes=536870912   # Roll a new segment file at 512MB

# ── Step 2: Inspect the physical log files on the broker ──
# Each partition gets its own directory: <topic>-<partition-number>
ls -lh /var/kafka-logs/order-events-0/
# Expected output:
# 00000000000000000000.index   <- sparse offset index for fast seek
# 00000000000000000000.log     <- the actual message data, append-only
# 00000000000000000000.timeindex <- maps timestamps to offsets
# leader-epoch-checkpoint      <- tracks leader changes for safe consumer recovery

# ── Step 3: Dump raw messages from a partition segment to inspect offsets ──
# kafka-dump-log is your best friend for debugging corruption or offset gaps
kafka-dump-log.sh \
  --files /var/kafka-logs/order-events-0/00000000000000000000.log \
  --print-data-log | head -20

# ── Step 4: Check topic configuration and current low/high watermarks ──
# High watermark = last offset fully replicated to all ISR members
# Log end offset = last offset written (may not be replicated yet)
kafka-log-dirs.sh \
  --bootstrap-server localhost:9092 \
  --topic-list order-events \
  --describe | python3 -m json.tool
Why the High Watermark Matters More Than You Think:
Consumers can only read up to the high watermark — not the log end offset. If your ISR (In-Sync Replica) set shrinks because a follower is slow, the high watermark stalls and consumer lag appears to spike even though producers are still writing. Check ISR sizes first when you see sudden unexplained consumer lag.
Production Insight
Sequential disk writes are faster than random writes on SSDs — Kafka's log structure is engineered for this.
Zero-copy sendfile moves data from disk to network without touching user memory, eliminating a full copy.
Rule: Never fight Kafka's append-only nature. Random access patterns and message-level deletes will destroy your throughput.
Key Takeaway
Kafka's performance advantage is architectural, not magic — sequential append-only disk writes plus zero-copy I/O let a single broker saturate a 10Gbps NIC.
Fight this architecture with random access and you will lose throughput fast.
Rule: treat Kafka as an append-only log with immutable segments. Writes are free; overwrites and deletes are expensive.
Kafka Compaction Decision Tree
IfTopic stores immutable event data (logs, clicks, transactions)
UseUse deletion compaction: cleanup.policy=delete, retention by time or size. Each event is independent and never updated.
IfTopic stores changelog — latest state per key matters (user profiles, inventory counts)
UseUse log compaction: cleanup.policy=compact. Kafka retains only the latest value per key. Deleting a key requires adding a tombstone record with null value.
IfTopic has very low volume (< 1000 messages/day)
UseUse segment.bytes=107374182 (100MB) and retention.ms=86400000 (1 day) only. Compaction overhead is wasted.
IfSegment file corruption suspected or offsets missing
UseRun kafka-dump-log on the segment. Corrupt segment? Reassign replicas to force a healthy replica to become leader, then stop the corrupt broker and let it rebuild from the healthy replica.
IfBroker runs out of disk space despite retention policy
UseCheck for active segments that cannot be deleted. A partition with very low volume may never roll a segment. Force roll with kafka-log-dirs.sh or set log.roll.ms to a sane value (604800000 = 7 days).

Partitions, Consumer Groups, and the Parallelism Contract You Must Honour

Here's the rule that most engineers learn the hard way: within a consumer group, each partition is assigned to exactly one consumer at any moment. One consumer can read from multiple partitions, but one partition cannot be read by multiple consumers in the same group simultaneously. This is Kafka's ordering guarantee — and it's also your parallelism ceiling.

If you have a topic with 6 partitions and spin up 8 consumers in the same group, 2 of those consumers will sit completely idle. You've paid for 8 processes and get the throughput of 6. Conversely, if you have 6 partitions and 2 consumers, each consumer reads from 3 partitions — perfectly fine, just not maximally parallel.

Consumer group rebalancing is where production systems silently degrade. A rebalance happens when a consumer joins, leaves, or is considered dead by the group coordinator broker. During a rebalance, all consumption stops — this is the stop-the-world event of the Kafka world. A slow consumer that exceeds max.poll.interval.ms (default 5 minutes) triggers a rebalance even if the consumer process is healthy. This is one of the most misdiagnosed issues in Kafka production systems.

Kafka 2.4+ introduced Incremental Cooperative Rebalancing (the StickyAssignor strategy), which lets consumers release only the partitions they need to give up rather than all of them. This eliminates the full stop-the-world pause for most real rebalances. If you're on an older assignor, you're leaving throughput on the table.

io/thecodeforge/kafka/OrderEventConsumer.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
package io.thecodeforge.kafka;

import org.apache.kafka.clients.consumer.*;
import org.apache.kafka.common.errors.WakeupException;
import org.apache.kafka.common.serialization.StringDeserializer;

import java.time.Duration;
import java.util.Collections;
import java.util.Properties;
import java.util.concurrent.atomic.AtomicBoolean;

public class OrderEventConsumer {

    // Shared shutdown flag — set from a shutdown hook so the poll loop exits cleanly
    private final AtomicBoolean shutdownRequested = new AtomicBoolean(false);

    public void startConsuming() {
        Properties consumerConfig = new Properties();

        consumerConfig.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka-broker-1:9092,kafka-broker-2:9092");
        consumerConfig.put(ConsumerConfig.GROUP_ID_CONFIG, "order-processing-service"); // Consumer group name
        consumerConfig.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
        consumerConfig.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());

        // CRITICAL: How long between poll() calls before the broker declares this consumer dead.
        // If your business logic per batch takes > 5 min, raise this — or better, move processing off the poll thread.
        consumerConfig.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, "300000"); // 5 minutes

        // CRITICAL: How many records to fetch per partition per poll() call.
        // Lower this if your processing is slow to avoid triggering max.poll.interval.ms
        consumerConfig.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, "100");

        // Use CooperativeStickyAssignor to avoid full stop-the-world rebalances (Kafka 2.4+)
        consumerConfig.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG,
            CooperativeStickyAssignor.class.getName());

        // Disable auto-commit — we commit manually AFTER processing succeeds
        // Auto-commit is at-most-once: it can commit before your DB write finishes
        consumerConfig.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false");

        // Start reading from the earliest available offset if no committed offset exists
        consumerConfig.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");

        try (KafkaConsumer<String, String> orderConsumer = new KafkaConsumer<>(consumerConfig)) {

            // Register a ConsumerRebalanceListener to log rebalance events in production
            orderConsumer.subscribe(Collections.singletonList("order-events"),
                new ConsumerRebalanceListener() {
                    @Override
                    public void onPartitionsRevoked(java.util.Collection<org.apache.kafka.common.TopicPartition> partitions) {
                        // Called BEFORE partitions are taken away — commit offsets here to avoid reprocessing
                        System.out.printf("[REBALANCE] Revoking partitions: %s — committing offsets now%n", partitions);
                        orderConsumer.commitSync(); // Synchronous commit before giving partitions up
                    }

                    @Override
                    public void onPartitionsAssigned(java.util.Collection<org.apache.kafka.common.TopicPartition> partitions) {
                        System.out.printf("[REBALANCE] Assigned partitions: %s%n", partitions);
                    }
                });

            // Register shutdown hook so Ctrl+C triggers a clean wakeup instead of an abrupt kill
            Runtime.getRuntime().addShutdownHook(new Thread(() -> {
                shutdownRequested.set(true);
                orderConsumer.wakeup(); // Causes the blocking poll() to throw WakeupException
            }));

            while (!shutdownRequested.get()) {
                try {
                    // poll() does two things: fetches records AND sends heartbeats to the group coordinator
                    ConsumerRecords<String, String> orderBatch = orderConsumer.poll(Duration.ofMillis(500));

                    for (ConsumerRecord<String, String> orderRecord : orderBatch) {
                        System.out.printf("Partition=%d | Offset=%d | Key=%s | Value=%s%n",
                            orderRecord.partition(),
                            orderRecord.offset(),       // Monotonically increasing per partition
                            orderRecord.key(),
                            orderRecord.value());

                        processOrder(orderRecord.value()); // Your actual business logic
                    }

                    if (!orderBatch.isEmpty()) {
                        // Synchronous commit: blocks until broker acknowledges.
                        // Guarantees at-least-once delivery — your processing must be idempotent.
                        orderConsumer.commitSync();
                        System.out.printf("Committed offsets for %d records%n", orderBatch.count());
                    }

                } catch (WakeupException shutdownSignal) {
                    // Expected during shutdown — exit the loop cleanly
                    if (!shutdownRequested.get()) throw shutdownSignal;
                    System.out.println("Shutdown signal received — exiting poll loop");
                }
            }
        }
    }

    private void processOrder(String orderJson) {
        // Simulate order processing — in production this writes to a DB, calls downstream services, etc.
        System.out.printf("Processing order: %s%n", orderJson);
    }

    public static void main(String[] args) {
        new OrderEventConsumer().startConsuming();
    }
}
Watch Out: The Silent Rebalance Loop
If your consumer group is stuck in a continuous rebalance loop, check max.poll.interval.ms vs your actual processing time per batch first. Then check session.timeout.ms — if your broker's group.min.session.timeout.ms is higher than your configured value, the broker silently ignores your setting and uses its minimum. Run kafka-consumer-groups.sh --describe and watch the STATE field: 'PreparingRebalance' appearing repeatedly is your smoking gun.
Production Insight
Consumer rebalancing stops ALL consumption in the group. A 30-second rebalance on a 6-partition topic is 30 seconds of zero throughput.
The default RangeAssignor revokes all partitions from all consumers during any rebalance, even a single consumer join/leave.
Rule: upgrade to CooperativeStickyAssignor (Kafka 2.4+). It only moves the partitions that actually need to move. The stop-the-world pause disappears.
Key Takeaway
Partition count is your permanent parallelism ceiling — you can never have more active consumers than partitions in a group, and you cannot reduce partition count without creating a new topic and migrating data.
Over-provision partitions by at least 3x your current consumer count to leave room for growth.
Rule: partition count = max_throughput_MBps / single_partition_throughput_MBps, then multiply by 3. Err on the side of more partitions.
Consumer Lag Investigation Tree
IfLag is growing, all consumers at < 80% CPU
UseYou're I/O bound or network bound. Check consumer's downstream system latency (DB calls, HTTP requests). Increase fetch.max.bytes to get more data per poll. Add more consumers — but only if you have partitions available.
IfLag is growing, at least one consumer at 100% CPU
UseYou're CPU bound. Optimize message processing logic. Move to a compiled language for the consumer if currently in Python/Ruby. Partition the workload differently — low-cardinality keys cause hot partitions.
IfLag spikes every few minutes, then returns to zero
UseRebalance storm. Check max.poll.interval.ms vs processing time. Check group state with kafka-consumer-groups.sh. Switch to CooperativeStickyAssignor. Move heavy processing off the poll thread.
IfOne specific partition has much higher lag than others
UseHot partition. Check message keys — low-cardinality keys (country, region) create imbalance. Check producer key distribution with kafka-run-class GetOffsetShell per partition. Redesign key to include salt: composite key with higher cardinality.
IfLag is zero but downstream system sees no messages
UseCommitted offset moved without processing. Check auto-commit setting. If true, set to false. If false, verify commitSync() is called AFTER processing, not before. Look for exceptions between processing and commit.

Producer Durability vs Latency: The acks, retries, and idempotence Triangle

Every Kafka producer configuration is really a negotiation between three forces: how fast you want to send, how sure you want to be the message arrived, and how much you'll tolerate duplicates. Getting this wrong is expensive in production — either you lose data silently or you process orders twice.

The acks setting controls acknowledgement behaviour. acks=0 means fire and forget — fastest, no guarantees. acks=1 means the partition leader acknowledges — fast, but if the leader dies before replication, the message is lost. acks=all (or -1) means every in-sync replica (ISR) must acknowledge — durable, but only as strong as your min.insync.replicas setting.

Here's the trap: acks=all with min.insync.replicas=1 is effectively the same as acks=1, because only one replica needs to be in sync. The correct production configuration is acks=all combined with min.insync.replicas=2 on a cluster with replication factor 3. This tolerates one broker failure with zero data loss.

Idempotent producers (enable.idempotence=true), available since Kafka 0.11, solve duplicate delivery during retries. The broker assigns each producer a PID (Producer ID) and tracks a sequence number per partition. If a retry arrives with the same sequence number, the broker deduplicates it. This gives you exactly-once semantics at the producer level with zero application-side logic. Always enable this in production — the overhead is negligible.

io/thecodeforge/kafka/DurableOrderEventProducer.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
package io.thecodeforge.kafka;

import org.apache.kafka.clients.producer.*;
import org.apache.kafka.common.serialization.StringSerializer;

import java.util.Properties;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.Future;

public class DurableOrderEventProducer {

    public static void main(String[] args) throws ExecutionException, InterruptedException {

        Properties producerConfig = new Properties();

        producerConfig.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka-broker-1:9092,kafka-broker-2:9092,kafka-broker-3:9092");
        producerConfig.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
        producerConfig.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());

        // ── DURABILITY CONFIG ──────────────────────────────────────────────────
        // Require ALL in-sync replicas to acknowledge before the write is confirmed.
        // Combined with min.insync.replicas=2 on the broker/topic, this means
        // the write survives a single broker failure.
        producerConfig.put(ProducerConfig.ACKS_CONFIG, "all");

        // Enable idempotent producer: assigns this producer a unique PID.
        // Broker deduplicates retried messages using PID + partition + sequence number.
        // REQUIRES: acks=all, retries > 0, max.in.flight.requests.per.connection <= 5
        producerConfig.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, "true");

        // How many times to retry a failed send before giving up.
        // With idempotence enabled, retries are safe — no duplicates.
        producerConfig.put(ProducerConfig.RETRIES_CONFIG, Integer.MAX_VALUE);

        // Max unacknowledged requests per connection.
        // Must be <= 5 when idempotence is enabled.
        // Higher values increase throughput but with idempotence, 5 is the max safe value.
        producerConfig.put(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, "5");

        // ── THROUGHPUT / LATENCY TUNING ───────────────────────────────────────
        // linger.ms: How long to wait for more messages before sending a batch.
        // 0 = send immediately (low latency, small batches).
        // 20 = wait 20ms, batch more records, fewer network round-trips (higher throughput).
        producerConfig.put(ProducerConfig.LINGER_MS_CONFIG, "20");

        // batch.size: Max bytes per batch per partition before sending regardless of linger.ms.
        // Default is 16KB — increasing to 64KB or 128KB dramatically improves throughput
        // when producing at high volume.
        producerConfig.put(ProducerConfig.BATCH_SIZE_CONFIG, "65536"); // 64KB

        // compression.type: Compresses batches before sending. 'snappy' is CPU-light.
        // 'lz4' is faster. 'zstd' gives best compression ratio (Kafka 2.1+).
        // Compression happens client-side — reduces network and disk usage.
        producerConfig.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "snappy");

        // buffer.memory: Total memory the producer uses to buffer records before sending.
        // If the buffer fills up (broker is slow), send() will block for max.block.ms,
        // then throw a TimeoutException.
        producerConfig.put(ProducerConfig.BUFFER_MEMORY_CONFIG, "67108864"); // 64MB

        try (KafkaProducer<String, String> orderProducer = new KafkaProducer<>(producerConfig)) {

            String orderId = "orderId-9821";
            String orderPayload = "{\"status\":\"PLACED\",\"amount\":149.99,\"customerId\":\"cust-4477\"}";

            // Using the orderId as the message key ensures all events for the same order
            // always land on the same partition — preserving per-order event ordering.
            ProducerRecord<String, String> orderRecord = new ProducerRecord<>(
                "order-events",   // topic
                orderId,          // key — determines partition via murmur2 hash
                orderPayload      // value
            );

            // ── OPTION A: Async send with callback (preferred for high throughput) ──
            orderProducer.send(orderRecord, (metadata, sendException) -> {
                if (sendException != null) {
                    // In production: push to a dead-letter topic or alert on-call
                    System.err.printf("FAILED to send order %s: %s%n", orderId, sendException.getMessage());
                } else {
                    System.out.printf("[ASYNC] Sent to topic=%s partition=%d offset=%d timestamp=%d%n",
                        metadata.topic(),
                        metadata.partition(),
                        metadata.offset(),
                        metadata.timestamp());
                }
            });

            // ── OPTION B: Sync send — blocks until ack or exception ──
            // Use when you MUST know the write succeeded before proceeding (e.g., financial transactions)
            try {
                RecordMetadata syncMetadata = orderProducer.send(orderRecord).get(); // Blocks here
                System.out.printf("[SYNC] Confirmed at partition=%d offset=%d%n",
                    syncMetadata.partition(),
                    syncMetadata.offset());
            } catch (ExecutionException kafkaSendError) {
                System.err.println("[SYNC] Send failed: " + kafkaSendError.getCause().getMessage());
            }

            // flush() ensures all buffered messages are sent before the try-with-resources closes the producer
            orderProducer.flush();
        }
    }
}
Pro Tip: Partition Key Design Is a Throughput Decision
Using a high-cardinality key like orderId distributes load evenly across partitions — ideal. Using a low-cardinality key like region or country creates hot partitions: one partition gets 80% of traffic while others sit idle. If you're seeing one partition's offset lag growing while others are flat, suspect a hot-key problem. Consider a composite key (region + customerId) or a custom partitioner that salts the key.
Production Insight
acks=all with min.insync.replicas=1 is a false sense of security. Lose the leader and writes fail.
Idempotent producers (enable.idempotence=true) add 2-5% latency overhead but eliminate duplicate delivery during retries.
Rule: For financial data: replication factor=3, min.insync.replicas=2, acks=all, enable.idempotence=true. For analytics logs: acks=1, compression=zstd, and accept occasional loss.
Key Takeaway
acks=all is necessary but not sufficient for durability — you must also set min.insync.replicas=2 at the topic level.
Without MISR=2, acks=all with only one in-sync replica is functionally identical to acks=1.
Rule: Always set enable.idempotence=true in production. The overhead is small and eliminated duplicates are worth it.
Producer Configuration Strategy
IfFinancial transactions, inventory updates, any data where loss is unacceptable
Useacks=all, min.insync.replicas=2 (topic config), replication factor=3, enable.idempotence=true, retries=MAX_INT
IfBusiness metrics, analytics, logs — loss is tolerable, throughput is priority
Useacks=1, enable.idempotence=false, compression=lz4 or zstd, linger.ms=50 for batch efficiency
IfDevelopment/testing environment
Useacks=0, no compression, no idempotence. Speed over durability — data loss is acceptable in non-production.
IfMessages must be processed in strict order per key (e.g., user events)
Usemax.in.flight.requests.per.connection=1. This serialises all requests, killing throughput. Alternative: enable.idempotence=true (limits to 5 in-flight) which is usually sufficient for ordering.
IfProducer buffer filling frequently — send() blocks
UseIncrease buffer.memory from 32MB to 128MB. Increase max.block.ms to 120000. Check if broker is slow — if so, increase broker's num.io.threads.

Replication, ISR, and What Actually Happens When a Broker Dies

Replication in Kafka is leader-follower. Each partition has one leader and N-1 followers spread across brokers. Producers and consumers always talk to the leader — followers exist purely for fault tolerance. The leader tracks which followers are caught up: if a follower hasn't fetched within replica.lag.time.max.ms (default 30 seconds), it's removed from the ISR. When the leader fails, the controller broker (elected via Zookeeper or KRaft) promotes one of the ISR members to leader.

This is where min.insync.replicas (often called MISR) becomes critical. With replication factor 3 and min.insync.replicas=2, you can lose one broker and still accept writes. Lose two brokers and the topic partition becomes unavailable for writes — producers get a NotEnoughReplicasException. This is intentional: Kafka chooses consistency over availability in this scenario, which is the correct trade-off for financial data.

Unclean leader election (unclean.leader.election.enable) is the escape hatch. If all ISR members are dead and this setting is true, Kafka will elect an out-of-sync replica as leader — potentially losing committed messages. This defaults to false in Kafka 0.11+ for exactly this reason. Only enable it for topics where availability beats data integrity.

Kafka 2.8+ introduced KRaft mode, replacing Zookeeper with a Raft-based metadata quorum built into Kafka itself. This eliminates the operational complexity of running a separate Zookeeper ensemble and dramatically speeds up controller failover — from tens of seconds to under a second.

io/thecodeforge/kafka/kafka_replication_diagnostics.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# ── Check ISR status for all partitions of a topic ──
# Under-replicated partitions (ISR < replication factor) are a red alert
kafka-topics.sh \
  --bootstrap-server localhost:9092 \
  --describe \
  --topic order-events

# Expected healthy output:
# Topic: order-events  Partition: 0  Leader: 1  Replicas: 1,2,3  Isr: 1,2,3
# Topic: order-events  Partition: 1  Leader: 2  Replicas: 2,3,1  Isr: 2,3,1
# Topic: order-events  Partition: 2  Leader: 3  Replicas: 3,1,2  Isr: 3,1,2

# ── Simulate broker 2 going down, then check ISR shrinkage ──
# After replica.lag.time.max.ms (30s), broker 2 is removed from ISR
kafka-topics.sh \
  --bootstrap-server localhost:9092 \
  --describe \
  --topic order-events

# Output after broker 2 failure:
# Topic: order-events  Partition: 0  Leader: 1  Replicas: 1,2,3  Isr: 1,3  <-- broker 2 dropped!
# Topic: order-events  Partition: 1  Leader: 3  Replicas: 2,3,1  Isr: 3,1  <-- leader elected from ISR
# Topic: order-events  Partition: 2  Leader: 3  Replicas: 3,1,2  Isr: 3,1

# ── Check consumer group lag — the most important production health metric ──
# LAG = LOG-END-OFFSET minus CURRENT-OFFSET per partition
# Growing lag means consumers can't keep up with producer throughput
kafka-consumer-groups.sh \
  --bootstrap-server localhost:9092 \
  --describe \
  --group order-processing-service

# Healthy output (lag near zero):
# GROUP                    TOPIC        PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG  CONSUMER-ID
# order-processing-service order-events 0          1047            1047            0    consumer-1-uuid
# order-processing-service order-events 1          892             892             0    consumer-2-uuid
# order-processing-service order-events 2          1103            1103            0    consumer-1-uuid

# ── Set min.insync.replicas at the topic level (overrides broker default) ──
# This is the safest place to set it — per topic, not globally
kafka-configs.sh \
  --bootstrap-server localhost:9092 \
  --alter \
  --entity-type topics \
  --entity-name order-events \
  --add-config min.insync.replicas=2

# ── Verify the config was applied ──
kafka-configs.sh \
  --bootstrap-server localhost:9092 \
  --describe \
  --entity-type topics \
  --entity-name order-events
Watch Out: Under-Replicated Partitions Are a Time Bomb
An under-replicated partition isn't a crisis on its own — but it means you've lost your fault-tolerance buffer. If the remaining replica goes down while you're under-replicated, that partition becomes offline and you lose write availability. Set up a Prometheus alert on kafka_server_replicamanager_underreplicatedpartitions > 0 and treat it as P1 regardless of time of day.
Production Insight
ISR (In-Sync Replica) shrinkage happens silently. One slow broker can drop from ISR without any alert.
min.insync.replicas is a topic-level config, not a broker default. Set it per topic based on data criticality.
Rule: Monitor under-replicated partitions with a P1 alert. If ISR size < replication factor for more than 5 minutes, investigate immediately. An under-replicated partition cannot survive another broker loss.
Key Takeaway
Replication factor sets the total copies. ISR tracks copies that are fully caught up.
min.insync.replicas sets the minimum ISR size required for writes — the true durability knob.
Rule: For production critical topics: RF=3, MISR=2. For analytics: RF=2, MISR=1. Never run RF=1 in production unless you accept data loss.
Replication and ISR Decision Tree
IfTopic contains financial or transactional data where loss is unacceptable
Usereplication.factor=3, min.insync.replicas=2. Tolerates one broker failure without downtime or data loss. Can survive with one remaining replica but no further failures.
IfTopic contains logs or metrics where loss is tolerable, availability matters more
Usereplication.factor=2, min.insync.replicas=1. Faster writes, slightly lower durability. Data loss possible if the leader fails before replication to the single follower.
IfISR size < replication.factor consistently — one follower always lagging
UseIncrease replica.lag.time.max.ms from 30s to 60s. Monitor follower disk I/O and network latency. If a broker is permanently slow, replace it or move its replicas to another broker.
IfYou have a single broker in production (development only)
Usereplication.factor=1, min.insync.replicas=1. No fault tolerance. Any broker failure causes data loss and downtime. Do not run production workloads on a single broker.
IfMultiple brokers down simultaneously
UseIf down brokers exceed (replication.factor - min.insync.replicas), writes will fail with NotEnoughReplicasException. This is intentional. Kafka chooses consistency over availability. To restore, bring brokers back online or (dangerously) set min.insync.replicas=1 temporarily.
● Production incidentPOST-MORTEMseverity: high

The Rebalance Storm That Killed Black Friday Sales

Symptom
Consumer lag monitoring showed no processing for 45 minutes during Black Friday peak. When lag finally dropped, it then spiked again 5 minutes later. The pattern repeated: complete stall, burst of consumption, stall, burst. kafka-consumer-groups.sh showed STATE=PreparingRebalance repeatedly. 8 consumer instances were in the group, but only 2 actually processed messages at any given time.
Assumption
The team assumed max.poll.interval.ms of 5 minutes was generous — their processing per batch averaged 30 seconds. They didn't account for long-tail latency: occasional batches with large messages or complex enrichment took 4.5 minutes. The consumer sent heartbeats, so the broker kept the session alive, but the group coordinator saw no poll() calls for 4.5 minutes and initiated a rebalance, thinking the consumer was dead.
Root cause
A single consumer in the group occasionally exceeded max.poll.interval.ms by 30 seconds due to an external API call timing out. The group coordinator removed it from the group and triggered a full rebalance (stop-the-world). During the rebalance, the consumer finished processing and called poll() again — but the coordinator had already removed it. The consumer re-joined, triggering another rebalance. This cycle repeated endlessly. The default partition.assignment.strategy (RangeAssignor) forced a full revocation of all partitions, not just the affected ones, making the storm worse.
Fix
Increased max.poll.interval.ms to 10 minutes to cover the worst-case processing time. Switched to CooperativeStickyAssignor (Kafka 2.4+), which only revokes partitions that need to move, not all of them. Moved the slow external API call from the poll thread to a separate thread pool, keeping the poll thread fast. Added a circuit breaker that fails fast on API timeouts instead of waiting the full timeout. After the fix, rebalances dropped from 47 per hour to 0.
Key lesson
  • max.poll.interval.ms must cover your 99.9th percentile processing time, not the average. Long-tail latency kills consumer groups.
  • Never place blocking or slow operations on the Kafka poll thread. Offload to a processing thread pool and keep poll() under 100ms.
  • Upgrade to CooperativeStickyAssignor. The stop-the-world rebalance of RangeAssignor is a production liability for any group with >2 consumers.
  • Monitor rebalance metrics: kafka.consumer:type=consumer-coordinator-metrics looking for rebalances.total and rebalance.time.total. A rising rebalance count is always a red alert.
Production debug guideSymptom → Action mapping for common Kafka production failures5 entries
Symptom · 01
Consumer group stuck in continuous rebalance loop (PreparingRebalance state repeating)
Fix
Check max.poll.interval.ms vs actual per-batch processing time. Run kafka-consumer-groups.sh --describe --group <group> and look for any consumer with high time since last poll. Increase max.poll.interval.ms to cover worst-case processing. Switch from RangeAssignor to CooperativeStickyAssignor by setting partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor in consumer config.
Symptom · 02
Producers getting NotEnoughReplicasException — writes failing intermittently
Fix
Check ISR (In-Sync Replica) count per partition with kafka-topics.sh --describe --topic <topic>. If Isr list size < min.insync.replicas, writes will fail. Check broker logs for follower fetch lag. Most common cause: a follower broker is falling behind due to disk I/O saturation or network latency. Increase replica.lag.time.max.ms temporarily, but address the underlying performance issue.
Symptom · 03
Consumer lag growing but consumers are at 100% CPU
Fix
You're CPU-bound. Add more consumers — but only if you have partitions available. Check partition count: if 6 partitions and 8 consumers, 2 are idle. You must increase partition count (requires downtime or new topic) or optimize processing logic to reduce CPU per message. Use async processing to batch work or parallelize per-message within the consumer.
Symptom · 04
Messages seem to be processed twice — duplicates in downstream system
Fix
You have 'at-least-once' delivery and your consumer crashes after processing but before committing offsets. Check enable.auto.commit — if true, change to false. Implement manual commitSync() ONLY after the entire batch is successfully processed and the downstream transaction is committed. Make downstream operations idempotent using a message-ID deduplication table with TTL.
Symptom · 05
High producer latency — send() blocks for seconds
Fix
Check buffer.memory vs actual usage. If the producer's buffer fills because the broker is slow to acknowledge, send() blocks. Monitor record-queue-time metric. Increase buffer.memory from default 32MB to 64MB or 128MB. Check broker request.handler.threads — if saturated, broker is the bottleneck, not producer.
★ Kafka Quick Debug Cheat SheetFast diagnostics for production Kafka issues. Run these commands to confirm the root cause before changing configuration.
Consumer group rebalancing repeatedly
Immediate action
Check rebalance state and frequency
Commands
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group <group>
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group <group> --state
Fix now
Set max.poll.interval.ms=600000 (10 min). Set partition.assignment.strategy=CooperativeStickyAssignor. Move slow processing off the poll thread to a separate executor pool.
Under-replicated partitions (ISR < replication factor)+
Immediate action
Identify which replicas are out of sync
Commands
kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic <topic> | grep -i 'isr'
kafka-replica-verification.sh --broker-list localhost:9092
Fix now
Check follower broker disk space (df -h). Check network latency between brokers. Increase replica.lag.time.max.ms temporarily. Add followers back by restarting slow broker.
Message loss suspected — offset commits without processing+
Immediate action
Check auto-commit config and look for gaps in offset sequence
Commands
kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic <topic> --from-beginning --property print.offset=true 2>/dev/null | head -20
kafka-dump-log.sh --files /var/kafka-logs/<topic>-<partition>/00000000000000000000.log --print-data-log 2>/dev/null | grep -E 'offset:|payload'
Fix now
Set enable.auto.commit=false. Implement manual commitSync() in finally block after DB transaction. Add idempotent processing using original offset as deduplication key.
Producer requests timing out — write latency spikes+
Immediate action
Check broker request queue depth and network
Commands
echo 'cat /proc/net/sockstat' | ssh <broker> ; netstat -an | grep :9092 | wc -l
kafka-run-class.sh kafka.tools.JmxTool --object-name kafka.server:type=BrokerTopicMetrics --report-byte-ratio | grep -i 'request'
Fix now
Increase request.timeout.ms to 60000. Increase buffer.memory to 134217728. Check num.network.threads on brokers — default 3 may be too low for high throughput.
Log compaction not cleaning up old records as expected+
Immediate action
Check min.compaction.lag.ms and segment deletion settings
Commands
kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic <topic> | grep -E 'cleanup.policy|segment|compaction'
kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic <topic> --time -2
Fix now
Set min.cleanable.dirty.ratio=0.5 to make compaction more aggressive. Ensure segment.bytes is not too large (default 1GB). Run kafka-leader-election.sh to force compaction if stuck.
Kafka Configuration Trade-offs
Configuration AspectHigh Throughput (Analytics)High Durability (Financial)
acks settingacks=1 (leader only)acks=all (all ISR members)
min.insync.replicas1 (effectively same as acks=1)2 (tolerates 1 broker loss)
enable.idempotencefalse (not needed, duplicates OK)true (exactly-once at producer level)
linger.ms50-100ms (large batches)0-5ms (low latency commits)
batch.size128KB-512KB (maximize batching)16KB-64KB (flush quickly)
compression.typelz4 or zstd (heavy compression)snappy or none (CPU budget for other work)
Consumer: enable.auto.committrue (at-most-once acceptable)false (manual commitSync after DB write)
Consumer: max.poll.records1000+ (process fast in bulk)50-100 (safer for slow processing)
Replication factor2 (cost savings)3+ (survive 1 broker loss with ISR=2)
unclean.leader.electiontrue (availability > consistency)false (never lose committed data)

Key takeaways

1
Kafka's performance advantage is architectural, not magic
sequential append-only disk writes plus zero-copy I/O (sendfile syscall) let a single broker saturate a 10Gbps NIC; fight this architecture and you will lose throughput fast
2
Partition count is your permanent parallelism ceiling
you can never have more active consumers than partitions in a group, and you cannot reduce partition count without creating a new topic and migrating data, so over-provision by at least 3x your current consumer count
3
Consumer group rebalances stop all consumption
switch to CooperativeStickyAssignor (Kafka 2.4+) to eliminate the stop-the-world pause, and always commit offsets in the onPartitionsRevoked callback to avoid reprocessing after a rebalance
4
acks=all is necessary but not sufficient for durability
you must also set min.insync.replicas=2 at the topic level; without MISR=2, acks=all with only one in-sync replica is functionally identical to acks=1
5
Idempotent producers (enable.idempotence=true) eliminate duplicate delivery during retries at 2-5% overhead
enable them in every production workload, no exceptions.

Common mistakes to avoid

5 patterns
×

Using low-cardinality keys (country, region) as partition keys

Symptom
Some partitions have 10x the message count of others. Consumer lag grows on hot partitions while others idle. Throughput is limited by the hottest partition, not total partition count.
Fix
Salt the key: partitionKey = s"${randomSalt}_${naturalKey}". For composite keys, ensure the first component has high cardinality. Test distribution with kafka-run-class GetOffsetShell and compare partition totals. If hot partitions are unavoidable, increase partition count to spread the hot keys across more partitions.
×

Using auto-commit with slow business logic

Symptom
Messages appear processed but are actually lost after a consumer crash mid-batch. Downstream DB writes succeeded but Kafka re-delivers the same messages to the next consumer instance, causing duplicate inserts.
Fix
Set enable.auto.commit=false. Call consumer.commitSync() explicitly only after your DB transaction commits. Make downstream operations idempotent using a message-ID deduplication table with TTL. For high throughput, commit offsets periodically after every N messages, but accept at-least-once semantics.
×

Setting partition count higher than consumer count without planning for future scale

Symptom
Idle consumers sitting at 0% CPU while some partitions become hot. Adding more consumers later does nothing because partitions are already at 1:1 ratio. Throughput cannot scale beyond partition count.
Fix
Choose partition count based on your 12-month throughput target, not today's consumer count. Formula: max(target_throughput_MB / single_consumer_throughput_MB, target_consumer_count * 3) rounded up to a power of 2. You can increase partitions later but you cannot decrease them without creating a new topic. Over-provision by at least 3x.
×

Ignoring max.poll.interval.ms when processing is slow

Symptom
Consumer group enters a continuous rebalance storm. kafka-consumer-groups.sh shows STATE=PreparingRebalance repeatedly. Lag oscillates rather than growing linearly. Consumption stops completely during rebalances.
Fix
Increase max.poll.interval.ms to exceed your worst-case processing time per batch (e.g., 10 minutes). Better yet, decouple Kafka polling from business logic using an internal queue and a separate processing thread pool, keeping the Kafka poll thread under 100ms.
×

Setting acks=all but forgetting min.insync.replicas=2

Symptom
Believing writes are durable, but a single broker loss can still lose data. If the leader fails before any follower catches up, committed messages are lost even with acks=all.
Fix
Always set min.insync.replicas=2 at the topic level when using acks=all. This requires replication.factor=3 (so 3 replicas total, 2 in ISR required). Test by killing a leader broker while producing — writes should continue without errors and no data loss.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain exactly what happens — at the broker and consumer level — when a...
Q02SENIOR
You have a topic with 12 partitions, replication factor 3, and min.insyn...
Q03SENIOR
A colleague suggests setting acks=all to guarantee no data loss. You agr...
Q04SENIOR
How does Kafka's log compaction differ from deletion-based retention, an...
Q05SENIOR
Walk me through Kafka's consumer rebalancing protocol. What's the differ...
Q01 of 05SENIOR

Explain exactly what happens — at the broker and consumer level — when a Kafka consumer exceeds max.poll.interval.ms. What does the broker do, what does the consumer see, and how does this affect other consumers in the same group?

ANSWER
The consumer's internal heartbeat thread continues running. The group coordinator receives heartbeats, so session.timeout.ms doesn't trigger. However, the coordinator also tracks the last time the consumer called poll(). When max.poll.interval.ms passes without a poll call, the coordinator marks the consumer as dead and initiates a rebalance, removing that consumer from the group. The consumer meanwhile finishes processing its batch and calls poll() again. It discovers that its assignment has been revoked (via a WakeupException on older clients, or an empty partition set on newer ones). It then re-joins the group, triggering ANOTHER rebalance. This creates a continuous loop: the slow consumer is repeatedly removed, rejoins, and removed again. During each rebalance, all consumers in the group stop processing entirely. On the default RangeAssignor, all partitions are revoked from all consumers, causing a full stop-the-world pause. On CooperativeStickyAssignor, only the partitions belonging to the slow consumer are revoked, leaving others processing. The solution is to increase max.poll.interval.ms to cover worst-case processing time, or move processing off the poll thread entirely.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
How many Kafka partitions should I create for a new topic?
02
What is the difference between Kafka offset and a message ID?
03
Is Kafka a database? Can I query it like one?
04
How do I choose between Kafka and RabbitMQ for a new project?
🔥

That's NoSQL. Mark it forged?

5 min read · try the examples if you haven't

Previous
Neo4j Graph Database Basics
14 / 15 · NoSQL
Next
Apache HBase Basics