Apache Kafka — Why Rebalance Storms Kill Consumer Groups
One slow consumer caused 45 min of Black Friday downtime.
20+ years shipping high-throughput database systems. Drawn from code that ran under real load.
- Kafka is a distributed, partitioned, append-only commit log — not a message queue. Data persists, consumers read by offset, messages are not deleted after consumption
- Core components: topics (named logs), partitions (parallelism units), offsets (sequence numbers), consumer groups (shared load), brokers (servers)
- Performance magic: sequential append writes + zero-copy sendfile — a single broker can saturate a 10Gbps NIC
- Partition count is your permanent max parallelism: you can never have more active consumers than partitions in a group
- Production trap: consumer rebalance storms when max.poll.interval.ms is too low — the group stops all consumption and oscillates
- Biggest mistake: acks=all without min.insync.replicas=2 — you think writes are durable but a single broker loss can still lose data
Imagine a massive newspaper printing plant that runs 24/7. Instead of delivering one paper to one house at a time, it prints millions of copies, sorts them into labeled bundles (topics), loads each bundle onto separate delivery trucks (partitions), and lets any number of paperboys (consumers) grab from their assigned truck without slowing anyone else down. If a truck breaks down, a spare steps in immediately — nobody misses their morning paper. Kafka is that printing plant, but for data.
Every modern distributed system eventually hits the same wall: services need to talk to each other faster than synchronous HTTP allows, and they need to do it reliably even when parts of the system go down. A payment service can't wait for analytics to finish before confirming a transaction. An inventory system can't drop an event just because the warehouse app is restarting. The moment you have more than two services exchanging real-time data, you need a message broker — and Kafka has become the industry's default answer.
Kafka solves a deceptively hard problem: durable, ordered, high-throughput, multi-consumer event streaming. Traditional message queues like RabbitMQ delete a message once it's consumed. Kafka keeps it, indexed by offset, for as long as you tell it to. This means you can replay events, onboard new consumers without re-producing data, and audit exactly what happened and when. That's not a queue — it's a distributed, append-only log. Understanding the difference separates engineers who use Kafka from engineers who understand Kafka.
By the end you'll be able to reason about partition assignment and why it determines your throughput ceiling, explain exactly how consumer group rebalancing works and when it silently kills your throughput, configure producers for durability versus latency trade-offs, and walk into any production incident knowing which knobs to turn first.
Apache Kafka — The Log That Never Sleeps
Apache Kafka is a distributed event store and stream-processing platform built around an immutable, partitioned commit log. Producers append records to topics; consumers read from offsets they control. The core mechanic is that each partition is an ordered, replayable sequence — no fan-out, no broker-side filtering. This design gives Kafka O(1) disk reads per partition and horizontal scaling by adding partitions.
In practice, Kafka decouples data producers from consumers via a pull model. Consumers within a group coordinate to assign partitions — each partition goes to exactly one consumer in the group. This is where the rebalance protocol enters: when a consumer joins, leaves, or fails, the group triggers a rebalance that stops all consumers, revokes partitions, and reassigns them. During a rebalance, no messages are processed. A large group with many partitions can take seconds to minutes to stabilize.
Use Kafka when you need a durable, replayable, high-throughput event backbone — think order processing, metrics pipelines, or CDC feeds. It shines where multiple downstream systems must consume the same stream independently. But its rebalance semantics mean you must design for partition count, consumer count, and session timeouts carefully. A misconfigured group with 50 partitions and 10 consumers can rebalance for 30 seconds every time a single consumer restarts.
The Append-Only Log: Why Kafka's Core Data Structure Changes Everything
Most engineers learn Kafka by learning its API. That's backwards. To really understand it, start with the log — a data structure so simple it's almost boring, yet so powerful it underpins everything from databases (WAL in PostgreSQL) to distributed consensus (Raft).
A Kafka topic is not a queue. It's a named, partitioned, append-only log. When a producer sends a message, Kafka appends it to the end of the log and assigns it a monotonically increasing integer: the offset. Consumers don't pop messages off — they read forward from an offset they track themselves. This is the first mental model shift you need to make.
Because the log is append-only and offset-indexed, reads are sequential I/O — the fastest kind of disk operation. Kafka exploits this aggressively with zero-copy I/O (sendfile syscall), meaning data moves from disk to the network socket without ever touching user-space memory. A single Kafka broker can saturate a 10Gbps NIC handling millions of messages per second, not because Kafka is magic, but because it's built around hardware's natural strengths.
Each partition is an independent log stored as a set of segment files on disk. The active segment is open for writes. Older segments are immutable and eligible for deletion or compaction based on your retention policy. Understanding this file structure is critical when you're debugging disk space alerts or tuning log compaction.
Partitions, Consumer Groups, and the Parallelism Contract You Must Honour
Here's the rule that most engineers learn the hard way: within a consumer group, each partition is assigned to exactly one consumer at any moment. One consumer can read from multiple partitions, but one partition cannot be read by multiple consumers in the same group simultaneously. This is Kafka's ordering guarantee — and it's also your parallelism ceiling.
If you have a topic with 6 partitions and spin up 8 consumers in the same group, 2 of those consumers will sit completely idle. You've paid for 8 processes and get the throughput of 6. Conversely, if you have 6 partitions and 2 consumers, each consumer reads from 3 partitions — perfectly fine, just not maximally parallel.
Consumer group rebalancing is where production systems silently degrade. A rebalance happens when a consumer joins, leaves, or is considered dead by the group coordinator broker. During a rebalance, all consumption stops — this is the stop-the-world event of the Kafka world. A slow consumer that exceeds max.poll.interval.ms (default 5 minutes) triggers a rebalance even if the consumer process is healthy. This is one of the most misdiagnosed issues in Kafka production systems.
Kafka 2.4+ introduced Incremental Cooperative Rebalancing (the StickyAssignor strategy), which lets consumers release only the partitions they need to give up rather than all of them. This eliminates the full stop-the-world pause for most real rebalances. If you're on an older assignor, you're leaving throughput on the table.
Producer Durability vs Latency: The acks, retries, and idempotence Triangle
Every Kafka producer configuration is really a negotiation between three forces: how fast you want to send, how sure you want to be the message arrived, and how much you'll tolerate duplicates. Getting this wrong is expensive in production — either you lose data silently or you process orders twice.
The acks setting controls acknowledgement behaviour. acks=0 means fire and forget — fastest, no guarantees. acks=1 means the partition leader acknowledges — fast, but if the leader dies before replication, the message is lost. acks=all (or -1) means every in-sync replica (ISR) must acknowledge — durable, but only as strong as your min.insync.replicas setting.
Here's the trap: acks=all with min.insync.replicas=1 is effectively the same as acks=1, because only one replica needs to be in sync. The correct production configuration is acks=all combined with min.insync.replicas=2 on a cluster with replication factor 3. This tolerates one broker failure with zero data loss.
Idempotent producers (enable.idempotence=true), available since Kafka 0.11, solve duplicate delivery during retries. The broker assigns each producer a PID (Producer ID) and tracks a sequence number per partition. If a retry arrives with the same sequence number, the broker deduplicates it. This gives you exactly-once semantics at the producer level with zero application-side logic. Always enable this in production — the overhead is negligible.
send() blocksReplication, ISR, and What Actually Happens When a Broker Dies
Replication in Kafka is leader-follower. Each partition has one leader and N-1 followers spread across brokers. Producers and consumers always talk to the leader — followers exist purely for fault tolerance. The leader tracks which followers are caught up: if a follower hasn't fetched within replica.lag.time.max.ms (default 30 seconds), it's removed from the ISR. When the leader fails, the controller broker (elected via Zookeeper or KRaft) promotes one of the ISR members to leader.
This is where min.insync.replicas (often called MISR) becomes critical. With replication factor 3 and min.insync.replicas=2, you can lose one broker and still accept writes. Lose two brokers and the topic partition becomes unavailable for writes — producers get a NotEnoughReplicasException. This is intentional: Kafka chooses consistency over availability in this scenario, which is the correct trade-off for financial data.
Unclean leader election (unclean.leader.election.enable) is the escape hatch. If all ISR members are dead and this setting is true, Kafka will elect an out-of-sync replica as leader — potentially losing committed messages. This defaults to false in Kafka 0.11+ for exactly this reason. Only enable it for topics where availability beats data integrity.
Kafka 2.8+ introduced KRaft mode, replacing Zookeeper with a Raft-based metadata quorum built into Kafka itself. This eliminates the operational complexity of running a separate Zookeeper ensemble and dramatically speeds up controller failover — from tens of seconds to under a second.
Leader Election: Why ZooKeeper Isn't the Bottleneck You Think (But Your Controller Configuration Is)
Every Kafka cluster has exactly one controller broker. That controller is responsible for electing new partition leaders when a broker dies. The election itself is fast — ZooKeeper handles the ephemeral znode check, and the controller picks the first in-sync replica from the ISR list. That decision happens in milliseconds.
What kills you is the controller failover when the controller broker itself goes down. All brokers compete for the /controller znode in ZooKeeper. The winner becomes the new controller and immediately processes pending partition reassignments. If you have thousands of partitions, that batch operation can block metadata requests from producers and consumers for seconds.
You tune zookeeper.session.timeout.ms to detect failures faster. You set auto.leader.rebalance.enable=true only if you want the controller to continuously push leaders back to preferred replicas — which adds load. The real optimization? Keep your partition count sane. A broker with 10,000 partitions is a limp hand grenade. Start raising alarms at 4,000 per broker.
Consumer Rebalancing: The Silent Throttle That Wastes Your Parallelism Gains
You set up a consumer group with 12 consumers, you have 12 partitions, you expect 12-way parallelism. Then the group coordinator (one of the brokers) triggers a rebalance because a consumer’s heartbeat timed out. For the duration of the rebalance — 'stop the world' — every consumer in that group stops processing. All 12 partitions go dark.
The new consumer protocol (cooperative rebalancing) fixes the worst of this. Instead of revoking all partitions, it revokes a subset. Consumers keep processing the partitions they still own. You enable it with partition.assignment.strategy=CooperativeStickyAssignor. But here is the catch: the coordinator still serializes assignment changes. Your consumers are idle during the assignment computation.
The hard truth: rebalancing frequency is a function of consumer instability. Short-lived consumers, consumer crashes, or network blips trigger cascading rebalances. Keep session.timeout.ms at 45 seconds and heartbeat.interval.ms at 3 seconds. Don't let a misconfigured health check kill your consumer every 30 seconds. Each rebalance costs your entire group its throughput for a window.
Also, monitor kafka.consumer:type=consumer-coordinator-metrics for rebalance rate and time since last rebalance. If you see more than one rebalance per minute, you have a problem.
Kafka CLI Basics: The Commands That Actually Matter in Production
You don't need a UI to run Kafka—you need a terminal and three commands that will diagnose 90% of your cluster issues before they hit production. The Kafka CLI tools are ugly, but they're fast and they don't lie.
Start with kafka-console-producer to blast test data into a topic: it's your canary. Then kafka-console-consumer with --from-beginning to verify everything arrived—or your consumer group isn't stuck. The real money is kafka-consumer-groups --describe. That single command shows you lag per partition. If lag grows, you have a consumer bottleneck or a partition count mismatch.
Don't fall for the trap of using kafka-topics --describe to check health. That's metadata, not data. The logs tell the truth. If you're debugging, grep for "LEADER_NOT_AVAILABLE" or "NOT_ENOUGH_REPLICAS" in broker logs. That's where production fires start.
kafka-console-consumer with --group in production unless you intend to join that consumer group. You'll mess up offset commits and cause rebalances on real services.kafka-consumer-groups --describe shows partition lag—the only metric that tells you if your consumers are falling behind.Scalability: Adding Partitions Does Not Magically Speed You Up
Everyone thinks "more partitions = more throughput." That's a half-truth that will cost you. Kafka partitions are the unit of parallelism—each partition can only be consumed by one consumer in a group. So if you have 3 partitions and 10 consumers, 7 sit idle. The bottleneck isn't partitions—it's the number of consumer threads that can actually process data concurrently.
Here's the real math: max throughput = (number of partitions) × (single partition throughput). Single partition throughput is bounded by disk I/O and network bandwidth on that broker. Adding partitions spreads load across brokers if your topic is well-distributed, but adding partitions on a single broker just adds seek contention.
Production rule: set partition count to at least 2× your peak expected consumer count. Overshooting wastes broker resources—each partition is a file with leader election overhead. And never use kafka-topics --alter to increase partitions on a keyed topic unless you understand that keys will no longer map to the same partitions. Your joins will silently break.
Log Aggregation: Kafka The Unsexy Firehose That Replaced Syslog
Log aggregation was the original Kafka use case—LinkedIn built it to replace a mountain of point-to-point syslog pipes. Every service dumped structured logs to a single Kafka topic, and consumers parsed, filtered, and stored them to HDFS, Elasticsearch, or S3. It worked because Kafka decoupled producers (anything that logs) from consumers (anything that stores or alerts).
The pattern is dead simple: each service writes to a topic like logs_app_ with a JSON payload containing timestamp, level, service, trace_id, and message. Your consumers read from the topic, filter on log level, and route to the right sink. Need to add a new monitoring tool? Spin up a new consumer group. No changes to producers.
Here's the trap: Kafka is fast, but log aggregation at scale means high throughput with small messages. Each log line is maybe 500 bytes. Batching matters. Tune batch.size and linger.ms on producers to avoid a flood of tiny I/O operations. Otherwise your brokers thrash on disk writes and your aggregation pipeline collapses under metadata overhead.
acks=all for log aggregation. Losing a few log lines is acceptable. Acks=all kills throughput. Set acks=1, compress with gzip, and accept eventual delivery.Real-World Applications Beyond Log Aggregation
Kafka's append-only log and replayability make it the backbone for mission-critical systems far beyond simple log shipping. Financial institutions use Kafka for real-time fraud detection — each transaction is a message consumed by multiple applications: one checks velocity limits, another cross-references blacklists, and a third updates risk scores. E-commerce platforms rely on Kafka for order processing state machines. When a customer places an order, Kafka streams events through inventory reservation, payment authorization, and shipping dispatch, all in separate consumer groups. Healthcare systems use Kafka for patient monitoring data pipelines, where sensor readings from thousands of devices must be processed in real-time to alert clinicians. The key pattern is the decoupling of producers from consumers — each system writes its events once, and any number of downstream systems replay or filter those events independently. Kafka replaces fragile point-to-point integrations with a durable, ordered event backbone that scales horizontally. Any system where data loss is unacceptable and multiple consumers need the same stream is a candidate for Kafka.
Advanced Kafka Features That Save Your Architecture
Most engineers stop at basic produce/consume patterns, but Kafka's advanced features solve real production headaches. Idempotent producers, enabled via enable.idempotence=true, guarantee exactly-once semantics for writes — no duplicate messages even during retries, which prevents financial double-counting. Transactional messaging extends this across multiple partitions: wrap a batch of messages in a transaction so they all commit or none do, critical for atomic updates across topics. Compacted topics retain only the latest value per key, not the full history — ideal for restoring state from a changelog without reprocessing old events. Tiered storage offloads older log segments to cheaper object stores (S3, HDFS), reducing broker disk pressure while keeping data accessible. KSQL (Kafka Streams SQL) lets you define streaming joins, aggregations, and windowing with plain SQL, saving you from writing Java Streams code for simple transformations. The MirrorMaker 2 pattern enables cross-datacenter replication with active-active topologies, but beware of cyclic replication loops. These features exist because real systems need transactions, state restoration, and infinite retention without crashing brokers.
acks=all and min.insync.replicas=2. Without proper replication, a single broker failure can abort uncommitted transactions.Activity Tracking at Scale: Kafka's Original Purpose
Before Kafka became the universal event bus, LinkedIn used it to solve a simple problem: track every user action — page views, clicks, likes, shares — and make that data available to dozens of systems without overwhelming databases. This is still one of Kafka's most powerful yet underused applications. Each user action is a message in an activity topic partitioned by user ID. Multiple consumers read the same stream for different purposes: one writes to a real-time dashboard for active user counts, another feeds a recommendation engine, a third sends data to a data lake for weekly analytics. The key insight is that Kafka decouples the capture of activity from its consumption. You never worry about backpressure from a slow consumer because Kafka buffers messages on disk. You can add new consumers years later and replay past activity to backfill models — a capability impossible with traditional pub/sub systems. Activity tracking demands high throughput (millions of events per second) and low latency (sub-second from click to dashboard). Kafka handles both when you configure sufficient partitions (at least 50 per broker) and use batching wisely. Every web application generating user events should consider Kafka before building custom polling solutions.
Installation: Kafka Without the Pain
Installing Kafka isn't about running a script and forgetting it; it's about understanding the moving parts that make your cluster reliable. Why does the installation method matter? Because your environment dictates how ZooKeeper, brokers, and your CLI tools communicate. For macOS, use Homebrew: brew install kafka. This pulls both ZooKeeper and the broker binaries. On Windows, avoid native installs—use WSL2 for Linux parity. Install Java 11+, then download Kafka from Apache. On Linux, extract the tarball and set KAFKA_HOME. The key insight: install ZooKeeper separately (brew or manual) because Kafka's controller is still dependent on it despite KIP-500 progress. Always verify with kafka-broker-api-versions.sh --bootstrap-server localhost:9092. A failed install often means Java version mismatch or ZooKeeper not starting first—check logs, not assumptions.
Starting Kafka: The CLI Lifecycle You Must Own
Starting Kafka isn't a single command—it's a chain of decisions that affect durability and replication. Why start this way? Because mixing startup order causes phantom leader failures. On WSL2 and Linux, begin with ZooKeeper: bin/zookeeper-server-start.sh config/zookeeper.properties. It must bind to 0.0.0.0:2181 for cross-container access. Next, start the Kafka broker: bin/kafka-server-start.sh config/server.properties. Tweak log.dirs to a persistent mount (e.g., /data/kafka), not /tmp. For WSL2, use /mnt/c/kafka-logs to survive restarts. After startup, test with bin/kafka-topics.sh --create --topic test --bootstrap-server localhost:9092 --partitions 3 --replication-factor 1. If you see Connection refused, check that ZooKeeper is alive first. The silent killer? advertised.listeners misconfiguration in server.properties—set it to your WSL2 IP (found via ip addr show eth0) or localhost for single-broker dev. Only then does your CLI toolchain work reliably.
advertised.listeners in server.properties will make your WSL2 consumers fail silently—they can't reach the broker's internal IP.Console CLI Tools: Producers, Consumers, and Groups
The kafka-console-producer and kafka-console-consumer are your surgical instruments for testing message flow and group behavior. Why use them before code? Because they isolate network and config issues from application logic. Produce a message: echo "order:123" | bin/kafka-console-producer.sh --topic orders --bootstrap-server localhost:9092. The --property parse.key=true --property key.separator=: adds key semantics—without it, you get null keys, killing partition strategy. For consumers, use --group my-group to enable group management: bin/kafka-console-consumer.sh --topic orders --group my-group --from-beginning --bootstrap-server localhost:9092. To inspect group state, use kafka-consumer-groups.sh --group my-group --describe --bootstrap-server localhost:9092. This shows lag (difference between LOG-END-OFFSET and CURRENT-OFFSET). High lag means your consumer is throttled. The trap: --from-beginning with an existing group only works if the group has no committed offsets—reset with --reset-offsets --to-earliest --execute. Master these commands, and you debug Kafka blindfolded.
--from-beginning on a group with committed offsets does nothing—you must reset offsets via --reset-offsets to reprocess old data.The Rebalance Storm That Killed Black Friday Sales
max.poll.interval.ms of 5 minutes was generous — their processing per batch averaged 30 seconds. They didn't account for long-tail latency: occasional batches with large messages or complex enrichment took 4.5 minutes. The consumer sent heartbeats, so the broker kept the session alive, but the group coordinator saw no poll() calls for 4.5 minutes and initiated a rebalance, thinking the consumer was dead.max.poll.interval.ms by 30 seconds due to an external API call timing out. The group coordinator removed it from the group and triggered a full rebalance (stop-the-world). During the rebalance, the consumer finished processing and called poll() again — but the coordinator had already removed it. The consumer re-joined, triggering another rebalance. This cycle repeated endlessly. The default partition.assignment.strategy (RangeAssignor) forced a full revocation of all partitions, not just the affected ones, making the storm worse.max.poll.interval.ms to 10 minutes to cover the worst-case processing time. Switched to CooperativeStickyAssignor (Kafka 2.4+), which only revokes partitions that need to move, not all of them. Moved the slow external API call from the poll thread to a separate thread pool, keeping the poll thread fast. Added a circuit breaker that fails fast on API timeouts instead of waiting the full timeout. After the fix, rebalances dropped from 47 per hour to 0.max.poll.interval.msmust cover your 99.9th percentile processing time, not the average. Long-tail latency kills consumer groups.- Never place blocking or slow operations on the Kafka poll thread. Offload to a processing thread pool and keep
poll()under 100ms. - Upgrade to CooperativeStickyAssignor. The stop-the-world rebalance of RangeAssignor is a production liability for any group with >2 consumers.
- Monitor rebalance metrics:
kafka.consumer:type=consumer-coordinator-metricslooking forrebalances.totalandrebalance.time.total. A rising rebalance count is always a red alert.
max.poll.interval.ms vs actual per-batch processing time. Run kafka-consumer-groups.sh --describe --group <group> and look for any consumer with high time since last poll. Increase max.poll.interval.ms to cover worst-case processing. Switch from RangeAssignor to CooperativeStickyAssignor by setting partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor in consumer config.NotEnoughReplicasException — writes failing intermittentlykafka-topics.sh --describe --topic <topic>. If Isr list size < min.insync.replicas, writes will fail. Check broker logs for follower fetch lag. Most common cause: a follower broker is falling behind due to disk I/O saturation or network latency. Increase replica.lag.time.max.ms temporarily, but address the underlying performance issue.enable.auto.commit — if true, change to false. Implement manual commitSync() ONLY after the entire batch is successfully processed and the downstream transaction is committed. Make downstream operations idempotent using a message-ID deduplication table with TTL.send() blocks for secondsbuffer.memory vs actual usage. If the producer's buffer fills because the broker is slow to acknowledge, send() blocks. Monitor record-queue-time metric. Increase buffer.memory from default 32MB to 64MB or 128MB. Check broker request.handler.threads — if saturated, broker is the bottleneck, not producer.kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group <group>kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group <group> --statemax.poll.interval.ms=600000 (10 min). Set partition.assignment.strategy=CooperativeStickyAssignor. Move slow processing off the poll thread to a separate executor pool.Key takeaways
Common mistakes to avoid
5 patternsUsing low-cardinality keys (country, region) as partition keys
partitionKey = s"${randomSalt}_${naturalKey}". For composite keys, ensure the first component has high cardinality. Test distribution with kafka-run-class GetOffsetShell and compare partition totals. If hot partitions are unavoidable, increase partition count to spread the hot keys across more partitions.Using auto-commit with slow business logic
enable.auto.commit=false. Call consumer.commitSync() explicitly only after your DB transaction commits. Make downstream operations idempotent using a message-ID deduplication table with TTL. For high throughput, commit offsets periodically after every N messages, but accept at-least-once semantics.Setting partition count higher than consumer count without planning for future scale
max(target_throughput_MB / single_consumer_throughput_MB, target_consumer_count * 3) rounded up to a power of 2. You can increase partitions later but you cannot decrease them without creating a new topic. Over-provision by at least 3x.Ignoring max.poll.interval.ms when processing is slow
kafka-consumer-groups.sh shows STATE=PreparingRebalance repeatedly. Lag oscillates rather than growing linearly. Consumption stops completely during rebalances.max.poll.interval.ms to exceed your worst-case processing time per batch (e.g., 10 minutes). Better yet, decouple Kafka polling from business logic using an internal queue and a separate processing thread pool, keeping the Kafka poll thread under 100ms.Setting acks=all but forgetting min.insync.replicas=2
min.insync.replicas=2 at the topic level when using acks=all. This requires replication.factor=3 (so 3 replicas total, 2 in ISR required). Test by killing a leader broker while producing — writes should continue without errors and no data loss.Interview Questions on This Topic
Explain exactly what happens — at the broker and consumer level — when a Kafka consumer exceeds max.poll.interval.ms. What does the broker do, what does the consumer see, and how does this affect other consumers in the same group?
poll(). When max.poll.interval.ms passes without a poll call, the coordinator marks the consumer as dead and initiates a rebalance, removing that consumer from the group. The consumer meanwhile finishes processing its batch and calls poll() again. It discovers that its assignment has been revoked (via a WakeupException on older clients, or an empty partition set on newer ones). It then re-joins the group, triggering ANOTHER rebalance. This creates a continuous loop: the slow consumer is repeatedly removed, rejoins, and removed again. During each rebalance, all consumers in the group stop processing entirely. On the default RangeAssignor, all partitions are revoked from all consumers, causing a full stop-the-world pause. On CooperativeStickyAssignor, only the partitions belonging to the slow consumer are revoked, leaving others processing. The solution is to increase max.poll.interval.ms to cover worst-case processing time, or move processing off the poll thread entirely.Frequently Asked Questions
20+ years shipping high-throughput database systems. Drawn from code that ran under real load.
That's NoSQL. Mark it forged?
15 min read · try the examples if you haven't