Design a Distributed Message Queue: Avoid the 3 AM Thread Pool Exhaustion
Design a distributed message queue that survives production.
20+ years shipping large-scale distributed systems. Written from production experience, not tutorials.
Design a distributed message queue by partitioning data across brokers, using a commit log for durability, and implementing consumer groups for parallel processing. Key choices: Kafka vs Pulsar vs RabbitMQ — each trades off latency, ordering, and operational complexity.
Imagine a restaurant kitchen with one chef and a hundred waiters. Without a queue, waiters shout orders at the chef, orders get lost, and the chef burns out. A distributed message queue is like a ticket rail — waiters drop tickets, chefs pick them in order, and multiple chefs can work different sections (partitions) without stepping on each other.
The 3 AM page that wakes you up: 'Payment service latency spike — messages stuck in queue.' You SSH in, see 10,000 unacknowledged messages, and the consumer thread pool is completely deadlocked. This isn't a Kafka bug. It's a design failure you introduced. Most engineers think a distributed message queue is just 'Kafka with a topic.' They're wrong. The hard part isn't the queue — it's the guarantees: exactly-once semantics, ordering across partitions, and handling a broker crash without losing a single message. By the end of this, you'll design a queue that survives a node failure, a network partition, and a consumer that takes 30 seconds to process a message — without dropping data or deadlocking.
Core Architecture: Partitions Are Your Scalability Unit
Before you write a single line, understand this: a distributed message queue is a partitioned commit log. Each partition is an ordered, immutable sequence of messages. Producers append to partitions, consumers read from them. The magic is that partitions can live on different brokers, giving you horizontal scaling. But here's the trap: ordering is only guaranteed within a partition. If you need global ordering, you need a single partition — which kills throughput. Choose wisely. In production, we use a key-based partitioning scheme: all messages for the same order ID go to the same partition. That preserves order per order, while allowing parallel consumption across orders. Never hash by random — you'll get hot partitions.
Durability: The Commit Log and Replication Factor
A queue that loses messages is a liability. Durability comes from the commit log — every message is written to disk on the broker before being acknowledged. But disk writes are slow. So we batch them. In Kafka, the producer can set acks=all and linger.ms=5 to batch 5ms worth of messages into a single disk write. That gives you durability with 10x throughput. Replication factor (RF) of 3 means three brokers have a copy. If one dies, the others take over. But RF=3 means three times the storage and network traffic. For non-critical logs, RF=2 is fine. For payment events, RF=3 minimum. Never set RF=1 in production — I've seen a single disk failure wipe out a day of orders.
Consumer Groups and Rebalancing: The Silent Killer
Consumer groups let you scale consumption: multiple consumers split partitions among themselves. But when a consumer joins or leaves, a rebalance triggers — all consumers stop, revoke partitions, and reassign. During rebalance, no messages are processed. In a 100-partition topic, rebalance can take 30 seconds. That's 30 seconds of piling messages. The fix: use cooperative rebalancing (Kafka 2.4+) which reassigns partitions incrementally. Also set session.timeout.ms high enough (45s) to avoid false rebalances on GC pauses. And never use static group membership unless you want manual partition assignment — it's a maintenance nightmare.
Exactly-Once Semantics: The Holy Grail and Its Cost
At-least-once is easy: retry on failure. But duplicates happen. Exactly-once requires idempotent producers and transactional consumers. The producer assigns a unique ID to each message; the broker deduplicates. The consumer reads messages and writes results in a transaction — if it crashes, the transaction is rolled back, and the message is re-delivered. But transactions add latency (2x-3x). For payment processing, it's worth it. For logging, it's overkill. Use exactly-once only when duplicates cause financial loss. Otherwise, use idempotent consumers (dedup by message ID) — cheaper and simpler.
Backpressure and Flow Control: Don't Let the Queue Eat Your Memory
If producers outpace consumers, the queue grows unbounded. Brokers run out of disk. Consumers get overwhelmed. The fix: backpressure. Producers should block when the queue is full. Kafka doesn't have native backpressure — you implement it with max.in.flight.requests.per.connection (set to 1 for strict ordering) and a callback that blocks. Better: use a rate limiter on the consumer side. If consumer processing time spikes, slow down polling. Also set retention limits: delete messages older than 7 days or when topic size exceeds 100GB. Never let a topic grow forever — I've seen a 2TB topic that took 3 days to clean up.
kafka-log-dirs.sh.Monitoring and Alerting: What to Watch Before It Burns
You can't fix what you don't measure. Monitor these metrics: consumer lag (difference between latest offset and consumer offset), request rate, error rate, and disk usage. Use Prometheus + Grafana. Alert on consumer lag > 1000 for more than 5 minutes — that means consumers are falling behind. Also alert on under-replicated partitions (URP) — that means a broker is down and data might be at risk. And monitor GC pauses: if young GC takes > 200ms, your heap is too small. I've seen a 10-second GC pause cause a consumer group rebalance that took down a whole microservice.
When Not to Use a Distributed Message Queue
Distributed message queues add operational complexity: brokers to manage, partitions to tune, rebalances to handle. If you have a single server and low throughput (< 100 msg/s), use an in-memory queue (e.g., Disruptor) or a simple database table with a status column. If you need strict FIFO across all messages, a single-partition queue limits throughput — consider a database with ordered IDs. And if your messages are tiny and latency-critical (< 1ms), a distributed queue's network overhead kills you. Use a shared memory queue instead. I've seen teams adopt Kafka for a 10 msg/s internal logging pipeline — they spent more time managing Kafka than building features. Don't be that team.
The 4GB Container That Kept Dying
- Always bound memory per partition, not per broker.
- Unbounded fetch buffers are silent killers.
kafka-consumer-groups --describe --group <group>
2. Check if consumer is stuck in rebalance: look for 'Revoke' and 'Assign' logs
3. Check session.timeout.ms and heartbeat.interval.ms — ensure heartbeat is sent before timeout
4. If rebalancing too often, increase session.timeout.ms to 60sdf -h on broker
2. Reduce retention: kafka-configs --alter --add-config retention.bytes=10737418240
3. Delete old segments manually: rm -rf /data/kafka/<topic>-<partition>/*.log (stop broker first)
4. Add more brokers and reassign partitionskafka-consumer-groups --bootstrap-server localhost:9092 --group <group> --describekafka-run-class kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic <topic> --time -1Key takeaways
Interview Questions on This Topic
How does Kafka handle a broker crash without losing messages? Describe the leader election and recovery process.
Frequently Asked Questions
20+ years shipping large-scale distributed systems. Written from production experience, not tutorials.
That's Real World. Mark it forged?
4 min read · try the examples if you haven't