Senior 4 min · June 25, 2026

Design a Distributed Message Queue: Avoid the 3 AM Thread Pool Exhaustion

Design a distributed message queue that survives production.

N
Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Written from production experience, not tutorials.

Follow
Production
production tested
June 25, 2026
last updated
1,663
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer

Design a distributed message queue by partitioning data across brokers, using a commit log for durability, and implementing consumer groups for parallel processing. Key choices: Kafka vs Pulsar vs RabbitMQ — each trades off latency, ordering, and operational complexity.

✦ Definition~90s read
What is Design a Distributed Message Queue?

A distributed message queue is a system that decouples producers and consumers across multiple machines, providing at-least-once delivery, ordering guarantees, and horizontal scalability. It's the backbone of async microservices communication.

Imagine a restaurant kitchen with one chef and a hundred waiters.
Plain-English First

Imagine a restaurant kitchen with one chef and a hundred waiters. Without a queue, waiters shout orders at the chef, orders get lost, and the chef burns out. A distributed message queue is like a ticket rail — waiters drop tickets, chefs pick them in order, and multiple chefs can work different sections (partitions) without stepping on each other.

The 3 AM page that wakes you up: 'Payment service latency spike — messages stuck in queue.' You SSH in, see 10,000 unacknowledged messages, and the consumer thread pool is completely deadlocked. This isn't a Kafka bug. It's a design failure you introduced. Most engineers think a distributed message queue is just 'Kafka with a topic.' They're wrong. The hard part isn't the queue — it's the guarantees: exactly-once semantics, ordering across partitions, and handling a broker crash without losing a single message. By the end of this, you'll design a queue that survives a node failure, a network partition, and a consumer that takes 30 seconds to process a message — without dropping data or deadlocking.

Core Architecture: Partitions Are Your Scalability Unit

Before you write a single line, understand this: a distributed message queue is a partitioned commit log. Each partition is an ordered, immutable sequence of messages. Producers append to partitions, consumers read from them. The magic is that partitions can live on different brokers, giving you horizontal scaling. But here's the trap: ordering is only guaranteed within a partition. If you need global ordering, you need a single partition — which kills throughput. Choose wisely. In production, we use a key-based partitioning scheme: all messages for the same order ID go to the same partition. That preserves order per order, while allowing parallel consumption across orders. Never hash by random — you'll get hot partitions.

PartitionedMessageQueue.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// io.thecodeforge — System Design tutorial

// Producer: partition by order ID
ProducerRecord<String, OrderEvent> record = new ProducerRecord<>(
    "order-events",
    order.getOrderId(), // key -> same partition
    order.toEvent()
);
producer.send(record, (metadata, exception) -> {
    if (exception != null) {
        // Retry with exponential backoff, log to dead letter queue
        deadLetterQueue.send(record);
    }
});

// Consumer: assign partitions to threads
// Each thread owns a set of partitions
for (TopicPartition partition : assignment) {
    executor.submit(() -> {
        while (true) {
            ConsumerRecords<String, OrderEvent> records = consumer.poll(Duration.ofMillis(100));
            for (ConsumerRecord<String, OrderEvent> record : records) {
                processOrder(record.value());
            }
            consumer.commitSync(); // commit after batch
        }
    });
}
Output
Messages for order-123 always go to partition 0. Consumer thread 1 processes partition 0, thread 2 processes partition 1. No global ordering, but per-order ordering preserved.
Production Trap: Hot Partitions
If you use a timestamp as key, all messages within the same millisecond hit the same partition. Use a composite key like {customerId}_{timestamp} to spread load.
Distributed Message Queue Architecture THECODEFORGE.IO Distributed Message Queue Architecture Core components and failure modes for production systems Partitions Shard data for parallel throughput Commit Log Append-only log with replication factor Consumer Group Rebalancing can cause latency spikes Exactly-Once Semantics Idempotent producers and transactional log Backpressure Throttle producers to prevent overload Monitoring Track lag, throughput, and error rates ⚠ Consumer rebalancing can cause thread pool exhaustion Use static group membership or incremental rebalancing THECODEFORGE.IO
thecodeforge.io
Distributed Message Queue Architecture
Design Distributed Message Queue

Durability: The Commit Log and Replication Factor

A queue that loses messages is a liability. Durability comes from the commit log — every message is written to disk on the broker before being acknowledged. But disk writes are slow. So we batch them. In Kafka, the producer can set acks=all and linger.ms=5 to batch 5ms worth of messages into a single disk write. That gives you durability with 10x throughput. Replication factor (RF) of 3 means three brokers have a copy. If one dies, the others take over. But RF=3 means three times the storage and network traffic. For non-critical logs, RF=2 is fine. For payment events, RF=3 minimum. Never set RF=1 in production — I've seen a single disk failure wipe out a day of orders.

DurableProducer.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
// io.thecodeforge — System Design tutorial

Properties props = new Properties();
props.put("bootstrap.servers", "broker1:9092,broker2:9092,broker3:9092");
props.put("acks", "all"); // wait for all in-sync replicas
props.put("retries", Integer.MAX_VALUE); // infinite retries on transient errors
props.put("enable.idempotence", true); // exactly-once semantics
props.put("linger.ms", 5); // batch up to 5ms
props.put("batch.size", 65536); // 64KB batches
props.put("compression.type", "snappy"); // reduce network bandwidth

KafkaProducer<String, OrderEvent> producer = new KafkaProducer<>(props);
Output
Producer sends 64KB batches every 5ms. Broker writes to disk and replicates to 2 followers. If broker1 crashes, broker2 takes over with zero data loss.
Senior Shortcut: Tune batch.size and linger.ms together
Don't set linger.ms too high (>100ms) for latency-sensitive apps. For payment processing, 5ms is safe. For analytics, 500ms is fine.

Consumer Groups and Rebalancing: The Silent Killer

Consumer groups let you scale consumption: multiple consumers split partitions among themselves. But when a consumer joins or leaves, a rebalance triggers — all consumers stop, revoke partitions, and reassign. During rebalance, no messages are processed. In a 100-partition topic, rebalance can take 30 seconds. That's 30 seconds of piling messages. The fix: use cooperative rebalancing (Kafka 2.4+) which reassigns partitions incrementally. Also set session.timeout.ms high enough (45s) to avoid false rebalances on GC pauses. And never use static group membership unless you want manual partition assignment — it's a maintenance nightmare.

ConsumerGroupRebalance.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
// io.thecodeforge — System Design tutorial

Properties consumerProps = new Properties();
consumerProps.put("group.id", "order-processor");
consumerProps.put("session.timeout.ms", 45000); // 45s before considered dead
consumerProps.put("heartbeat.interval.ms", 15000); // send heartbeat every 15s
consumerProps.put("partition.assignment.strategy", 
    "org.apache.kafka.clients.consumer.CooperativeStickyAssignor"); // incremental rebalance
consumerProps.put("max.poll.interval.ms", 300000); // 5min max between polls
consumerProps.put("max.poll.records", 100); // process 100 records per poll

KafkaConsumer<String, OrderEvent> consumer = new KafkaConsumer<>(consumerProps);
consumer.subscribe(Arrays.asList("order-events"));
Output
When consumer-3 crashes, only its partitions (e.g., 5-7) are reassigned. Other consumers keep processing. Rebalance time drops from 30s to 2s.
Never Do This: Static Group Membership
Setting group.instance.id makes consumers static — they keep partitions even if they disconnect. Sounds good? Until a consumer crashes and its partitions are stuck unprocessed for hours. Avoid.
Consumer Group Rebalance CycleTHECODEFORGE.IOConsumer Group Rebalance CycleHow a join/leave stops all processingStable GroupConsumers each own partitions, processing messagesMember Joins/LeavesGroup coordinator detects changeRevoke PartitionsAll consumers stop and release partitionsRebalanceNew partition assignment computedResume ProcessingConsumers start on new partitions⚠ Rebalance on 100 partitions can stall processing for secondsTHECODEFORGE.IO
thecodeforge.io
Consumer Group Rebalance Cycle
Design Distributed Message Queue

Exactly-Once Semantics: The Holy Grail and Its Cost

At-least-once is easy: retry on failure. But duplicates happen. Exactly-once requires idempotent producers and transactional consumers. The producer assigns a unique ID to each message; the broker deduplicates. The consumer reads messages and writes results in a transaction — if it crashes, the transaction is rolled back, and the message is re-delivered. But transactions add latency (2x-3x). For payment processing, it's worth it. For logging, it's overkill. Use exactly-once only when duplicates cause financial loss. Otherwise, use idempotent consumers (dedup by message ID) — cheaper and simpler.

ExactlyOnceConsumer.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// io.thecodeforge — System Design tutorial

// Consumer with idempotent processing
consumer.subscribe(Arrays.asList("order-events"));
while (true) {
    ConsumerRecords<String, OrderEvent> records = consumer.poll(Duration.ofMillis(100));
    for (ConsumerRecord<String, OrderEvent> record : records) {
        String messageId = record.key(); // unique per message
        if (processedIds.contains(messageId)) {
            continue; // dedup
        }
        processOrder(record.value());
        processedIds.add(messageId);
    }
    consumer.commitSync();
}

// For exactly-once with transactions:
producer.initTransactions();
producer.beginTransaction();
for (ConsumerRecord<String, OrderEvent> record : records) {
    producer.send(new ProducerRecord<>("processed-orders", record.key(), record.value()));
}
producer.sendOffsetsToTransaction(offsets, consumer.groupMetadata());
producer.commitTransaction();
Output
Idempotent consumer skips duplicates. Transactional producer ensures output topic and offsets are committed atomically.
Interview Gold: Exactly-Once vs Idempotent
Exactly-once is a system guarantee. Idempotent is an application property. You can have idempotent consumers without exactly-once — just dedup by message ID. That's what most production systems do.

Backpressure and Flow Control: Don't Let the Queue Eat Your Memory

If producers outpace consumers, the queue grows unbounded. Brokers run out of disk. Consumers get overwhelmed. The fix: backpressure. Producers should block when the queue is full. Kafka doesn't have native backpressure — you implement it with max.in.flight.requests.per.connection (set to 1 for strict ordering) and a callback that blocks. Better: use a rate limiter on the consumer side. If consumer processing time spikes, slow down polling. Also set retention limits: delete messages older than 7 days or when topic size exceeds 100GB. Never let a topic grow forever — I've seen a 2TB topic that took 3 days to clean up.

BackpressureConsumer.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// io.thecodeforge — System Design tutorial

// Rate-limited consumer
RateLimiter limiter = RateLimiter.create(100.0); // 100 permits per second
while (true) {
    limiter.acquire(); // blocks if rate exceeded
    ConsumerRecords<String, OrderEvent> records = consumer.poll(Duration.ofMillis(100));
    for (ConsumerRecord<String, OrderEvent> record : records) {
        processOrder(record.value());
    }
    consumer.commitSync();
}

// Broker-side retention
// Set on topic creation or via CLI:
// bin/kafka-configs.sh --bootstrap-server localhost:9092 --entity-type topics --entity-name order-events --alter --add-config retention.ms=604800000,retention.bytes=107374182400
Output
Consumer processes max 100 messages per second. Topic retains max 100GB or 7 days, whichever comes first.
Production Trap: Unbounded Retention
Default retention is 7 days. For high-throughput topics (10k msg/s), that's 6 billion messages. Set retention.bytes aggressively. Monitor disk usage with kafka-log-dirs.sh.
Backpressure: Block vs. DropTHECODEFORGE.IOBackpressure: Block vs. DropTwo strategies to avoid unbounded queue growthBlock ProducerProducer waits when queue is fullUses max.in.flight.requests=1Preserves all messagesMay cause client timeoutsDrop MessagesProducer discards excess messagesUses a bounded queue with rejectLoses data under loadRequires retry logic elsewhereBlocking is safer for critical data; dropping suits real-time metricsTHECODEFORGE.IO
thecodeforge.io
Backpressure: Block vs. Drop
Design Distributed Message Queue

Monitoring and Alerting: What to Watch Before It Burns

You can't fix what you don't measure. Monitor these metrics: consumer lag (difference between latest offset and consumer offset), request rate, error rate, and disk usage. Use Prometheus + Grafana. Alert on consumer lag > 1000 for more than 5 minutes — that means consumers are falling behind. Also alert on under-replicated partitions (URP) — that means a broker is down and data might be at risk. And monitor GC pauses: if young GC takes > 200ms, your heap is too small. I've seen a 10-second GC pause cause a consumer group rebalance that took down a whole microservice.

MonitoringSetup.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// io.thecodeforge — System Design tutorial

# Prometheus alert rule for consumer lag
groups:
  - name: kafka_alerts
    rules:
      - alert: ConsumerLagHigh
        expr: kafka_consumer_lag > 1000
        for: 5m
        annotations:
          summary: "Consumer lag > 1000 for topic {{ $labels.topic }}"
      - alert: UnderReplicatedPartitions
        expr: kafka_server_ReplicaManager_UnderReplicatedPartitions > 0
        for: 1m
        annotations:
          summary: "Under-replicated partitions detected"

# Check lag via CLI
# bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group order-processor --describe
Output
Prometheus alerts when lag > 1000 for 5 minutes. CLI shows per-partition lag: partition 0 lag 500, partition 1 lag 1200.
Senior Shortcut: Lag per partition, not total
Total lag hides hot partitions. Monitor max partition lag — if one partition has 10k lag while others have 0, your partitioning key is skewed.

When Not to Use a Distributed Message Queue

Distributed message queues add operational complexity: brokers to manage, partitions to tune, rebalances to handle. If you have a single server and low throughput (< 100 msg/s), use an in-memory queue (e.g., Disruptor) or a simple database table with a status column. If you need strict FIFO across all messages, a single-partition queue limits throughput — consider a database with ordered IDs. And if your messages are tiny and latency-critical (< 1ms), a distributed queue's network overhead kills you. Use a shared memory queue instead. I've seen teams adopt Kafka for a 10 msg/s internal logging pipeline — they spent more time managing Kafka than building features. Don't be that team.

Interview Gold: When to Choose Simplicity
If your queue fits in memory on one machine and you can tolerate data loss on crash, use an in-memory queue. If you need persistence but not distribution, use SQLite with WAL mode. Distributed queues are for when you need horizontal scaling and fault tolerance.
● Production incidentPOST-MORTEMseverity: high

The 4GB Container That Kept Dying

Symptom
Kafka broker OOM-killed every 6 hours. Heap dump showed 3.5GB of unread message buffers.
Assumption
We assumed a memory leak in the consumer library.
Root cause
Consumer fetch.max.bytes was set to 50MB per partition, and with 100 partitions, the broker tried to buffer 5GB in memory. Combined with replication threads, the JVM heap (4GB) couldn't keep up.
Fix
Set fetch.max.bytes to 5MB, fetch.max.wait.ms to 500ms, and added a consumer-side rate limiter with Guava's RateLimiter (permits per second = target throughput / average message size).
Key lesson
  • Always bound memory per partition, not per broker.
  • Unbounded fetch buffers are silent killers.
Production debug guideSystematic recovery paths for the failure modes engineers actually hit.3 entries
Symptom · 01
Consumer not processing messages, no errors in logs
Fix
1. Check consumer group status: kafka-consumer-groups --describe --group <group> 2. Check if consumer is stuck in rebalance: look for 'Revoke' and 'Assign' logs 3. Check session.timeout.ms and heartbeat.interval.ms — ensure heartbeat is sent before timeout 4. If rebalancing too often, increase session.timeout.ms to 60s
Symptom · 02
Messages duplicated in output
Fix
1. Check if producer has enable.idempotence=true 2. Check consumer commitSync() vs commitAsync() — async can cause duplicates on rebalance 3. Implement dedup by message ID in consumer 4. If using transactions, ensure commitTransaction() is called
Symptom · 03
Broker out of disk space
Fix
1. Check disk usage: df -h on broker 2. Reduce retention: kafka-configs --alter --add-config retention.bytes=10737418240 3. Delete old segments manually: rm -rf /data/kafka/<topic>-<partition>/*.log (stop broker first) 4. Add more brokers and reassign partitions
★ Distributed Message Queue Triage Cheat SheetFirst-response commands for when things go wrong — copy-paste ready.
Consumer lag > 1000
Immediate action
Check which partition has highest lag
Commands
kafka-consumer-groups --bootstrap-server localhost:9092 --group <group> --describe
kafka-run-class kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic <topic> --time -1
Fix now
Increase consumer threads or reduce max.poll.records. If single partition hot, repartition with better key.
Under-replicated partitions+
Immediate action
Check which broker is down
Commands
kafka-topics --bootstrap-server localhost:9092 --describe --under-replicated-partitions
kafka-broker-api-versions --bootstrap-server localhost:9092
Fix now
Restart the down broker. If disk failure, replace broker and reassign partitions.
Producer timeout errors+
Immediate action
Check network and broker load
Commands
ping <broker-host>
kafka-run-class kafka.tools.JmxTool --object-name kafka.server:type=BrokerTopicMetrics,name=TotalProduceRequestsPerSec
Fix now
Increase request.timeout.ms to 30000. If broker overloaded, add more brokers.
Consumer rebalancing every few minutes+
Immediate action
Check session.timeout.ms and GC pauses
Commands
kafka-consumer-groups --bootstrap-server localhost:9092 --group <group> --describe
jstat -gcutil <consumer-pid> 1000
Fix now
Increase session.timeout.ms to 60000. If GC > 10% time, increase heap. Use CooperativeStickyAssignor.
Feature / AspectKafkaRabbitMQ
OrderingPer partition onlyPer queue (FIFO if single consumer)
Throughput1M msg/s per broker50k msg/s per node
Latency10ms (batched)1ms (no batching)
DurabilityCommit log, RF configurableMirrored queues, but slower
Exactly-onceTransactional producer + consumerNot native, requires dedup
Operational complexityHigh (ZooKeeper/KRaft, brokers, partitions)Medium (Erlang VM, simpler clustering)

Key takeaways

1
Partition by key to preserve ordering per entity while allowing parallel consumption.
2
Always set acks=all and enable.idempotence for critical data
never trade durability for speed.
3
Monitor consumer lag per partition, not total
hot partitions kill throughput.
4
Distributed queues are overkill for single-server, low-throughput use cases
use simpler tools.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
How does Kafka handle a broker crash without losing messages? Describe t...
Q02SENIOR
When would you choose RabbitMQ over Kafka for a production system?
Q03SENIOR
What happens when a consumer's processing time exceeds max.poll.interval...
Q04JUNIOR
What is the difference between at-least-once and exactly-once delivery i...
Q05SENIOR
You notice consumer lag growing on one partition while others are fine. ...
Q06SENIOR
Design a distributed message queue for a payment system that requires ex...
Q01 of 06SENIOR

How does Kafka handle a broker crash without losing messages? Describe the leader election and recovery process.

ANSWER
Kafka uses a controller that detects broker failure via ZooKeeper ephemeral nodes. It elects new leaders for partitions from in-sync replicas (ISR). Producers with acks=all will retry until new leader is ready. Consumers see no data loss because committed offsets are stored in the __consumer_offsets topic, which is replicated. The key is that only ISR replicas can become leaders — if a broker had stale data, it's not in ISR and won't be elected.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
How do I design a distributed message queue for high throughput?
02
What's the difference between Kafka and RabbitMQ for a distributed queue?
03
How do I prevent message loss in a distributed message queue?
04
How does a distributed message queue handle consumer failures?
N
Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Written from production experience, not tutorials.

Follow
Verified
production tested
June 25, 2026
last updated
1,663
articles · all by Naren
🔥

That's Real World. Mark it forged?

4 min read · try the examples if you haven't

Previous
Design Instagram
21 / 40 · Real World
Next
Design a Distributed Key-Value Store