Senior 4 min · June 25, 2026

Design a Distributed Message Queue: Avoid the 3 AM Thread Pool Exhaustion

Q: How do I design a distributed message queue for high throughput?

Partition your topic across multiple brokers, use key-based partitioning for ordering, batch producer requests (linger.ms=5, batch.size=64KB), and set acks=1 for throughput (but risk data loss). For durability, use acks=all and replication factor 3. Consumer side: use multiple consumer threads per partition, and commit offsets in batches.

Q: What's the difference between Kafka and RabbitMQ for a distributed queue?

Kafka is a distributed commit log optimized for high throughput and replayability. RabbitMQ is a message broker with complex routing and lower latency. Choose Kafka for event streaming and analytics; choose RabbitMQ for task queues and RPC.

Q: How do I prevent message loss in a distributed message queue?

Set producer acks=all, enable.idempotence=true, and use replication factor >= 3. On consumer side, commit offsets only after processing (commitSync). Monitor under-replicated partitions and consumer lag. Use a dead letter queue for failed messages.

Q: How does a distributed message queue handle consumer failures?

Consumer failures are detected via session timeout. The group coordinator triggers a rebalance, reassigning partitions to remaining consumers. To minimize impact, use cooperative rebalancing and set session.timeout.ms high enough to avoid false positives. Messages are redelivered to the new consumer (at-least-once).

Design a distributed message queue that survives production.

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Written from production experience, not tutorials.

✓ Production

production tested

June 25, 2026

last updated

1,663

articles · all by Naren

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Design a distributed message queue by partitioning data across brokers, using a commit log for durability, and implementing consumer groups for parallel processing. Key choices: Kafka vs Pulsar vs RabbitMQ — each trades off latency, ordering, and operational complexity.

✦ Definition~90s read

What is Design a Distributed Message Queue?

A distributed message queue is a system that decouples producers and consumers across multiple machines, providing at-least-once delivery, ordering guarantees, and horizontal scalability. It's the backbone of async microservices communication.

★

Imagine a restaurant kitchen with one chef and a hundred waiters.

Plain-English First

Imagine a restaurant kitchen with one chef and a hundred waiters. Without a queue, waiters shout orders at the chef, orders get lost, and the chef burns out. A distributed message queue is like a ticket rail — waiters drop tickets, chefs pick them in order, and multiple chefs can work different sections (partitions) without stepping on each other.

The 3 AM page that wakes you up: 'Payment service latency spike — messages stuck in queue.' You SSH in, see 10,000 unacknowledged messages, and the consumer thread pool is completely deadlocked. This isn't a Kafka bug. It's a design failure you introduced. Most engineers think a distributed message queue is just 'Kafka with a topic.' They're wrong. The hard part isn't the queue — it's the guarantees: exactly-once semantics, ordering across partitions, and handling a broker crash without losing a single message. By the end of this, you'll design a queue that survives a node failure, a network partition, and a consumer that takes 30 seconds to process a message — without dropping data or deadlocking.

Core Architecture: Partitions Are Your Scalability Unit

Before you write a single line, understand this: a distributed message queue is a partitioned commit log. Each partition is an ordered, immutable sequence of messages. Producers append to partitions, consumers read from them. The magic is that partitions can live on different brokers, giving you horizontal scaling. But here's the trap: ordering is only guaranteed within a partition. If you need global ordering, you need a single partition — which kills throughput. Choose wisely. In production, we use a key-based partitioning scheme: all messages for the same order ID go to the same partition. That preserves order per order, while allowing parallel consumption across orders. Never hash by random — you'll get hot partitions.

PartitionedMessageQueue.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Producer: partition by order ID
ProducerRecord<String, OrderEvent> record = new ProducerRecord<>(
    "order-events",
    order.getOrderId(), // key -> same partition
    order.toEvent()
);
producer.send(record, (metadata, exception) -> {
    if (exception != null) {
        // Retry with exponential backoff, log to dead letter queue
        deadLetterQueue.send(record);
    }
});

// Consumer: assign partitions to threads
// Each thread owns a set of partitions
for (TopicPartition partition : assignment) {
    executor.submit(() -> {
        while (true) {
            ConsumerRecords<String, OrderEvent> records = consumer.poll(Duration.ofMillis(100));
            for (ConsumerRecord<String, OrderEvent> record : records) {
                processOrder(record.value());
            }
            consumer.commitSync(); // commit after batch
        }
    });
}

Output

Messages for order-123 always go to partition 0. Consumer thread 1 processes partition 0, thread 2 processes partition 1. No global ordering, but per-order ordering preserved.

Production Trap: Hot Partitions

If you use a timestamp as key, all messages within the same millisecond hit the same partition. Use a composite key like {customerId}_{timestamp} to spread load.

thecodeforge.io

Distributed Message Queue Architecture

Design Distributed Message Queue

Durability: The Commit Log and Replication Factor

A queue that loses messages is a liability. Durability comes from the commit log — every message is written to disk on the broker before being acknowledged. But disk writes are slow. So we batch them. In Kafka, the producer can set acks=all and linger.ms=5 to batch 5ms worth of messages into a single disk write. That gives you durability with 10x throughput. Replication factor (RF) of 3 means three brokers have a copy. If one dies, the others take over. But RF=3 means three times the storage and network traffic. For non-critical logs, RF=2 is fine. For payment events, RF=3 minimum. Never set RF=1 in production — I've seen a single disk failure wipe out a day of orders.

DurableProducer.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

Properties props = new Properties();
props.put("bootstrap.servers", "broker1:9092,broker2:9092,broker3:9092");
props.put("acks", "all"); // wait for all in-sync replicas
props.put("retries", Integer.MAX_VALUE); // infinite retries on transient errors
props.put("enable.idempotence", true); // exactly-once semantics
props.put("linger.ms", 5); // batch up to 5ms
props.put("batch.size", 65536); // 64KB batches
props.put("compression.type", "snappy"); // reduce network bandwidth

KafkaProducer<String, OrderEvent> producer = new KafkaProducer<>(props);

Output

Producer sends 64KB batches every 5ms. Broker writes to disk and replicates to 2 followers. If broker1 crashes, broker2 takes over with zero data loss.

Senior Shortcut: Tune batch.size and linger.ms together

Don't set linger.ms too high (>100ms) for latency-sensitive apps. For payment processing, 5ms is safe. For analytics, 500ms is fine.

Consumer Groups and Rebalancing: The Silent Killer

Consumer groups let you scale consumption: multiple consumers split partitions among themselves. But when a consumer joins or leaves, a rebalance triggers — all consumers stop, revoke partitions, and reassign. During rebalance, no messages are processed. In a 100-partition topic, rebalance can take 30 seconds. That's 30 seconds of piling messages. The fix: use cooperative rebalancing (Kafka 2.4+) which reassigns partitions incrementally. Also set session.timeout.ms high enough (45s) to avoid false rebalances on GC pauses. And never use static group membership unless you want manual partition assignment — it's a maintenance nightmare.

ConsumerGroupRebalance.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

Properties consumerProps = new Properties();
consumerProps.put("group.id", "order-processor");
consumerProps.put("session.timeout.ms", 45000); // 45s before considered dead
consumerProps.put("heartbeat.interval.ms", 15000); // send heartbeat every 15s
consumerProps.put("partition.assignment.strategy", 
    "org.apache.kafka.clients.consumer.CooperativeStickyAssignor"); // incremental rebalance
consumerProps.put("max.poll.interval.ms", 300000); // 5min max between polls
consumerProps.put("max.poll.records", 100); // process 100 records per poll

KafkaConsumer<String, OrderEvent> consumer = new KafkaConsumer<>(consumerProps);
consumer.subscribe(Arrays.asList("order-events"));

Output

When consumer-3 crashes, only its partitions (e.g., 5-7) are reassigned. Other consumers keep processing. Rebalance time drops from 30s to 2s.

Never Do This: Static Group Membership

Setting group.instance.id makes consumers static — they keep partitions even if they disconnect. Sounds good? Until a consumer crashes and its partitions are stuck unprocessed for hours. Avoid.

thecodeforge.io

Consumer Group Rebalance Cycle

Design Distributed Message Queue

Exactly-Once Semantics: The Holy Grail and Its Cost

At-least-once is easy: retry on failure. But duplicates happen. Exactly-once requires idempotent producers and transactional consumers. The producer assigns a unique ID to each message; the broker deduplicates. The consumer reads messages and writes results in a transaction — if it crashes, the transaction is rolled back, and the message is re-delivered. But transactions add latency (2x-3x). For payment processing, it's worth it. For logging, it's overkill. Use exactly-once only when duplicates cause financial loss. Otherwise, use idempotent consumers (dedup by message ID) — cheaper and simpler.

ExactlyOnceConsumer.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Consumer with idempotent processing
consumer.subscribe(Arrays.asList("order-events"));
while (true) {
    ConsumerRecords<String, OrderEvent> records = consumer.poll(Duration.ofMillis(100));
    for (ConsumerRecord<String, OrderEvent> record : records) {
        String messageId = record.key(); // unique per message
        if (processedIds.contains(messageId)) {
            continue; // dedup
        }
        processOrder(record.value());
        processedIds.add(messageId);
    }
    consumer.commitSync();
}

// For exactly-once with transactions:
producer.initTransactions();
producer.beginTransaction();
for (ConsumerRecord<String, OrderEvent> record : records) {
    producer.send(new ProducerRecord<>("processed-orders", record.key(), record.value()));
}
producer.sendOffsetsToTransaction(offsets, consumer.groupMetadata());
producer.commitTransaction();

Output

Idempotent consumer skips duplicates. Transactional producer ensures output topic and offsets are committed atomically.

Interview Gold: Exactly-Once vs Idempotent

Exactly-once is a system guarantee. Idempotent is an application property. You can have idempotent consumers without exactly-once — just dedup by message ID. That's what most production systems do.

Backpressure and Flow Control: Don't Let the Queue Eat Your Memory

If producers outpace consumers, the queue grows unbounded. Brokers run out of disk. Consumers get overwhelmed. The fix: backpressure. Producers should block when the queue is full. Kafka doesn't have native backpressure — you implement it with max.in.flight.requests.per.connection (set to 1 for strict ordering) and a callback that blocks. Better: use a rate limiter on the consumer side. If consumer processing time spikes, slow down polling. Also set retention limits: delete messages older than 7 days or when topic size exceeds 100GB. Never let a topic grow forever — I've seen a 2TB topic that took 3 days to clean up.

BackpressureConsumer.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Rate-limited consumer
RateLimiter limiter = RateLimiter.create(100.0); // 100 permits per second
while (true) {
    limiter.acquire(); // blocks if rate exceeded
    ConsumerRecords<String, OrderEvent> records = consumer.poll(Duration.ofMillis(100));
    for (ConsumerRecord<String, OrderEvent> record : records) {
        processOrder(record.value());
    }
    consumer.commitSync();
}

// Broker-side retention
// Set on topic creation or via CLI:
// bin/kafka-configs.sh --bootstrap-server localhost:9092 --entity-type topics --entity-name order-events --alter --add-config retention.ms=604800000,retention.bytes=107374182400

Output

Consumer processes max 100 messages per second. Topic retains max 100GB or 7 days, whichever comes first.

Production Trap: Unbounded Retention

Default retention is 7 days. For high-throughput topics (10k msg/s), that's 6 billion messages. Set retention.bytes aggressively. Monitor disk usage with kafka-log-dirs.sh.

thecodeforge.io

Backpressure: Block vs. Drop

Design Distributed Message Queue

Monitoring and Alerting: What to Watch Before It Burns

You can't fix what you don't measure. Monitor these metrics: consumer lag (difference between latest offset and consumer offset), request rate, error rate, and disk usage. Use Prometheus + Grafana. Alert on consumer lag > 1000 for more than 5 minutes — that means consumers are falling behind. Also alert on under-replicated partitions (URP) — that means a broker is down and data might be at risk. And monitor GC pauses: if young GC takes > 200ms, your heap is too small. I've seen a 10-second GC pause cause a consumer group rebalance that took down a whole microservice.

MonitoringSetup.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

# Prometheus alert rule for consumer lag
groups:
  - name: kafka_alerts
    rules:
      - alert: ConsumerLagHigh
        expr: kafka_consumer_lag > 1000
        for: 5m
        annotations:
          summary: "Consumer lag > 1000 for topic {{ $labels.topic }}"
      - alert: UnderReplicatedPartitions
        expr: kafka_server_ReplicaManager_UnderReplicatedPartitions > 0
        for: 1m
        annotations:
          summary: "Under-replicated partitions detected"

# Check lag via CLI
# bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group order-processor --describe

Output

Prometheus alerts when lag > 1000 for 5 minutes. CLI shows per-partition lag: partition 0 lag 500, partition 1 lag 1200.

Senior Shortcut: Lag per partition, not total

Total lag hides hot partitions. Monitor max partition lag — if one partition has 10k lag while others have 0, your partitioning key is skewed.

When Not to Use a Distributed Message Queue

Distributed message queues add operational complexity: brokers to manage, partitions to tune, rebalances to handle. If you have a single server and low throughput (< 100 msg/s), use an in-memory queue (e.g., Disruptor) or a simple database table with a status column. If you need strict FIFO across all messages, a single-partition queue limits throughput — consider a database with ordered IDs. And if your messages are tiny and latency-critical (< 1ms), a distributed queue's network overhead kills you. Use a shared memory queue instead. I've seen teams adopt Kafka for a 10 msg/s internal logging pipeline — they spent more time managing Kafka than building features. Don't be that team.

Interview Gold: When to Choose Simplicity

If your queue fits in memory on one machine and you can tolerate data loss on crash, use an in-memory queue. If you need persistence but not distribution, use SQLite with WAL mode. Distributed queues are for when you need horizontal scaling and fault tolerance.

● Production incidentPOST-MORTEMseverity: high

The 4GB Container That Kept Dying

Symptom

Kafka broker OOM-killed every 6 hours. Heap dump showed 3.5GB of unread message buffers.

Assumption

We assumed a memory leak in the consumer library.

Root cause

Consumer fetch.max.bytes was set to 50MB per partition, and with 100 partitions, the broker tried to buffer 5GB in memory. Combined with replication threads, the JVM heap (4GB) couldn't keep up.

Fix

Set fetch.max.bytes to 5MB, fetch.max.wait.ms to 500ms, and added a consumer-side rate limiter with Guava's RateLimiter (permits per second = target throughput / average message size).

Key lesson

Always bound memory per partition, not per broker.
Unbounded fetch buffers are silent killers.

Production debug guideSystematic recovery paths for the failure modes engineers actually hit.3 entries

Symptom · 01

Consumer not processing messages, no errors in logs

→

Fix

1. Check consumer group status: kafka-consumer-groups --describe --group <group> 2. Check if consumer is stuck in rebalance: look for 'Revoke' and 'Assign' logs 3. Check session.timeout.ms and heartbeat.interval.ms — ensure heartbeat is sent before timeout 4. If rebalancing too often, increase session.timeout.ms to 60s

Symptom · 02

Messages duplicated in output

→

Fix

1. Check if producer has enable.idempotence=true 2. Check consumer commitSync() vs commitAsync() — async can cause duplicates on rebalance 3. Implement dedup by message ID in consumer 4. If using transactions, ensure commitTransaction() is called

Symptom · 03

Broker out of disk space

→

Fix

1. Check disk usage: df -h on broker 2. Reduce retention: kafka-configs --alter --add-config retention.bytes=10737418240 3. Delete old segments manually: rm -rf /data/kafka/<topic>-<partition>/*.log (stop broker first) 4. Add more brokers and reassign partitions

★ Distributed Message Queue Triage Cheat SheetFirst-response commands for when things go wrong — copy-paste ready.

Consumer lag > 1000−

Immediate action

Check which partition has highest lag

Commands

kafka-consumer-groups --bootstrap-server localhost:9092 --group <group> --describe

kafka-run-class kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic <topic> --time -1

Fix now

Increase consumer threads or reduce max.poll.records. If single partition hot, repartition with better key.

Under-replicated partitions+

Producer timeout errors+

Consumer rebalancing every few minutes+

Feature / Aspect	Kafka	RabbitMQ
Ordering	Per partition only	Per queue (FIFO if single consumer)
Throughput	1M msg/s per broker	50k msg/s per node
Latency	10ms (batched)	1ms (no batching)
Durability	Commit log, RF configurable	Mirrored queues, but slower
Exactly-once	Transactional producer + consumer	Not native, requires dedup
Operational complexity	High (ZooKeeper/KRaft, brokers, partitions)	Medium (Erlang VM, simpler clustering)

Key takeaways

Partition by key to preserve ordering per entity while allowing parallel consumption.

Always set acks=all and enable.idempotence for critical data

never trade durability for speed.

Monitor consumer lag per partition, not total

hot partitions kill throughput.

Distributed queues are overkill for single-server, low-throughput use cases

use simpler tools.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

How does Kafka handle a broker crash without losing messages? Describe t...

Q02SENIOR

When would you choose RabbitMQ over Kafka for a production system?

Q03SENIOR

What happens when a consumer's processing time exceeds max.poll.interval...

Q04JUNIOR

What is the difference between at-least-once and exactly-once delivery i...

Q05SENIOR

You notice consumer lag growing on one partition while others are fine. ...

Q06SENIOR

Design a distributed message queue for a payment system that requires ex...

Q01 of 06SENIOR

How does Kafka handle a broker crash without losing messages? Describe the leader election and recovery process.

ANSWER

Kafka uses a controller that detects broker failure via ZooKeeper ephemeral nodes. It elects new leaders for partitions from in-sync replicas (ISR). Producers with acks=all will retry until new leader is ready. Consumers see no data loss because committed offsets are stored in the __consumer_offsets topic, which is replicated. The key is that only ISR replicas can become leaders — if a broker had stale data, it's not in ISR and won't be elected.

FAQ · 4 QUESTIONS

Frequently Asked Questions

How do I design a distributed message queue for high throughput?

What's the difference between Kafka and RabbitMQ for a distributed queue?

How do I prevent message loss in a distributed message queue?

How does a distributed message queue handle consumer failures?

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Written from production experience, not tutorials.

✓ Verified

production tested

June 25, 2026

last updated

1,663

articles · all by Naren

🔥

That's Real World. Mark it forged?

4 min read · try the examples if you haven't