Pub/sub decouples message producers from consumers via a message broker. Publishers emit events to topics; subscribers consume from those topics. This enables async, scalable, and fault-tolerant communication between services.
✦ Definition~90s read
What is Publish-Subscribe Pattern?
The publish-subscribe pattern (pub/sub) is a messaging paradigm where publishers send messages to topics without knowing which subscribers receive them. Subscribers express interest in topics and receive messages asynchronously, decoupling the sender from the receiver.
★
Think of a radio station.
Plain-English First
Think of a radio station. The station (publisher) broadcasts music on a frequency (topic). Anyone with a radio tuned to that frequency (subscriber) hears the music. The station doesn't know who's listening, and listeners don't affect the broadcast. If you miss a song, you can't replay it unless the station recorded it (persistent subscription).
I've seen a well-intentioned pub/sub system bring down a payments service at 3 AM because the subscriber couldn't keep up and the broker's memory filled up. The team had to restart the broker and replay hours of lost messages. That's the kind of failure that makes you appreciate the pattern's sharp edges.
Pub/sub solves a fundamental problem: how do you let multiple services react to events without coupling them? Before pub/sub, teams used direct HTTP calls or shared databases. Both create tight coupling and single points of failure. A payment service calling a notification service directly means if notifications are down, payments fail. That's unacceptable.
By the end of this article, you'll know exactly when to use pub/sub, how to implement it without the rookie mistakes, and how to debug the three most common production failures. You'll also know when a simple queue or direct call is the smarter choice.
Why Decoupling Matters More Than You Think
Direct coupling between services creates fragility. If service A calls service B directly, and B is slow or down, A fails too. Pub/sub breaks that chain. The publisher doesn't care if subscribers exist or are healthy. It just fires a message and moves on. This lets you scale subscribers independently, add new consumers without touching publishers, and survive subscriber failures without data loss.
But decoupling comes with a cost: you lose visibility. A direct call returns a response; pub/sub is fire-and-forget. You need monitoring, retries, and dead-letter queues to handle failures. Many teams jump into pub/sub without these, then wonder why messages vanish.
No direct output — messages flow asynchronously. Both subscribers receive the same message independently.
Production Trap: No Message Validation
Never trust message payloads blindly. Always validate schema on the subscriber side. I've seen a malformed JSON crash a whole fleet of subscribers because the parser threw an unhandled exception.
thecodeforge.io
Pub/Sub Pattern: Async Pipeline Architecture
Pub Sub Pattern
Choosing Your Broker: Kafka vs Redis vs RabbitMQ
Your broker choice defines your failure modes. Kafka is built for high-throughput, persistent, replayable streams. Redis pub/sub is fast but ephemeral — messages are lost if no subscriber is listening. RabbitMQ sits in between: persistent queues, flexible routing, but lower throughput than Kafka.
Here's the rule: Use Kafka when you need message replay and ordering guarantees (e.g., event sourcing, audit logs). Use Redis pub/sub for transient notifications where loss is acceptable (e.g., live dashboards). Use RabbitMQ for reliable task distribution with complex routing (e.g., work queues with multiple bindings).
Don't use Redis pub/sub for anything that must survive a restart. Don't use Kafka for low-latency real-time chat — the overhead is too high.
No direct output — each broker delivers messages to subscribers asynchronously.
Broker Selection Decision Tree
IfNeed message replay and ordering across consumers
→
UseKafka
IfTransient notifications, loss acceptable
→
UseRedis pub/sub
IfReliable delivery with complex routing
→
UseRabbitMQ
IfExactly-once semantics required
→
UseKafka with idempotent producer + transactional API
Handling Backpressure: Don't Let the Broker Eat Your Memory
The most common pub/sub failure is the subscriber falling behind. Messages pile up on the broker, memory grows, and eventually the broker OOMs or starts dropping messages. This is backpressure, and you must handle it.
Solutions: 1) Use a bounded buffer on the subscriber side — never let the client library buffer unlimited messages. 2) Implement a circuit breaker: if the subscriber's processing queue exceeds a threshold, pause consumption and alert. 3) Use a dead-letter queue for messages that can't be processed after retries.
In Kafka, you can increase partitions and add more consumers to scale horizontally. In Redis, you're limited — consider switching to Redis Streams which support consumer groups and backpressure.
BackpressureHandler.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
// io.thecodeforge — SystemDesign tutorial
// Subscriber with bounded buffer and circuit breaker
const MAX_BUFFER_SIZE = 1000;
const processingQueue = [];
let isPaused = false;
broker.subscribe('orders', (message) => {
if (processingQueue.length >= MAX_BUFFER_SIZE) {
// Circuit breaker: pause subscription
if (!isPaused) {
broker.pause('orders');
isPaused = true;
alertOps('Order subscriber buffer full — pausing consumption');
}
// Reject message — broker will retry or DLQreturnfalse;
}
processingQueue.push(message);
processNext();
});
async function processNext() {
if (processingQueue.length === 0) {
if (isPaused) {
broker.resume('orders');
isPaused = false;
}
return;
}
const message = processingQueue.shift();
try {
await processOrder(message);
} catch (err) {
// Send to dead-letter queue
deadLetterQueue.send(message);
}
setImmediate(processNext); // process next without stack overflow
}
Output
No direct output — backpressure handling is internal. Alerts fire when buffer is full.
Senior Shortcut: Monitor Consumer Lag
Always monitor consumer lag (Kafka: kafka-consumer-groups --describe). Set alerts when lag exceeds a threshold (e.g., 10,000 messages). That's your early warning before OOM.
thecodeforge.io
Backpressure Cascade
Pub Sub Pattern
Idempotency: Because At-Least-Once Means Duplicates
Most pub/sub systems guarantee at-least-once delivery. That means your subscriber will see the same message more than once — during retries, rebalances, or broker failovers. If you're not idempotent, you'll double-charge customers, send duplicate emails, or create duplicate database records.
Solution: Every message carries a unique idempotency key (e.g., UUID). The subscriber checks a dedup cache (Redis with TTL) before processing. If the key exists, skip. If not, process and store the key with a TTL longer than the maximum possible duplicate window.
Never rely on database unique constraints alone — they cause constraint violation errors that need handling.
No direct output — duplicate messages are silently skipped.
The Classic Bug: TTL Too Short
Set your dedup TTL longer than the maximum retry window. If your broker retries for 30 minutes, set TTL to at least 1 hour. I've seen TTL set to 5 minutes, and duplicates slipped through during a 10-minute outage.
thecodeforge.io
At-Least-Once vs Idempotent
Pub Sub Pattern
Dead-Letter Queues: Where Messages Go to Die (or Be Resurrected)
Not all messages can be processed. Maybe the downstream service is down, the data is corrupt, or a bug in your subscriber. If you keep retrying forever, you'll clog the queue and block other messages. Enter the dead-letter queue (DLQ): a separate queue for messages that failed after a maximum number of retries.
Configure your broker to send messages to a DLQ after N retries. Then have a separate process that periodically replays DLQ messages (after fixing the issue) or alerts a human. Never let messages vanish silently.
In Kafka, you can use a DLQ topic and a separate consumer. In RabbitMQ, use dead letter exchanges. In Redis Streams, you'll need to implement it manually.
DLQSetup.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// io.thecodeforge — SystemDesign tutorial
// RabbitMQ: setup dead letter exchange
channel.exchangeDeclare("orders.dlx", "direct", true);
channel.queueDeclare("orders.dlq", true, false, false, null);
channel.queueBind("orders.dlq", "orders.dlx", "order.failed");
// Main queue with DLX config
Map<String, Object> args = newHashMap<>();
args.put("x-dead-letter-exchange", "orders.dlx");
args.put("x-dead-letter-routing-key", "order.failed");
args.put("x-max-retries", 3);
channel.queueDeclare("orders_queue", true, false, false, args);
// Consumer with retry count in headers
channel.basicConsume("orders_queue", false, (consumerTag, delivery) -> {
try {
processOrder(newString(delivery.getBody()));
channel.basicAck(delivery.getEnvelope().getDeliveryTag(), false);
} catch (Exception e) {
long retryCount = delivery.getProperties().getHeaders().getOrDefault("x-retry-count", 0);
if (retryCount < 3) {
// Reject and requeue with incremented retry count
AMQP.BasicProperties props = newAMQP.BasicProperties.Builder()
.headers(Map.of("x-retry-count", retryCount + 1))
.build();
channel.basicReject(delivery.getEnvelope().getDeliveryTag(), true);
// Note: actual requeue with new headers requires custom logic
} else {
// Send to DLQ
channel.basicReject(delivery.getEnvelope().getDeliveryTag(), false);
}
}
}, consumerTag -> {});
Output
No direct output — messages exceeding retries end up in orders.dlq.
Interview Gold: DLQ Monitoring
Always monitor DLQ depth. A growing DLQ means something is broken. Set an alert on DLQ message count > 0. In an interview, mention that you'd build a dashboard showing DLQ age and content for quick debugging.
When Not to Use Pub/Sub: The Overkill Trap
Pub/sub is not always the answer. If you have a single consumer and need a response, use a direct call or a request-reply pattern. If you need exactly-once processing and can't tolerate duplicates, consider a transactional outbox pattern with a database instead.
I've seen teams use Kafka to send a notification email — that's a sledgehammer for a nail. A simple queue (RabbitMQ) or even an HTTP call with retries would be simpler and cheaper.
Also avoid pub/sub for low-latency (<10ms) interactions. The broker hop adds latency. For real-time gaming or trading, consider direct WebSocket connections or UDP multicast.
WhenToAvoidPubSub.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// io.thecodeforge — SystemDesign tutorial
// Scenario: User registration needs to send welcome email.
// Overkill: Kafka topic with one partition and one consumer
// Better: DirectHTTP call to email service with retry
async function registerUser(userData) {
const user = await db.insertUser(userData);
// Direct call with retry — simpler and synchronous
const emailSent = await retry(() =>
emailService.sendWelcomeEmail(user.email),
{ retries: 3, backoff: 'exponential' }
);
if (!emailSent) {
// Fallback: queue for later processing
await backgroundJobQueue.add('send_email', { userId: user.id });
}
return user;
}
Output
No direct output — the email is sent synchronously with retry.
Never Do This: Pub/Sub for Request-Reply
Don't use pub/sub to implement RPC. You'll end up with correlation IDs, temporary reply queues, and timeouts — a poor man's HTTP. Use a proper RPC framework or direct calls instead.
Monitoring and Debugging: The Blind Spot
Pub/sub systems are notoriously hard to debug because messages flow asynchronously. You need end-to-end tracing. Attach a unique trace ID to every message at the publisher, and propagate it through all subscribers. Use distributed tracing tools (Jaeger, Zipkin) to visualize the flow.
Also log every message arrival and processing outcome. Without logs, you're blind. I've spent hours chasing a missing message only to find it was published to the wrong topic due to a typo.
Set up metrics: publish rate, consume rate, lag, error rate, DLQ depth. Alert on anomalies.
Logs with traceId allow correlating publisher and subscriber events.
Production Trap: No Logging on Publish
Always log when you publish a message, including the topic and message ID. Otherwise, when a message goes missing, you can't tell if it was never published or lost in transit.
● Production incidentPOST-MORTEMseverity: high
The 4GB Container That Kept Dying
Symptom
A microservice processing order events kept crashing with OOMKilled every 30 minutes. The container had 4GB RAM limit.
Assumption
The team assumed a memory leak in the subscriber code.
Root cause
The subscriber was using Redis pub/sub with a blocking brpop. When the Redis broker couldn't keep up, messages buffered in Redis memory. The subscriber's client library also had an unbounded in-memory buffer (default 10,000 messages). Combined, they consumed 3.8GB before OOM.
Fix
Set a maximum in-memory buffer size on the subscriber (e.g., 1000 messages). Configure Redis maxmemory-policy allkeys-lru. Added a circuit breaker to pause subscription when buffer exceeds 80%.
Key lesson
Always bound every buffer in your pipeline — broker, client, and application.
Unbounded buffers are landmines.
Production debug guideSystematic recovery paths for the failure modes engineers actually hit.3 entries
Symptom · 01
Messages not being consumed — consumer lag growing
→
Fix
1. Check consumer health: is it running? CPU/memory? 2. Check broker logs for rebalances. 3. Increase partitions and consumers. 4. If using Kafka, run kafka-consumer-groups --describe --group <group> to see lag per partition.
Symptom · 02
Duplicate messages processed
→
Fix
1. Check if subscriber is idempotent. 2. Verify dedup cache TTL. 3. Check for rebalances causing reprocessing. 4. In Kafka, ensure enable.auto.commit=false and commit manually after processing.
Symptom · 03
Broker OOM or high memory usage
→
Fix
1. Check consumer lag — messages buffered in broker memory. 2. Reduce max in-memory buffer on subscriber. 3. For Redis, set maxmemory-policy allkeys-lru. 4. For Kafka, reduce log retention or increase disk.
★ Publish-Subscribe Pattern Triage Cheat SheetFirst-response commands for when things go wrong — copy-paste ready.
Consumer lag growing — `kafka-consumer-groups --describe --group my-group` shows high LAG−
Increase partitions: kafka-topics --alter --topic orders --partitions 10 and add more consumers.
Duplicate messages — same order processed twice+
Immediate action
Check idempotency key in logs
Commands
grep 'duplicate' /var/log/subscriber.log
redis-cli TTL order_placed:12345
Fix now
Increase dedup TTL to 3600 seconds. Ensure idempotency key is set before processing.
Redis OOM — `OOM command not allowed when used memory > 'maxmemory'`+
Immediate action
Check memory usage and eviction policy
Commands
redis-cli INFO memory | grep used_memory_human
redis-cli CONFIG GET maxmemory-policy
Fix now
Set maxmemory-policy allkeys-lru: redis-cli CONFIG SET maxmemory-policy allkeys-lru. Reduce subscriber buffer size.
RabbitMQ queue growing — messages not consumed+
Immediate action
Check if consumer is connected and not blocked
Commands
rabbitmqctl list_queues name messages consumers
rabbitmqctl list_consumers
Fix now
Restart consumer or increase prefetch count: channel.basicQos(100).
Feature / Aspect
Kafka
Redis Pub/Sub
RabbitMQ
Message Persistence
Yes, on disk
No, in-memory only
Yes, optional
Message Replay
Yes, by offset
No
No (unless using shovel/backup)
Ordering Guarantee
Within partition
No
Within queue
Throughput
Very high (millions/s)
High (hundreds of thousands/s)
Moderate (tens of thousands/s)
Latency
Low (ms)
Very low (sub-ms)
Low (ms)
Consumer Groups
Yes
No (use Redis Streams)
Yes
Routing Flexibility
Topic-based
Channel-based
Exchanges + bindings
Operational Complexity
High (ZooKeeper/KRaft)
Low
Medium
Key takeaways
1
Pub/sub decouples producers and consumers via a broker, enabling async, scalable communication
but adds complexity in monitoring and error handling.
2
Always bound buffers at every layer
broker, client, and application. Unbounded buffers cause OOMs.
3
Idempotency is mandatory for at-least-once delivery. Use a dedup cache with TTL longer than the retry window.
4
Dead-letter queues are your safety net. Never let messages vanish silently. Monitor DLQ depth.
5
Choose your broker based on durability and replay needs
Kafka for replay, Redis for ephemeral, RabbitMQ for flexible routing.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01SENIOR
How does Kafka handle backpressure when a consumer is slower than the pr...
Q02SENIOR
When would you choose RabbitMQ over Kafka for a pub/sub system?
Q03SENIOR
What happens when a Kafka consumer crashes mid-processing? How do you pr...
Q04JUNIOR
What is the difference between pub/sub and message queues?
Q05SENIOR
A subscriber is processing messages but the broker shows increasing lag....
Q06SENIOR
Design a pub/sub system for a ride-hailing app that needs to broadcast d...
Q01 of 06SENIOR
How does Kafka handle backpressure when a consumer is slower than the producer?
ANSWER
Kafka doesn't have backpressure to the producer. Instead, it buffers messages on disk (log). The consumer lag grows. You must monitor lag and scale consumers by adding partitions. If lag becomes too large, you may hit disk space limits or retention thresholds, causing data loss.
Q02 of 06SENIOR
When would you choose RabbitMQ over Kafka for a pub/sub system?
ANSWER
Choose RabbitMQ when you need flexible routing (exchanges/bindings), lower operational complexity, and don't need message replay. For example, a task distribution system with multiple queues and routing keys. Choose Kafka when you need high throughput, durable storage, and replay capability.
Q03 of 06SENIOR
What happens when a Kafka consumer crashes mid-processing? How do you prevent data loss?
ANSWER
If enable.auto.commit=true, offsets are committed periodically, so messages processed after the last commit will be reprocessed (at-least-once). To prevent data loss, commit offsets only after processing is complete (enable.auto.commit=false). Use idempotent processing to handle duplicates.
Q04 of 06JUNIOR
What is the difference between pub/sub and message queues?
ANSWER
In pub/sub, each message is delivered to all subscribers (fan-out). In message queues, each message is delivered to one consumer (point-to-point). Pub/sub is for broadcasting events; queues are for distributing work.
Q05 of 06SENIOR
A subscriber is processing messages but the broker shows increasing lag. How do you debug?
ANSWER
1. Check subscriber throughput: is it CPU-bound, I/O-bound, or blocked on a downstream call? 2. Check for rebalances: if consumers join/leave, processing pauses. 3. Check message size: large messages take longer to process. 4. Increase partitions and consumers to scale horizontally.
Q06 of 06SENIOR
Design a pub/sub system for a ride-hailing app that needs to broadcast driver location updates to nearby riders. What broker and topology?
ANSWER
Use Redis pub/sub for low-latency, ephemeral location updates. Partition by geohash: each geohash region is a channel. Drivers publish to their geohash channel; riders subscribe to channels in their vicinity. For persistence, log location updates to Kafka for analytics. Handle backpressure by limiting subscription count per rider.
01
How does Kafka handle backpressure when a consumer is slower than the producer?
SENIOR
02
When would you choose RabbitMQ over Kafka for a pub/sub system?
SENIOR
03
What happens when a Kafka consumer crashes mid-processing? How do you prevent data loss?
SENIOR
04
What is the difference between pub/sub and message queues?
JUNIOR
05
A subscriber is processing messages but the broker shows increasing lag. How do you debug?
SENIOR
06
Design a pub/sub system for a ride-hailing app that needs to broadcast driver location updates to nearby riders. What broker and topology?
SENIOR
FAQ · 4 QUESTIONS
Frequently Asked Questions
01
What is the publish-subscribe pattern used for?
It's used to decouple services that produce events from those that consume them. Common uses: notifying multiple services of an event (e.g., order placed), streaming data (e.g., user activity), and building event-driven architectures.
Was this helpful?
02
What's the difference between pub/sub and observer pattern?
Observer pattern is in-process: subjects notify observers directly within the same application. Pub/sub is cross-process: publishers and subscribers communicate via a broker, often across different services or machines.
Was this helpful?
03
How do I ensure exactly-once delivery in pub/sub?
Exactly-once is extremely hard in distributed systems. Most systems offer at-least-once with idempotent processing. Kafka provides exactly-once semantics via transactional API and idempotent producer, but it's complex and costly. For most cases, at-least-once + idempotency is sufficient.
Was this helpful?
04
What happens if a subscriber crashes in Kafka?
The consumer group rebalances: partitions assigned to the crashed consumer are reassigned to other consumers. Messages that were sent but not yet committed will be reprocessed by the new consumer (at-least-once). If you need exactly-once, use idempotent processing.