Senior 4 min · June 25, 2026

Publish-Subscribe Pattern: Build Async Pipelines That Don't Fall Over at 3 AM

Publish-subscribe pattern decouples producers from consumers.

N
Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Lessons pulled from things that broke in production.

Follow
Production
production tested
June 25, 2026
last updated
1,663
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer

Pub/sub decouples message producers from consumers via a message broker. Publishers emit events to topics; subscribers consume from those topics. This enables async, scalable, and fault-tolerant communication between services.

✦ Definition~90s read
What is Publish-Subscribe Pattern?

The publish-subscribe pattern (pub/sub) is a messaging paradigm where publishers send messages to topics without knowing which subscribers receive them. Subscribers express interest in topics and receive messages asynchronously, decoupling the sender from the receiver.

Think of a radio station.
Plain-English First

Think of a radio station. The station (publisher) broadcasts music on a frequency (topic). Anyone with a radio tuned to that frequency (subscriber) hears the music. The station doesn't know who's listening, and listeners don't affect the broadcast. If you miss a song, you can't replay it unless the station recorded it (persistent subscription).

I've seen a well-intentioned pub/sub system bring down a payments service at 3 AM because the subscriber couldn't keep up and the broker's memory filled up. The team had to restart the broker and replay hours of lost messages. That's the kind of failure that makes you appreciate the pattern's sharp edges.

Pub/sub solves a fundamental problem: how do you let multiple services react to events without coupling them? Before pub/sub, teams used direct HTTP calls or shared databases. Both create tight coupling and single points of failure. A payment service calling a notification service directly means if notifications are down, payments fail. That's unacceptable.

By the end of this article, you'll know exactly when to use pub/sub, how to implement it without the rookie mistakes, and how to debug the three most common production failures. You'll also know when a simple queue or direct call is the smarter choice.

Why Decoupling Matters More Than You Think

Direct coupling between services creates fragility. If service A calls service B directly, and B is slow or down, A fails too. Pub/sub breaks that chain. The publisher doesn't care if subscribers exist or are healthy. It just fires a message and moves on. This lets you scale subscribers independently, add new consumers without touching publishers, and survive subscriber failures without data loss.

But decoupling comes with a cost: you lose visibility. A direct call returns a response; pub/sub is fire-and-forget. You need monitoring, retries, and dead-letter queues to handle failures. Many teams jump into pub/sub without these, then wonder why messages vanish.

BasicPubSub.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// io.thecodeforge — System Design tutorial

// Scenario: Order service publishes 'order_placed' event.
// Notification and Inventory services subscribe.

// Publisher (Order Service)
function publishOrderPlaced(orderId, userId, items) {
  const message = {
    event: 'order_placed',
    orderId: orderId,
    userId: userId,
    items: items,
    timestamp: Date.now()
  };
  // Publish to topic 'orders' — broker handles fan-out
  broker.publish('orders', JSON.stringify(message));
}

// Subscriber 1 (Notification Service)
broker.subscribe('orders', (message) => {
  const event = JSON.parse(message);
  if (event.event === 'order_placed') {
    sendEmail(event.userId, 'Your order is confirmed!');
  }
});

// Subscriber 2 (Inventory Service)
broker.subscribe('orders', (message) => {
  const event = JSON.parse(message);
  if (event.event === 'order_placed') {
    event.items.forEach(item => decrementStock(item.sku, item.quantity));
  }
});
Output
No direct output — messages flow asynchronously. Both subscribers receive the same message independently.
Production Trap: No Message Validation
Never trust message payloads blindly. Always validate schema on the subscriber side. I've seen a malformed JSON crash a whole fleet of subscribers because the parser threw an unhandled exception.
Pub/Sub Pattern: Async Pipeline Architecture THECODEFORGE.IO Pub/Sub Pattern: Async Pipeline Architecture Flow from decoupling to broker choice, backpressure, idempotency, and DLQ Decoupling Producers & Consumers Independent scaling and fault isolation Broker Selection Kafka vs Redis vs RabbitMQ tradeoffs Backpressure Handling Flow control to prevent broker overload Idempotent Processing Handle duplicates from at-least-once delivery Dead-Letter Queue Isolate failed messages for later analysis Monitoring & Debugging Track latency, throughput, and error rates ⚠ Overusing pub/sub for simple sync calls adds latency Use direct request-reply when decoupling isn't needed THECODEFORGE.IO
thecodeforge.io
Pub/Sub Pattern: Async Pipeline Architecture
Pub Sub Pattern

Choosing Your Broker: Kafka vs Redis vs RabbitMQ

Your broker choice defines your failure modes. Kafka is built for high-throughput, persistent, replayable streams. Redis pub/sub is fast but ephemeral — messages are lost if no subscriber is listening. RabbitMQ sits in between: persistent queues, flexible routing, but lower throughput than Kafka.

Here's the rule: Use Kafka when you need message replay and ordering guarantees (e.g., event sourcing, audit logs). Use Redis pub/sub for transient notifications where loss is acceptable (e.g., live dashboards). Use RabbitMQ for reliable task distribution with complex routing (e.g., work queues with multiple bindings).

Don't use Redis pub/sub for anything that must survive a restart. Don't use Kafka for low-latency real-time chat — the overhead is too high.

BrokerComparison.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// io.thecodeforge — System Design tutorial

// Kafka: durable, ordered, replayable
// Producer
producer.send(new ProducerRecord<>("orders", orderId, orderJson));
// Consumer (with offset management)
consumer.subscribe(Arrays.asList("orders"));
while (true) {
  ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
  for (ConsumerRecord<String, String> record : records) {
    processOrder(record.value());
  }
  consumer.commitSync(); // ensures at-least-once
}

// Redis: fast, ephemeral, no persistence
// Publisher
redis.publish("orders", orderJson);
// Subscriber (blocking)
redis.subscribe(new JedisPubSub() {
  public void onMessage(String channel, String message) {
    processOrder(message);
  }
}, "orders");

// RabbitMQ: reliable, flexible routing
// Publisher
channel.basicPublish("orders_exchange", "order.placed", null, orderJson.getBytes());
// Consumer
channel.basicConsume("orders_queue", true, (consumerTag, delivery) -> {
  processOrder(new String(delivery.getBody()));
}, consumerTag -> {});
Output
No direct output — each broker delivers messages to subscribers asynchronously.
Broker Selection Decision Tree
IfNeed message replay and ordering across consumers
UseKafka
IfTransient notifications, loss acceptable
UseRedis pub/sub
IfReliable delivery with complex routing
UseRabbitMQ
IfExactly-once semantics required
UseKafka with idempotent producer + transactional API

Handling Backpressure: Don't Let the Broker Eat Your Memory

The most common pub/sub failure is the subscriber falling behind. Messages pile up on the broker, memory grows, and eventually the broker OOMs or starts dropping messages. This is backpressure, and you must handle it.

Solutions: 1) Use a bounded buffer on the subscriber side — never let the client library buffer unlimited messages. 2) Implement a circuit breaker: if the subscriber's processing queue exceeds a threshold, pause consumption and alert. 3) Use a dead-letter queue for messages that can't be processed after retries.

In Kafka, you can increase partitions and add more consumers to scale horizontally. In Redis, you're limited — consider switching to Redis Streams which support consumer groups and backpressure.

BackpressureHandler.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
// io.thecodeforge — System Design tutorial

// Subscriber with bounded buffer and circuit breaker
const MAX_BUFFER_SIZE = 1000;
const processingQueue = [];
let isPaused = false;

broker.subscribe('orders', (message) => {
  if (processingQueue.length >= MAX_BUFFER_SIZE) {
    // Circuit breaker: pause subscription
    if (!isPaused) {
      broker.pause('orders');
      isPaused = true;
      alertOps('Order subscriber buffer full — pausing consumption');
    }
    // Reject message — broker will retry or DLQ
    return false;
  }
  processingQueue.push(message);
  processNext();
});

async function processNext() {
  if (processingQueue.length === 0) {
    if (isPaused) {
      broker.resume('orders');
      isPaused = false;
    }
    return;
  }
  const message = processingQueue.shift();
  try {
    await processOrder(message);
  } catch (err) {
    // Send to dead-letter queue
    deadLetterQueue.send(message);
  }
  setImmediate(processNext); // process next without stack overflow
}
Output
No direct output — backpressure handling is internal. Alerts fire when buffer is full.
Senior Shortcut: Monitor Consumer Lag
Always monitor consumer lag (Kafka: kafka-consumer-groups --describe). Set alerts when lag exceeds a threshold (e.g., 10,000 messages). That's your early warning before OOM.
Backpressure CascadeTHECODEFORGE.IOBackpressure CascadeHow a slow subscriber kills the brokerPublisherSends messages at peak rateBroker QueueMessages pile up unreadMemory GrowthBroker RAM usage spikesOOM / DropBroker crashes or evicts⚠ Always use bounded buffers and monitor consumer lagTHECODEFORGE.IO
thecodeforge.io
Backpressure Cascade
Pub Sub Pattern

Idempotency: Because At-Least-Once Means Duplicates

Most pub/sub systems guarantee at-least-once delivery. That means your subscriber will see the same message more than once — during retries, rebalances, or broker failovers. If you're not idempotent, you'll double-charge customers, send duplicate emails, or create duplicate database records.

Solution: Every message carries a unique idempotency key (e.g., UUID). The subscriber checks a dedup cache (Redis with TTL) before processing. If the key exists, skip. If not, process and store the key with a TTL longer than the maximum possible duplicate window.

Never rely on database unique constraints alone — they cause constraint violation errors that need handling.

IdempotentSubscriber.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// io.thecodeforge — System Design tutorial

// Idempotent subscriber using Redis dedup cache
const dedupCache = new RedisClient();
const DEDUP_TTL_SECONDS = 3600; // 1 hour — covers retry window

broker.subscribe('orders', async (message) => {
  const event = JSON.parse(message);
  const idempotencyKey = event.id; // e.g., 'order_placed:12345'
  
  // Check if already processed
  const alreadyProcessed = await dedupCache.get(idempotencyKey);
  if (alreadyProcessed) {
    logger.info(`Skipping duplicate message ${idempotencyKey}`);
    return;
  }
  
  // Process the message
  await processOrder(event);
  
  // Mark as processed
  await dedupCache.setex(idempotencyKey, DEDUP_TTL_SECONDS, '1');
});

async function processOrder(event) {
  // Database insert with ON CONFLICT DO NOTHING as safety net
  await db.query(`
    INSERT INTO orders (id, user_id, items, created_at)
    VALUES ($1, $2, $3, $4)
    ON CONFLICT (id) DO NOTHING
  `, [event.orderId, event.userId, JSON.stringify(event.items), new Date(event.timestamp)]);
}
Output
No direct output — duplicate messages are silently skipped.
The Classic Bug: TTL Too Short
Set your dedup TTL longer than the maximum retry window. If your broker retries for 30 minutes, set TTL to at least 1 hour. I've seen TTL set to 5 minutes, and duplicates slipped through during a 10-minute outage.
At-Least-Once vs IdempotentTHECODEFORGE.IOAt-Least-Once vs IdempotentWhy duplicates demand idempotent handlersWithout IdempotencyDuplicate charge to customerDuplicate email sentDuplicate DB row createdInconsistent stateWith IdempotencyIdempotency key dedupesSame result on retryExactly-once semanticsConsistent stateUse a unique message ID + dedup cache (e.g., Redis)THECODEFORGE.IO
thecodeforge.io
At-Least-Once vs Idempotent
Pub Sub Pattern

Dead-Letter Queues: Where Messages Go to Die (or Be Resurrected)

Not all messages can be processed. Maybe the downstream service is down, the data is corrupt, or a bug in your subscriber. If you keep retrying forever, you'll clog the queue and block other messages. Enter the dead-letter queue (DLQ): a separate queue for messages that failed after a maximum number of retries.

Configure your broker to send messages to a DLQ after N retries. Then have a separate process that periodically replays DLQ messages (after fixing the issue) or alerts a human. Never let messages vanish silently.

In Kafka, you can use a DLQ topic and a separate consumer. In RabbitMQ, use dead letter exchanges. In Redis Streams, you'll need to implement it manually.

DLQSetup.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// io.thecodeforge — System Design tutorial

// RabbitMQ: setup dead letter exchange
channel.exchangeDeclare("orders.dlx", "direct", true);
channel.queueDeclare("orders.dlq", true, false, false, null);
channel.queueBind("orders.dlq", "orders.dlx", "order.failed");

// Main queue with DLX config
Map<String, Object> args = new HashMap<>();
args.put("x-dead-letter-exchange", "orders.dlx");
args.put("x-dead-letter-routing-key", "order.failed");
args.put("x-max-retries", 3);
channel.queueDeclare("orders_queue", true, false, false, args);

// Consumer with retry count in headers
channel.basicConsume("orders_queue", false, (consumerTag, delivery) -> {
  try {
    processOrder(new String(delivery.getBody()));
    channel.basicAck(delivery.getEnvelope().getDeliveryTag(), false);
  } catch (Exception e) {
    long retryCount = delivery.getProperties().getHeaders().getOrDefault("x-retry-count", 0);
    if (retryCount < 3) {
      // Reject and requeue with incremented retry count
      AMQP.BasicProperties props = new AMQP.BasicProperties.Builder()
        .headers(Map.of("x-retry-count", retryCount + 1))
        .build();
      channel.basicReject(delivery.getEnvelope().getDeliveryTag(), true);
      // Note: actual requeue with new headers requires custom logic
    } else {
      // Send to DLQ
      channel.basicReject(delivery.getEnvelope().getDeliveryTag(), false);
    }
  }
}, consumerTag -> {});
Output
No direct output — messages exceeding retries end up in orders.dlq.
Interview Gold: DLQ Monitoring
Always monitor DLQ depth. A growing DLQ means something is broken. Set an alert on DLQ message count > 0. In an interview, mention that you'd build a dashboard showing DLQ age and content for quick debugging.

When Not to Use Pub/Sub: The Overkill Trap

Pub/sub is not always the answer. If you have a single consumer and need a response, use a direct call or a request-reply pattern. If you need exactly-once processing and can't tolerate duplicates, consider a transactional outbox pattern with a database instead.

I've seen teams use Kafka to send a notification email — that's a sledgehammer for a nail. A simple queue (RabbitMQ) or even an HTTP call with retries would be simpler and cheaper.

Also avoid pub/sub for low-latency (<10ms) interactions. The broker hop adds latency. For real-time gaming or trading, consider direct WebSocket connections or UDP multicast.

WhenToAvoidPubSub.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// io.thecodeforge — System Design tutorial

// Scenario: User registration needs to send welcome email.
// Overkill: Kafka topic with one partition and one consumer
// Better: Direct HTTP call to email service with retry

async function registerUser(userData) {
  const user = await db.insertUser(userData);
  
  // Direct call with retry — simpler and synchronous
  const emailSent = await retry(() => 
    emailService.sendWelcomeEmail(user.email),
    { retries: 3, backoff: 'exponential' }
  );
  
  if (!emailSent) {
    // Fallback: queue for later processing
    await backgroundJobQueue.add('send_email', { userId: user.id });
  }
  
  return user;
}
Output
No direct output — the email is sent synchronously with retry.
Never Do This: Pub/Sub for Request-Reply
Don't use pub/sub to implement RPC. You'll end up with correlation IDs, temporary reply queues, and timeouts — a poor man's HTTP. Use a proper RPC framework or direct calls instead.

Monitoring and Debugging: The Blind Spot

Pub/sub systems are notoriously hard to debug because messages flow asynchronously. You need end-to-end tracing. Attach a unique trace ID to every message at the publisher, and propagate it through all subscribers. Use distributed tracing tools (Jaeger, Zipkin) to visualize the flow.

Also log every message arrival and processing outcome. Without logs, you're blind. I've spent hours chasing a missing message only to find it was published to the wrong topic due to a typo.

Set up metrics: publish rate, consume rate, lag, error rate, DLQ depth. Alert on anomalies.

TracingSetup.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// io.thecodeforge — System Design tutorial

// Publisher: attach trace ID
const traceId = uuidv4();
const message = {
  event: 'order_placed',
  orderId: orderId,
  traceId: traceId,
  timestamp: Date.now()
};
logger.info(`Publishing order_placed`, { traceId, orderId });
broker.publish('orders', JSON.stringify(message));

// Subscriber: log arrival and processing
broker.subscribe('orders', (rawMessage) => {
  const message = JSON.parse(rawMessage);
  const { traceId, orderId } = message;
  logger.info(`Received order_placed`, { traceId, orderId });
  
  const start = Date.now();
  try {
    processOrder(message);
    logger.info(`Processed order_placed`, { traceId, orderId, duration: Date.now() - start });
  } catch (err) {
    logger.error(`Failed to process order_placed`, { traceId, orderId, error: err.message });
  }
});
Output
Logs with traceId allow correlating publisher and subscriber events.
Production Trap: No Logging on Publish
Always log when you publish a message, including the topic and message ID. Otherwise, when a message goes missing, you can't tell if it was never published or lost in transit.
● Production incidentPOST-MORTEMseverity: high

The 4GB Container That Kept Dying

Symptom
A microservice processing order events kept crashing with OOMKilled every 30 minutes. The container had 4GB RAM limit.
Assumption
The team assumed a memory leak in the subscriber code.
Root cause
The subscriber was using Redis pub/sub with a blocking brpop. When the Redis broker couldn't keep up, messages buffered in Redis memory. The subscriber's client library also had an unbounded in-memory buffer (default 10,000 messages). Combined, they consumed 3.8GB before OOM.
Fix
Set a maximum in-memory buffer size on the subscriber (e.g., 1000 messages). Configure Redis maxmemory-policy allkeys-lru. Added a circuit breaker to pause subscription when buffer exceeds 80%.
Key lesson
  • Always bound every buffer in your pipeline — broker, client, and application.
  • Unbounded buffers are landmines.
Production debug guideSystematic recovery paths for the failure modes engineers actually hit.3 entries
Symptom · 01
Messages not being consumed — consumer lag growing
Fix
1. Check consumer health: is it running? CPU/memory? 2. Check broker logs for rebalances. 3. Increase partitions and consumers. 4. If using Kafka, run kafka-consumer-groups --describe --group <group> to see lag per partition.
Symptom · 02
Duplicate messages processed
Fix
1. Check if subscriber is idempotent. 2. Verify dedup cache TTL. 3. Check for rebalances causing reprocessing. 4. In Kafka, ensure enable.auto.commit=false and commit manually after processing.
Symptom · 03
Broker OOM or high memory usage
Fix
1. Check consumer lag — messages buffered in broker memory. 2. Reduce max in-memory buffer on subscriber. 3. For Redis, set maxmemory-policy allkeys-lru. 4. For Kafka, reduce log retention or increase disk.
★ Publish-Subscribe Pattern Triage Cheat SheetFirst-response commands for when things go wrong — copy-paste ready.
Consumer lag growing — `kafka-consumer-groups --describe --group my-group` shows high LAG
Immediate action
Check if consumer is stuck or slow
Commands
kafka-consumer-groups --bootstrap-server localhost:9092 --group my-group --describe
curl http://consumer-host:8080/health
Fix now
Increase partitions: kafka-topics --alter --topic orders --partitions 10 and add more consumers.
Duplicate messages — same order processed twice+
Immediate action
Check idempotency key in logs
Commands
grep 'duplicate' /var/log/subscriber.log
redis-cli TTL order_placed:12345
Fix now
Increase dedup TTL to 3600 seconds. Ensure idempotency key is set before processing.
Redis OOM — `OOM command not allowed when used memory > 'maxmemory'`+
Immediate action
Check memory usage and eviction policy
Commands
redis-cli INFO memory | grep used_memory_human
redis-cli CONFIG GET maxmemory-policy
Fix now
Set maxmemory-policy allkeys-lru: redis-cli CONFIG SET maxmemory-policy allkeys-lru. Reduce subscriber buffer size.
RabbitMQ queue growing — messages not consumed+
Immediate action
Check if consumer is connected and not blocked
Commands
rabbitmqctl list_queues name messages consumers
rabbitmqctl list_consumers
Fix now
Restart consumer or increase prefetch count: channel.basicQos(100).
Feature / AspectKafkaRedis Pub/SubRabbitMQ
Message PersistenceYes, on diskNo, in-memory onlyYes, optional
Message ReplayYes, by offsetNoNo (unless using shovel/backup)
Ordering GuaranteeWithin partitionNoWithin queue
ThroughputVery high (millions/s)High (hundreds of thousands/s)Moderate (tens of thousands/s)
LatencyLow (ms)Very low (sub-ms)Low (ms)
Consumer GroupsYesNo (use Redis Streams)Yes
Routing FlexibilityTopic-basedChannel-basedExchanges + bindings
Operational ComplexityHigh (ZooKeeper/KRaft)LowMedium

Key takeaways

1
Pub/sub decouples producers and consumers via a broker, enabling async, scalable communication
but adds complexity in monitoring and error handling.
2
Always bound buffers at every layer
broker, client, and application. Unbounded buffers cause OOMs.
3
Idempotency is mandatory for at-least-once delivery. Use a dedup cache with TTL longer than the retry window.
4
Dead-letter queues are your safety net. Never let messages vanish silently. Monitor DLQ depth.
5
Choose your broker based on durability and replay needs
Kafka for replay, Redis for ephemeral, RabbitMQ for flexible routing.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
How does Kafka handle backpressure when a consumer is slower than the pr...
Q02SENIOR
When would you choose RabbitMQ over Kafka for a pub/sub system?
Q03SENIOR
What happens when a Kafka consumer crashes mid-processing? How do you pr...
Q04JUNIOR
What is the difference between pub/sub and message queues?
Q05SENIOR
A subscriber is processing messages but the broker shows increasing lag....
Q06SENIOR
Design a pub/sub system for a ride-hailing app that needs to broadcast d...
Q01 of 06SENIOR

How does Kafka handle backpressure when a consumer is slower than the producer?

ANSWER
Kafka doesn't have backpressure to the producer. Instead, it buffers messages on disk (log). The consumer lag grows. You must monitor lag and scale consumers by adding partitions. If lag becomes too large, you may hit disk space limits or retention thresholds, causing data loss.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
What is the publish-subscribe pattern used for?
02
What's the difference between pub/sub and observer pattern?
03
How do I ensure exactly-once delivery in pub/sub?
04
What happens if a subscriber crashes in Kafka?
N
Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Lessons pulled from things that broke in production.

Follow
Verified
production tested
June 25, 2026
last updated
1,663
articles · all by Naren
🔥

That's Async & Data Processing. Mark it forged?

4 min read · try the examples if you haven't

Previous
Heartbeats and Failure Detection
1 / 7 · Async & Data Processing
Next
Kafka and the Distributed Log