Senior 4 min · June 25, 2026

Publish-Subscribe Pattern: Build Async Pipelines That Don't Fall Over at 3 AM

Q: What is the publish-subscribe pattern used for?

It's used to decouple services that produce events from those that consume them. Common uses: notifying multiple services of an event (e.g., order placed), streaming data (e.g., user activity), and building event-driven architectures.

Q: What's the difference between pub/sub and observer pattern?

Observer pattern is in-process: subjects notify observers directly within the same application. Pub/sub is cross-process: publishers and subscribers communicate via a broker, often across different services or machines.

Q: How do I ensure exactly-once delivery in pub/sub?

Exactly-once is extremely hard in distributed systems. Most systems offer at-least-once with idempotent processing. Kafka provides exactly-once semantics via transactional API and idempotent producer, but it's complex and costly. For most cases, at-least-once + idempotency is sufficient.

Q: What happens if a subscriber crashes in Kafka?

The consumer group rebalances: partitions assigned to the crashed consumer are reassigned to other consumers. Messages that were sent but not yet committed will be reprocessed by the new consumer (at-least-once). If you need exactly-once, use idempotent processing.

Publish-subscribe pattern decouples producers from consumers.

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Lessons pulled from things that broke in production.

✓ Production

production tested

June 25, 2026

last updated

1,663

articles · all by Naren

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Pub/sub decouples message producers from consumers via a message broker. Publishers emit events to topics; subscribers consume from those topics. This enables async, scalable, and fault-tolerant communication between services.

✦ Definition~90s read

What is Publish-Subscribe Pattern?

The publish-subscribe pattern (pub/sub) is a messaging paradigm where publishers send messages to topics without knowing which subscribers receive them. Subscribers express interest in topics and receive messages asynchronously, decoupling the sender from the receiver.

★

Think of a radio station.

Plain-English First

Think of a radio station. The station (publisher) broadcasts music on a frequency (topic). Anyone with a radio tuned to that frequency (subscriber) hears the music. The station doesn't know who's listening, and listeners don't affect the broadcast. If you miss a song, you can't replay it unless the station recorded it (persistent subscription).

I've seen a well-intentioned pub/sub system bring down a payments service at 3 AM because the subscriber couldn't keep up and the broker's memory filled up. The team had to restart the broker and replay hours of lost messages. That's the kind of failure that makes you appreciate the pattern's sharp edges.

Pub/sub solves a fundamental problem: how do you let multiple services react to events without coupling them? Before pub/sub, teams used direct HTTP calls or shared databases. Both create tight coupling and single points of failure. A payment service calling a notification service directly means if notifications are down, payments fail. That's unacceptable.

By the end of this article, you'll know exactly when to use pub/sub, how to implement it without the rookie mistakes, and how to debug the three most common production failures. You'll also know when a simple queue or direct call is the smarter choice.

Why Decoupling Matters More Than You Think

Direct coupling between services creates fragility. If service A calls service B directly, and B is slow or down, A fails too. Pub/sub breaks that chain. The publisher doesn't care if subscribers exist or are healthy. It just fires a message and moves on. This lets you scale subscribers independently, add new consumers without touching publishers, and survive subscriber failures without data loss.

But decoupling comes with a cost: you lose visibility. A direct call returns a response; pub/sub is fire-and-forget. You need monitoring, retries, and dead-letter queues to handle failures. Many teams jump into pub/sub without these, then wonder why messages vanish.

BasicPubSub.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Scenario: Order service publishes 'order_placed' event.
// Notification and Inventory services subscribe.

// Publisher (Order Service)
function publishOrderPlaced(orderId, userId, items) {
  const message = {
    event: 'order_placed',
    orderId: orderId,
    userId: userId,
    items: items,
    timestamp: Date.now()
  };
  // Publish to topic 'orders' — broker handles fan-out
  broker.publish('orders', JSON.stringify(message));
}

// Subscriber 1 (Notification Service)
broker.subscribe('orders', (message) => {
  const event = JSON.parse(message);
  if (event.event === 'order_placed') {
    sendEmail(event.userId, 'Your order is confirmed!');
  }
});

// Subscriber 2 (Inventory Service)
broker.subscribe('orders', (message) => {
  const event = JSON.parse(message);
  if (event.event === 'order_placed') {
    event.items.forEach(item => decrementStock(item.sku, item.quantity));
  }
});

Output

No direct output — messages flow asynchronously. Both subscribers receive the same message independently.

Production Trap: No Message Validation

Never trust message payloads blindly. Always validate schema on the subscriber side. I've seen a malformed JSON crash a whole fleet of subscribers because the parser threw an unhandled exception.

thecodeforge.io

Pub/Sub Pattern: Async Pipeline Architecture

Pub Sub Pattern

Choosing Your Broker: Kafka vs Redis vs RabbitMQ

Your broker choice defines your failure modes. Kafka is built for high-throughput, persistent, replayable streams. Redis pub/sub is fast but ephemeral — messages are lost if no subscriber is listening. RabbitMQ sits in between: persistent queues, flexible routing, but lower throughput than Kafka.

Here's the rule: Use Kafka when you need message replay and ordering guarantees (e.g., event sourcing, audit logs). Use Redis pub/sub for transient notifications where loss is acceptable (e.g., live dashboards). Use RabbitMQ for reliable task distribution with complex routing (e.g., work queues with multiple bindings).

Don't use Redis pub/sub for anything that must survive a restart. Don't use Kafka for low-latency real-time chat — the overhead is too high.

BrokerComparison.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Kafka: durable, ordered, replayable
// Producer
producer.send(new ProducerRecord<>("orders", orderId, orderJson));
// Consumer (with offset management)
consumer.subscribe(Arrays.asList("orders"));
while (true) {
  ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
  for (ConsumerRecord<String, String> record : records) {
    processOrder(record.value());
  }
  consumer.commitSync(); // ensures at-least-once
}

// Redis: fast, ephemeral, no persistence
// Publisher
redis.publish("orders", orderJson);
// Subscriber (blocking)
redis.subscribe(new JedisPubSub() {
  public void onMessage(String channel, String message) {
    processOrder(message);
  }
}, "orders");

// RabbitMQ: reliable, flexible routing
// Publisher
channel.basicPublish("orders_exchange", "order.placed", null, orderJson.getBytes());
// Consumer
channel.basicConsume("orders_queue", true, (consumerTag, delivery) -> {
  processOrder(new String(delivery.getBody()));
}, consumerTag -> {});

Output

No direct output — each broker delivers messages to subscribers asynchronously.

Broker Selection Decision Tree

IfNeed message replay and ordering across consumers

→

UseKafka

IfTransient notifications, loss acceptable

→

UseRedis pub/sub

IfReliable delivery with complex routing

→

UseRabbitMQ

IfExactly-once semantics required

→

UseKafka with idempotent producer + transactional API

Handling Backpressure: Don't Let the Broker Eat Your Memory

The most common pub/sub failure is the subscriber falling behind. Messages pile up on the broker, memory grows, and eventually the broker OOMs or starts dropping messages. This is backpressure, and you must handle it.

Solutions: 1) Use a bounded buffer on the subscriber side — never let the client library buffer unlimited messages. 2) Implement a circuit breaker: if the subscriber's processing queue exceeds a threshold, pause consumption and alert. 3) Use a dead-letter queue for messages that can't be processed after retries.

In Kafka, you can increase partitions and add more consumers to scale horizontally. In Redis, you're limited — consider switching to Redis Streams which support consumer groups and backpressure.

BackpressureHandler.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Subscriber with bounded buffer and circuit breaker
const MAX_BUFFER_SIZE = 1000;
const processingQueue = [];
let isPaused = false;

broker.subscribe('orders', (message) => {
  if (processingQueue.length >= MAX_BUFFER_SIZE) {
    // Circuit breaker: pause subscription
    if (!isPaused) {
      broker.pause('orders');
      isPaused = true;
      alertOps('Order subscriber buffer full — pausing consumption');
    }
    // Reject message — broker will retry or DLQ
    return false;
  }
  processingQueue.push(message);
  processNext();
});

async function processNext() {
  if (processingQueue.length === 0) {
    if (isPaused) {
      broker.resume('orders');
      isPaused = false;
    }
    return;
  }
  const message = processingQueue.shift();
  try {
    await processOrder(message);
  } catch (err) {
    // Send to dead-letter queue
    deadLetterQueue.send(message);
  }
  setImmediate(processNext); // process next without stack overflow
}

Output

No direct output — backpressure handling is internal. Alerts fire when buffer is full.

Senior Shortcut: Monitor Consumer Lag

Always monitor consumer lag (Kafka: kafka-consumer-groups --describe). Set alerts when lag exceeds a threshold (e.g., 10,000 messages). That's your early warning before OOM.

thecodeforge.io

Backpressure Cascade

Pub Sub Pattern

Idempotency: Because At-Least-Once Means Duplicates

Most pub/sub systems guarantee at-least-once delivery. That means your subscriber will see the same message more than once — during retries, rebalances, or broker failovers. If you're not idempotent, you'll double-charge customers, send duplicate emails, or create duplicate database records.

Solution: Every message carries a unique idempotency key (e.g., UUID). The subscriber checks a dedup cache (Redis with TTL) before processing. If the key exists, skip. If not, process and store the key with a TTL longer than the maximum possible duplicate window.

Never rely on database unique constraints alone — they cause constraint violation errors that need handling.

IdempotentSubscriber.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Idempotent subscriber using Redis dedup cache
const dedupCache = new RedisClient();
const DEDUP_TTL_SECONDS = 3600; // 1 hour — covers retry window

broker.subscribe('orders', async (message) => {
  const event = JSON.parse(message);
  const idempotencyKey = event.id; // e.g., 'order_placed:12345'
  
  // Check if already processed
  const alreadyProcessed = await dedupCache.get(idempotencyKey);
  if (alreadyProcessed) {
    logger.info(`Skipping duplicate message ${idempotencyKey}`);
    return;
  }
  
  // Process the message
  await processOrder(event);
  
  // Mark as processed
  await dedupCache.setex(idempotencyKey, DEDUP_TTL_SECONDS, '1');
});

async function processOrder(event) {
  // Database insert with ON CONFLICT DO NOTHING as safety net
  await db.query(`
    INSERT INTO orders (id, user_id, items, created_at)
    VALUES ($1, $2, $3, $4)
    ON CONFLICT (id) DO NOTHING
  `, [event.orderId, event.userId, JSON.stringify(event.items), new Date(event.timestamp)]);
}

Output

No direct output — duplicate messages are silently skipped.

The Classic Bug: TTL Too Short

Set your dedup TTL longer than the maximum retry window. If your broker retries for 30 minutes, set TTL to at least 1 hour. I've seen TTL set to 5 minutes, and duplicates slipped through during a 10-minute outage.

thecodeforge.io

At-Least-Once vs Idempotent

Pub Sub Pattern

Dead-Letter Queues: Where Messages Go to Die (or Be Resurrected)

Not all messages can be processed. Maybe the downstream service is down, the data is corrupt, or a bug in your subscriber. If you keep retrying forever, you'll clog the queue and block other messages. Enter the dead-letter queue (DLQ): a separate queue for messages that failed after a maximum number of retries.

Configure your broker to send messages to a DLQ after N retries. Then have a separate process that periodically replays DLQ messages (after fixing the issue) or alerts a human. Never let messages vanish silently.

In Kafka, you can use a DLQ topic and a separate consumer. In RabbitMQ, use dead letter exchanges. In Redis Streams, you'll need to implement it manually.

DLQSetup.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// RabbitMQ: setup dead letter exchange
channel.exchangeDeclare("orders.dlx", "direct", true);
channel.queueDeclare("orders.dlq", true, false, false, null);
channel.queueBind("orders.dlq", "orders.dlx", "order.failed");

// Main queue with DLX config
Map<String, Object> args = new HashMap<>();
args.put("x-dead-letter-exchange", "orders.dlx");
args.put("x-dead-letter-routing-key", "order.failed");
args.put("x-max-retries", 3);
channel.queueDeclare("orders_queue", true, false, false, args);

// Consumer with retry count in headers
channel.basicConsume("orders_queue", false, (consumerTag, delivery) -> {
  try {
    processOrder(new String(delivery.getBody()));
    channel.basicAck(delivery.getEnvelope().getDeliveryTag(), false);
  } catch (Exception e) {
    long retryCount = delivery.getProperties().getHeaders().getOrDefault("x-retry-count", 0);
    if (retryCount < 3) {
      // Reject and requeue with incremented retry count
      AMQP.BasicProperties props = new AMQP.BasicProperties.Builder()
        .headers(Map.of("x-retry-count", retryCount + 1))
        .build();
      channel.basicReject(delivery.getEnvelope().getDeliveryTag(), true);
      // Note: actual requeue with new headers requires custom logic
    } else {
      // Send to DLQ
      channel.basicReject(delivery.getEnvelope().getDeliveryTag(), false);
    }
  }
}, consumerTag -> {});

Output

No direct output — messages exceeding retries end up in orders.dlq.

Interview Gold: DLQ Monitoring

Always monitor DLQ depth. A growing DLQ means something is broken. Set an alert on DLQ message count > 0. In an interview, mention that you'd build a dashboard showing DLQ age and content for quick debugging.

When Not to Use Pub/Sub: The Overkill Trap

Pub/sub is not always the answer. If you have a single consumer and need a response, use a direct call or a request-reply pattern. If you need exactly-once processing and can't tolerate duplicates, consider a transactional outbox pattern with a database instead.

I've seen teams use Kafka to send a notification email — that's a sledgehammer for a nail. A simple queue (RabbitMQ) or even an HTTP call with retries would be simpler and cheaper.

Also avoid pub/sub for low-latency (<10ms) interactions. The broker hop adds latency. For real-time gaming or trading, consider direct WebSocket connections or UDP multicast.

WhenToAvoidPubSub.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Scenario: User registration needs to send welcome email.
// Overkill: Kafka topic with one partition and one consumer
// Better: Direct HTTP call to email service with retry

async function registerUser(userData) {
  const user = await db.insertUser(userData);
  
  // Direct call with retry — simpler and synchronous
  const emailSent = await retry(() => 
    emailService.sendWelcomeEmail(user.email),
    { retries: 3, backoff: 'exponential' }
  );
  
  if (!emailSent) {
    // Fallback: queue for later processing
    await backgroundJobQueue.add('send_email', { userId: user.id });
  }
  
  return user;
}

Output

No direct output — the email is sent synchronously with retry.

Never Do This: Pub/Sub for Request-Reply

Don't use pub/sub to implement RPC. You'll end up with correlation IDs, temporary reply queues, and timeouts — a poor man's HTTP. Use a proper RPC framework or direct calls instead.

Pub/sub systems are notoriously hard to debug because messages flow asynchronously. You need end-to-end tracing. Attach a unique trace ID to every message at the publisher, and propagate it through all subscribers. Use distributed tracing tools (Jaeger, Zipkin) to visualize the flow.

Also log every message arrival and processing outcome. Without logs, you're blind. I've spent hours chasing a missing message only to find it was published to the wrong topic due to a typo.

Set up metrics: publish rate, consume rate, lag, error rate, DLQ depth. Alert on anomalies.

TracingSetup.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Publisher: attach trace ID
const traceId = uuidv4();
const message = {
  event: 'order_placed',
  orderId: orderId,
  traceId: traceId,
  timestamp: Date.now()
};
logger.info(`Publishing order_placed`, { traceId, orderId });
broker.publish('orders', JSON.stringify(message));

// Subscriber: log arrival and processing
broker.subscribe('orders', (rawMessage) => {
  const message = JSON.parse(rawMessage);
  const { traceId, orderId } = message;
  logger.info(`Received order_placed`, { traceId, orderId });
  
  const start = Date.now();
  try {
    processOrder(message);
    logger.info(`Processed order_placed`, { traceId, orderId, duration: Date.now() - start });
  } catch (err) {
    logger.error(`Failed to process order_placed`, { traceId, orderId, error: err.message });
  }
});

Output

Logs with traceId allow correlating publisher and subscriber events.

Production Trap: No Logging on Publish

Always log when you publish a message, including the topic and message ID. Otherwise, when a message goes missing, you can't tell if it was never published or lost in transit.

● Production incidentPOST-MORTEMseverity: high

The 4GB Container That Kept Dying

Symptom

A microservice processing order events kept crashing with OOMKilled every 30 minutes. The container had 4GB RAM limit.

Assumption

The team assumed a memory leak in the subscriber code.

Root cause

The subscriber was using Redis pub/sub with a blocking brpop. When the Redis broker couldn't keep up, messages buffered in Redis memory. The subscriber's client library also had an unbounded in-memory buffer (default 10,000 messages). Combined, they consumed 3.8GB before OOM.

Fix

Set a maximum in-memory buffer size on the subscriber (e.g., 1000 messages). Configure Redis maxmemory-policy allkeys-lru. Added a circuit breaker to pause subscription when buffer exceeds 80%.

Key lesson

Always bound every buffer in your pipeline — broker, client, and application.
Unbounded buffers are landmines.

Production debug guideSystematic recovery paths for the failure modes engineers actually hit.3 entries

Symptom · 01

Messages not being consumed — consumer lag growing

→

Fix

1. Check consumer health: is it running? CPU/memory? 2. Check broker logs for rebalances. 3. Increase partitions and consumers. 4. If using Kafka, run kafka-consumer-groups --describe --group <group> to see lag per partition.

Symptom · 02

Duplicate messages processed

→

Fix

1. Check if subscriber is idempotent. 2. Verify dedup cache TTL. 3. Check for rebalances causing reprocessing. 4. In Kafka, ensure enable.auto.commit=false and commit manually after processing.

Symptom · 03

Broker OOM or high memory usage

→

Fix

1. Check consumer lag — messages buffered in broker memory. 2. Reduce max in-memory buffer on subscriber. 3. For Redis, set maxmemory-policy allkeys-lru. 4. For Kafka, reduce log retention or increase disk.

★ Publish-Subscribe Pattern Triage Cheat SheetFirst-response commands for when things go wrong — copy-paste ready.

Consumer lag growing — `kafka-consumer-groups --describe --group my-group` shows high LAG−

Immediate action

Check if consumer is stuck or slow

Commands

kafka-consumer-groups --bootstrap-server localhost:9092 --group my-group --describe

curl http://consumer-host:8080/health

Fix now

Increase partitions: kafka-topics --alter --topic orders --partitions 10 and add more consumers.

Duplicate messages — same order processed twice+

Redis OOM — `OOM command not allowed when used memory > 'maxmemory'`+

RabbitMQ queue growing — messages not consumed+

Feature / Aspect	Kafka	Redis Pub/Sub	RabbitMQ
Message Persistence	Yes, on disk	No, in-memory only	Yes, optional
Message Replay	Yes, by offset	No	No (unless using shovel/backup)
Ordering Guarantee	Within partition	No	Within queue
Throughput	Very high (millions/s)	High (hundreds of thousands/s)	Moderate (tens of thousands/s)
Latency	Low (ms)	Very low (sub-ms)	Low (ms)
Consumer Groups	Yes	No (use Redis Streams)	Yes
Routing Flexibility	Topic-based	Channel-based	Exchanges + bindings
Operational Complexity	High (ZooKeeper/KRaft)	Low	Medium

Key takeaways

Pub/sub decouples producers and consumers via a broker, enabling async, scalable communication

but adds complexity in monitoring and error handling.

Always bound buffers at every layer

broker, client, and application. Unbounded buffers cause OOMs.

Idempotency is mandatory for at-least-once delivery. Use a dedup cache with TTL longer than the retry window.

Dead-letter queues are your safety net. Never let messages vanish silently. Monitor DLQ depth.

Choose your broker based on durability and replay needs

Kafka for replay, Redis for ephemeral, RabbitMQ for flexible routing.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

How does Kafka handle backpressure when a consumer is slower than the pr...

Q02SENIOR

When would you choose RabbitMQ over Kafka for a pub/sub system?

Q03SENIOR

What happens when a Kafka consumer crashes mid-processing? How do you pr...

Q04JUNIOR

What is the difference between pub/sub and message queues?

Q05SENIOR

A subscriber is processing messages but the broker shows increasing lag....

Q06SENIOR

Design a pub/sub system for a ride-hailing app that needs to broadcast d...

Q01 of 06SENIOR

How does Kafka handle backpressure when a consumer is slower than the producer?

ANSWER

Kafka doesn't have backpressure to the producer. Instead, it buffers messages on disk (log). The consumer lag grows. You must monitor lag and scale consumers by adding partitions. If lag becomes too large, you may hit disk space limits or retention thresholds, causing data loss.

FAQ · 4 QUESTIONS

Frequently Asked Questions

What is the publish-subscribe pattern used for?

What's the difference between pub/sub and observer pattern?

How do I ensure exactly-once delivery in pub/sub?

What happens if a subscriber crashes in Kafka?

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Lessons pulled from things that broke in production.

✓ Verified

production tested

June 25, 2026

last updated

1,663

articles · all by Naren

🔥

That's Async & Data Processing. Mark it forged?

4 min read · try the examples if you haven't

Publish-Subscribe Pattern: Build Async Pipelines That Don't Fall Over at 3 AM

Why Decoupling Matters More Than You Think

Choosing Your Broker: Kafka vs Redis vs RabbitMQ

Handling Backpressure: Don't Let the Broker Eat Your Memory

Idempotency: Because At-Least-Once Means Duplicates

Dead-Letter Queues: Where Messages Go to Die (or Be Resurrected)

When Not to Use Pub/Sub: The Overkill Trap

Monitoring and Debugging: The Blind Spot

The 4GB Container That Kept Dying

Key takeaways

Interview Questions on This Topic

Frequently Asked Questions

That's Async & Data Processing. Mark it forged?