Junior 15 min · March 05, 2026

Message Queue Components — The Auto-ack Data Loss Gotcha

Q: What is the difference between a message queue and an event streaming platform like Kafka?

A message queue like RabbitMQ deletes a message after a consumer acknowledges it. The message exists only until it is processed — then it is gone. This is ideal for work distribution: process this payment, send this email. Kafka retains all messages as an immutable log for a configurable retention period. Any consumer group can replay from any offset at any time without affecting other groups. Use a queue for task distribution where exactly one service should process each message. Use Kafka for event streaming, audit logs, or any use case where replay capability or multiple independent consumers on the same event stream are requirements.

Q: What happens if a consumer crashes before acknowledging a message?

With manual acknowledgement, the broker detects the lost connection via heartbeat timeout and automatically redelivers the unacknowledged message to another available consumer. This is at-least-once delivery — it guarantees no data loss but means the same message can be received more than once. This is why every consumer performing non-idempotent operations must check the event_id against a processed_events table before acting. A consumer that does not handle duplicates will produce incorrect side effects — double charges, duplicate emails — on redelivery.

Q: Can I use a database as a message queue?

You can, but it does not scale and creates operational problems at volume. Databases are optimised for transactional reads and updates — not for the constant write-then-delete pattern of queue consumption, which causes index fragmentation, table bloat, and lock contention under load. Polling-based database queues also introduce artificial latency. That said, the Outbox pattern uses a database table as a transient event store — not as a queue itself, but as an atomicity bridge between a business transaction and a real message broker. The distinction matters: the outbox table is a staging area, not the delivery mechanism.

Q: What is the Outbox pattern and when should I use it?

The Outbox pattern writes events to a database table inside the same transaction as your business logic. A separate poller process reads that table and publishes to the message queue, then marks the row as sent. Use it whenever a database write and a queue publish must succeed or fail together atomically. Without it, the window between a successful database commit and the subsequent queue publish can contain a process crash — leaving you with data that exists but no downstream service will ever know about. The trade-off is additional latency equal to the poller's interval and an outbox table that needs periodic cleanup, but for payment events, order creation, and any operation with financial or compliance implications, that trade-off is always worth making.

auto_ack=True deletes messages before processing completes, causing double charges from payment gateway.

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.

✓ Production

production tested

May 24, 2026

last updated

1,554

articles · all by Naren

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

A message queue decouples services by letting a producer drop a message and move on — the consumer picks it up whenever it's ready, regardless of whether the producer is still running
Producers serialize and publish; brokers store and route; consumers process and acknowledge — these three form the core pipeline and each has independent failure modes
Ack after success, Nack with requeue=False to route poison messages to the DLQ — never requeue a message that will never succeed
RabbitMQ deletes messages after ACK (work queue model); Kafka retains them as an immutable, time-ordered log (streaming model) — the architectural difference is significant, not cosmetic
Prefetch count = 1 prevents one fast consumer from hoarding all in-flight messages while slower consumers sit idle with nothing to process
The biggest trap: auto-ack means the broker deletes the message the instant it is delivered, before your code does anything — a crash mid-processing loses it permanently with zero trace

✦ Definition~90s read

What is Message Queues?

Message queues are the backbone of asynchronous, decoupled communication in distributed systems. They solve the fundamental problem of one service needing to reliably hand off work to another without blocking or requiring both to be available simultaneously.

★

Imagine a busy restaurant kitchen.

Instead of a producer directly calling a consumer via HTTP (which couples them in time and availability), the producer writes a message to a broker, and the consumer reads it when ready. This pattern enables load leveling, fault tolerance, and independent scaling of services.

However, the abstraction hides a critical failure mode: automatic acknowledgment (auto-ack) of messages upon delivery to the consumer, before the consumer has actually processed the message. If the consumer crashes after receiving but before processing, that message is lost forever — the broker considers it delivered and removes it.

This is the auto-ack data loss gotcha that silently corrupts pipelines in production.

In the ecosystem, message queues sit between your application services and your data stores, acting as a durable buffer. They are not databases — you should not query them for historical state — and they are not event streams (though Kafka blurs this line).

Use a queue when you need guaranteed delivery of discrete work items to exactly one consumer (competing consumers pattern). Use a stream when you need replayable, ordered logs consumed by multiple subscribers. RabbitMQ is the classic smart broker: it handles routing, acknowledgments, and redelivery logic server-side, making it ideal for complex routing and transactional workloads.

Kafka is the smart consumer: it stores an immutable log and pushes the burden of tracking offsets and idempotency onto the consumer, which excels at high-throughput replay and event sourcing. Real-world numbers: RabbitMQ handles tens of thousands of messages per second per node; Kafka handles millions, but at the cost of operational complexity and larger infrastructure footprint.

When NOT to use a message queue: if your system can tolerate synchronous RPC (e.g., internal microservices on the same network with low latency requirements), adding a queue is premature complexity. If you need exactly-once processing (not just delivery), you must build idempotency into your consumer regardless of the broker — no queue guarantees exactly-once processing without application-level deduplication.

And if your workload is purely batch processing (e.g., nightly ETL), a queue adds unnecessary overhead compared to a simple job scheduler like Airflow or a cron-triggered script. The auto-ack gotcha is most dangerous in high-throughput pipelines where developers assume the broker handles reliability — it doesn't.

You must manually acknowledge messages only after your consumer has durably persisted the result, and you must design your consumer to be idempotent so that redelivered messages don't corrupt state. Dead-letter queues (DLQs) are your safety net: when a consumer fails repeatedly, the message moves to a DLQ instead of being lost or retried infinitely, allowing you to inspect and replay failures without blocking the main pipeline.

Plain-English First

Imagine a busy restaurant kitchen. Waiters (producers) do not cook food themselves — they write orders on tickets and pin them to a rotating wheel (the broker and queue). Cooks (consumers) grab tickets when they are free, prepare the dish, and discard the ticket when it is done. The waiter never stands at the stove waiting. The cook never chases waiters around the dining room. If the kitchen gets backed up, the wheel just holds more tickets without slowing down the front of house. That rotating ticket wheel is a message queue — it decouples the people taking orders from the people fulfilling them, so both sides work at their own pace and neither blocks the other.

Modern software systems are rarely a single monolith running by itself. They are a constellation of services — payments, notifications, inventory, analytics — all needing to communicate. When service A calls service B synchronously and B is slow, A slows down too. When B crashes, A crashes or returns an error to the user. That tight coupling is a hidden time bomb in nearly every high-traffic system, and message queues are the first tool experienced engineers reach for.

A message queue solves one fundamental problem: it lets two services communicate without either needing to know the other's current availability or processing speed. The sender drops a message into the queue and moves on immediately. The receiver picks it up whenever it is ready. This tiny architectural shift unlocks fault tolerance, horizontal scalability, and independent deployability — the three properties every distributed system must eventually achieve.

By the end of this guide you will understand every moving part inside a message queue system — producers, brokers, queues versus topics, consumers, consumer groups, acknowledgements, and dead-letter queues — well enough to design one from scratch on a whiteboard, explain the trade-offs clearly under interview pressure, and avoid the subtle bugs that trip up engineers who only understand the happy path.

In 2026, message queues underpin everything from payment processing to LLM inference pipelines. The fundamentals have not changed, but the operational expectations have — consumers are expected to be idempotent by default, DLQs are expected to have alerting from day one, and the choice between RabbitMQ and Kafka is increasingly driven by replay requirements rather than throughput alone.

What Message Queues Actually Do — And Where They Break

A message queue is a durable buffer between producers and consumers, decoupling their execution timelines. Producers enqueue messages; consumers poll or subscribe to receive them. The core mechanic is asynchronous handoff: the producer doesn't wait for the consumer to finish processing. This turns a synchronous RPC into a fire-and-forget pattern, but only if the queue guarantees at-least-once delivery.

In practice, a message queue provides three properties that matter: ordering (FIFO or not), persistence (in-memory vs. disk-backed), and acknowledgment semantics. Most queues auto-acknowledge a message once it's delivered to a consumer — that's the gotcha. If the consumer crashes after auto-ack but before completing its work, the message is lost forever. RabbitMQ's auto-ack mode, Kafka's auto-commit, and SQS's default visibility timeout all exhibit this behavior. You must explicitly disable auto-ack and manually acknowledge only after successful processing.

Use a message queue when you need to absorb traffic spikes, decouple microservices, or guarantee delivery across failures. It's not for low-latency request-response — that's what HTTP or gRPC is for. The real value is resilience: a queue lets you survive consumer restarts, backpressure, and partial outages without dropping data. But that resilience is only as good as your acknowledgment strategy.

Auto-ack Is a Data Loss Contract

Auto-acknowledgment does not mean 'processed successfully' — it means 'delivered to consumer memory.' If the consumer dies before persisting, the message is gone.

Production Insight

A payment service using RabbitMQ auto-ack lost 2% of transactions when a consumer OOM'd mid-processing.

Symptom: no errors, no retries — just silent gaps in the database.

Rule: always use manual acknowledgment and ack only after the side effect (DB write, API call) is durable.

Key Takeaway

Auto-ack is a data loss mechanism, not a convenience feature.

Always ack after the work is done, not after the message is received.

Test consumer crashes with a chaos experiment to verify your ack logic.

thecodeforge.io

Message Queue Auto-Ack Data Loss Gotcha

Message Queues

The Producer — Implementing Reliable Event Egress

The producer is the entry point for all data entering your distributed system, and its reliability story is more nuanced than most tutorials suggest. A producer that simply calls publish and moves on is adequate for logging and analytics. A producer handling payment events, order creation, or anything with financial or regulatory consequences needs to think carefully about three things: persistence, confirmation, and the atomicity gap between a database write and a queue publish.

Persistence is about what happens when the broker restarts. A non-durable message lives only in memory — broker restart, message gone. For anything that matters, use delivery mode 2 (persistent), which tells the broker to flush the message to disk before acknowledging receipt. The trade-off is disk I/O on every publish, which reduces maximum throughput but eliminates the silent data loss vector.

Confirmation is about knowing the broker actually received the message. The standard publish call is fire-and-forget — if the network drops between publish and the broker's internal receipt, you never know. Publisher confirms enable an ack back from the broker to the producer. This adds latency per message but gives you a guarantee that the message is in the broker's hands.

The atomicity gap is the hardest problem. If your producer writes a row to Postgres and then publishes an event to RabbitMQ, there is a window where the database write succeeds but the queue publish fails — or the process is killed between the two operations. The downstream services never learn that the order exists. The Outbox pattern closes this gap by writing the event into a table inside the same database transaction, with a separate polling process handling the actual publish to the queue.

OrderProducer.javaJAVA

100

101

102

103

104

package io.thecodeforge;

import com.rabbitmq.client.Channel;
import com.rabbitmq.client.Connection;
import com.rabbitmq.client.ConnectionFactory;
import com.rabbitmq.client.MessageProperties;
import com.rabbitmq.client.ConfirmListener;
import java.nio.charset.StandardCharsets;
import java.util.UUID;
import java.util.concurrent.ConcurrentNavigableMap;
import java.util.concurrent.ConcurrentSkipListMap;

/**
 * Production-grade producer with durable messaging and publisher confirms.
 * io.thecodeforge naming conventions throughout.
 *
 * Two critical settings:
 *   1. durable=true on the queue declaration — survives broker restart
 *   2. PERSISTENT_TEXT_PLAIN delivery mode — message flushed to disk before broker ACKs
 *
 * Publisher confirms (channel.confirmSelect()) ensure we know when the broker
 * has actually received and persisted the message, not just buffered it.
 */
public class OrderProducer {
    private static final String QUEUE_NAME = "order_processing";

    // Track outstanding unconfirmed publishes: sequenceNumber -> message payload
    // In production this would trigger a retry or fallback path on nack
    private final ConcurrentNavigableMap<Long, String> outstandingConfirms =
        new ConcurrentSkipListMap<>();

    public void publishOrder(String customerId, double amount) throws Exception {
        ConnectionFactory factory = new ConnectionFactory();
        factory.setHost("localhost");
        factory.setUsername("forge_user");
        factory.setPassword("forge_password");

        // One Connection per process — creating a new Connection per publish
        // adds 100-200ms of TCP handshake overhead and exhausts file descriptors at scale.
        try (Connection connection = factory.newConnection();
             Channel channel = connection.createChannel()) {

            // durable=true: this queue survives a broker restart
            // exclusive=false: allow multiple consumers
            // autoDelete=false: queue persists when the last consumer disconnects
            channel.queueDeclare(QUEUE_NAME, true, false, false, null);

            // Enable publisher confirms — the broker will ack or nack each message
            // so we know it was received and persisted, not just buffered in memory
            channel.confirmSelect();

            // Wire up confirm listeners before publishing
            channel.addConfirmListener(new ConfirmListener() {
                @Override
                public void handleAck(long deliveryTag, boolean multiple) {
                    if (multiple) {
                        outstandingConfirms.headMap(deliveryTag + 1).clear();
                    } else {
                        outstandingConfirms.remove(deliveryTag);
                    }
                    System.out.printf(" [Forge] Broker confirmed delivery for tag: %d%n", deliveryTag);
                }

                @Override
                public void handleNack(long deliveryTag, boolean multiple) {
                    // Broker rejected the message — trigger your retry or fallback path
                    String nacked = outstandingConfirms.get(deliveryTag);
                    System.err.printf(" [Forge] Broker NACK'd message, triggering fallback: %s%n", nacked);
                    // In production: write to outbox table for later retry
                }
            });

            // Generate a unique event ID that consumers use for idempotency checks
            String eventId = UUID.randomUUID().toString();
            String payload = String.format(
                "{\"event_id\":\"%s\", \"customer_id\":\"%s\", \"amount\":%.2f, \"currency\":\"USD\"}",
                eventId, customerId, amount
            );

            // Track this publish before sending so the confirm listener can correlate
            long sequenceNumber = channel.getNextPublishSeqNo();
            outstandingConfirms.put(sequenceNumber, payload);

            // PERSISTENT_TEXT_PLAIN = Delivery Mode 2 + Content-Type text/plain
            // Mode 2 tells the broker to write to disk before acknowledging
            // Without this, a broker crash between receipt and flush loses the message
            channel.basicPublish(
                "",               // default exchange — routes directly to named queue
                QUEUE_NAME,       // routing key for default exchange = queue name
                MessageProperties.PERSISTENT_TEXT_PLAIN,
                payload.getBytes(StandardCharsets.UTF_8)
            );

            System.out.printf(" [Forge] Published order event %s for customer %s%n",
                eventId, customerId);

            // Wait for broker confirmation before returning to caller
            // In high-throughput scenarios, use async confirms and batch publishing instead
            if (!channel.waitForConfirms(5000)) {
                throw new RuntimeException("Broker did not confirm message within 5 seconds");
            }
        }
    }
}

Output

[Forge] Published order event 550e8400-e29b-41d4-a716-446655440000 for customer cust-123

[Forge] Broker confirmed delivery for tag: 1

The Outbox Pattern — Closing the Atomicity Gap Between DB and Queue

Write the event to an 'outbox' table inside the same database transaction as your business logic — if the transaction rolls back, the event disappears with it
A separate polling process reads uncommitted outbox rows and publishes them to the message queue — decoupling the database commit from the queue publish
Once published successfully, the poller marks the outbox row as sent or deletes it — the idempotent marker prevents double-publish on poller restart
This pattern guarantees atomicity: either both the data write and the event publish succeed, or neither does — no phantom events, no silent drops
The trade-off is added latency equal to the polling interval (typically 100ms to 5 seconds) and an outbox table that needs periodic cleanup

Production Insight

Non-durable messages are a silent data loss vector that only surfaces during broker restarts — which happen during upgrades, failovers, and incidents, exactly when you can least afford data loss. Always use delivery mode 2 for anything that matters.

Connection pooling is critical at scale: creating a new TCP connection per publish adds 100 to 200 milliseconds of handshake overhead and exhausts file descriptors under load. The correct pattern is one Connection per process and one Channel per thread — Channels are lightweight and cheap, Connections are expensive and shared.

Publisher confirms are not optional for payment or order events. Fire-and-forget publish gives you no signal when the network drops between your process and the broker. Confirms give you that signal at the cost of per-message latency.

Key Takeaway

The producer's job is not just publish — it is publish reliably and know when the broker has received it. Non-durable delivery mode is a silent data loss vector that only reveals itself during infrastructure failures. Publisher confirms give you the signal you need to implement retry or fallback logic. Use the Outbox pattern whenever a queue publish and a database write must succeed or fail together.

Producer Reliability Decisions

IfMessage loss is acceptable — analytics events, access logs, metrics

→

UseNon-durable queues with transient delivery mode — faster, less disk I/O, appropriate for high-volume observability data

IfMessage loss is unacceptable — payments, orders, compliance events

→

UseDurable queues, persistent delivery mode, and publisher confirms — all three are required, not optional

IfNeed transactional guarantee that ties a database write to a queue publish atomically

→

UseImplement the Outbox pattern — write to the outbox table in the same transaction, poll and publish separately with idempotency on the poller

IfHigh-throughput producer sending more than 10,000 messages per second

→

UseUse batch publishing and asynchronous publisher confirms rather than waiting for synchronous confirmation on each message — the throughput difference is significant

The Broker — Orchestrating Infrastructure with Docker

The broker is the central post office of your distributed system — every message passes through it, and its configuration determines the durability, routing, and delivery guarantees your system provides. Senior engineers containerise the broker with Docker Compose for a simple reason: dev-prod parity. When your local RabbitMQ instance has the same exchanges, bindings, and policies as production, you catch routing bugs and configuration mismatches before they hit staging.

The compose file below includes a named volume for data persistence — without it, a docker-compose down wipes every message in every queue, which is fine for unit tests and genuinely dangerous for integration tests where you are verifying retry and DLQ behaviour. The management UI on port 15672 is not just an admin panel — during an incident it is your primary visibility into queue depth, consumer count, message rates, and memory usage. Know how to read it under pressure before you need it.

One often-overlooked aspect of broker configuration is the Dead Letter Exchange. Defining the DLX in your compose file alongside the primary queue means every developer's local environment has the same failure routing as production. Discovering that your DLQ configuration is wrong during a production incident is several orders of magnitude worse than discovering it locally.

docker-compose.ymlYAML

version: '3.8'

services:
  thecodeforge-broker:
    image: rabbitmq:3.13-management
    container_name: forge-rabbitmq
    ports:
      - "5672:5672"   # AMQP protocol — application connections
      - "15672:15672" # Management UI — use for debugging queue depth and consumer health
    environment:
      RABBITMQ_DEFAULT_USER: forge_user
      RABBITMQ_DEFAULT_PASS: forge_password
      # Reduce heartbeat to detect dropped connections faster (default is 60s)
      RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS: "-rabbitmq_management load_definitions /etc/rabbitmq/definitions.json"
    volumes:
      # Named volume ensures messages survive docker-compose down and up
      # Without this, every restart wipes the queue — dangerous for DLQ testing
      - rabbitmq_data:/var/lib/rabbitmq
      # Pre-load exchange and queue definitions on startup for dev-prod parity
      - ./rabbitmq-definitions.json:/etc/rabbitmq/definitions.json:ro
    healthcheck:
      # Wait for the broker to be ready before starting dependent services
      test: ["CMD", "rabbitmq-diagnostics", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 20s

volumes:
  rabbitmq_data:
    # Explicit driver declaration makes the volume's purpose visible to future engineers
    driver: local

Output

# Container startup:

# forge-rabbitmq started. Waiting for healthcheck...

# rabbitmq-diagnostics ping succeeded after 15s

# Broker ready at amqp://forge_user:forge_password@localhost:5672

# Management UI at http://localhost:15672 (forge_user / forge_password)

# Verify broker is healthy:

# docker exec forge-rabbitmq rabbitmqctl status

# docker exec forge-rabbitmq rabbitmqctl list_queues name messages consumers

Queue vs Topic — The Scalability Difference That Matters Most

Queue (point-to-point): one message, one consumer — correct for task distribution like image resizing, email sending, or payment processing where you want exactly one service to act
Topic or fanout (pub/sub): one message, every subscriber — correct for domain events like ORDER_PLACED where inventory, notifications, and analytics all need to react independently
Adding a new microservice to a topic requires zero producer code changes — bind a new queue to the existing exchange and the service starts receiving events
Competing consumers on a single queue scale horizontally and naturally — add more consumer instances and the broker distributes work across all of them
Rule of thumb: if more than one downstream service will ever need the event, use a topic exchange from day one — retrofitting it later requires coordinating multiple teams

Production Insight

Docker named volumes prevent data loss on container restart — without them, a docker-compose down between test runs wipes all messages, making DLQ and retry behaviour untestable locally.

Pin the RabbitMQ image version explicitly — rabbitmq:3-management can pull a breaking minor version change between developers' machines if Docker cache is stale, creating environment-specific bugs that are painful to trace.

The management UI on port 15672 is your first stop during any message queue incident — queue depth, unacknowledged message count, consumer count, and publish rates are all visible in real time without any additional tooling.

Key Takeaway

The broker is the single source of truth for message delivery — its configuration drives your system's durability and routing guarantees. Dev-prod parity via Docker Compose means routing bugs and DLQ misconfigurations surface locally rather than in production. Use topic exchanges from the start whenever more than one downstream service needs an event — retrofitting point-to-point queues to pub/sub requires coordinating producer changes across teams.

The Consumer — Ensuring Idempotency and Manual Acknowledgement

Consumers are where processing happens and where most data loss actually occurs — not in the broker, not in the producer. The configuration decisions you make in the consumer determine whether your system has at-least-once delivery with retry guarantees or auto-ack with silent data loss disguised as reliability.

The most dangerous configuration is auto_ack=True. The broker interprets delivery as confirmation that the message was processed successfully. If your Java process receives the message, starts processing, and then gets killed by the OOM killer between the payment gateway call and the database commit, the message is gone. The broker has already deleted it. There is no retry, no DLQ entry, and no trace in any log — only a successful charge in the gateway and a missing order in your database.

Manual acknowledgement changes the contract: the broker keeps the message in an unacknowledged state until your code explicitly calls basicAck. Only call basicAck after every side effect has succeeded — not after receipt, not after the external API call, but after the database commit that makes the operation durable.

The second non-negotiable for production consumers is idempotency. At-least-once delivery means duplicates are guaranteed over time. A network blip between your ack call and the broker receiving it causes redelivery. A consumer restart with unacked messages causes redelivery. Assume it will happen and build accordingly — check the event_id against a processed_events table before acting, and ack without processing if the event was already handled.

Prefetch count rounds out the three required settings. Without it, RabbitMQ uses an unbounded prefetch and may deliver 500 messages to one fast consumer while three other consumers sit idle. Set basicQos(1) to enforce fair dispatch — the broker only sends a new message to a consumer after the previous one is acknowledged.

OrderConsumer.javaJAVA

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

package io.thecodeforge;

import com.rabbitmq.client.*;
import java.nio.charset.StandardCharsets;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;

/**
 * Production-grade consumer with:
 *   1. Manual acknowledgement — only ack after all side effects succeed
 *   2. Idempotency check — safe to receive the same message twice
 *   3. Prefetch=1 — fair dispatch across all consumer instances
 *   4. DLQ routing on permanent failure — no infinite poison message loops
 */
public class OrderConsumer {
    private static final String QUEUE_NAME = "order_processing";
    private static final String DB_URL     = "jdbc:postgresql://localhost:5432/forge";
    private static final String DB_USER    = "forge_user";
    private static final String DB_PASS    = "forge_password";

    public static void main(String[] args) throws Exception {
        ConnectionFactory factory = new ConnectionFactory();
        factory.setHost("localhost");
        factory.setUsername("forge_user");
        factory.setPassword("forge_password");

        // Enable automatic connection recovery — reconnects after network interruptions
        // without requiring consumer code to restart or re-register callbacks
        factory.setAutomaticRecoveryEnabled(true);
        factory.setNetworkRecoveryInterval(5000); // retry every 5 seconds

        Connection connection = factory.newConnection();
        Channel channel = connection.createChannel();

        // Declare the queue with the same parameters as the producer
        // If the queue already exists with different params, this throws — catch it early
        channel.queueDeclare(QUEUE_NAME, true, false, false, null);

        // CRITICAL: prefetch=1 prevents one consumer from hoarding all messages
        // The broker will not deliver a new message until the previous one is acked
        // This forces round-robin distribution across all running consumer instances
        channel.basicQos(1);

        System.out.println(" [Forge] Consumer waiting for messages on: " + QUEUE_NAME);

        DeliverCallback deliverCallback = (consumerTag, delivery) -> {
            String body = new String(delivery.getBody(), StandardCharsets.UTF_8);
            long deliveryTag = delivery.getEnvelope().getDeliveryTag();
            String eventId   = extractEventId(body);

            System.out.printf(" [Forge] Received event %s%n", eventId);

            try {
                // Step 1: Idempotency check — has this event been processed before?
                // Required because at-least-once delivery means duplicates WILL arrive
                if (isAlreadyProcessed(eventId)) {
                    System.out.printf(" [Forge] Duplicate event %s — acking without processing%n", eventId);
                    channel.basicAck(deliveryTag, false);
                    return;
                }

                // Step 2: Execute business logic — all of this must succeed or fail atomically
                processOrder(body);

                // Step 3: Mark as processed in the idempotency table
                // Do this inside the same DB transaction as processOrder() in production
                markAsProcessed(eventId);

                // Step 4: Only ack AFTER all side effects are committed to durable storage
                // If we crash between processOrder() and basicAck(), the broker redelivers
                // The idempotency check in Step 1 handles that redelivery safely
                channel.basicAck(deliveryTag, false);
                System.out.printf(" [Forge] Successfully processed and acked event %s%n", eventId);

            } catch (Exception error) {
                System.err.printf(" [Forge] Failed to process event %s: %s%n",
                    eventId, error.getMessage());

                // requeue=false: send to DLQ instead of requeueing
                // requeue=true would create an infinite loop for permanent failures
                // The DLQ preserves the message for manual inspection and replay
                channel.basicNack(deliveryTag, false, false);
            }
        };

        // autoAck=false: broker keeps message unacked until we call basicAck or basicNack
        // This is the single most important setting — never set this to true in production
        channel.basicConsume(QUEUE_NAME, false, deliverCallback, consumerTag ->
            System.out.println(" [Forge] Consumer cancelled: " + consumerTag)
        );
    }

    private static void processOrder(String json) throws Exception {
        // In production: parse JSON, call payment gateway, write to DB in one transaction
        System.out.printf(" [Forge] Processing order payload: %s%n", json);
        Thread.sleep(50); // simulate processing time
    }

    private static boolean isAlreadyProcessed(String eventId) throws Exception {
        try (var conn = DriverManager.getConnection(DB_URL, DB_USER, DB_PASS);
             PreparedStatement ps = conn.prepareStatement(
                 "SELECT 1 FROM processed_events WHERE event_id = ? LIMIT 1")) {
            ps.setString(1, eventId);
            ResultSet rs = ps.executeQuery();
            return rs.next();
        }
    }

    private static void markAsProcessed(String eventId) throws Exception {
        try (var conn = DriverManager.getConnection(DB_URL, DB_USER, DB_PASS);
             PreparedStatement ps = conn.prepareStatement(
                 "INSERT INTO processed_events (event_id, processed_at) VALUES (?, NOW()) ON CONFLICT DO NOTHING")) {
            ps.setString(1, eventId);
            ps.executeUpdate();
        }
    }

    private static String extractEventId(String json) {
        // In production use Jackson or Gson for JSON parsing
        // This simple extraction is for illustration clarity only
        int start = json.indexOf('"', json.indexOf("event_id") + 10) + 1;
        int end   = json.indexOf('"', start);
        return json.substring(start, end);
    }
}

Output

[Forge] Consumer waiting for messages on: order_processing

[Forge] Received event 550e8400-e29b-41d4-a716-446655440000

[Forge] Processing order payload: {"event_id":"550e8400...", "customer_id":"cust-123", "amount":99.99}

[Forge] Successfully processed and acked event 550e8400-e29b-41d4-a716-446655440000

# On redelivery of the same event_id:

[Forge] Received event 550e8400-e29b-41d4-a716-446655440000

[Forge] Duplicate event 550e8400-e29b-41d4-a716-446655440000 — acking without processing

The Poison Message Problem — Why requeue=True Is Dangerous

If a message has a permanent defect — malformed JSON, a missing required field, a schema the consumer cannot parse — it will fail on every delivery attempt. With requeue=True on nack, the broker delivers it again immediately. The consumer fails again, nacks with requeue=True, and the cycle repeats thousands of times per second. This infinite loop blocks every other message in the queue and spikes consumer CPU while making zero progress. Always use requeue=False to route unrecoverable failures to the DLQ where they can be inspected and replayed manually after the root cause is fixed.

Production Insight

basicQos(1) is the single most impactful consumer setting for fair distribution — without it, one fast consumer can hold hundreds of unacknowledged messages while other instances have nothing to process, which makes horizontal scaling ineffective.

auto_ack=True is a production incident waiting to happen — it has no legitimate use in any consumer that does real work. The convenience it offers is exactly the confidence that leads to the kind of incident described above.

Automatic connection recovery (factory.setAutomaticRecoveryEnabled(true)) handles the silent disconnection scenario described in the debug guide — without it, your consumer appears to be running but stops receiving messages after a network interruption.

Key Takeaway

The consumer is where data loss happens — not in the broker, not in the producer. Manual ack, idempotency check, and DLQ routing form a three-layer defence that makes your consumer safe to restart, safe to run as multiple instances, and safe to deliver the same message twice. If your consumer uses auto-ack, you are one OOM kill away from a production incident with no recovery path.

Consumer Configuration Decisions

IfProcessing is idempotent, fast (under 100ms), and stateless

→

UsePrefetch count can be higher (10 to 50) to improve throughput — but still use manual ack, never auto-ack

IfProcessing involves external API calls, database writes, or any I/O

→

UseSet prefetch count to 1 and use manual ack — only ack after all side effects are committed to durable storage

IfMessage fails due to a permanent defect — malformed data, schema mismatch

→

UseNack with requeue=False to route directly to DLQ — do not requeue, it will never succeed and will block the queue

IfMessage fails due to a transient error — downstream timeout, temporary unavailability

→

UseNack with requeue=False and route to a retry queue with exponential backoff TTL — not back to the main queue, which would allow immediate retry without backoff

Dead-Letter Queues — Handling Failures Without Blocking the Pipeline

A Dead Letter Queue is where messages go when they cannot be processed — not to be forgotten, but to be preserved for diagnosis and recovery. The DLQ's value is not that it catches failures (your nack logic does that), but that it isolates them from the primary queue so the happy path continues processing at full speed while the failed messages wait for human attention.

In RabbitMQ, a Dead Letter Exchange is a first-class feature configured on the primary queue via arguments. When a message is nacked with requeue=False, or expires its TTL, or exceeds the queue's maximum length, the broker automatically routes it to the configured dead-letter exchange, which then delivers it to the DLQ. This requires no code changes in the consumer beyond using requeue=False — the broker handles the routing.

Archiving DLQ messages to a relational database table adds context that the message broker cannot preserve on its own — the source queue name, the failure reason, the retry count, the timestamp of each attempt. Without this context, when someone opens the RabbitMQ management UI a week later and sees 47 messages in the DLQ, they have no idea which ones are related, which ones can be replayed, or what went wrong.

The alerting is not optional. A DLQ that accumulates silently for days is just deferred data loss. Every message that enters the DLQ represents a failure that a customer or downstream system is already experiencing. Treat DLQ growth exactly like a 500 error rate spike — investigate immediately, not during the next weekly review.

audit_schema.sqlSQL

-- io.thecodeforge: Schema for auditing and replaying DLQ extractions
-- This table provides the context that the message broker cannot preserve:
-- which queue failed, why it failed, how many times it was attempted,
-- and whether it has been replayed or resolved.

CREATE TABLE dlq_archive (
    id                  BIGSERIAL       PRIMARY KEY,
    original_event_id   UUID            NOT NULL,
    source_queue        VARCHAR(255)    NOT NULL,  -- which queue the message came from
    payload             JSONB           NOT NULL,  -- full message body for replay
    failure_reason      TEXT            NOT NULL,  -- exception message or error code
    retry_count         INT             NOT NULL DEFAULT 0,
    first_failed_at     TIMESTAMPTZ     NOT NULL DEFAULT NOW(),
    last_failed_at      TIMESTAMPTZ     NOT NULL DEFAULT NOW(),
    resolved_at         TIMESTAMPTZ,               -- null until an engineer resolves it
    resolution_notes    TEXT                       -- what was done to fix or discard
);

-- Fast lookup by event_id for idempotency checks and deduplication during replay
CREATE UNIQUE INDEX idx_dlq_event_id
    ON dlq_archive(original_event_id);

-- For ops dashboards: filter by source queue and unresolved status
CREATE INDEX idx_dlq_source_unresolved
    ON dlq_archive(source_queue, resolved_at)
    WHERE resolved_at IS NULL;

-- Query: find all unresolved failures for the order_processing queue in the last 24 hours
SELECT
    original_event_id,
    failure_reason,
    retry_count,
    first_failed_at,
    payload->>'customer_id'  AS customer_id,
    payload->>'amount'       AS amount
FROM dlq_archive
WHERE source_queue  = 'order_processing'
  AND resolved_at   IS NULL
  AND first_failed_at >= NOW() - INTERVAL '24 hours'
ORDER BY first_failed_at DESC;

-- Replay query: select fixable messages to republish to the original queue
-- After fixing the consumer bug, republish these payloads with new event_ids
SELECT original_event_id, payload
FROM dlq_archive
WHERE source_queue    = 'order_processing'
  AND failure_reason  LIKE '%connection timeout%'  -- transient errors worth replaying
  AND resolved_at     IS NULL;

Output

-- Table created with UNIQUE constraint on original_event_id.

-- Partial index for unresolved messages keeps ops dashboard queries fast

-- even as the archive grows to millions of rows over months.

-- Example unresolved failures query result:

-- original_event_id | failure_reason | retry_count | customer_id

-- 550e8400-e29b-41d4-a716-446655440000 | DB connection timeout | 3 | cust-123

-- 7c9e6679-7425-40de-944b-e07fc1f90ae7 | JSON parse error: ... | 1 | cust-456

A DLQ Without Alerting Is Deferred Data Loss — Not a Safety Net

Setting up a DLQ without an alert is the engineering equivalent of installing a smoke detector without batteries. You feel safer but you are not. Set a CloudWatch alarm, Prometheus alert rule, or Datadog monitor that fires the moment DLQ message count exceeds zero — not when it exceeds some arbitrary threshold you will rarely reach. A single message in the DLQ means a customer or downstream system is already experiencing a failure right now. Treat it like a 500 error: page the on-call engineer, investigate immediately, and resolve it before the next one arrives.

Production Insight

DLQ messages lose their original routing context inside the broker — the dlq_archive table is where you preserve source queue, failure reason, retry count, and the full payload so engineers can diagnose without guessing a week later.

A growing DLQ is a symptom of an upstream problem, not a feature of your system working correctly. If more than a handful of messages hit the DLQ on a normal business day, something upstream needs fixing — it is not time to replay the DLQ, it is time to fix the consumer or the data.

Design a replay mechanism from the start: a script or internal tool that reads from dlq_archive, generates new event IDs, and republishes to the original queue. Without it, DLQ recovery requires manual intervention that is slow, error-prone, and stressful during an incident.

Key Takeaway

The DLQ is your system's emergency stop valve — it prevents poison messages from blocking the entire pipeline while preserving failed messages for diagnosis and recovery. But it is only useful if you know when messages arrive in it. A DLQ without alerting is a silent graveyard. Archive DLQ entries to a relational table with source queue, failure reason, and retry count so engineers can diagnose root causes and replay selectively when the fix is deployed.

DLQ Handling Decisions

IfMessage failed due to a transient error — database timeout, network interruption, downstream service temporarily unavailable

→

UseRoute to a retry queue with exponential backoff TTL (5s, 30s, 5min). Only route to the DLQ after max retry count is exhausted. This avoids filling the DLQ with recoverable errors.

IfMessage failed due to a permanent error — malformed JSON, schema version mismatch, missing required fields

→

UseRoute directly to DLQ and archive to dlq_archive — do not retry, it will never succeed without a code fix or manual data correction

IfDLQ depth is growing faster than engineers can resolve it

→

UseStop treating individual DLQ messages as the problem — find and fix the root cause upstream. A DLQ growing at 50 messages per hour means your consumer or producer has a bug affecting a significant percentage of traffic

IfNeed to replay DLQ messages after fixing the root cause in the consumer

→

UseUse the dlq_archive replay query to identify fixable messages by failure_reason, generate new event IDs, and republish to the original queue — the idempotency check in the consumer handles any overlap with previously processed events

RabbitMQ vs Kafka Decision Matrix: Smart Broker vs Smart Consumer

The architectural distinction between RabbitMQ and Kafka is not about performance — it is about where you place the intelligence. RabbitMQ is a smart broker: the broker itself handles routing, persistence, acknowledgements, dead-lettering, and queue management. The consumer is relatively dumb — it connects, receives messages, acks them, and moves on. Kafka flips this model. The broker is a dumb, fast, immutable log. The consumer is smart — it manages its own offset, participates in rebalancing, and controls exactly when and how many messages to pull.

This decision determines your system's operational characteristics at scale. The table below summarises the key architectural differences and the trade-offs they create.

Decision Factor	RabbitMQ (Smart Broker)	Kafka (Smart Consumer)
Who manages delivery?	Broker pushes messages and controls ack/dead-letter routing	Consumer pulls messages and commits offsets independently
Routing flexibility	Built-in exchanges (direct, topic, fanout, headers) — complex routing without code	Requires explicit consumer logic or stream processing (KSQL) for fan-out
Ordering guarantee	Strict per-queue ordering as long as one consumer	Strict per-partition ordering; global ordering requires single partition (low throughput)
Replay capability	Not native — once acked, message is gone unless you archive separately	Core feature — retain messages for days/weeks, seek to any offset
Scalability model	Add more consumers to a queue — the broker distributes work	Add more consumer group members — partition rebalancing handles redistribution
Operational maturity	Simpler setup, single binary, management UI out of the box	Requires Zookeeper (until KRaft), partition tuning, careful consumer group coordination

When to choose each: - RabbitMQ if you need complex routing (e.g., RPC, work queues, multiple exchange types), or if your operations team wants a simple, observable infrastructure with minimal tuning. - Kafka if you need event replay, high-throughput streaming, multiple independent consumer groups, or a durable audit log that survives consumer failure without data loss.

Smart Broker vs Smart Consumer — The Core Trade-Off

RabbitMQ's smart broker model makes it easier to add new consumers without coordination — the broker handles all the routing and delivery logic. Kafka's smart consumer model gives you replay and independent fan-out at the cost of requiring consumers to manage offsets and rebalancing. There is no universally correct choice; the answer depends on whether your system values routing flexibility (RabbitMQ) or replay capability (Kafka).

Production Insight

In practice, many production systems run both. Use RabbitMQ for task queues and RPC patterns where complex routing and simple consumer logic are valuable. Use Kafka for event sourcing, audit logs, and streaming analytics where replay and high throughput are critical. Running both introduces operational overhead but lets you use each broker for what it does best.

Key Takeaway

The smart broker (RabbitMQ) vs smart consumer (Kafka) distinction determines your system's routing flexibility, replay capabilities, and operational complexity. Choose based on your need for complex routing (RabbitMQ) versus replayable event streams (Kafka), not on perceived performance alone.

Message Delivery Trade-offs: At-Most-Once, At-Least-Once, Exactly-Once

Every message queue system must make a trade-off between reliability and performance. The three common delivery semantics — at-most-once, at-least-once, and exactly-once — represent different points on that trade-off curve. Understanding which one your system actually needs is more important than any feature comparison between brokers.

Delivery Semantics	Description	Best For	Example Config	Risk
At-most-once	Message delivered zero or one time — no retry	Monitoring/analytics where occasional data loss is acceptable	auto_ack=True, no confirmation	Message loss
At-least-once	Message delivered one or more times — retries guaranteed	Business transactions where data loss is unacceptable but duplicates can be handled	manual ack, requeue on transient failure, consumer idempotency	Duplicates
Exactly-once (end-to-end)	Message processed exactly once — no loss, no duplicates	Payment processing, financial systems where duplicates or loss have real-world cost	Idempotent consumer + transactional outbox (or Kafka transactions for Kafka-to-Kafka)	Throughput reduction

Trade-offs in practice: - At-most-once is trivial for the broker but dangerous for the business. It is the default in many frameworks and must be deliberately avoided for any meaningful work. - At-least-once is the most common production choice because it balances reliability with performance. The catch? You must make your consumer idempotent. This is not optional — it is a prerequisite for at-least-once delivery. - Exactly-once is the gold standard but requires significant infrastructure: either an idempotent consumer with an outbox pattern, or Kafka's transactional API (which only guarantees end-to-end exactly-once within Kafka itself). The throughput cost is real — expect 20-40% reduction in peak throughput for exactly-once versus at-least-once.

Choosing: If you cannot tolerate data loss, use at-least-once with an idempotent consumer. If you also cannot tolerate duplicates, invest in exactly-once infrastructure. If both loss and duplicates are acceptable, you probably should not be using a message queue at all — a simple log file may suffice.

Exactly-Once Is Overrated — At-Least-Once with Idempotency Wins in Production

Practising senior engineers overwhelmingly prefer at-least-once with idempotency over exactly-once for most use cases. The reason is operational simplicity: an idempotent consumer handles duplicates via a database constraint (event_id unique index) without requiring any broker-level transaction coordination. Exactly-once adds latency, throughput reduction, and tight coupling between the broker and the consumer's data store. Reserve exactly-once for financial transactions where a single duplicate can cost real money, and use at-least-once everywhere else.

Production Insight

Kafka's exactly-once semantics (enable.idempotence=true + transactional producer) only provide end-to-end exactly-once when both the producer and consumer are within Kafka — i.e., reading from Kafka and writing back to Kafka. If your consumer writes to a relational database or any external system, the only way to achieve exactly-once end-to-end is through an idempotent consumer that checks event_id uniqueness at the database level.

Key Takeaway

The delivery trade-off is not a feature to be enabled — it is an architecture decision that affects every component. At-least-once with idempotency is the pragmatic choice for 90% of production systems. Reserve exactly-once for the narrow case where a single duplicate has financial or compliance consequences.

DLQ Pattern Visual Workflow — Message Routing Through Failure Paths

The diagram shows the three outcomes of consumer processing. The most dangerous path is the loop from D back to A (requeue=true) — this creates the poison message problem described in earlier sections. The correct transient error path goes through a retry queue with a time-to-live (TTL) that introduces backoff before re-delivery. The permanent error path sends the message directly to the DLQ, where it remains until an engineer inspects, fixes the root cause, and selectively replays it.

Key routing rules: - Never requeue a message that will never succeed — use requeue=false and route to DLQ. - For transient errors, use a retry queue with exponential backoff TTL (e.g., 5s, 30s, 5min). This is a separate queue from the primary queue to avoid blocking the happy path. - After max retries, the retry queue should also route to the DLQ via its own dead-letter exchange. - The DLQ must have an alert when depth > 0. The on-call engineer investigates and, after fixing the bug, replays the archived messages (not from the DLQ itself, but from the dlq_archive table). This ensures no re-insertion into the primary queue until the root cause is resolved.

Replaying from DLQ — Never Republish Directly from the DLQ Queue

When you need to replay a DLQ message, do not consume it from the DLQ and republish it immediately. The DLQ may contain messages that failed for the same bug that you just fixed — but other messages may have failed for different, unresolved reasons. Always archive DLQ messages to a database table, let an engineer examine the failure_reason, and only republish messages whose root cause has been fixed. Otherwise you risk re-injecting the same poison message into the primary queue and starting the cycle again.

Production Insight

The retry queue pattern with TTL requires no code beyond the initial nack with requeue=false. Configure the primary queue's dead-letter exchange to point to a retry queue, and the retry queue's dead-letter exchange to point to the DLQ. RabbitMQ handles the entire routing chain automatically — the consumer only needs to call basicNack with requeue=false. The TTL on the retry queue provides the backoff.

Key Takeaway

The DLQ pattern is a three-stage routing pipeline: primary queue for processing, retry queue with backoff for transient failures, and DLQ for permanent failures. The broker handles the routing automatically if configured correctly. Alert on DLQ depth > 0, archive to a database before replaying, and never use requeue=true for failed messages.

DLQ Pattern Workflow

Throughput vs Latency Benchmarks for Top Message Brokers

Choosing a message broker often comes down to throughput and latency numbers, but these benchmarks vary wildly with configuration, hardware, and message size. The numbers below are from controlled tests on modern cloud hardware (3-node clusters, 8 vCPU / 32 GB RAM, NVMe SSDs, 1 KB message payload, published and consumed in the same region). They represent peak sustainable throughput and median end-to-end latency under moderate load (50% of max throughput). Use them as directional guidance, not as absolute performance guarantees.

Broker	Max Throughput (msg/s)	Median Latency (ms)	P99 Latency (ms)	Strength	Weakness
RabbitMQ 3.13	~50,000	5–15	50–100	Simple setup, excellent routing, low latency at moderate throughput	Throughput drops with durable messaging and clustered setups; not designed for streaming
Apache Kafka 3.7	~1,500,000	10–30	100–200	Extreme throughput, durable log, replay capabilities	Higher operational complexity; latency increases under heavy producer load
Apache Pulsar 3.0	~1,000,000	5–10	30–60	Low latency, geo-replication, separation of serving/storage	Newer ecosystem; less community adoption than Kafka
Amazon SQS (Standard)	Unlimited (auto-scaling)	50–200	500–1000	Fully managed, no ops overhead, infinite scale	Higher latency, no ordering guarantee (FIFO queues have throughput limit of 300 msg/s)

Key observations: - RabbitMQ is the latency king for moderate throughput. If your system processes tens of thousands of messages per second and cares about sub-10ms delivery, RabbitMQ is the best choice. - Kafka is designed for throughput, not latency. Its pull-based model naturally adds 10–30ms of polling latency. For time-sensitive operations (e.g., real-time trading), Kafka's latency may be unacceptable. - Pulsar offers a compelling middle ground: high throughput with low latency, thanks to its separate serving and storage layers. However, its ecosystem is smaller, which affects debug tooling and hiring. - SQS is for teams that want zero ops overhead and can tolerate variable latency. The standard queue's at-least-once delivery with potential duplicates is acceptable for many use cases.

Recommendation: Run your own benchmark with your message size, payload format, and consumption pattern. The numbers above assume 1 KB messages — double the message size and throughput drops by roughly 50% for most brokers due to network and serialization overhead. For a realistic test, use your actual message schema and simulate your production traffic profile.

Benchmark With Your Data, Not Generic Numbers

The throughput numbers in this section are directional — they come from standardised tests with 1 KB messages on homogeneous hardware. Your mileage will vary significantly based on message size (throughput is inversely proportional), consumer processing time, network latency between services, and broker configuration choices like replication factor. Always run a benchmark with your own payloads and traffic patterns before making a final broker selection. The cost of picking the wrong broker is a painful migration six months into production.

Production Insight

In production, the bottleneck is rarely the broker — it is the consumer's processing logic or the network between services. A well-optimised RabbitMQ setup with manual ack and prefetch=1 can handle 10,000 messages per second per consumer instance, but if each message takes 50ms to process (DB write + external API call), one consumer handles only 20 messages per second. The broker is not the limiting factor; the consumer's downstream dependencies are. When benchmarking, always include the full processing pipeline, not just publish-and-ack latency.

Key Takeaway

Throughput and latency benchmarks are directional indicators, not absolute guarantees. RabbitMQ excels at low latency with moderate throughput; Kafka dominates at extreme throughput with higher latency; Pulsar offers a balance; SQS trades latency for zero ops. Benchmark with your actual message size and processing pipeline before committing to a broker, and remember that consumer processing time — not broker speed — is usually the real bottleneck.

Point-to-Point vs Pub-Sub — Choosing the Right Topology

Your broker topology determines how messages flow and scale. Point-to-point queues work like a shared workbench: one message, one consumer. Use this when each task must be processed exactly once—think payment processing or order fulfillment. Publish-subscribe topics broadcast every message to all subscribers. Use this when multiple independent systems need the same data—like inventory updates feeding warehouse, analytics, and notification services simultaneously. Mixing them up causes chaos: pub-sub for job queues gives you duplicate work, point-to-point for events starves downstream systems. Map your intent first. If you need processing guarantees, choose point-to-point. If you need broadcast with fan-out, choose pub-sub. The broker doesn't care about your mistake until pager duty calls.

topology_choice.pyPYTHON

// io.thecodeforge

# Point-to-point: RabbitMQ queue
import pika
connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
channel = connection.channel()
channel.queue_declare(queue='payment_orders', durable=True)
channel.basic_publish(
    exchange='',
    routing_key='payment_orders',
    body='order_1234',
    properties=pika.BasicProperties(delivery_mode=2)  # persistent
)

# Publish-subscribe: Redis pub/sub
import redis
r = redis.Redis()
r.publish('order_events', 'new_order_1234')

# Consumer 1: warehouse
pubsub = r.pubsub()
pubsub.subscribe('order_events')

# Consumer 2: analytics (same channel)
pubsub2 = r.pubsub()
pubsub2.subscribe('order_events')

Output

Point-to-point: one consumer gets 'order_1234'. Pub-sub: both warehouse and analytics get it.

Production Trap:

Don't use pub-sub for delivery guarantees. If a consumer disconnects, the message is gone forever. You need durable subscriptions or a point-to-point queue.

Key Takeaway

Match topology to behavior: point-to-point for work, pub-sub for events.

Message Prioritization — Don't Let Low-Value Jobs Starve Critical Work

Your queue will fill with junk. Promotional emails, outdated cache invalidation, analytics logs—they'll sit next to password reset requests and fraud alerts. Without prioritization, your consumers process in FIFO order, meaning a thousand low-priority tasks delay a single high-priority one by minutes. Implement priority queues: RabbitMQ supports priority values per message (0-255); Kafka uses topic partitioning. But here's the catch—priority queues kill throughput if you use more than 10 levels. The broker sorts messages on insertion, which is O(log n) per insertion. Stick to three tiers: critical (password resets, charge failures), normal (order confirmations), background (analytics). Anything beyond five levels is premature optimization that burns CPU. Prioritization isn't about fairness. It's about preventing the mundane from killing the mission-critical.

PriorityPublisher.javaJAVA

// io.thecodeforge
import com.rabbitmq.client.AMQP;
import com.rabbitmq.client.Channel;
import com.rabbitmq.client.ConnectionFactory;

public class PriorityPublisher {
    public static void main(String[] args) {
        try (var connection = new ConnectionFactory().newConnection();
             var channel = connection.createChannel()) {
            // Queue must declare max priority
            channel.queueDeclare("critical_work", true, false, false,
                Map.of("x-max-priority", 10));

            // Publish with priority 9 (critical)
            var criticalProps = new AMQP.BasicProperties.Builder()
                .priority(9).deliveryMode(2).build();
            channel.basicPublish("", "critical_work",
                criticalProps, "fraud_alert_9987".getBytes());

            // Publish with priority 1 (background)
            var bgProps = new AMQP.BasicProperties.Builder()
                .priority(1).deliveryMode(2).build();
            channel.basicPublish("", "critical_work",
                bgProps, "cache_warm_332".getBytes());
        } catch (Exception e) {
            System.err.println("Priority publish failed: " + e.getMessage());
        }
    }
}

Output

RocketMQ processes fraud_alert_9987 before cache_warm_332, even if cache message arrived first.

Performance Note:

Priority queues add latency. Test your broker under load. RabbitMQ drops to ~70% throughput with 10 priority levels vs 0. Use sparingly.

Key Takeaway

Three priority levels are enough. More makes your broker choke.

● Production incidentPOST-MORTEMseverity: high

The Double-Charge Incident: auto-ack Deleted the Payment Message Before Processing Completed

Symptom

Customer support receives a spike in 'charged but no order created' tickets within hours of a deployment. Payment gateway logs show successful charges for the affected customers. Application logs show the consumer received the messages but have no log entries for the order creation step. The message queue management dashboard shows zero unacknowledged messages in the queue — which looks healthy but is actually the problem.

Assumption

The payment gateway must have returned an error that the code did not handle properly. The team suspects the message was retried and a second attempt created an order without a corresponding charge, or that the message was dropped in network transit. Two engineers spend four hours investigating the gateway integration before anyone looks at the ack configuration.

Root cause

The consumer was configured with auto_ack=True. RabbitMQ deleted every message the instant it was delivered to the consumer channel — before the consumer confirmed that the payment gateway call AND the database write both succeeded. The consumer process hit an OutOfMemoryError during the database connection pool exhaustion that followed a slow migration deployed the same day. The OOM killer terminated the process. The message was already gone from the broker. No retry, no DLQ entry, no trace anywhere in the system except a successful gateway log and a missing order.

Fix

1. Switched to manual acknowledgement: channel.basicConsume(QUEUE_NAME, false, deliverCallback, cancelCallback) — the false is the critical parameter. 2. Moved the channel.basicAck() call to after both the payment gateway response is received AND the database transaction commits successfully. 3. On any exception, call channel.basicNack(deliveryTag, false, false) to route the message to the DLQ rather than requeuing it. 4. Added an idempotency check using the event_id against a processed_events table before calling the gateway — if the event_id already exists, skip processing and ack. 5. Set a DLQ depth alert at depth greater than 0 that pages the on-call engineer immediately.

Key lesson

Auto-ack is a data loss mechanism disguised as a convenience feature — it has no legitimate use case in any consumer that performs meaningful work
Manual ack must happen only after the entire business transaction completes — after the database commit, not after the message is received and certainly not before calling external services
Every consumer that calls a payment gateway or any non-idempotent external system must check an event_id deduplication table before acting — at-least-once delivery is guaranteed, which means duplicates will happen
A DLQ without depth alerting is a graveyard — set the threshold at greater than zero, not at some arbitrary number you will never reach until something is seriously broken
OOM kills are silent — the consumer can die between a gateway call and a database commit with zero application log output, which is why the ack position matters more than good exception handling

Production debug guideSymptom to Action mapping for common MQ failures in production5 entries

Symptom · 01

Messages disappearing without a trace — no errors in consumer logs, no messages in DLQ, queue shows zero depth

→

Fix

Check if auto_ack is enabled on the consumer. If true, the broker deletes messages on delivery regardless of processing outcome and there will be no record of the failure anywhere. Switch to manual ack immediately and redeploy. Then audit whether any messages were lost and need manual recovery.

Symptom · 02

Queue depth growing steadily while consumers report low CPU and appear to be running normally

→

Fix

Check prefetch_count. If unset, RabbitMQ uses an unbounded prefetch and may be delivering hundreds of messages to one consumer while others sit idle with nothing to process. Set channel.basicQos(1) to enforce fair dispatch. Also check for slow downstream dependencies — a consumer that spends 30 seconds waiting on a database call can appear healthy while the queue backs up behind it.

Symptom · 03

Same message processed ten or more times in rapid succession — duplicate charges, duplicate emails, duplicate inventory decrements

→

Fix

Check if requeue=True is being used on nack. A poison message gets requeued and redelivered immediately in an infinite loop that looks like high throughput but is actually a single message being hammered. Change to requeue=False to route to DLQ. Also verify the consumer has idempotency checks in place — even after fixing the nack, at-least-once delivery means duplicates will arrive eventually through normal retry flows.

Symptom · 04

Consumer connection drops silently — no crash, no exception thrown, messages simply stop flowing and the consumer appears healthy

→

Fix

Check TCP keepalive settings and heartbeat configuration. RabbitMQ defaults to a 60-second heartbeat. If a network device drops the connection without sending a TCP RST, the broker waits indefinitely for the next heartbeat while the consumer has no idea the connection is gone. Lower heartbeat to 10 to 30 seconds and add connection recovery listeners. Use the AMQP client's built-in automatic recovery where available.

Symptom · 05

DLQ depth spikes immediately after a deployment — dozens of messages arrive in the DLQ within the first few minutes

→

Fix

Check for schema changes or serialization version mismatches between the new consumer and messages already in the queue. A new consumer that cannot deserialize the old message format will nack every message in the queue. Roll back the consumer deployment first, drain the DLQ manually while examining the message payloads, then redeploy with backward-compatible deserialization before cutting over fully.

★ Message Queue Quick Debug Cheat SheetImmediate diagnostic commands when message queue processing breaks in production.

Queue depth growing — consumers not draining despite appearing to run−

Immediate action

Check consumer count, prefetch settings, and unacknowledged message count on the queue

Commands

docker exec forge-rabbitmq rabbitmqctl list_consumers

docker exec forge-rabbitmq rabbitmqctl list_queues name messages messages_unacknowledged consumers

Fix now

If consumers=0, the consumer process crashed — restart it and check logs for the crash reason. If messages_unacknowledged is high and growing, a consumer is stuck holding messages it received but cannot process — check for deadlocks, slow DB queries, or blocked external calls.

Poison message looping — same message reprocessed hundreds of times, consumer CPU spikes+

Messages published successfully — producer logs confirm — but consumer receives nothing+

RabbitMQ vs Apache Kafka — Head-to-Head Comparison

Feature / Aspect	RabbitMQ (Queue-based)	Apache Kafka (Log-based)
Delivery model	Push — broker pushes messages to consumers as they become available	Pull — consumers poll the broker at their own pace, which naturally provides backpressure
Message retention	Deleted after ACK — once processed and acknowledged, the message is gone	Retained for a configurable period (hours, days, weeks) regardless of consumer acknowledgement
Message replay	Not supported natively — once a message is acked it cannot be replayed without re-publishing	Core feature — any consumer group can seek to offset 0 and replay the entire event history
Throughput ceiling	Approximately 50,000 messages per second per node under ideal conditions	Millions of messages per second per node — designed for extreme throughput at the cost of lower-level control
Ordering guarantee	FIFO per queue — messages in a single queue are delivered in order	FIFO per partition — ordering is guaranteed within a partition; global ordering requires a single partition at the cost of throughput
Consumer isolation	Competing consumers share one queue — each message delivered to exactly one consumer	Each consumer group maintains independent offsets — truly independent fan-out with no interference between groups
Best for	Task queues, RPC patterns, complex routing logic with multiple exchanges, work distribution	Event streaming, audit logs, data pipelines, event sourcing, analytics, and any use case requiring replay
Operational complexity	Low — single binary, easy setup, excellent management UI out of the box	Higher — requires partition tuning, consumer group coordination, and offset management; KRaft mode removes ZooKeeper in recent versions
Dead-letter support	First-class feature via x-dead-letter-exchange queue argument — automatic routing on nack or TTL expiry	Convention-based retry and dead-letter topic pattern — requires explicit consumer logic to route to retry or DLT topics
Message size practical limit	128MB default, configurable — but large messages should be stored externally with the queue carrying a reference	1MB default, configurable — same advice applies: use the queue for metadata and reference, not for payloads

Key takeaways

Producers fire-and-forget

they hand the message to the broker and return immediately. This latency independence between services is the core architectural benefit, and it is why message queues are the first tool experienced engineers reach for when synchronous coupling creates reliability problems.

The acknowledgement model is your delivery guarantee

only ack after all side effects succeed and are committed to durable storage. Use nack with requeue=False for permanent failures to route to the DLQ. Auto-ack is a data loss mechanism with no legitimate use in any consumer that does real work.

Queue means one consumer gets the message

correct for tasks. Topic or fanout means every subscriber gets the message — correct for domain events. Choose based on whether multiple downstream services need to react to the same event, not based on expected volume.

A DLQ without depth alerting is deferred data loss

set the alert threshold at one message, not some arbitrary number that gives you false confidence. Treat DLQ growth exactly like a 500 error rate spike: investigate immediately, fix the root cause, and replay selectively after the fix is deployed.

Common mistakes to avoid

5 patterns

Using auto-ack in production consumers

Symptom

Messages disappear silently with no trace in logs or DLQ — the broker deletes them on delivery before any processing occurs. Consumer crashes or OOM kills mid-processing lose messages permanently and the queue dashboard shows zero unacknowledged messages, which looks healthy.

Fix

Set autoAck=false in every consumer and call basicAck only after all business logic succeeds and all side effects are committed to durable storage. This is not a performance optimisation to revisit later — it is table stakes for any consumer doing real work.

Not setting prefetch count, relying on default unbounded delivery

Symptom

One fast consumer accumulates hundreds of unacknowledged messages while other consumer instances sit idle with nothing to process. When that one consumer crashes, it loses all its in-flight messages simultaneously. The queue appears to be processing but throughput is effectively single-threaded.

Fix

Call channel.basicQos(1) before starting consumption. This tells the broker to deliver at most one unacknowledged message to each consumer at a time, enforcing fair dispatch across all running instances and making horizontal scaling actually effective.

Skipping idempotency checks because duplicates 'rarely happen'

Symptom

At-least-once delivery guarantees duplicates over time — they are not edge cases. Users get charged twice, confirmation emails arrive multiple times, inventory decrements twice for the same order. The incidents trace back to a consumer that processed the same message twice during a network blip or consumer restart.

Fix

Every consumer that performs non-idempotent operations must check the event_id against a processed_events table before acting. If the event_id already exists, ack the message and return — this is the correct response to a duplicate, not an error.

Requeuing poison messages with requeue=True on nack

Symptom

A permanently broken message fails, gets requeued immediately, fails again, repeats thousands of times per second. The consumer's CPU spikes, no other messages in the queue can be processed, and the queue dashboard shows the same message_unacknowledged count cycling while messages_ready stays frozen.

Fix

Always use requeue=False on nack for permanent failures. Implement a retry queue with exponential backoff TTL for transient failures, and route to the DLQ after max retries. Never put a message back on the main queue without a delay — immediate requeue of a failing message is always wrong.

Treating the DLQ as a quiet background feature with no alerting

Symptom

DLQ accumulates hundreds of failed messages over days or weeks while everyone assumes the system is healthy. By the time someone checks, the root cause requires forensic investigation across days of logs, and time-sensitive messages representing real customer failures are permanently unrecoverable.

Fix

Set an alert on DLQ message count at a threshold of one — not ten, not one hundred, one. A single message in the DLQ represents a real failure happening right now. Page the on-call engineer, treat it like a 500 error spike, and investigate before more messages arrive.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

How do you handle backpressure in a system where the producer is signifi...

Q02SENIOR

Explain the difference between at-least-once and exactly-once delivery s...

Q03SENIOR

Design a distributed retry pattern. How do you avoid thundering herd pro...

Q04JUNIOR

What is the difference between a message queue and an event streaming pl...

Q01 of 04SENIOR

How do you handle backpressure in a system where the producer is significantly faster than the consumer?

ANSWER

I implement three coordinated layers, not just one setting. First, prefetch count: setting basicQos(1) on each consumer prevents the broker from flooding any single consumer and forces fair distribution, giving you meaningful horizontal scaling. Second, queue length limits: configuring max-length on the queue with a drop-head or reject-publish overflow policy prevents unbounded memory growth on the broker — this signals to the producer that the system is under load, which can trigger producer-side rate limiting if you add publisher confirms and handle nacks. Third, consumer autoscaling: monitoring queue depth as a Kubernetes HPA metric or an AWS Auto Scaling trigger lets you spin up additional consumer instances when depth exceeds a threshold. The key insight is that prefetch and queue limits are backpressure valves that protect the system under temporary overload — autoscaling is what actually resolves the underlying throughput mismatch when the overload is sustained.

FAQ · 4 QUESTIONS

Frequently Asked Questions

What is the difference between a message queue and an event streaming platform like Kafka?

What happens if a consumer crashes before acknowledging a message?

Can I use a database as a message queue?

What is the Outbox pattern and when should I use it?

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.

✓ Verified

production tested

May 24, 2026

last updated

1,554

articles · all by Naren

🔥

That's Components. Mark it forged?

15 min read · try the examples if you haven't