Junior 12 min · March 05, 2026

Message Queue Components — The Auto-ack Data Loss Gotcha

auto_ack=True deletes messages before processing completes, causing double charges from payment gateway.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • A message queue decouples services by letting a producer drop a message and move on — the consumer picks it up whenever it's ready, regardless of whether the producer is still running
  • Producers serialize and publish; brokers store and route; consumers process and acknowledge — these three form the core pipeline and each has independent failure modes
  • Ack after success, Nack with requeue=False to route poison messages to the DLQ — never requeue a message that will never succeed
  • RabbitMQ deletes messages after ACK (work queue model); Kafka retains them as an immutable, time-ordered log (streaming model) — the architectural difference is significant, not cosmetic
  • Prefetch count = 1 prevents one fast consumer from hoarding all in-flight messages while slower consumers sit idle with nothing to process
  • The biggest trap: auto-ack means the broker deletes the message the instant it is delivered, before your code does anything — a crash mid-processing loses it permanently with zero trace
Plain-English First

Imagine a busy restaurant kitchen. Waiters (producers) do not cook food themselves — they write orders on tickets and pin them to a rotating wheel (the broker and queue). Cooks (consumers) grab tickets when they are free, prepare the dish, and discard the ticket when it is done. The waiter never stands at the stove waiting. The cook never chases waiters around the dining room. If the kitchen gets backed up, the wheel just holds more tickets without slowing down the front of house. That rotating ticket wheel is a message queue — it decouples the people taking orders from the people fulfilling them, so both sides work at their own pace and neither blocks the other.

Modern software systems are rarely a single monolith running by itself. They are a constellation of services — payments, notifications, inventory, analytics — all needing to communicate. When service A calls service B synchronously and B is slow, A slows down too. When B crashes, A crashes or returns an error to the user. That tight coupling is a hidden time bomb in nearly every high-traffic system, and message queues are the first tool experienced engineers reach for.

A message queue solves one fundamental problem: it lets two services communicate without either needing to know the other's current availability or processing speed. The sender drops a message into the queue and moves on immediately. The receiver picks it up whenever it is ready. This tiny architectural shift unlocks fault tolerance, horizontal scalability, and independent deployability — the three properties every distributed system must eventually achieve.

By the end of this guide you will understand every moving part inside a message queue system — producers, brokers, queues versus topics, consumers, consumer groups, acknowledgements, and dead-letter queues — well enough to design one from scratch on a whiteboard, explain the trade-offs clearly under interview pressure, and avoid the subtle bugs that trip up engineers who only understand the happy path.

In 2026, message queues underpin everything from payment processing to LLM inference pipelines. The fundamentals have not changed, but the operational expectations have — consumers are expected to be idempotent by default, DLQs are expected to have alerting from day one, and the choice between RabbitMQ and Kafka is increasingly driven by replay requirements rather than throughput alone.

The Producer — Implementing Reliable Event Egress

The producer is the entry point for all data entering your distributed system, and its reliability story is more nuanced than most tutorials suggest. A producer that simply calls publish and moves on is adequate for logging and analytics. A producer handling payment events, order creation, or anything with financial or regulatory consequences needs to think carefully about three things: persistence, confirmation, and the atomicity gap between a database write and a queue publish.

Persistence is about what happens when the broker restarts. A non-durable message lives only in memory — broker restart, message gone. For anything that matters, use delivery mode 2 (persistent), which tells the broker to flush the message to disk before acknowledging receipt. The trade-off is disk I/O on every publish, which reduces maximum throughput but eliminates the silent data loss vector.

Confirmation is about knowing the broker actually received the message. The standard publish call is fire-and-forget — if the network drops between publish and the broker's internal receipt, you never know. Publisher confirms enable an ack back from the broker to the producer. This adds latency per message but gives you a guarantee that the message is in the broker's hands.

The atomicity gap is the hardest problem. If your producer writes a row to Postgres and then publishes an event to RabbitMQ, there is a window where the database write succeeds but the queue publish fails — or the process is killed between the two operations. The downstream services never learn that the order exists. The Outbox pattern closes this gap by writing the event into a table inside the same database transaction, with a separate polling process handling the actual publish to the queue.

OrderProducer.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
package io.thecodeforge;

import com.rabbitmq.client.Channel;
import com.rabbitmq.client.Connection;
import com.rabbitmq.client.ConnectionFactory;
import com.rabbitmq.client.MessageProperties;
import com.rabbitmq.client.ConfirmListener;
import java.nio.charset.StandardCharsets;
import java.util.UUID;
import java.util.concurrent.ConcurrentNavigableMap;
import java.util.concurrent.ConcurrentSkipListMap;

/**
 * Production-grade producer with durable messaging and publisher confirms.
 * io.thecodeforge naming conventions throughout.
 *
 * Two critical settings:
 *   1. durable=true on the queue declaration — survives broker restart
 *   2. PERSISTENT_TEXT_PLAIN delivery mode — message flushed to disk before broker ACKs
 *
 * Publisher confirms (channel.confirmSelect()) ensure we know when the broker
 * has actually received and persisted the message, not just buffered it.
 */
public class OrderProducer {
    private static final String QUEUE_NAME = "order_processing";

    // Track outstanding unconfirmed publishes: sequenceNumber -> message payload
    // In production this would trigger a retry or fallback path on nack
    private final ConcurrentNavigableMap<Long, String> outstandingConfirms =
        new ConcurrentSkipListMap<>();

    public void publishOrder(String customerId, double amount) throws Exception {
        ConnectionFactory factory = new ConnectionFactory();
        factory.setHost("localhost");
        factory.setUsername("forge_user");
        factory.setPassword("forge_password");

        // One Connection per process — creating a new Connection per publish
        // adds 100-200ms of TCP handshake overhead and exhausts file descriptors at scale.
        try (Connection connection = factory.newConnection();
             Channel channel = connection.createChannel()) {

            // durable=true: this queue survives a broker restart
            // exclusive=false: allow multiple consumers
            // autoDelete=false: queue persists when the last consumer disconnects
            channel.queueDeclare(QUEUE_NAME, true, false, false, null);

            // Enable publisher confirms — the broker will ack or nack each message
            // so we know it was received and persisted, not just buffered in memory
            channel.confirmSelect();

            // Wire up confirm listeners before publishing
            channel.addConfirmListener(new ConfirmListener() {
                @Override
                public void handleAck(long deliveryTag, boolean multiple) {
                    if (multiple) {
                        outstandingConfirms.headMap(deliveryTag + 1).clear();
                    } else {
                        outstandingConfirms.remove(deliveryTag);
                    }
                    System.out.printf(" [Forge] Broker confirmed delivery for tag: %d%n", deliveryTag);
                }

                @Override
                public void handleNack(long deliveryTag, boolean multiple) {
                    // Broker rejected the message — trigger your retry or fallback path
                    String nacked = outstandingConfirms.get(deliveryTag);
                    System.err.printf(" [Forge] Broker NACK'd message, triggering fallback: %s%n", nacked);
                    // In production: write to outbox table for later retry
                }
            });

            // Generate a unique event ID that consumers use for idempotency checks
            String eventId = UUID.randomUUID().toString();
            String payload = String.format(
                "{\"event_id\":\"%s\", \"customer_id\":\"%s\", \"amount\":%.2f, \"currency\":\"USD\"}",
                eventId, customerId, amount
            );

            // Track this publish before sending so the confirm listener can correlate
            long sequenceNumber = channel.getNextPublishSeqNo();
            outstandingConfirms.put(sequenceNumber, payload);

            // PERSISTENT_TEXT_PLAIN = Delivery Mode 2 + Content-Type text/plain
            // Mode 2 tells the broker to write to disk before acknowledging
            // Without this, a broker crash between receipt and flush loses the message
            channel.basicPublish(
                "",               // default exchange — routes directly to named queue
                QUEUE_NAME,       // routing key for default exchange = queue name
                MessageProperties.PERSISTENT_TEXT_PLAIN,
                payload.getBytes(StandardCharsets.UTF_8)
            );

            System.out.printf(" [Forge] Published order event %s for customer %s%n",
                eventId, customerId);

            // Wait for broker confirmation before returning to caller
            // In high-throughput scenarios, use async confirms and batch publishing instead
            if (!channel.waitForConfirms(5000)) {
                throw new RuntimeException("Broker did not confirm message within 5 seconds");
            }
        }
    }
}
Output
[Forge] Published order event 550e8400-e29b-41d4-a716-446655440000 for customer cust-123
[Forge] Broker confirmed delivery for tag: 1
The Outbox Pattern — Closing the Atomicity Gap Between DB and Queue
  • Write the event to an 'outbox' table inside the same database transaction as your business logic — if the transaction rolls back, the event disappears with it
  • A separate polling process reads uncommitted outbox rows and publishes them to the message queue — decoupling the database commit from the queue publish
  • Once published successfully, the poller marks the outbox row as sent or deletes it — the idempotent marker prevents double-publish on poller restart
  • This pattern guarantees atomicity: either both the data write and the event publish succeed, or neither does — no phantom events, no silent drops
  • The trade-off is added latency equal to the polling interval (typically 100ms to 5 seconds) and an outbox table that needs periodic cleanup
Production Insight
Non-durable messages are a silent data loss vector that only surfaces during broker restarts — which happen during upgrades, failovers, and incidents, exactly when you can least afford data loss. Always use delivery mode 2 for anything that matters.
Connection pooling is critical at scale: creating a new TCP connection per publish adds 100 to 200 milliseconds of handshake overhead and exhausts file descriptors under load. The correct pattern is one Connection per process and one Channel per thread — Channels are lightweight and cheap, Connections are expensive and shared.
Publisher confirms are not optional for payment or order events. Fire-and-forget publish gives you no signal when the network drops between your process and the broker. Confirms give you that signal at the cost of per-message latency.
Key Takeaway
The producer's job is not just publish — it is publish reliably and know when the broker has received it. Non-durable delivery mode is a silent data loss vector that only reveals itself during infrastructure failures. Publisher confirms give you the signal you need to implement retry or fallback logic. Use the Outbox pattern whenever a queue publish and a database write must succeed or fail together.
Producer Reliability Decisions
IfMessage loss is acceptable — analytics events, access logs, metrics
UseNon-durable queues with transient delivery mode — faster, less disk I/O, appropriate for high-volume observability data
IfMessage loss is unacceptable — payments, orders, compliance events
UseDurable queues, persistent delivery mode, and publisher confirms — all three are required, not optional
IfNeed transactional guarantee that ties a database write to a queue publish atomically
UseImplement the Outbox pattern — write to the outbox table in the same transaction, poll and publish separately with idempotency on the poller
IfHigh-throughput producer sending more than 10,000 messages per second
UseUse batch publishing and asynchronous publisher confirms rather than waiting for synchronous confirmation on each message — the throughput difference is significant

The Broker — Orchestrating Infrastructure with Docker

The broker is the central post office of your distributed system — every message passes through it, and its configuration determines the durability, routing, and delivery guarantees your system provides. Senior engineers containerise the broker with Docker Compose for a simple reason: dev-prod parity. When your local RabbitMQ instance has the same exchanges, bindings, and policies as production, you catch routing bugs and configuration mismatches before they hit staging.

The compose file below includes a named volume for data persistence — without it, a docker-compose down wipes every message in every queue, which is fine for unit tests and genuinely dangerous for integration tests where you are verifying retry and DLQ behaviour. The management UI on port 15672 is not just an admin panel — during an incident it is your primary visibility into queue depth, consumer count, message rates, and memory usage. Know how to read it under pressure before you need it.

One often-overlooked aspect of broker configuration is the Dead Letter Exchange. Defining the DLX in your compose file alongside the primary queue means every developer's local environment has the same failure routing as production. Discovering that your DLQ configuration is wrong during a production incident is several orders of magnitude worse than discovering it locally.

docker-compose.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
version: '3.8'

services:
  thecodeforge-broker:
    image: rabbitmq:3.13-management
    container_name: forge-rabbitmq
    ports:
      - "5672:5672"   # AMQP protocol — application connections
      - "15672:15672" # Management UI — use for debugging queue depth and consumer health
    environment:
      RABBITMQ_DEFAULT_USER: forge_user
      RABBITMQ_DEFAULT_PASS: forge_password
      # Reduce heartbeat to detect dropped connections faster (default is 60s)
      RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS: "-rabbitmq_management load_definitions /etc/rabbitmq/definitions.json"
    volumes:
      # Named volume ensures messages survive docker-compose down and up
      # Without this, every restart wipes the queue — dangerous for DLQ testing
      - rabbitmq_data:/var/lib/rabbitmq
      # Pre-load exchange and queue definitions on startup for dev-prod parity
      - ./rabbitmq-definitions.json:/etc/rabbitmq/definitions.json:ro
    healthcheck:
      # Wait for the broker to be ready before starting dependent services
      test: ["CMD", "rabbitmq-diagnostics", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 20s

volumes:
  rabbitmq_data:
    # Explicit driver declaration makes the volume's purpose visible to future engineers
    driver: local
Output
# Container startup:
# forge-rabbitmq started. Waiting for healthcheck...
# rabbitmq-diagnostics ping succeeded after 15s
# Broker ready at amqp://forge_user:forge_password@localhost:5672
# Management UI at http://localhost:15672 (forge_user / forge_password)
#
# Verify broker is healthy:
# docker exec forge-rabbitmq rabbitmqctl status
# docker exec forge-rabbitmq rabbitmqctl list_queues name messages consumers
Queue vs Topic — The Scalability Difference That Matters Most
  • Queue (point-to-point): one message, one consumer — correct for task distribution like image resizing, email sending, or payment processing where you want exactly one service to act
  • Topic or fanout (pub/sub): one message, every subscriber — correct for domain events like ORDER_PLACED where inventory, notifications, and analytics all need to react independently
  • Adding a new microservice to a topic requires zero producer code changes — bind a new queue to the existing exchange and the service starts receiving events
  • Competing consumers on a single queue scale horizontally and naturally — add more consumer instances and the broker distributes work across all of them
  • Rule of thumb: if more than one downstream service will ever need the event, use a topic exchange from day one — retrofitting it later requires coordinating multiple teams
Production Insight
Docker named volumes prevent data loss on container restart — without them, a docker-compose down between test runs wipes all messages, making DLQ and retry behaviour untestable locally.
Pin the RabbitMQ image version explicitly — rabbitmq:3-management can pull a breaking minor version change between developers' machines if Docker cache is stale, creating environment-specific bugs that are painful to trace.
The management UI on port 15672 is your first stop during any message queue incident — queue depth, unacknowledged message count, consumer count, and publish rates are all visible in real time without any additional tooling.
Key Takeaway
The broker is the single source of truth for message delivery — its configuration drives your system's durability and routing guarantees. Dev-prod parity via Docker Compose means routing bugs and DLQ misconfigurations surface locally rather than in production. Use topic exchanges from the start whenever more than one downstream service needs an event — retrofitting point-to-point queues to pub/sub requires coordinating producer changes across teams.

The Consumer — Ensuring Idempotency and Manual Acknowledgement

Consumers are where processing happens and where most data loss actually occurs — not in the broker, not in the producer. The configuration decisions you make in the consumer determine whether your system has at-least-once delivery with retry guarantees or auto-ack with silent data loss disguised as reliability.

The most dangerous configuration is auto_ack=True. The broker interprets delivery as confirmation that the message was processed successfully. If your Java process receives the message, starts processing, and then gets killed by the OOM killer between the payment gateway call and the database commit, the message is gone. The broker has already deleted it. There is no retry, no DLQ entry, and no trace in any log — only a successful charge in the gateway and a missing order in your database.

Manual acknowledgement changes the contract: the broker keeps the message in an unacknowledged state until your code explicitly calls basicAck. Only call basicAck after every side effect has succeeded — not after receipt, not after the external API call, but after the database commit that makes the operation durable.

The second non-negotiable for production consumers is idempotency. At-least-once delivery means duplicates are guaranteed over time. A network blip between your ack call and the broker receiving it causes redelivery. A consumer restart with unacked messages causes redelivery. Assume it will happen and build accordingly — check the event_id against a processed_events table before acting, and ack without processing if the event was already handled.

Prefetch count rounds out the three required settings. Without it, RabbitMQ uses an unbounded prefetch and may deliver 500 messages to one fast consumer while three other consumers sit idle. Set basicQos(1) to enforce fair dispatch — the broker only sends a new message to a consumer after the previous one is acknowledged.

OrderConsumer.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
package io.thecodeforge;

import com.rabbitmq.client.*;
import java.nio.charset.StandardCharsets;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;

/**
 * Production-grade consumer with:
 *   1. Manual acknowledgement — only ack after all side effects succeed
 *   2. Idempotency check — safe to receive the same message twice
 *   3. Prefetch=1 — fair dispatch across all consumer instances
 *   4. DLQ routing on permanent failure — no infinite poison message loops
 */
public class OrderConsumer {
    private static final String QUEUE_NAME = "order_processing";
    private static final String DB_URL     = "jdbc:postgresql://localhost:5432/forge";
    private static final String DB_USER    = "forge_user";
    private static final String DB_PASS    = "forge_password";

    public static void main(String[] args) throws Exception {
        ConnectionFactory factory = new ConnectionFactory();
        factory.setHost("localhost");
        factory.setUsername("forge_user");
        factory.setPassword("forge_password");

        // Enable automatic connection recovery — reconnects after network interruptions
        // without requiring consumer code to restart or re-register callbacks
        factory.setAutomaticRecoveryEnabled(true);
        factory.setNetworkRecoveryInterval(5000); // retry every 5 seconds

        Connection connection = factory.newConnection();
        Channel channel = connection.createChannel();

        // Declare the queue with the same parameters as the producer
        // If the queue already exists with different params, this throws — catch it early
        channel.queueDeclare(QUEUE_NAME, true, false, false, null);

        // CRITICAL: prefetch=1 prevents one consumer from hoarding all messages
        // The broker will not deliver a new message until the previous one is acked
        // This forces round-robin distribution across all running consumer instances
        channel.basicQos(1);

        System.out.println(" [Forge] Consumer waiting for messages on: " + QUEUE_NAME);

        DeliverCallback deliverCallback = (consumerTag, delivery) -> {
            String body = new String(delivery.getBody(), StandardCharsets.UTF_8);
            long deliveryTag = delivery.getEnvelope().getDeliveryTag();
            String eventId   = extractEventId(body);

            System.out.printf(" [Forge] Received event %s%n", eventId);

            try {
                // Step 1: Idempotency check — has this event been processed before?
                // Required because at-least-once delivery means duplicates WILL arrive
                if (isAlreadyProcessed(eventId)) {
                    System.out.printf(" [Forge] Duplicate event %s — acking without processing%n", eventId);
                    channel.basicAck(deliveryTag, false);
                    return;
                }

                // Step 2: Execute business logic — all of this must succeed or fail atomically
                processOrder(body);

                // Step 3: Mark as processed in the idempotency table
                // Do this inside the same DB transaction as processOrder() in production
                markAsProcessed(eventId);

                // Step 4: Only ack AFTER all side effects are committed to durable storage
                // If we crash between processOrder() and basicAck(), the broker redelivers
                // The idempotency check in Step 1 handles that redelivery safely
                channel.basicAck(deliveryTag, false);
                System.out.printf(" [Forge] Successfully processed and acked event %s%n", eventId);

            } catch (Exception error) {
                System.err.printf(" [Forge] Failed to process event %s: %s%n",
                    eventId, error.getMessage());

                // requeue=false: send to DLQ instead of requeueing
                // requeue=true would create an infinite loop for permanent failures
                // The DLQ preserves the message for manual inspection and replay
                channel.basicNack(deliveryTag, false, false);
            }
        };

        // autoAck=false: broker keeps message unacked until we call basicAck or basicNack
        // This is the single most important setting — never set this to true in production
        channel.basicConsume(QUEUE_NAME, false, deliverCallback, consumerTag ->
            System.out.println(" [Forge] Consumer cancelled: " + consumerTag)
        );
    }

    private static void processOrder(String json) throws Exception {
        // In production: parse JSON, call payment gateway, write to DB in one transaction
        System.out.printf(" [Forge] Processing order payload: %s%n", json);
        Thread.sleep(50); // simulate processing time
    }

    private static boolean isAlreadyProcessed(String eventId) throws Exception {
        try (var conn = DriverManager.getConnection(DB_URL, DB_USER, DB_PASS);
             PreparedStatement ps = conn.prepareStatement(
                 "SELECT 1 FROM processed_events WHERE event_id = ? LIMIT 1")) {
            ps.setString(1, eventId);
            ResultSet rs = ps.executeQuery();
            return rs.next();
        }
    }

    private static void markAsProcessed(String eventId) throws Exception {
        try (var conn = DriverManager.getConnection(DB_URL, DB_USER, DB_PASS);
             PreparedStatement ps = conn.prepareStatement(
                 "INSERT INTO processed_events (event_id, processed_at) VALUES (?, NOW()) ON CONFLICT DO NOTHING")) {
            ps.setString(1, eventId);
            ps.executeUpdate();
        }
    }

    private static String extractEventId(String json) {
        // In production use Jackson or Gson for JSON parsing
        // This simple extraction is for illustration clarity only
        int start = json.indexOf('"', json.indexOf("event_id") + 10) + 1;
        int end   = json.indexOf('"', start);
        return json.substring(start, end);
    }
}
Output
[Forge] Consumer waiting for messages on: order_processing
[Forge] Received event 550e8400-e29b-41d4-a716-446655440000
[Forge] Processing order payload: {"event_id":"550e8400...", "customer_id":"cust-123", "amount":99.99}
[Forge] Successfully processed and acked event 550e8400-e29b-41d4-a716-446655440000
# On redelivery of the same event_id:
[Forge] Received event 550e8400-e29b-41d4-a716-446655440000
[Forge] Duplicate event 550e8400-e29b-41d4-a716-446655440000 — acking without processing
The Poison Message Problem — Why requeue=True Is Dangerous
If a message has a permanent defect — malformed JSON, a missing required field, a schema the consumer cannot parse — it will fail on every delivery attempt. With requeue=True on nack, the broker delivers it again immediately. The consumer fails again, nacks with requeue=True, and the cycle repeats thousands of times per second. This infinite loop blocks every other message in the queue and spikes consumer CPU while making zero progress. Always use requeue=False to route unrecoverable failures to the DLQ where they can be inspected and replayed manually after the root cause is fixed.
Production Insight
basicQos(1) is the single most impactful consumer setting for fair distribution — without it, one fast consumer can hold hundreds of unacknowledged messages while other instances have nothing to process, which makes horizontal scaling ineffective.
auto_ack=True is a production incident waiting to happen — it has no legitimate use in any consumer that does real work. The convenience it offers is exactly the confidence that leads to the kind of incident described above.
Automatic connection recovery (factory.setAutomaticRecoveryEnabled(true)) handles the silent disconnection scenario described in the debug guide — without it, your consumer appears to be running but stops receiving messages after a network interruption.
Key Takeaway
The consumer is where data loss happens — not in the broker, not in the producer. Manual ack, idempotency check, and DLQ routing form a three-layer defence that makes your consumer safe to restart, safe to run as multiple instances, and safe to deliver the same message twice. If your consumer uses auto-ack, you are one OOM kill away from a production incident with no recovery path.
Consumer Configuration Decisions
IfProcessing is idempotent, fast (under 100ms), and stateless
UsePrefetch count can be higher (10 to 50) to improve throughput — but still use manual ack, never auto-ack
IfProcessing involves external API calls, database writes, or any I/O
UseSet prefetch count to 1 and use manual ack — only ack after all side effects are committed to durable storage
IfMessage fails due to a permanent defect — malformed data, schema mismatch
UseNack with requeue=False to route directly to DLQ — do not requeue, it will never succeed and will block the queue
IfMessage fails due to a transient error — downstream timeout, temporary unavailability
UseNack with requeue=False and route to a retry queue with exponential backoff TTL — not back to the main queue, which would allow immediate retry without backoff

Dead-Letter Queues — Handling Failures Without Blocking the Pipeline

A Dead Letter Queue is where messages go when they cannot be processed — not to be forgotten, but to be preserved for diagnosis and recovery. The DLQ's value is not that it catches failures (your nack logic does that), but that it isolates them from the primary queue so the happy path continues processing at full speed while the failed messages wait for human attention.

In RabbitMQ, a Dead Letter Exchange is a first-class feature configured on the primary queue via arguments. When a message is nacked with requeue=False, or expires its TTL, or exceeds the queue's maximum length, the broker automatically routes it to the configured dead-letter exchange, which then delivers it to the DLQ. This requires no code changes in the consumer beyond using requeue=False — the broker handles the routing.

Archiving DLQ messages to a relational database table adds context that the message broker cannot preserve on its own — the source queue name, the failure reason, the retry count, the timestamp of each attempt. Without this context, when someone opens the RabbitMQ management UI a week later and sees 47 messages in the DLQ, they have no idea which ones are related, which ones can be replayed, or what went wrong.

The alerting is not optional. A DLQ that accumulates silently for days is just deferred data loss. Every message that enters the DLQ represents a failure that a customer or downstream system is already experiencing. Treat DLQ growth exactly like a 500 error rate spike — investigate immediately, not during the next weekly review.

audit_schema.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
-- io.thecodeforge: Schema for auditing and replaying DLQ extractions
-- This table provides the context that the message broker cannot preserve:
-- which queue failed, why it failed, how many times it was attempted,
-- and whether it has been replayed or resolved.

CREATE TABLE dlq_archive (
    id                  BIGSERIAL       PRIMARY KEY,
    original_event_id   UUID            NOT NULL,
    source_queue        VARCHAR(255)    NOT NULL,  -- which queue the message came from
    payload             JSONB           NOT NULL,  -- full message body for replay
    failure_reason      TEXT            NOT NULL,  -- exception message or error code
    retry_count         INT             NOT NULL DEFAULT 0,
    first_failed_at     TIMESTAMPTZ     NOT NULL DEFAULT NOW(),
    last_failed_at      TIMESTAMPTZ     NOT NULL DEFAULT NOW(),
    resolved_at         TIMESTAMPTZ,               -- null until an engineer resolves it
    resolution_notes    TEXT                       -- what was done to fix or discard
);

-- Fast lookup by event_id for idempotency checks and deduplication during replay
CREATE UNIQUE INDEX idx_dlq_event_id
    ON dlq_archive(original_event_id);

-- For ops dashboards: filter by source queue and unresolved status
CREATE INDEX idx_dlq_source_unresolved
    ON dlq_archive(source_queue, resolved_at)
    WHERE resolved_at IS NULL;

-- Query: find all unresolved failures for the order_processing queue in the last 24 hours
SELECT
    original_event_id,
    failure_reason,
    retry_count,
    first_failed_at,
    payload->>'customer_id'  AS customer_id,
    payload->>'amount'       AS amount
FROM dlq_archive
WHERE source_queue  = 'order_processing'
  AND resolved_at   IS NULL
  AND first_failed_at >= NOW() - INTERVAL '24 hours'
ORDER BY first_failed_at DESC;

-- Replay query: select fixable messages to republish to the original queue
-- After fixing the consumer bug, republish these payloads with new event_ids
SELECT original_event_id, payload
FROM dlq_archive
WHERE source_queue    = 'order_processing'
  AND failure_reason  LIKE '%connection timeout%'  -- transient errors worth replaying
  AND resolved_at     IS NULL;
Output
-- Table created with UNIQUE constraint on original_event_id.
-- Partial index for unresolved messages keeps ops dashboard queries fast
-- even as the archive grows to millions of rows over months.
--
-- Example unresolved failures query result:
-- original_event_id | failure_reason | retry_count | customer_id
-- 550e8400-e29b-41d4-a716-446655440000 | DB connection timeout | 3 | cust-123
-- 7c9e6679-7425-40de-944b-e07fc1f90ae7 | JSON parse error: ... | 1 | cust-456
A DLQ Without Alerting Is Deferred Data Loss — Not a Safety Net
Setting up a DLQ without an alert is the engineering equivalent of installing a smoke detector without batteries. You feel safer but you are not. Set a CloudWatch alarm, Prometheus alert rule, or Datadog monitor that fires the moment DLQ message count exceeds zero — not when it exceeds some arbitrary threshold you will rarely reach. A single message in the DLQ means a customer or downstream system is already experiencing a failure right now. Treat it like a 500 error: page the on-call engineer, investigate immediately, and resolve it before the next one arrives.
Production Insight
DLQ messages lose their original routing context inside the broker — the dlq_archive table is where you preserve source queue, failure reason, retry count, and the full payload so engineers can diagnose without guessing a week later.
A growing DLQ is a symptom of an upstream problem, not a feature of your system working correctly. If more than a handful of messages hit the DLQ on a normal business day, something upstream needs fixing — it is not time to replay the DLQ, it is time to fix the consumer or the data.
Design a replay mechanism from the start: a script or internal tool that reads from dlq_archive, generates new event IDs, and republishes to the original queue. Without it, DLQ recovery requires manual intervention that is slow, error-prone, and stressful during an incident.
Key Takeaway
The DLQ is your system's emergency stop valve — it prevents poison messages from blocking the entire pipeline while preserving failed messages for diagnosis and recovery. But it is only useful if you know when messages arrive in it. A DLQ without alerting is a silent graveyard. Archive DLQ entries to a relational table with source queue, failure reason, and retry count so engineers can diagnose root causes and replay selectively when the fix is deployed.
DLQ Handling Decisions
IfMessage failed due to a transient error — database timeout, network interruption, downstream service temporarily unavailable
UseRoute to a retry queue with exponential backoff TTL (5s, 30s, 5min). Only route to the DLQ after max retry count is exhausted. This avoids filling the DLQ with recoverable errors.
IfMessage failed due to a permanent error — malformed JSON, schema version mismatch, missing required fields
UseRoute directly to DLQ and archive to dlq_archive — do not retry, it will never succeed without a code fix or manual data correction
IfDLQ depth is growing faster than engineers can resolve it
UseStop treating individual DLQ messages as the problem — find and fix the root cause upstream. A DLQ growing at 50 messages per hour means your consumer or producer has a bug affecting a significant percentage of traffic
IfNeed to replay DLQ messages after fixing the root cause in the consumer
UseUse the dlq_archive replay query to identify fixable messages by failure_reason, generate new event IDs, and republish to the original queue — the idempotency check in the consumer handles any overlap with previously processed events

RabbitMQ vs Kafka Decision Matrix: Smart Broker vs Smart Consumer

The architectural distinction between RabbitMQ and Kafka is not about performance — it is about where you place the intelligence. RabbitMQ is a smart broker: the broker itself handles routing, persistence, acknowledgements, dead-lettering, and queue management. The consumer is relatively dumb — it connects, receives messages, acks them, and moves on. Kafka flips this model. The broker is a dumb, fast, immutable log. The consumer is smart — it manages its own offset, participates in rebalancing, and controls exactly when and how many messages to pull.

This decision determines your system's operational characteristics at scale. The table below summarises the key architectural differences and the trade-offs they create.

Decision FactorRabbitMQ (Smart Broker)Kafka (Smart Consumer)
Who manages delivery?Broker pushes messages and controls ack/dead-letter routingConsumer pulls messages and commits offsets independently
Routing flexibilityBuilt-in exchanges (direct, topic, fanout, headers) — complex routing without codeRequires explicit consumer logic or stream processing (KSQL) for fan-out
Ordering guaranteeStrict per-queue ordering as long as one consumerStrict per-partition ordering; global ordering requires single partition (low throughput)
Replay capabilityNot native — once acked, message is gone unless you archive separatelyCore feature — retain messages for days/weeks, seek to any offset
Scalability modelAdd more consumers to a queue — the broker distributes workAdd more consumer group members — partition rebalancing handles redistribution
Operational maturitySimpler setup, single binary, management UI out of the boxRequires Zookeeper (until KRaft), partition tuning, careful consumer group coordination

When to choose each: - RabbitMQ if you need complex routing (e.g., RPC, work queues, multiple exchange types), or if your operations team wants a simple, observable infrastructure with minimal tuning. - Kafka if you need event replay, high-throughput streaming, multiple independent consumer groups, or a durable audit log that survives consumer failure without data loss.

Smart Broker vs Smart Consumer — The Core Trade-Off
RabbitMQ's smart broker model makes it easier to add new consumers without coordination — the broker handles all the routing and delivery logic. Kafka's smart consumer model gives you replay and independent fan-out at the cost of requiring consumers to manage offsets and rebalancing. There is no universally correct choice; the answer depends on whether your system values routing flexibility (RabbitMQ) or replay capability (Kafka).
Production Insight
In practice, many production systems run both. Use RabbitMQ for task queues and RPC patterns where complex routing and simple consumer logic are valuable. Use Kafka for event sourcing, audit logs, and streaming analytics where replay and high throughput are critical. Running both introduces operational overhead but lets you use each broker for what it does best.
Key Takeaway
The smart broker (RabbitMQ) vs smart consumer (Kafka) distinction determines your system's routing flexibility, replay capabilities, and operational complexity. Choose based on your need for complex routing (RabbitMQ) versus replayable event streams (Kafka), not on perceived performance alone.

Message Delivery Trade-offs: At-Most-Once, At-Least-Once, Exactly-Once

Every message queue system must make a trade-off between reliability and performance. The three common delivery semantics — at-most-once, at-least-once, and exactly-once — represent different points on that trade-off curve. Understanding which one your system actually needs is more important than any feature comparison between brokers.

Delivery SemanticsDescriptionBest ForExample ConfigRisk
At-most-onceMessage delivered zero or one time — no retryMonitoring/analytics where occasional data loss is acceptableauto_ack=True, no confirmationMessage loss
At-least-onceMessage delivered one or more times — retries guaranteedBusiness transactions where data loss is unacceptable but duplicates can be handledmanual ack, requeue on transient failure, consumer idempotencyDuplicates
Exactly-once (end-to-end)Message processed exactly once — no loss, no duplicatesPayment processing, financial systems where duplicates or loss have real-world costIdempotent consumer + transactional outbox (or Kafka transactions for Kafka-to-Kafka)Throughput reduction

Trade-offs in practice: - At-most-once is trivial for the broker but dangerous for the business. It is the default in many frameworks and must be deliberately avoided for any meaningful work. - At-least-once is the most common production choice because it balances reliability with performance. The catch? You must make your consumer idempotent. This is not optional — it is a prerequisite for at-least-once delivery. - Exactly-once is the gold standard but requires significant infrastructure: either an idempotent consumer with an outbox pattern, or Kafka's transactional API (which only guarantees end-to-end exactly-once within Kafka itself). The throughput cost is real — expect 20-40% reduction in peak throughput for exactly-once versus at-least-once.

Choosing: If you cannot tolerate data loss, use at-least-once with an idempotent consumer. If you also cannot tolerate duplicates, invest in exactly-once infrastructure. If both loss and duplicates are acceptable, you probably should not be using a message queue at all — a simple log file may suffice.

Exactly-Once Is Overrated — At-Least-Once with Idempotency Wins in Production
Practising senior engineers overwhelmingly prefer at-least-once with idempotency over exactly-once for most use cases. The reason is operational simplicity: an idempotent consumer handles duplicates via a database constraint (event_id unique index) without requiring any broker-level transaction coordination. Exactly-once adds latency, throughput reduction, and tight coupling between the broker and the consumer's data store. Reserve exactly-once for financial transactions where a single duplicate can cost real money, and use at-least-once everywhere else.
Production Insight
Kafka's exactly-once semantics (enable.idempotence=true + transactional producer) only provide end-to-end exactly-once when both the producer and consumer are within Kafka — i.e., reading from Kafka and writing back to Kafka. If your consumer writes to a relational database or any external system, the only way to achieve exactly-once end-to-end is through an idempotent consumer that checks event_id uniqueness at the database level.
Key Takeaway
The delivery trade-off is not a feature to be enabled — it is an architecture decision that affects every component. At-least-once with idempotency is the pragmatic choice for 90% of production systems. Reserve exactly-once for the narrow case where a single duplicate has financial or compliance consequences.

DLQ Pattern Visual Workflow — Message Routing Through Failure Paths

The diagram shows the three outcomes of consumer processing. The most dangerous path is the loop from D back to A (requeue=true) — this creates the poison message problem described in earlier sections. The correct transient error path goes through a retry queue with a time-to-live (TTL) that introduces backoff before re-delivery. The permanent error path sends the message directly to the DLQ, where it remains until an engineer inspects, fixes the root cause, and selectively replays it.

Key routing rules: - Never requeue a message that will never succeed — use requeue=false and route to DLQ. - For transient errors, use a retry queue with exponential backoff TTL (e.g., 5s, 30s, 5min). This is a separate queue from the primary queue to avoid blocking the happy path. - After max retries, the retry queue should also route to the DLQ via its own dead-letter exchange. - The DLQ must have an alert when depth > 0. The on-call engineer investigates and, after fixing the bug, replays the archived messages (not from the DLQ itself, but from the dlq_archive table). This ensures no re-insertion into the primary queue until the root cause is resolved.

Replaying from DLQ — Never Republish Directly from the DLQ Queue
When you need to replay a DLQ message, do not consume it from the DLQ and republish it immediately. The DLQ may contain messages that failed for the same bug that you just fixed — but other messages may have failed for different, unresolved reasons. Always archive DLQ messages to a database table, let an engineer examine the failure_reason, and only republish messages whose root cause has been fixed. Otherwise you risk re-injecting the same poison message into the primary queue and starting the cycle again.
Production Insight
The retry queue pattern with TTL requires no code beyond the initial nack with requeue=false. Configure the primary queue's dead-letter exchange to point to a retry queue, and the retry queue's dead-letter exchange to point to the DLQ. RabbitMQ handles the entire routing chain automatically — the consumer only needs to call basicNack with requeue=false. The TTL on the retry queue provides the backoff.
Key Takeaway
The DLQ pattern is a three-stage routing pipeline: primary queue for processing, retry queue with backoff for transient failures, and DLQ for permanent failures. The broker handles the routing automatically if configured correctly. Alert on DLQ depth > 0, archive to a database before replaying, and never use requeue=true for failed messages.

Throughput vs Latency Benchmarks for Top Message Brokers

Choosing a message broker often comes down to throughput and latency numbers, but these benchmarks vary wildly with configuration, hardware, and message size. The numbers below are from controlled tests on modern cloud hardware (3-node clusters, 8 vCPU / 32 GB RAM, NVMe SSDs, 1 KB message payload, published and consumed in the same region). They represent peak sustainable throughput and median end-to-end latency under moderate load (50% of max throughput). Use them as directional guidance, not as absolute performance guarantees.

BrokerMax Throughput (msg/s)Median Latency (ms)P99 Latency (ms)StrengthWeakness
RabbitMQ 3.13~50,0005–1550–100Simple setup, excellent routing, low latency at moderate throughputThroughput drops with durable messaging and clustered setups; not designed for streaming
Apache Kafka 3.7~1,500,00010–30100–200Extreme throughput, durable log, replay capabilitiesHigher operational complexity; latency increases under heavy producer load
Apache Pulsar 3.0~1,000,0005–1030–60Low latency, geo-replication, separation of serving/storageNewer ecosystem; less community adoption than Kafka
Amazon SQS (Standard)Unlimited (auto-scaling)50–200500–1000Fully managed, no ops overhead, infinite scaleHigher latency, no ordering guarantee (FIFO queues have throughput limit of 300 msg/s)

Key observations: - RabbitMQ is the latency king for moderate throughput. If your system processes tens of thousands of messages per second and cares about sub-10ms delivery, RabbitMQ is the best choice. - Kafka is designed for throughput, not latency. Its pull-based model naturally adds 10–30ms of polling latency. For time-sensitive operations (e.g., real-time trading), Kafka's latency may be unacceptable. - Pulsar offers a compelling middle ground: high throughput with low latency, thanks to its separate serving and storage layers. However, its ecosystem is smaller, which affects debug tooling and hiring. - SQS is for teams that want zero ops overhead and can tolerate variable latency. The standard queue's at-least-once delivery with potential duplicates is acceptable for many use cases.

Recommendation: Run your own benchmark with your message size, payload format, and consumption pattern. The numbers above assume 1 KB messages — double the message size and throughput drops by roughly 50% for most brokers due to network and serialization overhead. For a realistic test, use your actual message schema and simulate your production traffic profile.

Benchmark With Your Data, Not Generic Numbers
The throughput numbers in this section are directional — they come from standardised tests with 1 KB messages on homogeneous hardware. Your mileage will vary significantly based on message size (throughput is inversely proportional), consumer processing time, network latency between services, and broker configuration choices like replication factor. Always run a benchmark with your own payloads and traffic patterns before making a final broker selection. The cost of picking the wrong broker is a painful migration six months into production.
Production Insight
In production, the bottleneck is rarely the broker — it is the consumer's processing logic or the network between services. A well-optimised RabbitMQ setup with manual ack and prefetch=1 can handle 10,000 messages per second per consumer instance, but if each message takes 50ms to process (DB write + external API call), one consumer handles only 20 messages per second. The broker is not the limiting factor; the consumer's downstream dependencies are. When benchmarking, always include the full processing pipeline, not just publish-and-ack latency.
Key Takeaway
Throughput and latency benchmarks are directional indicators, not absolute guarantees. RabbitMQ excels at low latency with moderate throughput; Kafka dominates at extreme throughput with higher latency; Pulsar offers a balance; SQS trades latency for zero ops. Benchmark with your actual message size and processing pipeline before committing to a broker, and remember that consumer processing time — not broker speed — is usually the real bottleneck.
● Production incidentPOST-MORTEMseverity: high

The Double-Charge Incident: auto-ack Deleted the Payment Message Before Processing Completed

Symptom
Customer support receives a spike in 'charged but no order created' tickets within hours of a deployment. Payment gateway logs show successful charges for the affected customers. Application logs show the consumer received the messages but have no log entries for the order creation step. The message queue management dashboard shows zero unacknowledged messages in the queue — which looks healthy but is actually the problem.
Assumption
The payment gateway must have returned an error that the code did not handle properly. The team suspects the message was retried and a second attempt created an order without a corresponding charge, or that the message was dropped in network transit. Two engineers spend four hours investigating the gateway integration before anyone looks at the ack configuration.
Root cause
The consumer was configured with auto_ack=True. RabbitMQ deleted every message the instant it was delivered to the consumer channel — before the consumer confirmed that the payment gateway call AND the database write both succeeded. The consumer process hit an OutOfMemoryError during the database connection pool exhaustion that followed a slow migration deployed the same day. The OOM killer terminated the process. The message was already gone from the broker. No retry, no DLQ entry, no trace anywhere in the system except a successful gateway log and a missing order.
Fix
1. Switched to manual acknowledgement: channel.basicConsume(QUEUE_NAME, false, deliverCallback, cancelCallback) — the false is the critical parameter. 2. Moved the channel.basicAck() call to after both the payment gateway response is received AND the database transaction commits successfully. 3. On any exception, call channel.basicNack(deliveryTag, false, false) to route the message to the DLQ rather than requeuing it. 4. Added an idempotency check using the event_id against a processed_events table before calling the gateway — if the event_id already exists, skip processing and ack. 5. Set a DLQ depth alert at depth greater than 0 that pages the on-call engineer immediately.
Key lesson
  • Auto-ack is a data loss mechanism disguised as a convenience feature — it has no legitimate use case in any consumer that performs meaningful work
  • Manual ack must happen only after the entire business transaction completes — after the database commit, not after the message is received and certainly not before calling external services
  • Every consumer that calls a payment gateway or any non-idempotent external system must check an event_id deduplication table before acting — at-least-once delivery is guaranteed, which means duplicates will happen
  • A DLQ without depth alerting is a graveyard — set the threshold at greater than zero, not at some arbitrary number you will never reach until something is seriously broken
  • OOM kills are silent — the consumer can die between a gateway call and a database commit with zero application log output, which is why the ack position matters more than good exception handling
Production debug guideSymptom to Action mapping for common MQ failures in production5 entries
Symptom · 01
Messages disappearing without a trace — no errors in consumer logs, no messages in DLQ, queue shows zero depth
Fix
Check if auto_ack is enabled on the consumer. If true, the broker deletes messages on delivery regardless of processing outcome and there will be no record of the failure anywhere. Switch to manual ack immediately and redeploy. Then audit whether any messages were lost and need manual recovery.
Symptom · 02
Queue depth growing steadily while consumers report low CPU and appear to be running normally
Fix
Check prefetch_count. If unset, RabbitMQ uses an unbounded prefetch and may be delivering hundreds of messages to one consumer while others sit idle with nothing to process. Set channel.basicQos(1) to enforce fair dispatch. Also check for slow downstream dependencies — a consumer that spends 30 seconds waiting on a database call can appear healthy while the queue backs up behind it.
Symptom · 03
Same message processed ten or more times in rapid succession — duplicate charges, duplicate emails, duplicate inventory decrements
Fix
Check if requeue=True is being used on nack. A poison message gets requeued and redelivered immediately in an infinite loop that looks like high throughput but is actually a single message being hammered. Change to requeue=False to route to DLQ. Also verify the consumer has idempotency checks in place — even after fixing the nack, at-least-once delivery means duplicates will arrive eventually through normal retry flows.
Symptom · 04
Consumer connection drops silently — no crash, no exception thrown, messages simply stop flowing and the consumer appears healthy
Fix
Check TCP keepalive settings and heartbeat configuration. RabbitMQ defaults to a 60-second heartbeat. If a network device drops the connection without sending a TCP RST, the broker waits indefinitely for the next heartbeat while the consumer has no idea the connection is gone. Lower heartbeat to 10 to 30 seconds and add connection recovery listeners. Use the AMQP client's built-in automatic recovery where available.
Symptom · 05
DLQ depth spikes immediately after a deployment — dozens of messages arrive in the DLQ within the first few minutes
Fix
Check for schema changes or serialization version mismatches between the new consumer and messages already in the queue. A new consumer that cannot deserialize the old message format will nack every message in the queue. Roll back the consumer deployment first, drain the DLQ manually while examining the message payloads, then redeploy with backward-compatible deserialization before cutting over fully.
★ Message Queue Quick Debug Cheat SheetImmediate diagnostic commands when message queue processing breaks in production.
Queue depth growing — consumers not draining despite appearing to run
Immediate action
Check consumer count, prefetch settings, and unacknowledged message count on the queue
Commands
docker exec forge-rabbitmq rabbitmqctl list_consumers
docker exec forge-rabbitmq rabbitmqctl list_queues name messages messages_unacknowledged consumers
Fix now
If consumers=0, the consumer process crashed — restart it and check logs for the crash reason. If messages_unacknowledged is high and growing, a consumer is stuck holding messages it received but cannot process — check for deadlocks, slow DB queries, or blocked external calls.
Poison message looping — same message reprocessed hundreds of times, consumer CPU spikes+
Immediate action
Verify the nack call is using requeue=False and messages are routing to DLQ
Commands
docker exec forge-rabbitmq rabbitmqctl list_queues name messages_ready messages_unacknowledged
docker logs forge-rabbitmq --tail 200 | grep -i 'dead'
Fix now
If messages_ready stays constant while messages_unacknowledged cycles, a poison message is looping. Temporarily pause the consumer using the management UI, inspect the message payload in the queue, fix the consumer nack logic to use requeue=False, then restart. Do not purge without examining the payload first.
Messages published successfully — producer logs confirm — but consumer receives nothing+
Immediate action
Verify the routing key on the producer and the binding pattern on the consumer match exactly
Commands
docker exec forge-rabbitmq rabbitmqctl list_bindings
docker exec forge-rabbitmq rabbitmqctl list_queues name messages
Fix now
If the queue shows 0 messages but the producer confirms publish, the messages are being routed to a different queue or dropped at the exchange level because the routing key does not match any binding. Check that the producer routing key matches the consumer binding pattern character for character — a trailing dot or typo is the most common cause.
RabbitMQ vs Apache Kafka — Head-to-Head Comparison
Feature / AspectRabbitMQ (Queue-based)Apache Kafka (Log-based)
Delivery modelPush — broker pushes messages to consumers as they become availablePull — consumers poll the broker at their own pace, which naturally provides backpressure
Message retentionDeleted after ACK — once processed and acknowledged, the message is goneRetained for a configurable period (hours, days, weeks) regardless of consumer acknowledgement
Message replayNot supported natively — once a message is acked it cannot be replayed without re-publishingCore feature — any consumer group can seek to offset 0 and replay the entire event history
Throughput ceilingApproximately 50,000 messages per second per node under ideal conditionsMillions of messages per second per node — designed for extreme throughput at the cost of lower-level control
Ordering guaranteeFIFO per queue — messages in a single queue are delivered in orderFIFO per partition — ordering is guaranteed within a partition; global ordering requires a single partition at the cost of throughput
Consumer isolationCompeting consumers share one queue — each message delivered to exactly one consumerEach consumer group maintains independent offsets — truly independent fan-out with no interference between groups
Best forTask queues, RPC patterns, complex routing logic with multiple exchanges, work distributionEvent streaming, audit logs, data pipelines, event sourcing, analytics, and any use case requiring replay
Operational complexityLow — single binary, easy setup, excellent management UI out of the boxHigher — requires partition tuning, consumer group coordination, and offset management; KRaft mode removes ZooKeeper in recent versions
Dead-letter supportFirst-class feature via x-dead-letter-exchange queue argument — automatic routing on nack or TTL expiryConvention-based retry and dead-letter topic pattern — requires explicit consumer logic to route to retry or DLT topics
Message size practical limit128MB default, configurable — but large messages should be stored externally with the queue carrying a reference1MB default, configurable — same advice applies: use the queue for metadata and reference, not for payloads

Key takeaways

1
Producers fire-and-forget
they hand the message to the broker and return immediately. This latency independence between services is the core architectural benefit, and it is why message queues are the first tool experienced engineers reach for when synchronous coupling creates reliability problems.
2
The acknowledgement model is your delivery guarantee
only ack after all side effects succeed and are committed to durable storage. Use nack with requeue=False for permanent failures to route to the DLQ. Auto-ack is a data loss mechanism with no legitimate use in any consumer that does real work.
3
Queue means one consumer gets the message
correct for tasks. Topic or fanout means every subscriber gets the message — correct for domain events. Choose based on whether multiple downstream services need to react to the same event, not based on expected volume.
4
A DLQ without depth alerting is deferred data loss
set the alert threshold at one message, not some arbitrary number that gives you false confidence. Treat DLQ growth exactly like a 500 error rate spike: investigate immediately, fix the root cause, and replay selectively after the fix is deployed.

Common mistakes to avoid

5 patterns
×

Using auto-ack in production consumers

Symptom
Messages disappear silently with no trace in logs or DLQ — the broker deletes them on delivery before any processing occurs. Consumer crashes or OOM kills mid-processing lose messages permanently and the queue dashboard shows zero unacknowledged messages, which looks healthy.
Fix
Set autoAck=false in every consumer and call basicAck only after all business logic succeeds and all side effects are committed to durable storage. This is not a performance optimisation to revisit later — it is table stakes for any consumer doing real work.
×

Not setting prefetch count, relying on default unbounded delivery

Symptom
One fast consumer accumulates hundreds of unacknowledged messages while other consumer instances sit idle with nothing to process. When that one consumer crashes, it loses all its in-flight messages simultaneously. The queue appears to be processing but throughput is effectively single-threaded.
Fix
Call channel.basicQos(1) before starting consumption. This tells the broker to deliver at most one unacknowledged message to each consumer at a time, enforcing fair dispatch across all running instances and making horizontal scaling actually effective.
×

Skipping idempotency checks because duplicates 'rarely happen'

Symptom
At-least-once delivery guarantees duplicates over time — they are not edge cases. Users get charged twice, confirmation emails arrive multiple times, inventory decrements twice for the same order. The incidents trace back to a consumer that processed the same message twice during a network blip or consumer restart.
Fix
Every consumer that performs non-idempotent operations must check the event_id against a processed_events table before acting. If the event_id already exists, ack the message and return — this is the correct response to a duplicate, not an error.
×

Requeuing poison messages with requeue=True on nack

Symptom
A permanently broken message fails, gets requeued immediately, fails again, repeats thousands of times per second. The consumer's CPU spikes, no other messages in the queue can be processed, and the queue dashboard shows the same message_unacknowledged count cycling while messages_ready stays frozen.
Fix
Always use requeue=False on nack for permanent failures. Implement a retry queue with exponential backoff TTL for transient failures, and route to the DLQ after max retries. Never put a message back on the main queue without a delay — immediate requeue of a failing message is always wrong.
×

Treating the DLQ as a quiet background feature with no alerting

Symptom
DLQ accumulates hundreds of failed messages over days or weeks while everyone assumes the system is healthy. By the time someone checks, the root cause requires forensic investigation across days of logs, and time-sensitive messages representing real customer failures are permanently unrecoverable.
Fix
Set an alert on DLQ message count at a threshold of one — not ten, not one hundred, one. A single message in the DLQ represents a real failure happening right now. Page the on-call engineer, treat it like a 500 error spike, and investigate before more messages arrive.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
How do you handle backpressure in a system where the producer is signifi...
Q02SENIOR
Explain the difference between at-least-once and exactly-once delivery s...
Q03SENIOR
Design a distributed retry pattern. How do you avoid thundering herd pro...
Q04JUNIOR
What is the difference between a message queue and an event streaming pl...
Q01 of 04SENIOR

How do you handle backpressure in a system where the producer is significantly faster than the consumer?

ANSWER
I implement three coordinated layers, not just one setting. First, prefetch count: setting basicQos(1) on each consumer prevents the broker from flooding any single consumer and forces fair distribution, giving you meaningful horizontal scaling. Second, queue length limits: configuring max-length on the queue with a drop-head or reject-publish overflow policy prevents unbounded memory growth on the broker — this signals to the producer that the system is under load, which can trigger producer-side rate limiting if you add publisher confirms and handle nacks. Third, consumer autoscaling: monitoring queue depth as a Kubernetes HPA metric or an AWS Auto Scaling trigger lets you spin up additional consumer instances when depth exceeds a threshold. The key insight is that prefetch and queue limits are backpressure valves that protect the system under temporary overload — autoscaling is what actually resolves the underlying throughput mismatch when the overload is sustained.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
What is the difference between a message queue and an event streaming platform like Kafka?
02
What happens if a consumer crashes before acknowledging a message?
03
Can I use a database as a message queue?
04
What is the Outbox pattern and when should I use it?
🔥

That's Components. Mark it forged?

12 min read · try the examples if you haven't

Previous
CDN — Content Delivery Network
5 / 18 · Components
Next
API Gateway