Transactional Outbox Pattern: Stop Losing Events in Production — The Definitive Guide
Transactional outbox pattern explained with production code, failure modes, and debugging.
20+ years shipping large-scale distributed systems. Lessons pulled from things that broke in production.
Use the transactional outbox pattern when you need to guarantee that a message is sent exactly once after a database transaction commits. Write the event to an outbox table in the same DB transaction, then have a background process poll and publish those events to your message broker.
Imagine you're a bartender who needs to both pour a drink and tell the waitstaff it's ready. If you pour the drink but forget to call out, the waitstaff never picks it up. The outbox pattern is like writing the order on a slip of paper and putting it on a spike — the drink is poured (DB commit) and the order slip is there (outbox). A runner checks the spike every few seconds and yells the order. If the runner crashes, the slip is still on the spike when they restart. No drink gets lost.
You've been there. A payment succeeds in the database, but the confirmation email never sends. Or an order is placed, but the inventory service never gets the message. The root cause? Your service committed the transaction, then tried to publish a message — and crashed between the two. That's the dual-write problem, and it's been burning production systems since the dawn of microservices.
The transactional outbox pattern is the battle-tested solution. Instead of sending messages directly from your business logic, you write them to a database table in the same transaction. A separate process — the publisher — reads that table and sends the messages to the broker. If the publisher crashes, it picks up where it left off. No more lost events.
By the end of this article, you'll be able to implement the transactional outbox pattern in your own services, handle edge cases like duplicate messages and backpressure, and debug the most common production failures. You'll also know exactly when this pattern is overkill and what simpler alternatives exist.
Why You Can't Trust Direct Message Publishing
The naive approach: after your business logic commits the DB transaction, you publish a message to Kafka/RabbitMQ/SQS. This works 99.9% of the time. But that 0.1%? That's your 3am call. The service crashes between commit and publish. Or the broker is briefly unavailable. Or the network times out. The DB says the order is placed, but the rest of the system never knows.
Before the outbox pattern, teams hacked around this with two-phase commits (too slow), delayed retries (complex), or just hoped for the best (production incidents). The outbox pattern is the pragmatic middle ground: atomicity without distributed transactions.
The Outbox Pattern: Atomic Writes to the Rescue
The fix is brutally simple: write the event to a database table (the outbox) in the same transaction as your business operation. If the transaction commits, the event is persisted. If it rolls back, the event disappears with the business data. A separate process — the publisher — reads the outbox table and sends events to the broker.
This decouples the reliability of the DB from the reliability of the network. The DB is local, fast, and transactional. The network is none of those things. By writing the event first, you guarantee it won't be lost even if the publisher crashes mid-flight.
Polling vs CDC: Which Publisher Strategy Wins?
The simplest publisher polls the outbox table every N milliseconds. This works fine for most systems, but has two drawbacks: (1) polling adds latency proportional to the poll interval, and (2) it puts load on the database, especially if you have many rows.
Change Data Capture (CDC) is the alternative. Tools like Debezium read the database's transaction log (WAL in PostgreSQL, binlog in MySQL) and stream changes to Kafka. This gives you sub-millisecond latency and zero load on the application table. The trade-off: you now manage a CDC pipeline, which is another moving part.
My rule of thumb: if your latency requirement is <100ms and you already have Kafka, use CDC. If you're on a simpler stack or latency isn't critical, polling is fine. I've run polling-based outboxes at 500ms intervals handling 10k events/sec without issues.
Handling Duplicates: Idempotent Consumers Save Your Sanity
Even with the outbox pattern, duplicates can happen. The publisher might crash after publishing but before marking the event as published. On restart, it re-publishes the same event. Your consumer sees it twice.
The fix is idempotent consumers. Each event carries a unique ID (UUID). The consumer stores processed event IDs in a deduplication table (or uses a Redis set with TTL). Before processing an event, it checks if the ID was already processed. If yes, it skips.
This is not optional. Every production outbox implementation I've seen eventually produces duplicates — network retries, publisher crashes, DB replication lag. Idempotent consumers are your safety net.
When the Outbox Pattern Is Overkill
The outbox pattern adds complexity: an extra table, a publisher process, and deduplication logic. Don't use it if you don't need it.
- You're using a message broker that supports transactions (e.g., Kafka with exactly-once semantics via transactional producers). But even then, the broker transaction is a distributed transaction — it's slower and more fragile.
- Your system can tolerate occasional message loss. For example, a cache invalidation event can be retried on the next read.
- You're building a prototype or internal tool where losing a message means a manual retry.
My rule: if losing a single message costs you money or reputation, use the outbox pattern. Otherwise, keep it simple.
Production Gotchas: What Will Burn You
I've seen three common failures in production outbox implementations:
- Outbox table growth: If the publisher falls behind, the outbox table grows unbounded. This slows down your business transactions because every write also inserts into the outbox. Fix: add a TTL or archive processed events to a separate table. Or use a partitioned table and drop old partitions.
- Deadlocks on the outbox table: If your publisher uses SELECT FOR UPDATE to claim events, and your business transaction also writes to the outbox, you can get deadlocks. Fix: use a separate connection pool for the publisher, or use optimistic locking with a version column.
- Publisher backpressure: If the message broker is slow, the publisher's poll loop blocks, and events pile up. Fix: use a bounded queue in the publisher and drop events if the queue is full (with a dead-letter queue for retries).
The 3AM Payment That Vanished
- Never trust a network call after a DB commit.
- Always persist the intent to send before the commit.
kubectl logs deployment/outbox-publisher --tail=100SELECT COUNT(*) FROM outbox_events WHERE published = false;Key takeaways
Interview Questions on This Topic
How does the transactional outbox pattern handle concurrent writes to the outbox table from multiple service instances?
Frequently Asked Questions
20+ years shipping large-scale distributed systems. Lessons pulled from things that broke in production.
That's Async & Data Processing. Mark it forged?
4 min read · try the examples if you haven't