Saga Pattern - Compensation Race with Slow Retry
Payment notification delay causes saga compensation to run while retry succeeds, double-charging customers.
- Saga pattern manages distributed transactions across services without distributed locks
- Two flavors: choreography (events, no central coordinator) and orchestration (central coordinator sends commands)
- Compensation transactions undo each step on failure — must be idempotent
- Performance: choreography reduces latency by ~30% but debugging is harder
- Production pitfall: partial compensation if a retry succeeds during rollback — need idempotency keys
- Biggest mistake: thinking exactly-once delivery is possible; design for at-least-once with dedup
Imagine you're booking a holiday online — the website books your flight, then your hotel, then your rental car, all in one go. If the car rental fails, it doesn't just stop there and leave you stranded; it cancels the hotel and then cancels the flight, working backwards to undo everything cleanly. That's a Saga: a sequence of steps where each step knows its own 'undo' move. It's how big distributed systems keep your data consistent without locking everything up at once.
Modern distributed systems have quietly broken one of the oldest guarantees in computing: the database transaction. When your order touches a Payment Service, an Inventory Service, and a Shipping Service — each with its own database — you can't just wrap them in a single BEGIN/COMMIT block. The network doesn't care about your ACID properties, and holding distributed locks across three services for 800ms in production is a reliability nightmare waiting to happen.
The Saga pattern is the pragmatic answer to this problem. Instead of one giant atomic transaction, a Saga breaks a business operation into a sequence of local transactions, each immediately committed to its own service's database. If anything goes wrong midway, a series of compensating transactions — purpose-built undo operations — roll the system back to a semantically consistent state. It's eventual consistency with a safety net.
By the end of this article you'll understand exactly how Sagas work at the implementation level, when to pick choreography over orchestration (and why getting this wrong is expensive), how to handle the genuinely hard edge cases like idempotency and concurrent compensations, and what production systems like Uber and Netflix actually wrestle with when running Sagas at scale. Expect real, runnable Java code, concrete failure scenarios, and the kind of detail that separates a confident system design interview answer from a vague one.
What is Saga Pattern?
The Saga pattern is a sequence of local transactions where each step has a compensating action. Unlike a distributed transaction (2PC), the saga commits each step immediately. If a later step fails, earlier steps must be undone via compensations. This trades strong consistency for availability — your data may be temporarily inconsistent, but the system stays up.
Why does this matter? Because in production, network partitions happen. Databases go down. Holding a lock across three services for 800ms doesn't just slow things down — it causes cascading timeouts. Sagas let each service commit locally and move on. The inconsistency window is usually seconds, not minutes.
Common misconception: Sagas are not asynchronous by default. You can run them synchronously with a coordinator, but that defeats the purpose. Real sagas use async messaging or event-driven coordination to decouple services.
Choreography vs Orchestration: Choosing Your Coordination Model
Two coordination patterns dominate saga implementations: choreography and orchestration. Choreography uses async events — each service publishes an event after its local transaction, and the next service subscribes. There's no central brain. Orchestration uses a coordinator that tells each service what to do next, like a conductor.
Choreography feels simpler at first. No single point of failure, no coordinator to crash. But the hidden cost is observability. To understand why a saga failed, you must correlate logs across every participant. Orchestration centralizes flow logic, making monitoring and retries straightforward. The trade-off? An extra network hop per step and a potential SPOF — unless you persist the orchestrator's state.
In production, most teams start with choreography for low-risk flows (notification chains) and switch to orchestration for payment pipelines where every step must be auditable. I've seen a team waste three weeks debugging a choreography saga that only had four services. After moving to orchestration, the same bug took two hours to fix.
- Choreography: Decentralized, resilient, but hard to trace.
- Orchestration: Centralized flow control, easy to monitor, single point of failure.
- The orchestration coordinator itself must be stateless and recovered via event sourcing.
- For critical money flows, always use orchestration — you'll thank yourself during an audit.
Compensation Transactions: Designing the Undo Button That Actually Works
A compensation transaction is the logical inverse of a forward step. It's a new transaction that reverses the effect — not a database rollback. For example, if a payment deducted $10, the compensation is a refund of $10. That refund is a separate call with its own side effects.
Designing compensations requires care: they must be idempotent (running twice is safe), commutative (order doesn't matter), and self-healing (must not fail permanently). In production, compensate actions fail — payment gateways go down, inventory systems are slow. Your saga must handle that: retry with exponential backoff, but cap retries. After exhaustion, escalate.
A common trap: compensating a payment that was never actually charged. If the payment step timed out but later completes, you now have a compensation fighting a forward operation. The fix is a saga state machine with explicit phases: PENDING, PROCESSING, COMPENSATING, COMPLETED. Reject any forward completion when in COMPENSATING.
Latency mismatch is another real problem. A refund might take 24 hours in banking. Your saga timeout must account for that — or separate the refund into a async step that doesn't block the main flow.
Idempotency and Ordering: The Two Hardest Problems in Sagas
In distributed sagas, the same message can be delivered more than once. If your forward step is not idempotent, duplicate deliveries cause double charges, duplicate reservations, or duplicate shipments. The fix is an idempotency key: a unique string (saga ID + step name + retry number) that the receiver deduplicates against.
The receiver stores processed keys with a TTL longer than the maximum retry window. If a duplicate arrives, return the stored response. The TTL must cover the entire saga lifetime plus the retry margin. A 5-minute TTL when retries last 20 minutes is a time bomb.
Ordering is equally tricky. In a choreography saga, events can arrive out of order if the message broker partitions or reorders messages. A 'shipped' event might arrive before 'payment confirmed'. Your service must handle that gracefully — either reject out-of-order events or buffer and reorder them using sequence numbers embedded in the event payload.
In an orchestration saga, the coordinator enforces ordering. But what if the coordinator sends two commands concurrently for the same saga? Use a versioned saga state — reject any command carrying a stale version. This prevents the lost update problem.
Production Pitfalls: What Actually Breaks at Scale
Running sagas in production uncovers failure modes that unit tests never simulate. Here are the most common ones:
- Partial compensation: compensating step A succeeds, step B fails. Now you have a partial undo. The only safe path: keep retrying B with exponential backoff, then escalate after exhaustion. You need a manual intervention playbook.
- Timeout-induced double execution: A step times out, compensation starts, but the original step actually completed slowly. The compensation then undoes a completed step. The fix: use a state machine with explicit 'compensating' state that rejects forward completions.
- Dependency order in compensations: If you must release inventory before refunding payment, the compensation order must mirror the forward order. In orchestration, this is built in. In choreography, you need to enforce event sequencing, which is non-trivial.
- Monitoring blind spots: A saga that fails silently because nobody monitors pending sagas. Each saga should emit metrics: active count, completed count, failed count by step. Alert on active sagas older than X minutes — this catches stuck sagas before customers notice.
- Database sagas: When your saga spans database transactions (e.g., an order database and a payment database), you need to handle the case where the first DB commits but the second fails. Use transactional outbox pattern to ensure reliable message delivery.
- Persist state transitions in a database or event log.
- Use a lock on the saga row to prevent concurrent state updates.
- The coordinator is just a state machine interpreter.
- With state persisted, you can restart the coordinator and replay incomplete sagas.
Compensation Race: When Rollback Clashes with a Slow Retry
- Never trust message ordering. Use a persistent saga state that all participants read before acting.
- Idempotency alone isn't enough for sagas — you need versioned state to reject late arrivals.
- Always test with network delays and message duplication in chaos experiments.
Key takeaways
Common mistakes to avoid
4 patternsNot designing compensations for failure
Assuming exactly-once message delivery
Ignoring timeout and partial completion edge cases
Using a synchronous coordination model without async boundaries
Interview Questions on This Topic
What is a Saga pattern, and how does it differ from a distributed transaction (2PC)?
Frequently Asked Questions
That's Architecture. Mark it forged?
5 min read · try the examples if you haven't