Senior 5 min · March 05, 2026

Saga Pattern - Compensation Race with Slow Retry

Payment notification delay causes saga compensation to run while retry succeeds, double-charging customers.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Saga pattern manages distributed transactions across services without distributed locks
  • Two flavors: choreography (events, no central coordinator) and orchestration (central coordinator sends commands)
  • Compensation transactions undo each step on failure — must be idempotent
  • Performance: choreography reduces latency by ~30% but debugging is harder
  • Production pitfall: partial compensation if a retry succeeds during rollback — need idempotency keys
  • Biggest mistake: thinking exactly-once delivery is possible; design for at-least-once with dedup
Plain-English First

Imagine you're booking a holiday online — the website books your flight, then your hotel, then your rental car, all in one go. If the car rental fails, it doesn't just stop there and leave you stranded; it cancels the hotel and then cancels the flight, working backwards to undo everything cleanly. That's a Saga: a sequence of steps where each step knows its own 'undo' move. It's how big distributed systems keep your data consistent without locking everything up at once.

Modern distributed systems have quietly broken one of the oldest guarantees in computing: the database transaction. When your order touches a Payment Service, an Inventory Service, and a Shipping Service — each with its own database — you can't just wrap them in a single BEGIN/COMMIT block. The network doesn't care about your ACID properties, and holding distributed locks across three services for 800ms in production is a reliability nightmare waiting to happen.

The Saga pattern is the pragmatic answer to this problem. Instead of one giant atomic transaction, a Saga breaks a business operation into a sequence of local transactions, each immediately committed to its own service's database. If anything goes wrong midway, a series of compensating transactions — purpose-built undo operations — roll the system back to a semantically consistent state. It's eventual consistency with a safety net.

By the end of this article you'll understand exactly how Sagas work at the implementation level, when to pick choreography over orchestration (and why getting this wrong is expensive), how to handle the genuinely hard edge cases like idempotency and concurrent compensations, and what production systems like Uber and Netflix actually wrestle with when running Sagas at scale. Expect real, runnable Java code, concrete failure scenarios, and the kind of detail that separates a confident system design interview answer from a vague one.

What is Saga Pattern?

The Saga pattern is a sequence of local transactions where each step has a compensating action. Unlike a distributed transaction (2PC), the saga commits each step immediately. If a later step fails, earlier steps must be undone via compensations. This trades strong consistency for availability — your data may be temporarily inconsistent, but the system stays up.

Why does this matter? Because in production, network partitions happen. Databases go down. Holding a lock across three services for 800ms doesn't just slow things down — it causes cascading timeouts. Sagas let each service commit locally and move on. The inconsistency window is usually seconds, not minutes.

Common misconception: Sagas are not asynchronous by default. You can run them synchronously with a coordinator, but that defeats the purpose. Real sagas use async messaging or event-driven coordination to decouple services.

io/thecodeforge/saga/OrderSagaOrchestrator.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
package io.thecodeforge.saga;

import java.util.UUID;

public class OrderSagaOrchestrator {

    private final PaymentClient paymentClient;
    private final InventoryClient inventoryClient;
    private final ShippingClient shippingClient;

    public OrderSagaOrchestrator(PaymentClient payment, InventoryClient inventory, ShippingClient shipping) {
        this.paymentClient = payment;
        this.inventoryClient = inventory;
        this.shippingClient = shipping;
    }

    public void executeOrderSaga(UUID orderId, OrderDetails details) {
        String sagaId = UUID.randomUUID().toString();
        SagaContext ctx = new SagaContext(sagaId, orderId);

        try {
            // Step 1: Reserve Payment
            paymentClient.reserve(ctx, details.totalAmount());
            ctx.stepCompleted("payment");

            // Step 2: Reserve Inventory
            inventoryClient.reserve(ctx, details.items());
            ctx.stepCompleted("inventory");

            // Step 3: Create Shipment
            shippingClient.schedule(ctx, details.shippingAddress());
            ctx.stepCompleted("shipping");

            // Confirm — all participants commit their reservation
            ctx.commitAll();

        } catch (Exception e) {
            // Compensate in reverse order
            if (ctx.isStepCompleted("shipping")) {
                shippingClient.cancel(ctx);
            }
            if (ctx.isStepCompleted("inventory")) {
                inventoryClient.release(ctx);
            }
            if (ctx.isStepCompleted("payment")) {
                paymentClient.refund(ctx);
            }
            ctx.markFailed(e);
            throw new SagaExecutionException("Order saga failed for " + orderId, e);
        }
    }
}
Real Talk: ACID vs Base
Sagas embrace BASE (Basically Available, Soft state, Eventually consistent). Don't try to force strong consistency into a saga — you'll end up with a slow, fragile 2PC. Accept the inconsistency window and design for it.
Production Insight
At a large fintech, a saga without idempotency keys caused a double withdrawal of $2M.
The compensation ran twice because the message broker delivered the same event after a network partition healed.
Rule: Idempotency is not optional in sagas — it's the safety net.
Key Takeaway
A saga is a sequence of local transactions with compensating actions.
It trades strong consistency for availability and scalability.
Always design compensations to be idempotent and commutative.

Choreography vs Orchestration: Choosing Your Coordination Model

Two coordination patterns dominate saga implementations: choreography and orchestration. Choreography uses async events — each service publishes an event after its local transaction, and the next service subscribes. There's no central brain. Orchestration uses a coordinator that tells each service what to do next, like a conductor.

Choreography feels simpler at first. No single point of failure, no coordinator to crash. But the hidden cost is observability. To understand why a saga failed, you must correlate logs across every participant. Orchestration centralizes flow logic, making monitoring and retries straightforward. The trade-off? An extra network hop per step and a potential SPOF — unless you persist the orchestrator's state.

In production, most teams start with choreography for low-risk flows (notification chains) and switch to orchestration for payment pipelines where every step must be auditable. I've seen a team waste three weeks debugging a choreography saga that only had four services. After moving to orchestration, the same bug took two hours to fix.

io/thecodeforge/saga/ChoreographyBasedSaga.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
package io.thecodeforge.saga;

// Example: choreography via event publishing
public class ChoreographyBasedSaga {

    @EventListener
    public void onOrderCreated(OrderCreatedEvent event) {
        paymentService.reservePayment(event.orderId(), event.amount());
        // payment will publish PaymentReservedEvent
    }

    @EventListener
    public void onPaymentReserved(PaymentReservedEvent event) {
        inventoryService.reserveInventory(event.orderId(), event.items());
        // inventory publishes InventoryReservedEvent
    }

    @EventListener
    public void onInventoryReserved(InventoryReservedEvent event) {
        shippingService.scheduleShipment(event.orderId(), event.address());
        // shipping publishes ShipmentScheduledEvent
    }

    @EventListener
    public void onPaymentFailed(PaymentFailedEvent event) {
        // No inventory to release yet, but if step2 happened, we need compensation
        // This is where choreography gets messy — you must know which steps succeeded
    }
}
Mental Model: Dance vs. Puppet Show
  • Choreography: Decentralized, resilient, but hard to trace.
  • Orchestration: Centralized flow control, easy to monitor, single point of failure.
  • The orchestration coordinator itself must be stateless and recovered via event sourcing.
  • For critical money flows, always use orchestration — you'll thank yourself during an audit.
Production Insight
An orchestrator that crashes mid-saga leaves the system in an unknown state.
Persist saga state in a database before each step, then poll for incomplete sagas on restart.
Rule: Always store saga state before executing any external call.
Key Takeaway
Choreography for simple, low-risk flows.
Orchestration for anything financial or auditable.
Persist saga state — a crash without state is data loss.

Compensation Transactions: Designing the Undo Button That Actually Works

A compensation transaction is the logical inverse of a forward step. It's a new transaction that reverses the effect — not a database rollback. For example, if a payment deducted $10, the compensation is a refund of $10. That refund is a separate call with its own side effects.

Designing compensations requires care: they must be idempotent (running twice is safe), commutative (order doesn't matter), and self-healing (must not fail permanently). In production, compensate actions fail — payment gateways go down, inventory systems are slow. Your saga must handle that: retry with exponential backoff, but cap retries. After exhaustion, escalate.

A common trap: compensating a payment that was never actually charged. If the payment step timed out but later completes, you now have a compensation fighting a forward operation. The fix is a saga state machine with explicit phases: PENDING, PROCESSING, COMPENSATING, COMPLETED. Reject any forward completion when in COMPENSATING.

Latency mismatch is another real problem. A refund might take 24 hours in banking. Your saga timeout must account for that — or separate the refund into a async step that doesn't block the main flow.

io/thecodeforge/saga/compensation/PaymentCompensation.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
package io.thecodeforge.saga.compensation;

import io.thecodeforge.saga.SagaContext;
import java.util.UUID;

public class PaymentCompensation {

    private final PaymentGatewayClient client;

    public PaymentCompensation(PaymentGatewayClient client) {
        this.client = client;
    }

    /**
     * Refund the full amount. Idempotent: same refundId always results in one refund.
     */
    public RefundResult refund(SagaContext ctx, UUID refundId, Money amount) {
        // Use refundId as idempotency key
        RefundResult result = client.submitRefund(refundId, amount);
        if (!result.isSuccess()) {
            // Retry with exponential backoff
            throw new CompensatingActionFailedException(ctx.sagaId(), refundId, result.errorMessage());
        }
        return result;
    }
}
Common Trap: Synchronous Compensation
Don't run compensations synchronously in the same thread as the forward operation. If the compensation blocks, your whole saga coordinator stalls. Always use an async message or a separate thread pool with a timeout.
Production Insight
In an inventory saga, a compensation that releases stock ran twice due to a retry.
The inventory count became artificially high, causing overselling later.
Rule: Compensations must be idempotent AND have a guard (e.g., a status flag).
Key Takeaway
A compensation is a new transaction, not a rollback.
Always idempotent with a unique idempotency key.
Design for compensation latency and failure — include retry and escalation.

Idempotency and Ordering: The Two Hardest Problems in Sagas

In distributed sagas, the same message can be delivered more than once. If your forward step is not idempotent, duplicate deliveries cause double charges, duplicate reservations, or duplicate shipments. The fix is an idempotency key: a unique string (saga ID + step name + retry number) that the receiver deduplicates against.

The receiver stores processed keys with a TTL longer than the maximum retry window. If a duplicate arrives, return the stored response. The TTL must cover the entire saga lifetime plus the retry margin. A 5-minute TTL when retries last 20 minutes is a time bomb.

Ordering is equally tricky. In a choreography saga, events can arrive out of order if the message broker partitions or reorders messages. A 'shipped' event might arrive before 'payment confirmed'. Your service must handle that gracefully — either reject out-of-order events or buffer and reorder them using sequence numbers embedded in the event payload.

In an orchestration saga, the coordinator enforces ordering. But what if the coordinator sends two commands concurrently for the same saga? Use a versioned saga state — reject any command carrying a stale version. This prevents the lost update problem.

io/thecodeforge/saga/idempotency/IdempotencyFilter.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
package io.thecodeforge.saga.idempotency;

import java.util.concurrent.ConcurrentHashMap;
import java.time.Duration;

public class IdempotencyFilter {

    private final ConcurrentHashMap<String, IdempotencyRecord> store = new ConcurrentHashMap<>();
    private final Duration ttl = Duration.ofHours(24);

    /**
     * Returns true if this key has been processed before.
     * If not, marks it as processing and returns false.
     */
    public boolean isDuplicate(String idempotencyKey) {
        IdempotencyRecord record = store.computeIfAbsent(idempotencyKey, key -> new IdempotencyRecord());
        return !record.tryAcquire();
    }

    public void markProcessed(String idempotencyKey, Object response) {
        store.computeIfPresent(idempotencyKey, (key, record) -> {
            record.complete(response);
            return record;
        });
    }

    private static class IdempotencyRecord {
        private AtomicReference<Status> status = new AtomicReference<>(Status.PENDING);
        private Object response;

        boolean tryAcquire() {
            return status.compareAndSet(Status.PENDING, Status.PROCESSING);
        }

        void complete(Object response) {
            this.response = response;
            status.set(Status.COMPLETED);
        }
    }

    private enum Status { PENDING, PROCESSING, COMPLETED }
}
Idempotency Key Design
Use a combination of saga ID, step name, and a monotonic sequence number. Example: "saga_abc_step_payment_retry_2". This prevents confusion if the same step is retried multiple times.
Production Insight
A DNS failure caused a saga coordinator to retry the same payment request 12 times.
The idempotency key store in Redis had a TTL of 5 minutes, but the total retry window was 20 minutes.
After the key expired, a later retry was treated as new, triggering a second payment.
Rule: Idempotency key TTL must exceed the longest possible retry chain.
Key Takeaway
Use unique idempotency keys per step.
Handle out-of-order events with sequence numbers.
Idempotency key TTL must cover the entire saga lifetime plus retry margin.

Production Pitfalls: What Actually Breaks at Scale

Running sagas in production uncovers failure modes that unit tests never simulate. Here are the most common ones:

  1. Partial compensation: compensating step A succeeds, step B fails. Now you have a partial undo. The only safe path: keep retrying B with exponential backoff, then escalate after exhaustion. You need a manual intervention playbook.
  2. Timeout-induced double execution: A step times out, compensation starts, but the original step actually completed slowly. The compensation then undoes a completed step. The fix: use a state machine with explicit 'compensating' state that rejects forward completions.
  3. Dependency order in compensations: If you must release inventory before refunding payment, the compensation order must mirror the forward order. In orchestration, this is built in. In choreography, you need to enforce event sequencing, which is non-trivial.
  4. Monitoring blind spots: A saga that fails silently because nobody monitors pending sagas. Each saga should emit metrics: active count, completed count, failed count by step. Alert on active sagas older than X minutes — this catches stuck sagas before customers notice.
  5. Database sagas: When your saga spans database transactions (e.g., an order database and a payment database), you need to handle the case where the first DB commits but the second fails. Use transactional outbox pattern to ensure reliable message delivery.
io/thecodeforge/saga/monitoring/SagaMetrics.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
package io.thecodeforge.saga.monitoring;

import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.Gauge;

public class SagaMetrics {

    private final Counter sagaStarted;
    private final Counter sagaCompleted;
    private final Counter sagaFailed;
    private final Gauge activeSagas;

    public SagaMetrics(MeterRegistry registry) {
        sagaStarted = registry.counter("saga.started");
        sagaCompleted = registry.counter("saga.completed");
        sagaFailed = registry.counter("saga.failed");
        activeSagas = Gauge.builder("saga.active", () -> getActiveSagaCount()).register(registry);
    }

    public void onSagaStart() { sagaStarted.increment(); }
    public void onSagaComplete() { sagaCompleted.increment(); }
    public void onSagaFail() { sagaFailed.increment(); }

    private long getActiveSagaCount() {
        // Query saga state store
        return 0; // placeholder
    }
}
Mental Model: Sagas Are Finite State Machines
  • Persist state transitions in a database or event log.
  • Use a lock on the saga row to prevent concurrent state updates.
  • The coordinator is just a state machine interpreter.
  • With state persisted, you can restart the coordinator and replay incomplete sagas.
Production Insight
A team forgot to alert on sagas stuck in 'pending' state for >30 minutes.
A payment gateway outage left 15,000 orders in limbo.
Customers called support, but support had no visibility.
Rule: Monitor saga health like you monitor your database — count, age, and failure rate.
Key Takeaway
Partial compensation is the hardest failure to handle — retry with escalation.
Use a state machine with explicit transitions.
Monitor saga health: active count, age, step failure rate.
● Production incidentPOST-MORTEMseverity: high

Compensation Race: When Rollback Clashes with a Slow Retry

Symptom
Customers are double-charged or orders appear in 'cancelled' but inventory is deducted. No single error in logs, just inconsistent state.
Assumption
Once a compensation is triggered, no new forward actions will succeed for that saga instance.
Root cause
The payment step originally succeeded but the notification to the saga coordinator was delayed. The saga timed out and started compensating. Meanwhile, the payment service retried the notification, delivering a success after the compensation had completed. Both sides assumed they were the final word.
Fix
Add a status flag per saga instance in a shared database (e.g., 'compensating'). Reject any success notification after the flag is set. Use Compare-and-Swap to set the flag atomically.
Key lesson
  • Never trust message ordering. Use a persistent saga state that all participants read before acting.
  • Idempotency alone isn't enough for sagas — you need versioned state to reject late arrivals.
  • Always test with network delays and message duplication in chaos experiments.
Production debug guideSymptom → root cause → fix for common saga issues3 entries
Symptom · 01
Order stuck in 'pending' state; no subsequent steps fire
Fix
Check if the orchestration engine or event bus is healthy. Look for missing consumers or dead-letter queues. Verify timeout configuration on the saga coordinator.
Symptom · 02
Compensation runs but order ends up in inconsistent state (e.g., payment refunded but order marked 'completed')
Fix
Inspect the saga state machine: are compensating steps correctly ordered? Did any forward step succeed after compensation started? Enable per-step hooks to log transitions.
Symptom · 03
Duplicate compensation executed (customer refunded twice)
Fix
Check idempotency keys: is the compensation service using a unique key per saga instance? Are retries replaying the same key? Add a store of processed keys with TTL.
★ Saga Debugging Cheat SheetThree symptoms that hit every production saga system. One command, one config check, one rule.
No forward progress after step N
Immediate action
Check dead-letter queue and timeout logs
Commands
curl http://orchestrator/saga/{id} | jq .state
kubectl logs -l app=saga-coordinator --tail=100 | grep 'timeout'
Fix now
Increase saga timeouts or add retry backoff. Ensure all downstream services accept idempotency keys.
Inconsistent order state (payment ok but ship failed)+
Immediate action
Verify compensation actually ran on the failed step
Commands
SELECT saga_state FROM order_saga WHERE order_id = X
curl http://shipping/saga/{id}/compensation-status
Fix now
Add a compensating transaction for the shipping step (e.g., cancel shipment request). Ensure all steps have compensation defined.
Duplicate compensation or double charge+
Immediate action
Check idempotency key store for duplicates
Commands
redis-cli GET saga:idempotency:{order_id}:{step}
curl http://compensation-service/status?key=X
Fix now
Use a unique idempotency key per step action. Set TTL to match saga timeout. Reject any request with a key already processed.
Saga Pattern vs 2PC vs Eventual Consistency
ConceptCoordinationConsistency ModelFailure HandlingUse Case Example
Saga PatternChoreography or OrchestrationEventual consistencyCompensating transactionsOrder fulfillment pipeline
2PC (Two-Phase Commit)Centralized coordinatorStrong consistency (ACID)Rollback via coordinator, but can blockTightly coupled financial settlement
Eventual Consistency without SagasNo explicit coordinationEventual (no compensations)Manual intervention or TTL-based cleanupUser profile updates, non-critical data

Key takeaways

1
Saga pattern splits a distributed transaction into local steps with compensating actions.
2
Choreography uses events for coordination; orchestration uses a central controller.
3
Compensations must be idempotent, commutative, and designed for failure (retry + escalate).
4
Always use idempotency keys and persistent saga state to handle retries and crashes.
5
Monitor saga health
active count, age, step failure rate — treat sagas as critical infrastructure.

Common mistakes to avoid

4 patterns
×

Not designing compensations for failure

Symptom
A compensation throws an exception, leaving the system partially compensated and inconsistent.
Fix
Treat compensations as first-class transactions: retry with backoff, use dead-letter queues, and escalate if all retries fail.
×

Assuming exactly-once message delivery

Symptom
Duplicate forward steps lead to double charges or duplicate reservations.
Fix
Assume at-least-once delivery and implement idempotency with uniqueness constraints on each step.
×

Ignoring timeout and partial completion edge cases

Symptom
A step times out, compensation starts, but the original step eventually completes, causing both a forward and a compensation to apply.
Fix
Use a state machine with explicit states and a 'compensating' flag. Reject any forward completion after the flag is set.
×

Using a synchronous coordination model without async boundaries

Symptom
The saga coordinator blocks while waiting for a slow downstream service, causing thread pool exhaustion.
Fix
Use async messaging or reactive client for all saga steps. Each step should be non-blocking and run in its own thread or event loop.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
What is a Saga pattern, and how does it differ from a distributed transa...
Q02SENIOR
Compare choreography and orchestration. When would you pick one over the...
Q03SENIOR
How do you handle idempotency in a saga? Give a concrete example.
Q04SENIOR
What happens if a compensation fails? How do you recover?
Q05SENIOR
How would you handle a saga that involves both database writes and exter...
Q01 of 05SENIOR

What is a Saga pattern, and how does it differ from a distributed transaction (2PC)?

ANSWER
A saga is a sequence of local transactions where each step has a compensating action to undo it. Unlike two-phase commit (2PC), which holds locks across resources and requires a coordinator to reach consensus, sagas commit each step immediately and rely on compensations for rollback. This makes sagas more available and scalable (no global locks) but introduces eventual consistency. You use sagas when you can tolerate a window of inconsistency and cannot afford the coordination cost of 2PC.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
What is Saga Pattern in simple terms?
02
What's the difference between a saga and a distributed transaction (2PC)?
03
Do I always need an orchestrator for my saga?
04
Can I use sagas for long-running workflows that take days?
🔥

That's Architecture. Mark it forged?

5 min read · try the examples if you haven't

Previous
Event Sourcing
7 / 13 · Architecture
Next
Strangler Fig Pattern