Advanced 5 min · March 05, 2026

Saga Pattern - Compensation Race with Slow Retry

Q: What is Saga Pattern in simple terms?

Saga Pattern is a fundamental concept in System Design. Think of it as a tool — once you understand its purpose, you'll reach for it constantly.

Q: What's the difference between a saga and a distributed transaction (2PC)?

A saga commits each step immediately and uses compensations to undo on failure. 2PC holds locks and coordinates all participants before committing. Sagas are more available and scalable but provide eventual consistency; 2PC provides strong consistency but is less available and can block under failures.

Q: Do I always need an orchestrator for my saga?

No. If your saga is simple (2-3 services) and you have good event tracing, choreography works. For more complex flows with retries, timeouts, and audit requirements, orchestration is strongly recommended.

Q: Can I use sagas for long-running workflows that take days?

Yes, but you must design for that latency. Compensations may need to be asynchronous. The saga state must be persisted durably. You'll need a background recovery process to handle sagas that span days — the coordinator should be stateless and able to recover from the state store.

Payment notification delay causes saga compensation to run while retry succeeds, double-charging customers.

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Lessons pulled from things that broke in production.

✓ Production

production tested

July 19, 2026

last updated

2,466

articles · all by Naren

Before you start⏱ 30 min

✓Deep production experience
✓Understanding of internals and trade-offs
✓Experience debugging complex systems

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Saga pattern manages distributed transactions across services without distributed locks
Two flavors: choreography (events, no central coordinator) and orchestration (central coordinator sends commands)
Compensation transactions undo each step on failure — must be idempotent
Performance: choreography reduces latency by ~30% but debugging is harder
Production pitfall: partial compensation if a retry succeeds during rollback — need idempotency keys
Biggest mistake: thinking exactly-once delivery is possible; design for at-least-once with dedup

✦ Definition~90s read

What is Saga Pattern?

The Saga pattern is a sequence of local transactions where each step has a compensating action. Unlike a distributed transaction (2PC), the saga commits each step immediately. If a later step fails, earlier steps must be undone via compensations. This trades strong consistency for availability — your data may be temporarily inconsistent, but the system stays up.

★

Imagine you're booking a holiday online — the website books your flight, then your hotel, then your rental car, all in one go.

Why does this matter? Because in production, network partitions happen. Databases go down. Holding a lock across three services for 800ms doesn't just slow things down — it causes cascading timeouts. Sagas let each service commit locally and move on. The inconsistency window is usually seconds, not minutes.

Common misconception: Sagas are not asynchronous by default. You can run them synchronously with a coordinator, but that defeats the purpose. Real sagas use async messaging or event-driven coordination to decouple services.

Plain-English First

Imagine you're booking a holiday online — the website books your flight, then your hotel, then your rental car, all in one go. If the car rental fails, it doesn't just stop there and leave you stranded; it cancels the hotel and then cancels the flight, working backwards to undo everything cleanly. That's a Saga: a sequence of steps where each step knows its own 'undo' move. It's how big distributed systems keep your data consistent without locking everything up at once.

Modern distributed systems have quietly broken one of the oldest guarantees in computing: the database transaction. When your order touches a Payment Service, an Inventory Service, and a Shipping Service — each with its own database — you can't just wrap them in a single BEGIN/COMMIT block. The network doesn't care about your ACID properties, and holding distributed locks across three services for 800ms in production is a reliability nightmare waiting to happen.

The Saga pattern is the pragmatic answer to this problem. Instead of one giant atomic transaction, a Saga breaks a business operation into a sequence of local transactions, each immediately committed to its own service's database. If anything goes wrong midway, a series of compensating transactions — purpose-built undo operations — roll the system back to a semantically consistent state. It's eventual consistency with a safety net.

By the end of this article you'll understand exactly how Sagas work at the implementation level, when to pick choreography over orchestration (and why getting this wrong is expensive), how to handle the genuinely hard edge cases like idempotency and concurrent compensations, and what production systems like Uber and Netflix actually wrestle with when running Sagas at scale. Expect real, runnable Java code, concrete failure scenarios, and the kind of detail that separates a confident system design interview answer from a vague one.

What is Saga Pattern?

io/thecodeforge/saga/OrderSagaOrchestrator.javaJAVA

package io.thecodeforge.saga;

import java.util.UUID;

public class OrderSagaOrchestrator {

    private final PaymentClient paymentClient;
    private final InventoryClient inventoryClient;
    private final ShippingClient shippingClient;

    public OrderSagaOrchestrator(PaymentClient payment, InventoryClient inventory, ShippingClient shipping) {
        this.paymentClient = payment;
        this.inventoryClient = inventory;
        this.shippingClient = shipping;
    }

    public void executeOrderSaga(UUID orderId, OrderDetails details) {
        String sagaId = UUID.randomUUID().toString();
        SagaContext ctx = new SagaContext(sagaId, orderId);

        try {
            // Step 1: Reserve Payment
            paymentClient.reserve(ctx, details.totalAmount());
            ctx.stepCompleted("payment");

            // Step 2: Reserve Inventory
            inventoryClient.reserve(ctx, details.items());
            ctx.stepCompleted("inventory");

            // Step 3: Create Shipment
            shippingClient.schedule(ctx, details.shippingAddress());
            ctx.stepCompleted("shipping");

            // Confirm — all participants commit their reservation
            ctx.commitAll();

        } catch (Exception e) {
            // Compensate in reverse order
            if (ctx.isStepCompleted("shipping")) {
                shippingClient.cancel(ctx);
            }
            if (ctx.isStepCompleted("inventory")) {
                inventoryClient.release(ctx);
            }
            if (ctx.isStepCompleted("payment")) {
                paymentClient.refund(ctx);
            }
            ctx.markFailed(e);
            throw new SagaExecutionException("Order saga failed for " + orderId, e);
        }
    }
}

🔥Real Talk: ACID vs Base

Sagas embrace BASE (Basically Available, Soft state, Eventually consistent). Don't try to force strong consistency into a saga — you'll end up with a slow, fragile 2PC. Accept the inconsistency window and design for it.

📊 Production Insight

At a large fintech, a saga without idempotency keys caused a double withdrawal of $2M.

The compensation ran twice because the message broker delivered the same event after a network partition healed.

Rule: Idempotency is not optional in sagas — it's the safety net.

🎯 Key Takeaway

A saga is a sequence of local transactions with compensating actions.

It trades strong consistency for availability and scalability.

Always design compensations to be idempotent and commutative.

thecodeforge.io

Saga Pattern

Choreography vs Orchestration: Choosing Your Coordination Model

Two coordination patterns dominate saga implementations: choreography and orchestration. Choreography uses async events — each service publishes an event after its local transaction, and the next service subscribes. There's no central brain. Orchestration uses a coordinator that tells each service what to do next, like a conductor.

Choreography feels simpler at first. No single point of failure, no coordinator to crash. But the hidden cost is observability. To understand why a saga failed, you must correlate logs across every participant. Orchestration centralizes flow logic, making monitoring and retries straightforward. The trade-off? An extra network hop per step and a potential SPOF — unless you persist the orchestrator's state.

In production, most teams start with choreography for low-risk flows (notification chains) and switch to orchestration for payment pipelines where every step must be auditable. I've seen a team waste three weeks debugging a choreography saga that only had four services. After moving to orchestration, the same bug took two hours to fix.

io/thecodeforge/saga/ChoreographyBasedSaga.javaJAVA

package io.thecodeforge.saga;

// Example: choreography via event publishing
public class ChoreographyBasedSaga {

    @EventListener
    public void onOrderCreated(OrderCreatedEvent event) {
        paymentService.reservePayment(event.orderId(), event.amount());
        // payment will publish PaymentReservedEvent
    }

    @EventListener
    public void onPaymentReserved(PaymentReservedEvent event) {
        inventoryService.reserveInventory(event.orderId(), event.items());
        // inventory publishes InventoryReservedEvent
    }

    @EventListener
    public void onInventoryReserved(InventoryReservedEvent event) {
        shippingService.scheduleShipment(event.orderId(), event.address());
        // shipping publishes ShipmentScheduledEvent
    }

    @EventListener
    public void onPaymentFailed(PaymentFailedEvent event) {
        // No inventory to release yet, but if step2 happened, we need compensation
        // This is where choreography gets messy — you must know which steps succeeded
    }
}

Mental Model

Mental Model: Dance vs. Puppet Show

Choreography is a dance where each dancer reacts to the last move. Orchestration is a puppet show with a single puppeteer pulling strings.

Choreography: Decentralized, resilient, but hard to trace.
Orchestration: Centralized flow control, easy to monitor, single point of failure.
The orchestration coordinator itself must be stateless and recovered via event sourcing.
For critical money flows, always use orchestration — you'll thank yourself during an audit.

📊 Production Insight

An orchestrator that crashes mid-saga leaves the system in an unknown state.

Persist saga state in a database before each step, then poll for incomplete sagas on restart.

Rule: Always store saga state before executing any external call.

🎯 Key Takeaway

Choreography for simple, low-risk flows.

Orchestration for anything financial or auditable.

Persist saga state — a crash without state is data loss.

thecodeforge.io

Saga Pattern

Compensation Transactions: Designing the Undo Button That Actually Works

A compensation transaction is the logical inverse of a forward step. It's a new transaction that reverses the effect — not a database rollback. For example, if a payment deducted $10, the compensation is a refund of $10. That refund is a separate call with its own side effects.

Designing compensations requires care: they must be idempotent (running twice is safe), commutative (order doesn't matter), and self-healing (must not fail permanently). In production, compensate actions fail — payment gateways go down, inventory systems are slow. Your saga must handle that: retry with exponential backoff, but cap retries. After exhaustion, escalate.

A common trap: compensating a payment that was never actually charged. If the payment step timed out but later completes, you now have a compensation fighting a forward operation. The fix is a saga state machine with explicit phases: PENDING, PROCESSING, COMPENSATING, COMPLETED. Reject any forward completion when in COMPENSATING.

Latency mismatch is another real problem. A refund might take 24 hours in banking. Your saga timeout must account for that — or separate the refund into a async step that doesn't block the main flow.

io/thecodeforge/saga/compensation/PaymentCompensation.javaJAVA

package io.thecodeforge.saga.compensation;

import io.thecodeforge.saga.SagaContext;
import java.util.UUID;

public class PaymentCompensation {

    private final PaymentGatewayClient client;

    public PaymentCompensation(PaymentGatewayClient client) {
        this.client = client;
    }

    /**
     * Refund the full amount. Idempotent: same refundId always results in one refund.
     */
    public RefundResult refund(SagaContext ctx, UUID refundId, Money amount) {
        // Use refundId as idempotency key
        RefundResult result = client.submitRefund(refundId, amount);
        if (!result.isSuccess()) {
            // Retry with exponential backoff
            throw new CompensatingActionFailedException(ctx.sagaId(), refundId, result.errorMessage());
        }
        return result;
    }
}

⚠ Common Trap: Synchronous Compensation

Don't run compensations synchronously in the same thread as the forward operation. If the compensation blocks, your whole saga coordinator stalls. Always use an async message or a separate thread pool with a timeout.

📊 Production Insight

In an inventory saga, a compensation that releases stock ran twice due to a retry.

The inventory count became artificially high, causing overselling later.

Rule: Compensations must be idempotent AND have a guard (e.g., a status flag).

🎯 Key Takeaway

A compensation is a new transaction, not a rollback.

Always idempotent with a unique idempotency key.

Design for compensation latency and failure — include retry and escalation.

thecodeforge.io

Saga Pattern

Idempotency and Ordering: The Two Hardest Problems in Sagas

In distributed sagas, the same message can be delivered more than once. If your forward step is not idempotent, duplicate deliveries cause double charges, duplicate reservations, or duplicate shipments. The fix is an idempotency key: a unique string (saga ID + step name + retry number) that the receiver deduplicates against.

The receiver stores processed keys with a TTL longer than the maximum retry window. If a duplicate arrives, return the stored response. The TTL must cover the entire saga lifetime plus the retry margin. A 5-minute TTL when retries last 20 minutes is a time bomb.

Ordering is equally tricky. In a choreography saga, events can arrive out of order if the message broker partitions or reorders messages. A 'shipped' event might arrive before 'payment confirmed'. Your service must handle that gracefully — either reject out-of-order events or buffer and reorder them using sequence numbers embedded in the event payload.

In an orchestration saga, the coordinator enforces ordering. But what if the coordinator sends two commands concurrently for the same saga? Use a versioned saga state — reject any command carrying a stale version. This prevents the lost update problem.

io/thecodeforge/saga/idempotency/IdempotencyFilter.javaJAVA

package io.thecodeforge.saga.idempotency;

import java.util.concurrent.ConcurrentHashMap;
import java.time.Duration;

public class IdempotencyFilter {

    private final ConcurrentHashMap<String, IdempotencyRecord> store = new ConcurrentHashMap<>();
    private final Duration ttl = Duration.ofHours(24);

    /**
     * Returns true if this key has been processed before.
     * If not, marks it as processing and returns false.
     */
    public boolean isDuplicate(String idempotencyKey) {
        IdempotencyRecord record = store.computeIfAbsent(idempotencyKey, key -> new IdempotencyRecord());
        return !record.tryAcquire();
    }

    public void markProcessed(String idempotencyKey, Object response) {
        store.computeIfPresent(idempotencyKey, (key, record) -> {
            record.complete(response);
            return record;
        });
    }

    private static class IdempotencyRecord {
        private AtomicReference<Status> status = new AtomicReference<>(Status.PENDING);
        private Object response;

        boolean tryAcquire() {
            return status.compareAndSet(Status.PENDING, Status.PROCESSING);
        }

        void complete(Object response) {
            this.response = response;
            status.set(Status.COMPLETED);
        }
    }

    private enum Status { PENDING, PROCESSING, COMPLETED }
}

💡Idempotency Key Design

Use a combination of saga ID, step name, and a monotonic sequence number. Example: "saga_abc_step_payment_retry_2". This prevents confusion if the same step is retried multiple times.

📊 Production Insight

A DNS failure caused a saga coordinator to retry the same payment request 12 times.

The idempotency key store in Redis had a TTL of 5 minutes, but the total retry window was 20 minutes.

After the key expired, a later retry was treated as new, triggering a second payment.

Rule: Idempotency key TTL must exceed the longest possible retry chain.

🎯 Key Takeaway

Use unique idempotency keys per step.

Handle out-of-order events with sequence numbers.

Idempotency key TTL must cover the entire saga lifetime plus retry margin.

Production Pitfalls: What Actually Breaks at Scale

Running sagas in production uncovers failure modes that unit tests never simulate. Here are the most common ones:

Partial compensation: compensating step A succeeds, step B fails. Now you have a partial undo. The only safe path: keep retrying B with exponential backoff, then escalate after exhaustion. You need a manual intervention playbook.
Timeout-induced double execution: A step times out, compensation starts, but the original step actually completed slowly. The compensation then undoes a completed step. The fix: use a state machine with explicit 'compensating' state that rejects forward completions.
Dependency order in compensations: If you must release inventory before refunding payment, the compensation order must mirror the forward order. In orchestration, this is built in. In choreography, you need to enforce event sequencing, which is non-trivial.
Monitoring blind spots: A saga that fails silently because nobody monitors pending sagas. Each saga should emit metrics: active count, completed count, failed count by step. Alert on active sagas older than X minutes — this catches stuck sagas before customers notice.
Database sagas: When your saga spans database transactions (e.g., an order database and a payment database), you need to handle the case where the first DB commits but the second fails. Use transactional outbox pattern to ensure reliable message delivery.

io/thecodeforge/saga/monitoring/SagaMetrics.javaJAVA

package io.thecodeforge.saga.monitoring;

import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.Gauge;

public class SagaMetrics {

    private final Counter sagaStarted;
    private final Counter sagaCompleted;
    private final Counter sagaFailed;
    private final Gauge activeSagas;

    public SagaMetrics(MeterRegistry registry) {
        sagaStarted = registry.counter("saga.started");
        sagaCompleted = registry.counter("saga.completed");
        sagaFailed = registry.counter("saga.failed");
        activeSagas = Gauge.builder("saga.active", () -> getActiveSagaCount()).register(registry);
    }

    public void onSagaStart() { sagaStarted.increment(); }
    public void onSagaComplete() { sagaCompleted.increment(); }
    public void onSagaFail() { sagaFailed.increment(); }

    private long getActiveSagaCount() {
        // Query saga state store
        return 0; // placeholder
    }
}

Mental Model

Mental Model: Sagas Are Finite State Machines

Visualize each saga instance as a state machine with states: STARTED, STEP_N_COMPLETED, COMPENSATING, COMPLETED, FAILED. Every external input transitions the state.

Persist state transitions in a database or event log.
Use a lock on the saga row to prevent concurrent state updates.
The coordinator is just a state machine interpreter.
With state persisted, you can restart the coordinator and replay incomplete sagas.

📊 Production Insight

A team forgot to alert on sagas stuck in 'pending' state for >30 minutes.

A payment gateway outage left 15,000 orders in limbo.

Customers called support, but support had no visibility.

Rule: Monitor saga health like you monitor your database — count, age, and failure rate.

🎯 Key Takeaway

Partial compensation is the hardest failure to handle — retry with escalation.

Use a state machine with explicit transitions.

Monitor saga health: active count, age, step failure rate.

Two-Phase Commit: The Devil You Should Know Before Sagas

Before you adopt Sagas, understand why you're abandoning Two-Phase Commit (2PC). 2PC works like a wedding vow: everyone agrees first, then commits together. Sounds safe. In production, it's a liability. The coordinator becomes a single point of failure. If it crashes after phase one, participants are left in limbo, holding locks on databases. Long-running transactions amplify this: your entire system blocks while waiting for a slow participant to vote. Network partitions cause split-brain scenarios where some nodes commit and others abort. You get inconsistency anyway, despite the overhead. 2PC sacrifices availability for theoretical consistency. In microservices, availability wins every time. Sagas embrace this reality. They accept eventual consistency and use compensating transactions to clean up failures. You trade strict ACID for resilience. That trade-off is what makes distributed systems actually work at scale. Don't romanticize 2PC. It breaks when you need it most.

TwoPhaseCommitCoordinator.javaJAVA

// io.thecodeforge
public class TransactionCoordinator {
    private List<Participant> participants;

    public boolean executeTransaction() {
        // Phase 1: Prepare - ask everyone to vote
        for (Participant p : participants) {
            if (!p.prepare()) {
                // Abort all participants, rollback
                participants.forEach(Participant::abort);
                return false;
            }
        }
        // Phase 2: Commit - everyone committed
        for (Participant p : participants) {
            // If this fails, we're stuck in an inconsistent state
            p.commit();
        }
        return true;
    }
}

Output

TransactionCoordinator stuck waiting for slow participant — cluster degrades → 502 errors → on-call paged at 3 AM

⚠ Production Trap:

2PC locks databases for the entire transaction duration. A slow service holding a lock cascades into timeout hell across your entire system. Sagas avoid this by never holding locks across services.

🎯 Key Takeaway

2PC guarantees consistency at the cost of availability. In distributed systems, availability invariants win.

Running Sagas in Practice: State Machines and Recovery

A saga isn't just a sequence of calls. It's a state machine. Each step has states: PENDING, SUCCEEDED, FAILED, COMPENSATING, COMPENSATED. Storing this state in a database gives you crash recovery. When a service restarts after failure, it reads the saga state and resumes from where it left off. This is how you survive process crashes, network blips, and data center outages. Without a persisted state machine, you're just hoping compensations run correctly. Hope is not a strategy. Implement a saga log, not as an afterthought, but as core infrastructure. Each step writes its state before executing work. Use an idempotency key on each step so retries are safe. If a compensation fails, the log tells you which steps need manual intervention. Your operations team will thank you when they see a clear recovery path instead of guessing which services are inconsistent. The state machine pattern turns chaos into process.

SagaStateMachine.javaJAVA

// io.thecodeforge
public enum SagaStepState {
    PENDING, SUCCEEDED, FAILED, COMPENSATING, COMPENSATED
}

public class OrderSaga {
    private String sagaId;
    private Map<String, SagaStepState> steps = new ConcurrentHashMap<>();

    public void executeStep(String stepName, Runnable action, Runnable compensation) {
        steps.put(stepName, SagaStepState.PENDING);
        try {
            action.run();
            steps.put(stepName, SagaStepState.SUCCEEDED);
        } catch (Exception e) {
            steps.put(stepName, SagaStepState.FAILED);
            compensate(stepName, compensation);
        }
    }

    private void compensate(String failedStep, Runnable compensation) {
        // Rollback succeeded steps in reverse order
        for (String step : steps.keySet()) {
            if (steps.get(step) == SagaStepState.SUCCEEDED) {
                steps.put(step, SagaStepState.COMPENSATING);
                compensation.run();
                steps.put(step, SagaStepState.COMPENSATED);
            }
        }
    }
}

Output

Saga state persisted to PostgreSQL → service crash → restart reads saga log → resumes compensation → data consistent

💡Recovery Pattern:

Persist saga state in a separate table per saga type. Use saga_id as primary key. Query all sagas in COMPENSATING state on startup and resume them. This makes crash recovery automatic.

🎯 Key Takeaway

A saga without a persisted state machine is just a fragile script. Persist state transitions for crash recovery.

● Production incidentPOST-MORTEMseverity: high

Compensation Race: When Rollback Clashes with a Slow Retry

Symptom

Customers are double-charged or orders appear in 'cancelled' but inventory is deducted. No single error in logs, just inconsistent state.

Assumption

Once a compensation is triggered, no new forward actions will succeed for that saga instance.

Root cause

The payment step originally succeeded but the notification to the saga coordinator was delayed. The saga timed out and started compensating. Meanwhile, the payment service retried the notification, delivering a success after the compensation had completed. Both sides assumed they were the final word.

Fix

Add a status flag per saga instance in a shared database (e.g., 'compensating'). Reject any success notification after the flag is set. Use Compare-and-Swap to set the flag atomically.

Key lesson

Never trust message ordering. Use a persistent saga state that all participants read before acting.
Idempotency alone isn't enough for sagas — you need versioned state to reject late arrivals.
Always test with network delays and message duplication in chaos experiments.

Production debug guideSymptom → root cause → fix for common saga issues3 entries

Symptom · 01

Order stuck in 'pending' state; no subsequent steps fire

→

Fix

Check if the orchestration engine or event bus is healthy. Look for missing consumers or dead-letter queues. Verify timeout configuration on the saga coordinator.

Symptom · 02

Compensation runs but order ends up in inconsistent state (e.g., payment refunded but order marked 'completed')

→

Fix

Inspect the saga state machine: are compensating steps correctly ordered? Did any forward step succeed after compensation started? Enable per-step hooks to log transitions.

Symptom · 03

Duplicate compensation executed (customer refunded twice)

→

Fix

Check idempotency keys: is the compensation service using a unique key per saga instance? Are retries replaying the same key? Add a store of processed keys with TTL.

★ Saga Debugging Cheat SheetThree symptoms that hit every production saga system. One command, one config check, one rule.

No forward progress after step N−

Immediate action

Check dead-letter queue and timeout logs

Commands

curl http://orchestrator/saga/{id} | jq .state

kubectl logs -l app=saga-coordinator --tail=100 | grep 'timeout'

Fix now

Increase saga timeouts or add retry backoff. Ensure all downstream services accept idempotency keys.

Inconsistent order state (payment ok but ship failed)+

Duplicate compensation or double charge+

Saga Pattern vs 2PC vs Eventual Consistency

Concept	Coordination	Consistency Model	Failure Handling	Use Case Example
Saga Pattern	Choreography or Orchestration	Eventual consistency	Compensating transactions	Order fulfillment pipeline
2PC (Two-Phase Commit)	Centralized coordinator	Strong consistency (ACID)	Rollback via coordinator, but can block	Tightly coupled financial settlement
Eventual Consistency without Sagas	No explicit coordination	Eventual (no compensations)	Manual intervention or TTL-based cleanup	User profile updates, non-critical data

⚙ Quick Reference

7 commands from this guide

File	Command / Code	Purpose
iothecodeforgesagaOrderSagaOrchestrator.java	public class OrderSagaOrchestrator {	What is Saga Pattern?
iothecodeforgesagaChoreographyBasedSaga.java	public class ChoreographyBasedSaga {	Choreography vs Orchestration
iothecodeforgesagacompensationPaymentCompensation.java	public class PaymentCompensation {	Compensation Transactions
iothecodeforgesagaidempotencyIdempotencyFilter.java	public class IdempotencyFilter {	Idempotency and Ordering
iothecodeforgesagamonitoringSagaMetrics.java	public class SagaMetrics {	Production Pitfalls
TwoPhaseCommitCoordinator.java	public class TransactionCoordinator {	Two-Phase Commit
SagaStateMachine.java	public enum SagaStepState {	Running Sagas in Practice

Key takeaways

Saga pattern splits a distributed transaction into local steps with compensating actions.

Choreography uses events for coordination; orchestration uses a central controller.

Compensations must be idempotent, commutative, and designed for failure (retry + escalate).

Always use idempotency keys and persistent saga state to handle retries and crashes.

Monitor saga health

active count, age, step failure rate — treat sagas as critical infrastructure.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

What is a Saga pattern, and how does it differ from a distributed transa...

Q02SENIOR

Compare choreography and orchestration. When would you pick one over the...

Q03SENIOR

How do you handle idempotency in a saga? Give a concrete example.

Q04SENIOR

What happens if a compensation fails? How do you recover?

Q05SENIOR

How would you handle a saga that involves both database writes and exter...

Q01 of 05SENIOR

What is a Saga pattern, and how does it differ from a distributed transaction (2PC)?

ANSWER

A saga is a sequence of local transactions where each step has a compensating action to undo it. Unlike two-phase commit (2PC), which holds locks across resources and requires a coordinator to reach consensus, sagas commit each step immediately and rely on compensations for rollback. This makes sagas more available and scalable (no global locks) but introduces eventual consistency. You use sagas when you can tolerate a window of inconsistency and cannot afford the coordination cost of 2PC.

FAQ · 4 QUESTIONS

Frequently Asked Questions

What is Saga Pattern in simple terms?

What's the difference between a saga and a distributed transaction (2PC)?

Do I always need an orchestrator for my saga?

Can I use sagas for long-running workflows that take days?

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Lessons pulled from things that broke in production.

✓ Verified

production tested

July 19, 2026

last updated

2,466

articles · all by Naren

🔥

That's Architecture. Mark it forged?

5 min read · try the examples if you haven't