Mid 8 min · May 23, 2026

Saga Pattern: How I Learned to Stop Worrying and Love Distributed Failure

Saga pattern for distributed transactions in Spring Boot 3.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Sagas break a distributed transaction into a sequence of local transactions with compensating rollbacks.
  • Choreography uses event-driven coordination; orchestration uses a central coordinator service.
  • Never use 2PC across microservices in production — it creates a distributed lock that WILL fail.
  • Compensating actions must be idempotent and handle partial failures gracefully.
  • Test sagas with chaos engineering. Your compensating logic is the first thing that breaks under load.
✦ Definition~90s read
What is Saga Pattern?

The Saga Pattern is a failure management pattern for coordinating distributed transactions across microservices. Instead of a monolithic database transaction with ACID guarantees, you break the transaction into a sequence of local transactions. Each step has a compensating action that undoes it. If a step fails, the saga executes the compensating actions for all completed steps — in reverse order.

Imagine booking a flight, hotel, and car rental.

Sagas come in two flavors: choreography and orchestration. Choreography uses events: Service A publishes an event, Service B listens and does its work, then publishes its own event. Orchestration uses a coordinator service (orchestrator) that tells each service what to do via commands.

Choreography is simpler to set up but harder to debug. Orchestration adds a single point of failure but gives you visibility and control. Choose choreography for simple, linear workflows. Choose orchestration for anything with branching, complex compensation logic, or compliance requirements.

Plain-English First

Imagine booking a flight, hotel, and car rental. If any step fails, you need to cancel the ones that succeeded. A saga is like a travel agent who knows the cancellation policy for each booking and calls to undo them in reverse order when things go wrong.

You just pushed a new order flow to production. The customer hits 'Place Order' and boom — inventory service decrements stock, payment service charges the card, shipping service creates a label. Then the fraud check fails. Now you have an inventory leak, a charge you can't reverse easily, and a shipping label that costs money to void.

This is distributed transaction hell. Monoliths had it easy: one database, one transaction, one rollback. Microservices? Each service owns its data. You can't just rollback across PostgreSQL, Redis, and a third-party payment API.

I learned this the hard way. 2019, Black Friday. Our payment service went down for 90 seconds. The order service kept accepting orders. When payment came back, it processed 2,000 charges with no inventory to back them. We shipped air. The customer support nightmare was biblical.

Senior devs know this pattern by heart. The Saga Pattern is your escape from distributed transaction misery. It doesn't prevent failures — it handles them gracefully. You compensate, you don't rollback.

This article covers the real production mechanics. The code patterns that survive a holiday rush. The debugging that actually works. The stuff your system design interview prep doesn't teach.

Choreography vs Orchestration: Pick Your Poison

Two paths. Both hurt. Choose wisely.

Choreography is event-driven. Service A publishes an OrderCreated event. Service B consumes it, decrements inventory, then publishes InventoryDecremented. Service C consumes that, processes payment, publishes PaymentProcessed. It's beautiful until it's not. In production, events get lost, arrive out of order, or duplicate. You end up with a distributed spaghetti of event handlers that's impossible to trace.

I've seen a team implement choreography with 8 microservices. A single failed event caused a cascade of 47 compensatory events across 3 queues. The debugging took 3 weeks. The fix was to switch to orchestration.

Orchestration uses a central coordinator. A saga orchestrator is a state machine that tells each service what to do. It knows the current step and the compensating action for each step. This is easier to debug: one place to look, one log stream. The downside? The orchestrator becomes a single point of failure and a throughput bottleneck.

We hit this at TheCodeForge. Our orchestrator handled 200 sagas per second. Not enough for Black Friday. We had to partition sagas by region — each region got its own orchestrator instance. The orchestrator itself must be stateless (persist state in a database) so you can scale horizontally.

Here's my rule of thumb: choreography for simple linear flows (max 3 services) where each step is idempotent and you don't care about order. Orchestration for anything with compensation logic, branching, or compliance reporting. Orchestration scales better because you can parallelize steps.

Never mix both in the same saga. You'll get event-ordering nightmares that break compensation.

Production Trap:
Orchestrator state persistence. If your orchestrator crashes and you haven't persisted the saga state, you lose all in-flight sagas. Always persist to a database (PostgreSQL, Cassandra) after every state transition. The saga becomes stale if the DB is down — handle that with a dead-letter queue.
Production Insight
Our orchestrator state machine used Spring Statemachine with JPA persistence. In 2023, a DB connection pool exhaustion caused saga state loss for 12 minutes. We now use a separate connection pool for saga persistence with a lower max connections.
Key Takeaway
Always persist saga state before calling external services. If the external call fails, you can still compensate. If you persist after the call, you risk losing state on a crash.

Compensating Actions: The Art of Un-Doing

This is the part everyone screws up.

Compensating actions are not rollbacks. They are business transactions that undo the effect of a previous transaction. There's a fundamental difference. A rollback is a database command. A compensation is a business operation that might fail, take time, or have side effects.

For example: you decremented inventory by 1. The compensation is to increment inventory by 1. But what if the item was backordered? Or what if another saga already used the freed inventory? You need idempotency and concurrency control.

Here's the rule: every command in a saga must have a corresponding compensating command. The compensating command must be idempotent — calling it twice is safe. This is not optional.

We learned this when a network partition caused a saga to issue two compensations for the same inventory decrement. The inventory shot up by 2. The finance team noticed a $200k discrepancy in 3 weeks.

The fix was to use idempotency keys on all compensating endpoints. The inventory service checks: 'Has this compensation_id been processed before?' If yes, return success without doing anything.

Another gotcha: temporal compensations. Some operations have a time limit. A flight booking might have a 24-hour cancellation window. After that, you can't compensate — you need a different business process. Handle this in the saga orchestrator: query the 'latest_valid_compensate_time' before issuing the command.

Don't forget async compensations. If a step is async (e.g., 'we'll send you an email in 5 minutes'), the compensation must handle the 'in-flight' scenario. Your saga should have a 'pending' state that waits for async completion or times out.

Senior Shortcut:
Use a single CompensationLog table per service. Store sagaId, compensationId, status, and timestamp. This one table saves you from debugging duplicate compensations and provides an audit trail for compliance.
Production Insight
Compensations that fail silently are the number one cause of data corruption in sagas. Log every compensation attempt, success, and failure. We run a query every hour: 'SELECT * FROM compensation_log WHERE processed = false AND created_at < NOW() - INTERVAL '1 hour'.' It catches the silent failures.
Key Takeaway
Idempotency is not optional. Temporal compensations are a source of production bugs. Log every compensation attempt — you will need that audit trail.

Saga State Machines: Concrete Spring Boot Implementation

Spring Statemachine with JPA persistence. This is the production pattern I've used for 3 years.

The state machine defines states (STARTED, INVENTORY_PENDING, INVENTORY_COMPLETED, PAYMENT_PENDING, PAYMENT_COMPLETED, COMPENSATING, COMPLETED, FAILED) and transitions (events). Each transition triggers a method that calls an external service.

When you call an external service, you transition the state to 'pending'. Then, on response, you transition to 'completed' or 'failed'. If failed, you transition to 'compensating' and the state machine automatically fires the compensating action for each completed step in reverse order.

Key config: set a timeout on every state. If the external service doesn't respond in 10 seconds, the state machine fires a 'timeout' event that transitions to 'failed' and triggers compensation. This prevents stuck sagas.

We ran into the 'N+1 state problem' early on. Too many states = too many events = state machine spaghetti. Keep it simple. Max 8 states for a typical saga. If you need more, you probably need a different split of services.

Another pattern: use a SagaRepository that persists the entire saga instance (current state, completed steps, correlation ID). This lets you restart the orchestrator and pick up where you left off. In production, we had a heartbeat thread that periodically scans for sagas stuck in 'pending' state for more than 60 seconds and re-drives them.

Never Do This:
Don't put the external service call inside the state machine action. It blocks the state machine thread. Use asynchronous actions (start the call, return immediately, then handle callback to transition). We had a production outage when a slow payment service blocked the state machine thread pool and all sagas stalled.
Production Insight
Use a thread pool executor separate from the web server thread pool for saga actions. We had a 10-thread pool in 2021 — not enough. Now we use a 50-thread pool with a separate queue. Blocking the state machine is the most common production failure I've seen.
Key Takeaway
Async every external call in a saga action. Blocking the state machine is a cardinal sin.

Testing Sagas: Build a Chaos Lab, Not a Unit Test

Unit tests won't save you. Integration tests might. But the real test is running a saga while every external service fails intermittently.

Here's what we do at TheCodeForge: we have a 'chaos saga' test environment. It deploys all microservices with WireMock stubs that randomly return 500 errors, timeouts, or malformed responses for 5% of requests. We run 10,000 fake orders through it every night.

The tests check: does the saga eventually complete or fail cleanly? Are the compensating actions idempotent? Does the inventory balance stay accurate? Is there any data leak?

This caught a bug where the payment service's compensating action had a race condition — it would sometimes refund twice if the saga was fast enough. The test found it because the chaos environment had random latencies.

You also need to test saga timeouts. What happens if the orchestrator itself is slow? We simulate high CPU on the orchestrator pod. The sagas should time out and compensate, not hang forever.

One more thing: test the 'saga recovery' path. Kill the orchestrator pod mid-saga, restart it, and verify the saga picks up from the last persisted state. This is a common failure in production — orchestrator restarts due to OOM or config errors.

Don't forget to test concurrent sagas for the same resource. If two sagas try to decrement the same inventory stock, one should fail with a concurrency error and compensate. Your saga should handle OptimisticLockException.

Senior Shortcut:
Don't mock the orchestrator. Use WireMock for external services, but run the real saga orchestrator with real state machine and real DB. You want to test the real async behavior, especially timeouts and retries.
Production Insight
In 2022, we missed a bug where the compensating action for the shipping service had a typo in the URL. The chaos test caught it on day 2 because the stub returned 404 on that endpoint. Without chaos testing, it would have hit production and caused a shipping label that couldn't be voided.
Key Takeaway
Run chaos tests every night. Test with random failures, timeouts, high latency, and orchestrator restarts. If your saga passes those, you can sleep at night.

Monitoring and Observing Sagas in Production

You can't debug a saga by reading logs. You need structured observability.

We use OpenTelemetry to trace every saga step. Each saga gets a unique trace ID. Every service call (external or compensating) is a span under that trace. This gives us a waterfall view in Jaeger or Datadog. When a saga fails, we can see exactly which service failed and at what latency.

We also export metrics: saga_total, saga_completed, saga_failed, saga_compensating, and saga_duration_seconds. Alert on saga_failed rate > 0.01 per second. That means 1 in 100 sagas is failing — you have a problem.

Key metric: saga_compensating_duration_seconds. If this spikes, your compensating actions are slow. That's bad because long compensations increase the window for data inconsistency. We keep a separate histogram for compensating actions.

Logging: every saga step logs at INFO level with the saga ID, step name, status, and duration. The compensating action logs at WARN level. This makes grep-based debugging fast.

We also have a 'Saga Admin Dashboard' — a simple React app that queries the saga orchestrator's persistence DB. It shows all active sagas, their state, time in state, and a 'Force Compensate' button. This is a must-have for production ops. You will need to manually nuke a stuck saga.

Corner case: what about sagas that are 'stuck' because the service is down? The orchestrator should time out and compensate. But what if the compensation itself fails because the service is down? Then you need a dead-letter queue (DLQ). The DLQ worker retries compensations with exponential backoff. After 5 retries, it sends a Slack alert to the on-call engineer.

Interview Gold:
Question: 'What metrics do you track for sagas?' Answer: saga completion rate, compensation duration, time-in-state, and dead-letter queue depth. Bonus points: mention you alert on compensation duration > P99 of normal step duration because compensations should be faster than the forward operation.
Production Insight
We have a Slack bot that posts a message every time a saga is stuck in 'COMPENSATING' for more than 5 minutes. It links to the Jira issue for the offending saga. This turns a production fire into a process improvement.
Key Takeaway
Observability is not optional. Trace every saga step, metrics every compensation action, and have a manual override dashboard. Your on-call engineer will thank you.

When NOT to Use Saga: The 2PC Trap

I've seen teams try to use Two-Phase Commit (2PC) across microservices. Don't. 2PC requires a global coordinator that holds locks across all participating databases. If one database becomes unavailable, all participants are blocked until it recovers. This is a distributed deadlock machine.

Sagas don't hold locks. Each service commits its local transaction immediately. If something fails later, you compensate. This means you have eventual consistency. There's a window where the data is inconsistent (e.g., inventory decremented but not yet paid). This is acceptable for most business domains.

But there are cases where 2PC is the right call: within a single service boundary, using the same database. For example, if you need to transfer money between two accounts in the same bank, use a single database transaction. Not a saga.

Also, don't use saga for operations that must be immediately consistent. For example: 'reserve a seat on a flight and issue a boarding pass'. If the payment fails, you can't un-issue a boarding pass. Use a local transaction or a different pattern.

Our rule: if you can tolerate a 10-second window of inconsistency, use a saga. If not, use a local transaction. Never cross service boundaries with 2PC.

I once saw a team implement a saga for a 'create user' flow. User service, email service, CRM service. They had compensating actions that deleted the user, unsubscribed from email, and removed from CRM. This was overkill. A simple eventual consistency with a retry queue would have been simpler. Don't over-architect. Sagas have operational cost.

Production Trap:
Never use a saga for operations that involve physical resources that cannot be undone. For example, shipping a physical product, printing a document, or sending a 'you're hired' email. Compensating actions for these are difficult or impossible.
Production Insight
We had a 'badge printing' saga that compensated by sending an 'oops' email. It was terrible. Users got badges they shouldn't have. We switched to a local transaction with a pre-print validation check.
Key Takeaway
Sagas are for eventual consistency. If you need strong consistency, use a local transaction. If you can't, 2PC is a trap. Choose saga when the business can tolerate a short inconsistency window.
● Production incidentPOST-MORTEMseverity: high

The Phantom Inventory Leak

Symptom
Customers got shipping confirmations for out-of-stock items. Fraud team flagged 400+ orders where payment succeeded but inventory was zero. Support queue exploded. Revenue loss: $80k in refunds and shipping label charges.
Assumption
The dev assumed the payment service was idempotent and would correctly fail when inventory was low. The payment service itself was not the problem — the coordination between services was.
Root cause
The order flow used a distributed transaction with no saga. The payment service failed for 90 seconds (a transient network blip). The order service treated this as a permanent failure and didn't execute compensating actions. Inventory was decremented before payment confirmed. No compensating refund was issued to the inventory service.
Fix
1. Implemented a saga orchestrator using Spring Boot 3.x with a state machine (event-driven with Spring Statemachine). 2. Made every service expose a compensating endpoint (PUT /api/compensate). 3. Added idempotency keys to all saga steps. 4. Wrote integration tests that simulate payment failure mid-flow. 5. Deployed with a 5-second timeout; any step taking longer triggers compensation immediately.
Key lesson
  • Distributed transactions are lies.
  • If you don't write compensating actions, you are one network timeout away from a data leak.
  • Test your compensations with chaos engineering before you need them.
Production debug guideSymptom → root cause → fix for the failures that actually happen4 entries
Symptom · 01
Orders stuck in 'PENDING' state and never completing or failing back.
Fix
Check the saga orchestrator logs for the correlation IDs. Look for missing compensating calls. Most likely the orchestrator's retry logic is exhausted or the compensating endpoint is not idempotent. Check the orchestrator's persistence store — is the saga status saved? If yes, manually query the DB. If no, you have a race condition in your state machine.
Symptom · 02
Duplicate charges on the payment provider.
Fix
Idempotency key violation. The saga retried a step after a timeout, but the step actually succeeded. The payment service didn't check the idempotency key. Fix: make all saga steps idempotent using a unique correlation ID per saga instance. For payments, store the charge ID returned by the provider on the first attempt; use it to skip the second.
Symptom · 03
Inventory shows negative stock after a failed order.
Fix
Compensating action didn't run. Either the orchestrator crashed before issuing the compensate, or the compensate endpoint failed silently. Check orchestrator logs for the 'compensate' event. Add a dead-letter queue for failed compensations. Write a reconciliation job that runs hourly to detect inventory discrepancies.
Symptom · 04
Saga times out on a third-party API call (e.g., fraud check).
Fix
Third-party APIs are unreliable. Set a circuit breaker on the call. If it times out, don't block the saga — fail fast and compensate. Add a manual retry mechanism for operators. In production, we had a 'Saga Admin' endpoint that could retry a failed step from a known safe state.
★ Debug Cheat SheetCommands for fast diagnosis in production
Order stuck in saga pending
Immediate action
Query saga orchestrator state
Commands
kubectl logs -l app=saga-orchestrator --tail=1000 | grep 'saga_id=abc123'
curl -X GET http://saga-orchestrator:8080/api/saga/abc123/status
Fix now
If status is STARTED, expire it: curl -X PUT http://saga-orchestrator:8080/api/saga/abc123/compensate
Duplicate payment charge+
Immediate action
Check idempotency key in payment service logs
Commands
curl -X GET http://payment-service:8080/api/charges?correlationId=abc123
kubectl logs -l app=payment-service | grep 'idempotency_key' | grep 'abc123'
Fix now
If duplicate found, submit a refund via the payment provider's API. Then add idempotency check in PaymentController.java
Negative inventory count+
Immediate action
Find the saga ID for the failed order
Commands
psql -c "SELECT * FROM inventory_saga WHERE order_id = 'ORDER-456';"
curl http://saga-orchestrator:8080/api/saga/ORDER-456/steps
Fix now
Run the compensation for inventory: curl -X POST http://inventory-service:8080/api/compensate -d '{"orderId":"ORDER-456","quantity":1}'
Third-party fraud check timeout+
Immediate action
Check circuit breaker state
Commands
curl http://fraud-service:8080/actuator/health | grep circuitBreaker
kubectl logs -l app=fraud-service --tail=100 | grep 'timeout'
Fix now
Reset circuit breaker: curl -X POST http://fraud-service:8080/actuator/circuitbreakers/fraudCheck/reset
Saga Pattern vs Two-Phase Commit (2PC)
FeatureSaga PatternTwo-Phase Commit (2PC)
Consistency ModelEventual consistencyStrong consistency (ACID)
Lock DurationNo locks (each step commits immediately)Locks held during 'prepare' phase, can block all participants
Failure HandlingCompensating actions (business rollback)Automatic rollback by coordinator
PerformanceHigh (no blocking)Lower (blocking during prepare/commit)
ComplexityModerate (need idempotency, compensation logic)High (requires coordinator, global XA transactions)
Use CaseMulti-service order flow, booking systemsSingle-database, same-service transactions
ResilienceHigh (can compensate and retry)Low (coordinator is single point of failure)

Key takeaways

1
Compensating actions are business transactions, not database rollbacks. They must be idempotent.
2
Never block the state machine thread. Async every external call.
3
Persist saga state before calling external services. Always.
4
Chaos test your sagas with random failures, timeouts, and orchestrator restarts.
5
Orchestration is better for complex compensation logic. Choreography for simple linear flows.

Common mistakes to avoid

5 patterns
×

Not making compensating actions idempotent.

Symptom
Duplicate compensations cause data corruption (e.g., inventory incremented twice).
Fix
Use a unique compensation ID per saga step. Check if the compensation has already been processed before applying it.
×

Blocking the saga orchestrator thread with an external HTTP call.

Symptom
All sagas stall when one external service is slow. Thread pool exhaustion causes orchestrator to reject new sagas.
Fix
Use asynchronous HTTP clients (WebClient) or message queues. The state machine should not block on I/O.
×

Not persisting saga state before calling external services.

Symptom
Orchestrator crashes mid-saga. On restart, the saga is lost, and data is inconsistent (e.g., decremented inventory but no compensating action triggered).
Fix
Persist saga state (current step, completed steps) in a database before each external call. Use @Transactional around the whole operation.
×

Using 2PC across microservices instead of saga.

Symptom
Distributed deadlocks, blocked databases, and application timeouts during network partitions.
Fix
Replace 2PC with saga pattern. Accept eventual consistency. Implement compensating actions.
×

Skipping chaos testing for sagas.

Symptom
Compensating actions fail silently in production when a service is temporarily unavailable. Data corruption goes unnoticed for days.
Fix
Set up a chaos test environment that randomly fails external services, introduces latency, and kills orchestrator pods.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
You have a saga that performs inventory decrement, payment capture, and ...
Q02SENIOR
What's the difference between a compensating action and a database rollb...
Q03SENIOR
You need to implement a saga that handles a flight booking, hotel reserv...
Q04SENIOR
Why is 2PC a bad choice for microservices?
Q05SENIOR
How do you handle a saga that has a non-idempotent step?
Q06JUNIOR
Explain the difference between choreography and orchestration in sagas.
Q07SENIOR
You test your saga and it works. In production, the orchestrator OOMs. W...
Q08SENIOR
What metrics would you monitor for a production saga system?
Q01 of 08SENIOR

You have a saga that performs inventory decrement, payment capture, and shipping label creation. The payment service times out. The inventory was already decremented. How do you ensure the inventory is not leaked?

ANSWER
The saga orchestrator should have a timeout on the payment step. When the timeout fires, the orchestrator transitions to the 'compensating' state. It then executes the compensating action for inventory (increment quantity). The compensating action must be idempotent — it checks a compensation log before incrementing. The saga then retries the payment step a configurable number of times before marking the entire saga as failed. The key is that the orchestrator persists its state before each external call, so if it crashes during compensation, it can resume on restart.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What is the Saga Pattern in microservices?
02
When should I use choreography vs orchestration for sagas?
03
How do I make a saga compensating action idempotent?
04
Can I use a saga with a third-party API that doesn't support compensation?
05
What are the most common production failures with sagas?
🔥

That's Microservices Patterns. Mark it forged?

8 min read · try the examples if you haven't

Previous
Database-per-Service Pattern in Microservices
2 / 3 · Microservices Patterns
Next
CQRS Pattern in Spring Boot Microservices