Saga Pattern: How I Learned to Stop Worrying and Love Distributed Failure
Saga pattern for distributed transactions in Spring Boot 3.
- Sagas break a distributed transaction into a sequence of local transactions with compensating rollbacks.
- Choreography uses event-driven coordination; orchestration uses a central coordinator service.
- Never use 2PC across microservices in production — it creates a distributed lock that WILL fail.
- Compensating actions must be idempotent and handle partial failures gracefully.
- Test sagas with chaos engineering. Your compensating logic is the first thing that breaks under load.
Imagine booking a flight, hotel, and car rental. If any step fails, you need to cancel the ones that succeeded. A saga is like a travel agent who knows the cancellation policy for each booking and calls to undo them in reverse order when things go wrong.
You just pushed a new order flow to production. The customer hits 'Place Order' and boom — inventory service decrements stock, payment service charges the card, shipping service creates a label. Then the fraud check fails. Now you have an inventory leak, a charge you can't reverse easily, and a shipping label that costs money to void.
This is distributed transaction hell. Monoliths had it easy: one database, one transaction, one rollback. Microservices? Each service owns its data. You can't just rollback across PostgreSQL, Redis, and a third-party payment API.
I learned this the hard way. 2019, Black Friday. Our payment service went down for 90 seconds. The order service kept accepting orders. When payment came back, it processed 2,000 charges with no inventory to back them. We shipped air. The customer support nightmare was biblical.
Senior devs know this pattern by heart. The Saga Pattern is your escape from distributed transaction misery. It doesn't prevent failures — it handles them gracefully. You compensate, you don't rollback.
This article covers the real production mechanics. The code patterns that survive a holiday rush. The debugging that actually works. The stuff your system design interview prep doesn't teach.
Choreography vs Orchestration: Pick Your Poison
Two paths. Both hurt. Choose wisely.
Choreography is event-driven. Service A publishes an OrderCreated event. Service B consumes it, decrements inventory, then publishes InventoryDecremented. Service C consumes that, processes payment, publishes PaymentProcessed. It's beautiful until it's not. In production, events get lost, arrive out of order, or duplicate. You end up with a distributed spaghetti of event handlers that's impossible to trace.
I've seen a team implement choreography with 8 microservices. A single failed event caused a cascade of 47 compensatory events across 3 queues. The debugging took 3 weeks. The fix was to switch to orchestration.
Orchestration uses a central coordinator. A saga orchestrator is a state machine that tells each service what to do. It knows the current step and the compensating action for each step. This is easier to debug: one place to look, one log stream. The downside? The orchestrator becomes a single point of failure and a throughput bottleneck.
We hit this at TheCodeForge. Our orchestrator handled 200 sagas per second. Not enough for Black Friday. We had to partition sagas by region — each region got its own orchestrator instance. The orchestrator itself must be stateless (persist state in a database) so you can scale horizontally.
Here's my rule of thumb: choreography for simple linear flows (max 3 services) where each step is idempotent and you don't care about order. Orchestration for anything with compensation logic, branching, or compliance reporting. Orchestration scales better because you can parallelize steps.
Never mix both in the same saga. You'll get event-ordering nightmares that break compensation.
Compensating Actions: The Art of Un-Doing
This is the part everyone screws up.
Compensating actions are not rollbacks. They are business transactions that undo the effect of a previous transaction. There's a fundamental difference. A rollback is a database command. A compensation is a business operation that might fail, take time, or have side effects.
For example: you decremented inventory by 1. The compensation is to increment inventory by 1. But what if the item was backordered? Or what if another saga already used the freed inventory? You need idempotency and concurrency control.
Here's the rule: every command in a saga must have a corresponding compensating command. The compensating command must be idempotent — calling it twice is safe. This is not optional.
We learned this when a network partition caused a saga to issue two compensations for the same inventory decrement. The inventory shot up by 2. The finance team noticed a $200k discrepancy in 3 weeks.
The fix was to use idempotency keys on all compensating endpoints. The inventory service checks: 'Has this compensation_id been processed before?' If yes, return success without doing anything.
Another gotcha: temporal compensations. Some operations have a time limit. A flight booking might have a 24-hour cancellation window. After that, you can't compensate — you need a different business process. Handle this in the saga orchestrator: query the 'latest_valid_compensate_time' before issuing the command.
Don't forget async compensations. If a step is async (e.g., 'we'll send you an email in 5 minutes'), the compensation must handle the 'in-flight' scenario. Your saga should have a 'pending' state that waits for async completion or times out.
NOW() - INTERVAL '1 hour'.' It catches the silent failures.Saga State Machines: Concrete Spring Boot Implementation
Spring Statemachine with JPA persistence. This is the production pattern I've used for 3 years.
The state machine defines states (STARTED, INVENTORY_PENDING, INVENTORY_COMPLETED, PAYMENT_PENDING, PAYMENT_COMPLETED, COMPENSATING, COMPLETED, FAILED) and transitions (events). Each transition triggers a method that calls an external service.
When you call an external service, you transition the state to 'pending'. Then, on response, you transition to 'completed' or 'failed'. If failed, you transition to 'compensating' and the state machine automatically fires the compensating action for each completed step in reverse order.
Key config: set a timeout on every state. If the external service doesn't respond in 10 seconds, the state machine fires a 'timeout' event that transitions to 'failed' and triggers compensation. This prevents stuck sagas.
We ran into the 'N+1 state problem' early on. Too many states = too many events = state machine spaghetti. Keep it simple. Max 8 states for a typical saga. If you need more, you probably need a different split of services.
Another pattern: use a SagaRepository that persists the entire saga instance (current state, completed steps, correlation ID). This lets you restart the orchestrator and pick up where you left off. In production, we had a heartbeat thread that periodically scans for sagas stuck in 'pending' state for more than 60 seconds and re-drives them.
Testing Sagas: Build a Chaos Lab, Not a Unit Test
Unit tests won't save you. Integration tests might. But the real test is running a saga while every external service fails intermittently.
Here's what we do at TheCodeForge: we have a 'chaos saga' test environment. It deploys all microservices with WireMock stubs that randomly return 500 errors, timeouts, or malformed responses for 5% of requests. We run 10,000 fake orders through it every night.
The tests check: does the saga eventually complete or fail cleanly? Are the compensating actions idempotent? Does the inventory balance stay accurate? Is there any data leak?
This caught a bug where the payment service's compensating action had a race condition — it would sometimes refund twice if the saga was fast enough. The test found it because the chaos environment had random latencies.
You also need to test saga timeouts. What happens if the orchestrator itself is slow? We simulate high CPU on the orchestrator pod. The sagas should time out and compensate, not hang forever.
One more thing: test the 'saga recovery' path. Kill the orchestrator pod mid-saga, restart it, and verify the saga picks up from the last persisted state. This is a common failure in production — orchestrator restarts due to OOM or config errors.
Don't forget to test concurrent sagas for the same resource. If two sagas try to decrement the same inventory stock, one should fail with a concurrency error and compensate. Your saga should handle OptimisticLockException.
Monitoring and Observing Sagas in Production
You can't debug a saga by reading logs. You need structured observability.
We use OpenTelemetry to trace every saga step. Each saga gets a unique trace ID. Every service call (external or compensating) is a span under that trace. This gives us a waterfall view in Jaeger or Datadog. When a saga fails, we can see exactly which service failed and at what latency.
We also export metrics: saga_total, saga_completed, saga_failed, saga_compensating, and saga_duration_seconds. Alert on saga_failed rate > 0.01 per second. That means 1 in 100 sagas is failing — you have a problem.
Key metric: saga_compensating_duration_seconds. If this spikes, your compensating actions are slow. That's bad because long compensations increase the window for data inconsistency. We keep a separate histogram for compensating actions.
Logging: every saga step logs at INFO level with the saga ID, step name, status, and duration. The compensating action logs at WARN level. This makes grep-based debugging fast.
We also have a 'Saga Admin Dashboard' — a simple React app that queries the saga orchestrator's persistence DB. It shows all active sagas, their state, time in state, and a 'Force Compensate' button. This is a must-have for production ops. You will need to manually nuke a stuck saga.
Corner case: what about sagas that are 'stuck' because the service is down? The orchestrator should time out and compensate. But what if the compensation itself fails because the service is down? Then you need a dead-letter queue (DLQ). The DLQ worker retries compensations with exponential backoff. After 5 retries, it sends a Slack alert to the on-call engineer.
When NOT to Use Saga: The 2PC Trap
I've seen teams try to use Two-Phase Commit (2PC) across microservices. Don't. 2PC requires a global coordinator that holds locks across all participating databases. If one database becomes unavailable, all participants are blocked until it recovers. This is a distributed deadlock machine.
Sagas don't hold locks. Each service commits its local transaction immediately. If something fails later, you compensate. This means you have eventual consistency. There's a window where the data is inconsistent (e.g., inventory decremented but not yet paid). This is acceptable for most business domains.
But there are cases where 2PC is the right call: within a single service boundary, using the same database. For example, if you need to transfer money between two accounts in the same bank, use a single database transaction. Not a saga.
Also, don't use saga for operations that must be immediately consistent. For example: 'reserve a seat on a flight and issue a boarding pass'. If the payment fails, you can't un-issue a boarding pass. Use a local transaction or a different pattern.
Our rule: if you can tolerate a 10-second window of inconsistency, use a saga. If not, use a local transaction. Never cross service boundaries with 2PC.
I once saw a team implement a saga for a 'create user' flow. User service, email service, CRM service. They had compensating actions that deleted the user, unsubscribed from email, and removed from CRM. This was overkill. A simple eventual consistency with a retry queue would have been simpler. Don't over-architect. Sagas have operational cost.
The Phantom Inventory Leak
- Distributed transactions are lies.
- If you don't write compensating actions, you are one network timeout away from a data leak.
- Test your compensations with chaos engineering before you need them.
kubectl logs -l app=saga-orchestrator --tail=1000 | grep 'saga_id=abc123'curl -X GET http://saga-orchestrator:8080/api/saga/abc123/statusKey takeaways
Common mistakes to avoid
5 patternsNot making compensating actions idempotent.
Blocking the saga orchestrator thread with an external HTTP call.
Not persisting saga state before calling external services.
Using 2PC across microservices instead of saga.
Skipping chaos testing for sagas.
Interview Questions on This Topic
You have a saga that performs inventory decrement, payment capture, and shipping label creation. The payment service times out. The inventory was already decremented. How do you ensure the inventory is not leaked?
Frequently Asked Questions
That's Microservices Patterns. Mark it forged?
8 min read · try the examples if you haven't