Advanced 8 min · May 23, 2026

Saga Pattern: How I Learned to Stop Worrying and Love Distributed Failure

Q: What is the Saga Pattern in microservices?

The Saga Pattern breaks a distributed transaction into a sequence of local transactions. Each step has a compensating action that undoes it. If a step fails, the saga executes the compensating actions for all completed steps in reverse order, ensuring eventual consistency across services.

Q: When should I use choreography vs orchestration for sagas?

Use choreography for simple linear workflows (max 3 services) where each step is idempotent. Use orchestration for anything with branching, complex compensation logic, or compliance requirements. Orchestration is easier to debug and manage but adds a single point of failure.

Q: How do I make a saga compensating action idempotent?

Use a unique compensation ID per saga step. Before applying the compensation, check if the ID exists in a CompensationLog table. If it does, return success without doing anything. This ensures calling the compensation multiple times is safe.

Q: Can I use a saga with a third-party API that doesn't support compensation?

Yes, but the compensation must be designed at the business level. For example, if you send an SMS that can't be unsent, the compensation could send a follow-up SMS saying 'order cancelled'. If no compensation is possible, don't include that step in the saga.

Q: What are the most common production failures with sagas?

1) Compensating actions are not idempotent, causing duplicate compensation. 2) State machine blocks on external calls, causing thread pool exhaustion. 3) Saga state not persisted, so orchestrator crashes lose in-flight sagas. 4) Compensating actions fail silently and corrupt data.

Saga pattern for distributed transactions in Spring Boot 3.x.

Naren Founder & Principal Engineer

20+ years shipping production Java in banking & fintech. Drawn from code that ran under real load.

✓ Production

production tested

July 04, 2026

last updated

1,697

articles · all by Naren

Before you start⏱ 30 min

✓Deep production experience
✓Understanding of internals and trade-offs
✓Experience debugging complex systems

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Sagas break a distributed transaction into a sequence of local transactions with compensating rollbacks.
Choreography uses event-driven coordination; orchestration uses a central coordinator service.
Never use 2PC across microservices in production — it creates a distributed lock that WILL fail.
Compensating actions must be idempotent and handle partial failures gracefully.
Test sagas with chaos engineering. Your compensating logic is the first thing that breaks under load.

✦ Definition~90s read

What is Saga Pattern for Distributed Transactions?

The Saga Pattern is a failure management pattern for coordinating distributed transactions across microservices. Instead of a monolithic database transaction with ACID guarantees, you break the transaction into a sequence of local transactions. Each step has a compensating action that undoes it. If a step fails, the saga executes the compensating actions for all completed steps — in reverse order.

★

Imagine booking a flight, hotel, and car rental.

Sagas come in two flavors: choreography and orchestration. Choreography uses events: Service A publishes an event, Service B listens and does its work, then publishes its own event. Orchestration uses a coordinator service (orchestrator) that tells each service what to do via commands.

Choreography is simpler to set up but harder to debug. Orchestration adds a single point of failure but gives you visibility and control. Choose choreography for simple, linear workflows. Choose orchestration for anything with branching, complex compensation logic, or compliance requirements.

Plain-English First

Imagine booking a flight, hotel, and car rental. If any step fails, you need to cancel the ones that succeeded. A saga is like a travel agent who knows the cancellation policy for each booking and calls to undo them in reverse order when things go wrong.

You just pushed a new order flow to production. The customer hits 'Place Order' and boom — inventory service decrements stock, payment service charges the card, shipping service creates a label. Then the fraud check fails. Now you have an inventory leak, a charge you can't reverse easily, and a shipping label that costs money to void.

This is distributed transaction hell. Monoliths had it easy: one database, one transaction, one rollback. Microservices? Each service owns its data. You can't just rollback across PostgreSQL, Redis, and a third-party payment API.

I learned this the hard way. 2019, Black Friday. Our payment service went down for 90 seconds. The order service kept accepting orders. When payment came back, it processed 2,000 charges with no inventory to back them. We shipped air. The customer support nightmare was biblical.

Senior devs know this pattern by heart. The Saga Pattern is your escape from distributed transaction misery. It doesn't prevent failures — it handles them gracefully. You compensate, you don't rollback.

This article covers the real production mechanics. The code patterns that survive a holiday rush. The debugging that actually works. The stuff your system design interview prep doesn't teach.

Choreography vs Orchestration: Pick Your Poison

Two paths. Both hurt. Choose wisely.

Choreography is event-driven. Service A publishes an OrderCreated event. Service B consumes it, decrements inventory, then publishes InventoryDecremented. Service C consumes that, processes payment, publishes PaymentProcessed. It's beautiful until it's not. In production, events get lost, arrive out of order, or duplicate. You end up with a distributed spaghetti of event handlers that's impossible to trace.

I've seen a team implement choreography with 8 microservices. A single failed event caused a cascade of 47 compensatory events across 3 queues. The debugging took 3 weeks. The fix was to switch to orchestration.

Orchestration uses a central coordinator. A saga orchestrator is a state machine that tells each service what to do. It knows the current step and the compensating action for each step. This is easier to debug: one place to look, one log stream. The downside? The orchestrator becomes a single point of failure and a throughput bottleneck.

We hit this at TheCodeForge. Our orchestrator handled 200 sagas per second. Not enough for Black Friday. We had to partition sagas by region — each region got its own orchestrator instance. The orchestrator itself must be stateless (persist state in a database) so you can scale horizontally.

Here's my rule of thumb: choreography for simple linear flows (max 3 services) where each step is idempotent and you don't care about order. Orchestration for anything with compensation logic, branching, or compliance reporting. Orchestration scales better because you can parallelize steps.

Never mix both in the same saga. You'll get event-ordering nightmares that break compensation.

Production Trap:

Orchestrator state persistence. If your orchestrator crashes and you haven't persisted the saga state, you lose all in-flight sagas. Always persist to a database (PostgreSQL, Cassandra) after every state transition. The saga becomes stale if the DB is down — handle that with a dead-letter queue.

Production Insight

Our orchestrator state machine used Spring Statemachine with JPA persistence. In 2023, a DB connection pool exhaustion caused saga state loss for 12 minutes. We now use a separate connection pool for saga persistence with a lower max connections.

Key Takeaway

Always persist saga state before calling external services. If the external call fails, you can still compensate. If you persist after the call, you risk losing state on a crash.

thecodeforge.io

Microservices Saga Pattern

Compensating Actions: The Art of Un-Doing

This is the part everyone screws up.

Compensating actions are not rollbacks. They are business transactions that undo the effect of a previous transaction. There's a fundamental difference. A rollback is a database command. A compensation is a business operation that might fail, take time, or have side effects.

For example: you decremented inventory by 1. The compensation is to increment inventory by 1. But what if the item was backordered? Or what if another saga already used the freed inventory? You need idempotency and concurrency control.

Here's the rule: every command in a saga must have a corresponding compensating command. The compensating command must be idempotent — calling it twice is safe. This is not optional.

We learned this when a network partition caused a saga to issue two compensations for the same inventory decrement. The inventory shot up by 2. The finance team noticed a $200k discrepancy in 3 weeks.

The fix was to use idempotency keys on all compensating endpoints. The inventory service checks: 'Has this compensation_id been processed before?' If yes, return success without doing anything.

Another gotcha: temporal compensations. Some operations have a time limit. A flight booking might have a 24-hour cancellation window. After that, you can't compensate — you need a different business process. Handle this in the saga orchestrator: query the 'latest_valid_compensate_time' before issuing the command.

Don't forget async compensations. If a step is async (e.g., 'we'll send you an email in 5 minutes'), the compensation must handle the 'in-flight' scenario. Your saga should have a 'pending' state that waits for async completion or times out.

Senior Shortcut:

Use a single CompensationLog table per service. Store sagaId, compensationId, status, and timestamp. This one table saves you from debugging duplicate compensations and provides an audit trail for compliance.

Production Insight

Compensations that fail silently are the number one cause of data corruption in sagas. Log every compensation attempt, success, and failure. We run a query every hour: 'SELECT * FROM compensation_log WHERE processed = false AND created_at < NOW() - INTERVAL '1 hour'.' It catches the silent failures.

Key Takeaway

Idempotency is not optional. Temporal compensations are a source of production bugs. Log every compensation attempt — you will need that audit trail.

Saga State Machines: Concrete Spring Boot Implementation

Spring Statemachine with JPA persistence. This is the production pattern I've used for 3 years.

The state machine defines states (STARTED, INVENTORY_PENDING, INVENTORY_COMPLETED, PAYMENT_PENDING, PAYMENT_COMPLETED, COMPENSATING, COMPLETED, FAILED) and transitions (events). Each transition triggers a method that calls an external service.

When you call an external service, you transition the state to 'pending'. Then, on response, you transition to 'completed' or 'failed'. If failed, you transition to 'compensating' and the state machine automatically fires the compensating action for each completed step in reverse order.

Key config: set a timeout on every state. If the external service doesn't respond in 10 seconds, the state machine fires a 'timeout' event that transitions to 'failed' and triggers compensation. This prevents stuck sagas.

We ran into the 'N+1 state problem' early on. Too many states = too many events = state machine spaghetti. Keep it simple. Max 8 states for a typical saga. If you need more, you probably need a different split of services.

Another pattern: use a SagaRepository that persists the entire saga instance (current state, completed steps, correlation ID). This lets you restart the orchestrator and pick up where you left off. In production, we had a heartbeat thread that periodically scans for sagas stuck in 'pending' state for more than 60 seconds and re-drives them.

Never Do This:

Don't put the external service call inside the state machine action. It blocks the state machine thread. Use asynchronous actions (start the call, return immediately, then handle callback to transition). We had a production outage when a slow payment service blocked the state machine thread pool and all sagas stalled.

Production Insight

Use a thread pool executor separate from the web server thread pool for saga actions. We had a 10-thread pool in 2021 — not enough. Now we use a 50-thread pool with a separate queue. Blocking the state machine is the most common production failure I've seen.

Key Takeaway

Async every external call in a saga action. Blocking the state machine is a cardinal sin.

thecodeforge.io

Microservices Saga Pattern

Testing Sagas: Build a Chaos Lab, Not a Unit Test

Unit tests won't save you. Integration tests might. But the real test is running a saga while every external service fails intermittently.

Here's what we do at TheCodeForge: we have a 'chaos saga' test environment. It deploys all microservices with WireMock stubs that randomly return 500 errors, timeouts, or malformed responses for 5% of requests. We run 10,000 fake orders through it every night.

The tests check: does the saga eventually complete or fail cleanly? Are the compensating actions idempotent? Does the inventory balance stay accurate? Is there any data leak?

This caught a bug where the payment service's compensating action had a race condition — it would sometimes refund twice if the saga was fast enough. The test found it because the chaos environment had random latencies.

You also need to test saga timeouts. What happens if the orchestrator itself is slow? We simulate high CPU on the orchestrator pod. The sagas should time out and compensate, not hang forever.

One more thing: test the 'saga recovery' path. Kill the orchestrator pod mid-saga, restart it, and verify the saga picks up from the last persisted state. This is a common failure in production — orchestrator restarts due to OOM or config errors.

Don't forget to test concurrent sagas for the same resource. If two sagas try to decrement the same inventory stock, one should fail with a concurrency error and compensate. Your saga should handle OptimisticLockException.

Senior Shortcut:

Don't mock the orchestrator. Use WireMock for external services, but run the real saga orchestrator with real state machine and real DB. You want to test the real async behavior, especially timeouts and retries.

Production Insight

In 2022, we missed a bug where the compensating action for the shipping service had a typo in the URL. The chaos test caught it on day 2 because the stub returned 404 on that endpoint. Without chaos testing, it would have hit production and caused a shipping label that couldn't be voided.

Key Takeaway

Run chaos tests every night. Test with random failures, timeouts, high latency, and orchestrator restarts. If your saga passes those, you can sleep at night.

Monitoring and Observing Sagas in Production

You can't debug a saga by reading logs. You need structured observability.

We use OpenTelemetry to trace every saga step. Each saga gets a unique trace ID. Every service call (external or compensating) is a span under that trace. This gives us a waterfall view in Jaeger or Datadog. When a saga fails, we can see exactly which service failed and at what latency.

We also export metrics: saga_total, saga_completed, saga_failed, saga_compensating, and saga_duration_seconds. Alert on saga_failed rate > 0.01 per second. That means 1 in 100 sagas is failing — you have a problem.

Key metric: saga_compensating_duration_seconds. If this spikes, your compensating actions are slow. That's bad because long compensations increase the window for data inconsistency. We keep a separate histogram for compensating actions.

Logging: every saga step logs at INFO level with the saga ID, step name, status, and duration. The compensating action logs at WARN level. This makes grep-based debugging fast.

We also have a 'Saga Admin Dashboard' — a simple React app that queries the saga orchestrator's persistence DB. It shows all active sagas, their state, time in state, and a 'Force Compensate' button. This is a must-have for production ops. You will need to manually nuke a stuck saga.

Corner case: what about sagas that are 'stuck' because the service is down? The orchestrator should time out and compensate. But what if the compensation itself fails because the service is down? Then you need a dead-letter queue (DLQ). The DLQ worker retries compensations with exponential backoff. After 5 retries, it sends a Slack alert to the on-call engineer.

Interview Gold:

Question: 'What metrics do you track for sagas?' Answer: saga completion rate, compensation duration, time-in-state, and dead-letter queue depth. Bonus points: mention you alert on compensation duration > P99 of normal step duration because compensations should be faster than the forward operation.

Production Insight

We have a Slack bot that posts a message every time a saga is stuck in 'COMPENSATING' for more than 5 minutes. It links to the Jira issue for the offending saga. This turns a production fire into a process improvement.

Key Takeaway

Observability is not optional. Trace every saga step, metrics every compensation action, and have a manual override dashboard. Your on-call engineer will thank you.

When NOT to Use Saga: The 2PC Trap

I've seen teams try to use Two-Phase Commit (2PC) across microservices. Don't. 2PC requires a global coordinator that holds locks across all participating databases. If one database becomes unavailable, all participants are blocked until it recovers. This is a distributed deadlock machine.

Sagas don't hold locks. Each service commits its local transaction immediately. If something fails later, you compensate. This means you have eventual consistency. There's a window where the data is inconsistent (e.g., inventory decremented but not yet paid). This is acceptable for most business domains.

But there are cases where 2PC is the right call: within a single service boundary, using the same database. For example, if you need to transfer money between two accounts in the same bank, use a single database transaction. Not a saga.

Also, don't use saga for operations that must be immediately consistent. For example: 'reserve a seat on a flight and issue a boarding pass'. If the payment fails, you can't un-issue a boarding pass. Use a local transaction or a different pattern.

Our rule: if you can tolerate a 10-second window of inconsistency, use a saga. If not, use a local transaction. Never cross service boundaries with 2PC.

I once saw a team implement a saga for a 'create user' flow. User service, email service, CRM service. They had compensating actions that deleted the user, unsubscribed from email, and removed from CRM. This was overkill. A simple eventual consistency with a retry queue would have been simpler. Don't over-architect. Sagas have operational cost.

Production Trap:

Never use a saga for operations that involve physical resources that cannot be undone. For example, shipping a physical product, printing a document, or sending a 'you're hired' email. Compensating actions for these are difficult or impossible.

Production Insight

We had a 'badge printing' saga that compensated by sending an 'oops' email. It was terrible. Users got badges they shouldn't have. We switched to a local transaction with a pre-print validation check.

Key Takeaway

Sagas are for eventual consistency. If you need strong consistency, use a local transaction. If you can't, 2PC is a trap. Choose saga when the business can tolerate a short inconsistency window.

The Two-Phase Commit Pipe Dream

You know what's worse than a distributed transaction? Pretending you have one. The Two-Phase Commit (2PC) protocol was designed for monolithic databases, not microservices. It requires a global coordinator to lock resources across services. In a distributed system, that coordinator becomes a single point of failure and a performance bottleneck. If any participant fails during the commit phase, you're stuck holding locks that no one can release. Production systems that attempt 2PC across service boundaries usually end up with cascading failures and deadlocks. The real problem is that 2PC assumes all participants are available and trustworthy. In a world of network partitions and partial failures, that assumption is naive. You cannot guarantee atomic commit across independent databases without sacrificing availability. That's the CAP theorem doing its thing. So why does 2PC keep appearing in architecture discussions? Because it sounds clean and academic. In practice, it's a trap that will burn your production environment. Saga patterns exist because they embrace the reality that failures happen and you need to handle them gracefully.

TwoPhaseCommitTrap.javaJAVA

// io.thecodeforge — java tutorial
// Never use this in production
public class TwoPhaseCommitTrap {
    @Transactional
    public void createOrder(Order order) {
        orderService.save(order);      // locks row
        paymentService.charge(order);  // locks account
        inventoryService.reserve(order); // locks stock
        // If any of these fails, you're deadlocked
    }
}

Output

ERROR: Deadlock detected. Transaction rolled back.

ERROR: Coordinator unavailable. Locks held indefinitely.

Production Trap:

If you see @Transactional spanning multiple service calls, someone is about to learn what a distributed deadlock looks like. That annotation works fine for a single database. Across services, it's lying to you.

Key Takeaway

If you need distributed transactions, you need a saga, not a 2PC. Two-phase commit is for databases, not services.

Database Per Service: Why Your Monolith Friends Are Wrong

Here's the hard truth that every junior wants to argue about: each microservice gets its own database. Not a shared schema. Not a separate table in the same cluster. Its own database. This isn't about being fancy; it's about survival. When service A and service B share a database, they share coupling. A schema change in service A now requires coordinated deployments with service B. That kills your ability to deploy independently. It also means a poorly written query in service A can bring down service B's performance. The database-per-service pattern enforces boundaries. Each service owns its data and exposes it through APIs. If service B needs service A's data, it calls service A's API. Yes, this means you lose the ability to do JOIN queries across services. You lose foreign keys. You lose relational integrity at the database level. That's the trade-off. In exchange, you get services that can scale independently, use different database technologies, and evolve their schemas without breaking the entire system. This pattern is the foundation that makes sagas necessary. Without separate databases, you wouldn't need sagas at all. You'd just use a single ACID transaction.

DatabasePerService.javaJAVA

// io.thecodeforge — java tutorial
// Each service has its own datasource
@Configuration
public class OrderServiceDatabaseConfig {
    @Bean
    @Primary
    public DataSource orderDataSource() {
        return DataSourceBuilder.create()
            .url("jdbc:postgresql://orders-db:5432/orders")
            .username("orders_user")
            .password("orders_pass")
            .build();
    }
}

@Configuration
public class PaymentServiceDatabaseConfig {
    @Bean
    public DataSource paymentDataSource() {
        return DataSourceBuilder.create()
            .url("jdbc:postgresql://payments-db:5432/payments")
            .username("payments_user")
            .password("payments_pass")
            .build();
    }
}

Output

Service deployment order: 5 seconds

Service deployment payment: 5 seconds

No schema conflicts across services

The One Exception:

If two services truly cannot function without shared data access (like an aggregate boundary cross), you probably have a bounded context problem, not a database problem. Fix the service boundaries.

Key Takeaway

One service, one database. If you're sharing databases between services, you're building a distributed monolith with extra network calls.

● Production incidentPOST-MORTEMseverity: high

The Phantom Inventory Leak

Symptom

Customers got shipping confirmations for out-of-stock items. Fraud team flagged 400+ orders where payment succeeded but inventory was zero. Support queue exploded. Revenue loss: $80k in refunds and shipping label charges.

Assumption

The dev assumed the payment service was idempotent and would correctly fail when inventory was low. The payment service itself was not the problem — the coordination between services was.

Root cause

The order flow used a distributed transaction with no saga. The payment service failed for 90 seconds (a transient network blip). The order service treated this as a permanent failure and didn't execute compensating actions. Inventory was decremented before payment confirmed. No compensating refund was issued to the inventory service.

Fix

1. Implemented a saga orchestrator using Spring Boot 3.x with a state machine (event-driven with Spring Statemachine). 2. Made every service expose a compensating endpoint (PUT /api/compensate). 3. Added idempotency keys to all saga steps. 4. Wrote integration tests that simulate payment failure mid-flow. 5. Deployed with a 5-second timeout; any step taking longer triggers compensation immediately.

Key lesson

Distributed transactions are lies.
If you don't write compensating actions, you are one network timeout away from a data leak.
Test your compensations with chaos engineering before you need them.

Production debug guideSymptom → root cause → fix for the failures that actually happen4 entries

Symptom · 01

Orders stuck in 'PENDING' state and never completing or failing back.

→

Fix

Check the saga orchestrator logs for the correlation IDs. Look for missing compensating calls. Most likely the orchestrator's retry logic is exhausted or the compensating endpoint is not idempotent. Check the orchestrator's persistence store — is the saga status saved? If yes, manually query the DB. If no, you have a race condition in your state machine.

Symptom · 02

Duplicate charges on the payment provider.

→

Fix

Idempotency key violation. The saga retried a step after a timeout, but the step actually succeeded. The payment service didn't check the idempotency key. Fix: make all saga steps idempotent using a unique correlation ID per saga instance. For payments, store the charge ID returned by the provider on the first attempt; use it to skip the second.

Symptom · 03

Inventory shows negative stock after a failed order.

→

Fix

Compensating action didn't run. Either the orchestrator crashed before issuing the compensate, or the compensate endpoint failed silently. Check orchestrator logs for the 'compensate' event. Add a dead-letter queue for failed compensations. Write a reconciliation job that runs hourly to detect inventory discrepancies.

Symptom · 04

Saga times out on a third-party API call (e.g., fraud check).

→

Fix

Third-party APIs are unreliable. Set a circuit breaker on the call. If it times out, don't block the saga — fail fast and compensate. Add a manual retry mechanism for operators. In production, we had a 'Saga Admin' endpoint that could retry a failed step from a known safe state.

★ Debug Cheat SheetCommands for fast diagnosis in production

Order stuck in saga pending−

Immediate action

Query saga orchestrator state

Commands

kubectl logs -l app=saga-orchestrator --tail=1000 | grep 'saga_id=abc123'

curl -X GET http://saga-orchestrator:8080/api/saga/abc123/status

Fix now

If status is STARTED, expire it: curl -X PUT http://saga-orchestrator:8080/api/saga/abc123/compensate

Duplicate payment charge+

Negative inventory count+

Third-party fraud check timeout+

Saga Pattern vs Two-Phase Commit (2PC)

Feature	Saga Pattern	Two-Phase Commit (2PC)
Consistency Model	Eventual consistency	Strong consistency (ACID)
Lock Duration	No locks (each step commits immediately)	Locks held during 'prepare' phase, can block all participants
Failure Handling	Compensating actions (business rollback)	Automatic rollback by coordinator
Performance	High (no blocking)	Lower (blocking during prepare/commit)
Complexity	Moderate (need idempotency, compensation logic)	High (requires coordinator, global XA transactions)
Use Case	Multi-service order flow, booking systems	Single-database, same-service transactions
Resilience	High (can compensate and retry)	Low (coordinator is single point of failure)

⚙ Quick Reference

2 commands from this guide

File	Command / Code	Purpose
TwoPhaseCommitTrap.java	public class TwoPhaseCommitTrap {	The Two-Phase Commit Pipe Dream
DatabasePerService.java	@Configuration	Database Per Service

Key takeaways

Compensating actions are business transactions, not database rollbacks. They must be idempotent.

Never block the state machine thread. Async every external call.

Persist saga state before calling external services. Always.

Chaos test your sagas with random failures, timeouts, and orchestrator restarts.

Orchestration is better for complex compensation logic. Choreography for simple linear flows.

Common mistakes to avoid

5 patterns

Not making compensating actions idempotent.

Symptom

Duplicate compensations cause data corruption (e.g., inventory incremented twice).

Fix

Use a unique compensation ID per saga step. Check if the compensation has already been processed before applying it.

Blocking the saga orchestrator thread with an external HTTP call.

Symptom

All sagas stall when one external service is slow. Thread pool exhaustion causes orchestrator to reject new sagas.

Fix

Use asynchronous HTTP clients (WebClient) or message queues. The state machine should not block on I/O.

Not persisting saga state before calling external services.

Symptom

Orchestrator crashes mid-saga. On restart, the saga is lost, and data is inconsistent (e.g., decremented inventory but no compensating action triggered).

Fix

Persist saga state (current step, completed steps) in a database before each external call. Use @Transactional around the whole operation.

Using 2PC across microservices instead of saga.

Symptom

Distributed deadlocks, blocked databases, and application timeouts during network partitions.

Fix

Replace 2PC with saga pattern. Accept eventual consistency. Implement compensating actions.

Skipping chaos testing for sagas.

Symptom

Compensating actions fail silently in production when a service is temporarily unavailable. Data corruption goes unnoticed for days.

Fix

Set up a chaos test environment that randomly fails external services, introduces latency, and kills orchestrator pods.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

You have a saga that performs inventory decrement, payment capture, and ...

Q02SENIOR

What's the difference between a compensating action and a database rollb...

Q03SENIOR

You need to implement a saga that handles a flight booking, hotel reserv...

Q04SENIOR

Why is 2PC a bad choice for microservices?

Q05SENIOR

How do you handle a saga that has a non-idempotent step?

Q06JUNIOR

Explain the difference between choreography and orchestration in sagas.

Q07SENIOR

You test your saga and it works. In production, the orchestrator OOMs. W...

Q08SENIOR

What metrics would you monitor for a production saga system?

Q01 of 08SENIOR

You have a saga that performs inventory decrement, payment capture, and shipping label creation. The payment service times out. The inventory was already decremented. How do you ensure the inventory is not leaked?

ANSWER

The saga orchestrator should have a timeout on the payment step. When the timeout fires, the orchestrator transitions to the 'compensating' state. It then executes the compensating action for inventory (increment quantity). The compensating action must be idempotent — it checks a compensation log before incrementing. The saga then retries the payment step a configurable number of times before marking the entire saga as failed. The key is that the orchestrator persists its state before each external call, so if it crashes during compensation, it can resume on restart.

FAQ · 5 QUESTIONS

Frequently Asked Questions

What is the Saga Pattern in microservices?

When should I use choreography vs orchestration for sagas?

How do I make a saga compensating action idempotent?

Can I use a saga with a third-party API that doesn't support compensation?

What are the most common production failures with sagas?

Naren Founder & Principal Engineer

20+ years shipping production Java in banking & fintech. Drawn from code that ran under real load.

✓ Verified

production tested

July 04, 2026

last updated

1,697

articles · all by Naren

🔥

That's Microservices Patterns. Mark it forged?

8 min read · try the examples if you haven't