Distributed Transactions and 2PC: When Atomicity Goes Wrong in Production
Distributed transactions and 2PC explained with production failures, code, and debugging.
20+ years shipping large-scale distributed systems. Lessons pulled from things that broke in production.
Distributed transactions use 2PC to ensure all participants commit or abort together. The coordinator sends a prepare request, waits for all votes, then sends commit or abort. If any participant votes no, the entire transaction aborts. This guarantees atomicity but adds latency and blocking risks.
Imagine three friends deciding where to eat. First, everyone votes (prepare phase). If all agree, they go (commit). If anyone disagrees, they stay home (abort). But if one friend goes silent after voting, the others wait forever — that's the blocking problem in 2PC.
Distributed transactions are a lie we tell ourselves. We pretend multiple databases can atomically agree on a state change, but in production, that agreement breaks under network partitions, coordinator crashes, and timeouts. I've seen a payment service double-charge customers because a 2PC coordinator crashed after the prepare phase but before commit — the participants were left in doubt, and a manual recovery script went wrong.
You need distributed transactions when a single operation spans multiple systems and must be all-or-nothing: transferring money between banks, placing an order that deducts inventory and charges a card, or booking a flight and hotel together. Without atomicity, you get inconsistent state — money lost, inventory oversold, customers furious.
By the end of this, you'll know exactly how 2PC works under the hood, where it fails, and how to debug it in production. You'll also know when to walk away and use a saga instead.
Why You Can't Just Use ACID Across Databases
ACID transactions work inside a single database because the database controls all resources — locks, logs, recovery. Across databases, there's no shared lock manager, no global transaction log. Without coordination, you get partial failures: inventory deducted but payment not charged, or vice versa.
Before 2PC, teams hacked around this with two-phase writes: write to DB1, then write to DB2. If the second write fails, you're in inconsistent state. Some added a compensating transaction (manual rollback), but that's fragile and error-prone. I've seen a team try to use database triggers to sync two MySQL instances — it ended with a split-brain and data corruption.
2PC solves this by introducing a coordinator that orchestrates the commit decision. But it's not magic — it has its own failure modes.
How 2PC Works: The Coordinator's Dance
2PC has two phases: prepare and commit. The coordinator sends a 'prepare' request to all participants. Each participant writes the transaction to a durable log (so it can survive crashes) and replies 'yes' (ready) or 'no' (abort). If all say yes, the coordinator sends 'commit'. If any says no, it sends 'abort'.
Participants must block after voting yes — they hold locks on the modified data until they receive the final decision. This is the blocking property that makes 2PC risky for long-running transactions.
Here's a minimal implementation in Go using a coordinator and two participants (PostgreSQL and Redis). The coordinator uses a transaction log to recover after crashes.
The Blocking Problem: Why 2PC Can Freeze Your System
After a participant votes 'yes' in the prepare phase, it must hold locks on the modified data until it receives the final commit or abort. If the coordinator crashes after collecting all 'yes' votes but before sending commit, participants are stuck — they can't release locks because they don't know the outcome. This is the blocking problem.
In production, this means your database connections pile up, lock contention spikes, and eventually the system grinds to a halt. I've seen a PostgreSQL instance hit max_connections because 200 prepared transactions were holding locks on the same row.
The fix is to set a timeout on prepared transactions. PostgreSQL has idle_in_transaction_session_timeout — set it to a few seconds. But be careful: if the timeout fires before the coordinator recovers, you might abort a transaction that should have committed. That's why you need a robust coordinator recovery mechanism.
idle_in_transaction_session_timeout to a value lower than your expected coordinator recovery time. If the coordinator takes 30 seconds to restart and your timeout is 10 seconds, you'll abort transactions that should have committed. Set it to at least 2x the coordinator's expected recovery time.When 2PC Fails: The Three Failure Modes You'll Actually Hit
- Coordinator crash after prepare, before commit: Participants are in-doubt. On restart, the coordinator reads its log and sends commit/abort. If the log is lost, you must manually query each participant and decide.
- Participant crash after voting yes: The coordinator sees a timeout and aborts the transaction. But the participant might have already committed locally if it crashed after writing the commit to its log but before responding. This leads to a 'heuristic' outcome — the participant's actual state differs from the coordinator's decision.
- Network partition during commit: The coordinator sends commit but some participants don't receive it. They remain prepared. The coordinator might retry, but if the partition lasts too long, participants timeout and abort unilaterally — causing inconsistency.
For each failure mode, you need a recovery procedure. The coordinator should have a 'transaction manager' that periodically scans for in-doubt transactions and resolves them by re-contacting participants.
Performance: Why 2PC Is Slow and How to Make It Less Painful
2PC adds at least two round trips per participant (prepare + commit/abort). With three participants across different data centers, that's 6 network round trips plus disk I/O for logging. Latency adds up fast.
- Presumed Abort: If the coordinator doesn't hear back from a participant, assume abort. This reduces the need for a third phase but can cause premature aborts.
- Read-Only Optimization: If a participant only reads data, it can skip the prepare phase and vote 'read-only' — it doesn't need to hold locks.
- Batching: Group multiple transactions into one 2PC round trip. Useful for high-throughput scenarios.
But honestly, if you need high throughput, don't use 2PC. Use sagas with compensating actions. 2PC is for low-volume, high-value transactions where consistency is paramount.
When NOT to Use 2PC: The Saga Alternative
2PC is overkill for most microservices architectures. If your transaction spans services that don't share a database, consider a saga — a sequence of local transactions with compensating actions for rollback.
Sagas come in two flavors: choreography (each service publishes events) and orchestration (a coordinator tells each service what to do). Orchestrated sagas look similar to 2PC but without the blocking — each step commits immediately, and compensation is a separate transaction.
- You need strong consistency (e.g., financial transfers)
- Participants are within the same trust boundary (e.g., same organization)
- Transaction volume is low (e.g., < 100 TPS)
- You can tolerate eventual consistency
- Participants are across different teams or organizations
- You need high throughput
I've seen teams blindly adopt 2PC for every cross-service operation and then wonder why their system is slow and brittle. Don't be that team.
Production Debugging: How to Unstick a Stuck 2PC Transaction
When a 2PC transaction gets stuck, you need to identify the coordinator and participants, then manually resolve. Here's the playbook:
- Find the coordinator: Check the coordinator's logs for the transaction ID. If the coordinator is down, restart it and let it recover.
- Query participants: For each participant, check if the transaction is in 'prepared' state. In PostgreSQL:
SELECT * FROM pg_prepared_xacts;. In MySQL:XA RECOVER;. - Decide commit or abort: Based on business rules. If the coordinator's log says 'commit', commit on all participants. If 'abort', rollback. If log is lost, you need to infer from participant states — if all participants are prepared, it's safe to commit; if some are aborted, abort all.
- Execute: Use
COMMIT PREPARED 'txid'orROLLBACK PREPARED 'txid'on each participant.
Automate this with a script that queries all participants and resolves based on a majority vote or coordinator log.
pg_prepared_xacts has more than 5 rows. That's a sign of a stuck coordinator. Automate the resolution with a script that checks the coordinator's health and resolves in-doubt transactions.The 3AM Coordinator Crash That Double-Charged Customers
commit_if_prepared flag that prevents double-committing.- Always make your 2PC transactions idempotent.
- A coordinator crash and recovery can replay commits — your participants must handle that gracefully.
SELECT * FROM pg_prepared_xacts;. 4. If coordinator log says commit, run COMMIT PREPARED 'txid' on all participants. If abort, run ROLLBACK PREPARED. If log lost, manually decide based on business rules.coordinator_timeout (default 60s). 4. If participant is unresponsive, abort the transaction and retry.SELECT gid, prepared, owner FROM pg_prepared_xacts;SELECT count(*) FROM pg_prepared_xacts;COMMIT PREPARED 'txid'; or ROLLBACK PREPARED 'txid';Key takeaways
Interview Questions on This Topic
How does 2PC handle a coordinator crash after all participants have voted yes but before commit is sent?
Frequently Asked Questions
20+ years shipping large-scale distributed systems. Lessons pulled from things that broke in production.
That's Distributed Systems. Mark it forged?
5 min read · try the examples if you haven't