DBMS Deadlock — Lock Order Bugs That Freeze Payment Systems
Two procedures locked accounts in opposite order, spiking latency from 20ms to 60s.
- Deadlock is a permanent block when two or more transactions each hold a lock and wait for another's lock
- Four Coffman conditions must hold simultaneously: mutual exclusion, hold-and-wait, no preemption, circular wait
- Detection uses wait-for graphs; cycles indicate a deadlock
- Prevention costs throughput — you break one Coffman condition but pay in performance
- Recovery typically kills one victim transaction, rolling back its changes
- Biggest mistake: assuming
depends_onin tools equals deadlock prevention — it's about lock order, not start order
Imagine two kids at a dinner table. Kid A grabs the ketchup and waits for Kid B to pass the salt. Kid B already grabbed the salt and is waiting for Kid A to pass the ketchup. Neither will let go. Neither can move forward. That's a deadlock — two parties permanently frozen, each holding something the other needs. In a database, transactions are the kids and locks on data rows are the condiments.
Every production database will eventually deadlock. It's not a sign that something is broken — it's a mathematical certainty when multiple transactions compete for overlapping resources. What separates a system that handles deadlocks gracefully from one that freezes under load is understanding the four conditions that create them, the algorithms that detect them, and the prevention strategies that trade throughput for safety. Netflix, Amazon, and every major bank have war stories about deadlocks that brought services down at peak traffic. This isn't academic — it's survival knowledge for anyone building data-intensive systems.
The root problem is that databases need locks to guarantee isolation — one of the four ACID properties. But the moment you have locks, you have the possibility of circular waiting. Two transactions each hold a resource and each wants a resource the other holds. Without intervention, they'll wait forever. The database has to pick a strategy: never let it happen (prevention), detect it after the fact and break the cycle (detection + recovery), or let the caller handle timeouts (avoidance via timeouts). Each strategy has real performance costs.
By the end of this article you'll be able to draw a resource allocation graph and spot a deadlock cycle in it, explain exactly how PostgreSQL's deadlock detector works and when it fires, write transaction code that minimises deadlock probability, and answer the trick interview questions about why deadlock prevention can be worse for throughput than detection. You'll also understand the four Coffman conditions and why you only need to break one of them to eliminate deadlocks entirely.
What is Deadlock in DBMS? The Four Coffman Conditions
A deadlock requires exactly four conditions to hold simultaneously. That's not a simplification — it's a theorem from Coffman et al. (1971). Break any one, and the deadlock disappears. Here's each condition with a database example.
- Mutual Exclusion: Resources (rows, tables, pages) can be held by only one transaction at a time. If you could share every resource, no problem. But writes need exclusive access.
- Hold and Wait: A transaction holds at least one lock while waiting for another. This is the default behavior — acquire lock A, then request lock B without releasing A.
- No Preemption: The database can't forcibly take a lock from a transaction. Only the transaction itself releases its locks (usually at commit or rollback).
- Circular Wait: There's a cycle in the lock-wait graph. Transaction T1 waits for a lock held by T2, T2 waits for T3, ..., Tn waits for T1.
The practical insight: prevention strategies target one condition. For example, using a lock ordering protocol (e.g., always lock accounts by ascending ID) prevents circular wait without changing anything else. That's the most common production fix.
- Each transaction holds a lock (a resource), and waits for another. The graph of 'holds' and 'waits' must contain a cycle.
- The cycle is the deadlock. If there's no cycle, no deadlock — even if transactions wait a long time.
- Breaking any edge in the cycle (by killing a transaction or releasing a lock) resolves the deadlock.
Deadlock Detection Algorithms: Wait-For Graph and Wound-Wait
Detection is the most common real-world strategy. Instead of preventing deadlocks (which costs throughput), allow them to happen, but catch them quickly and break them. There are two main detection approaches:
Wait-For Graph (WFG): The database maintains a directed graph where nodes are transactions, and an edge Ti → Tj means Ti is waiting for a lock held by Tj. Periodically, or when a lock wait exceeds a threshold, the system checks for cycles. If a cycle exists, choose a victim transaction and abort it (preferably the one that will cause the least work to rollback).
Example: PostgreSQL uses a background deadlock detector that runs every deadlock_timeout (default 1 second). It builds a wait-for graph from the lock manager's state and searches for cycles using a depth-first search. If found, it aborts the youngest transaction in the cycle. The error message tells you exactly which transaction was killed and why.
MySQL/InnoDB uses a different approach: it detects deadlocks immediately upon lock acquisition that would cause a cycle. It automatically chooses a transaction to roll back (the one that holds fewer locks or has done less work). You can see the last deadlock with SHOW ENGINE INNODB STATUS.
Wound-Wait and Wait-Die: These are both prevention and detection hybrids using transaction age (timestamp). Older transactions get priority. In Wound-Wait, if an older transaction requests a lock held by a younger one, it 'wounds' the younger (aborts it). In Wait-Die, older transactions wait, younger ones die (abort). These can cause more aborts than simple WFG detection.
deadlock_timeout low enough to catch deadlocks under your normal transaction duration, but not so low that it fires on benign lock waits.Deadlock Prevention Strategies: Breaking the Coffman Conditions
Prevention eliminates deadlocks by design — but it costs throughput. Here are the four strategies, each targeting one Coffman condition:
1. Prevent Mutual Exclusion: That's impossible for write operations. You can use optimistic concurrency control (like MVCC snapshot isolation) where writers don't block readers and write conflicts are detected on commit. But writes still need exclusive locks on the data being modified — you can't two people update the same row simultaneously without conflict.
2. Prevent Hold-and-Wait: Require a transaction to acquire all locks it will ever need upfront (predeclaration). In a database, this is often done by starting the transaction, locking all required rows in a single statement (e.g., SELECT ... FOR UPDATE for all needed rows in a consistent order). If any lock is unavailable, the transaction fails immediately instead of waiting. This reduces deadlock risk but increases lock contention and reduces concurrency.
3. Prevent No Preemption: Allow the database to preempt (forcibly abort) a transaction when deadlock is detected — that's detection, not prevention. True 'prevention' of no preemption would require a mechanism for the DB to take locks away, which isn't standard.
4. Prevent Circular Wait: Enforce a global ordering of resources. For example, always lock rows in ascending primary key order. If every transaction follows the same order, cycles cannot form. This is the most practical prevention technique and widely used in production.
Trade-off: Prevention via predeclaration (hold-and-wait) reduces concurrency dramatically. Prevention via lock ordering (circular wait) is cheap but requires discipline across the entire codebase. Most teams go with detection + recovery because prevention is complex to enforce.
Deadlock Recovery: Choosing the Victim and Rolling Back
When a deadlock is detected, the database must break the cycle by aborting one or more transactions (the 'victim'). The victim is chosen based on a cost metric:
- Transaction age: Younger transactions are cheaper to abort (less work done).
- Number of locks held: Fewer locks means less rollback work.
- Work done (e.g., rows modified): Rollback of a large transaction costs more I/O and CPU.
PostgreSQL's victim selection: it picks the transaction that will cause the least amount of redo (i.e., the one that has made the fewest changes). In practice, it's the youngest transaction in the cycle — not by timestamp, but by the transaction's virtual transaction ID (VXID).
MySQL's victim selection: InnoDB chooses the transaction that has performed the least work (based on undo log size). If equal, it picks one with fewer locks.
Once a victim is chosen, the database rolls back that transaction entirely. All its locks are released, allowing the other waiting transaction(s) to proceed. The application receives an error (e.g., '40001' serialization failure in PostgreSQL). The application must retry.
- Immediate retry often causes the same deadlock. Use exponential backoff with jitter.
- Track the number of retries; after 3-5, escalate to a human or a dead letter queue.
- Retry at the transaction boundary, not inside it — you want a fresh transaction to avoid leftover locks.
MVCC and Optimistic Concurrency Control: Avoiding Locks Altogether
Multi-Version Concurrency Control (MVCC) doesn't eliminate locks, but it changes the game. In MVCC, readers don't block writers and writers don't block readers — each transaction sees a snapshot of the database (a version). Conflicts only arise on write-write collisions. Deadlocks still happen, but they're less frequent because read operations don't acquire locks.
In PostgreSQL's MVCC (snapshot isolation), a write-transaction that updates a row locks that row. If another write-transaction tries to update the same row, it waits. If there's a cycle, deadlock occurs. So MVCC reduces the chance of deadlock by eliminating reads from the lock picture, but write-write contention remains.
Optimistic Concurrency Control (OCC): No locks at all during the transaction. All operations are applied locally to a private copy. At commit time, the database checks if a conflict occurred (e.g., via a version number or timestamp). If yes, the transaction is aborted and retried. OCC completely eliminates deadlocks because there are no locks! But the cost is high abort rate under contention.
In practice, most relational databases use MVCC (which is a form of pessimistic concurrency control for writes). In-memory databases and some NoSQL systems use OCC. The trade-off is clear: OCC has zero deadlock risk but can waste work on aborts under heavy contention. MVCC has lower abort rates but risk of deadlock.
The Midnight Payment Jam: A Classic Deadlock Cascade
pg_stat_activity showed dozens of idle-in-transaction sessions waiting on lock events.- Always acquire locks in a fixed, documented order across your entire codebase.
- Add a lock timeout as a safety net — never rely solely on deadlock detection.
- Retry logic on deadlock must include exponential backoff and jitter, not immediate replay.
pg_stat_activity (PostgreSQL) or SHOW PROCESSLIST (MySQL). Look for many sessions in 'idle in transaction' or 'waiting for lock'.log_lock_waits = on and deadlock_timeout = 1s. Collect the deadlock details from the log.pg_locks with joins to find which sessions hold conflicting locks. Then refactor to use a consistent order.SELECT pg_terminate_backend(<blocked_pid>);Key takeaways
Common mistakes to avoid
5 patternsAssuming deadlocks will never happen in your DB
Using `NOWAIT` or `SKIP LOCKED` everywhere to avoid waits
NOWAIT only when you truly cannot wait (e.g., UI real-time). For backend batch processing, let transactions wait with a reasonable timeout.Implementing retry without backoff or jitter
sleep = min(base * 2^attempt + random(0, jitter), max_sleep).Ignoring lock ordering in stored procedures
Believing MVCC eliminates deadlocks entirely
Interview Questions on This Topic
Explain the four Coffman conditions necessary for deadlock. Can a deadlock occur if only three of them hold?
Frequently Asked Questions
That's DBMS. Mark it forged?
7 min read · try the examples if you haven't