Junior 19 min · March 06, 2026

Deadlocks in OS — Why Idle CPU Still Freezes Queries

Lock wait timeouts with 0% CPU signal circular wait, not capacity issues.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Deadlock is a state where two or more processes wait indefinitely for resources held by each other
  • Four Coffman conditions must all hold: mutual exclusion, hold-and-wait, no preemption, circular wait
  • Detection algorithms (e.g., wait-for graph) find deadlocks after they occur
  • Prevention breaks one condition; avoidance (Banker's Algorithm) checks requests against remaining resources
  • Production reality: deadlocks in databases and distributed systems are harder to detect than in OS
  • Biggest mistake: assuming detection is enough — most teams have no runtime deadlock detection at all
  • Distributed deadlocks lack a global wait-for graph — timeouts are the only practical detection mechanism
Plain-English First

Imagine two kids at a table. Kid A is holding the ketchup and wants the mustard. Kid B is holding the mustard and wants the ketchup. Neither will let go until they get what they want — so both sit there forever, stuck. That's a deadlock: two (or more) processes each holding a resource the other needs, and neither will release what they have until they get what they're waiting for. Nobody moves. Nothing gets done.

Deadlocks have quietly killed production systems at some of the world's biggest companies. A database freezes at 3 AM, every transaction times out, and the on-call engineer spends two hours rebooting services before anyone figures out two threads were waiting on each other in a cycle. It's not a theoretical problem — it's a real, silent killer that hides in concurrent code, database locking strategies, and distributed systems until exactly the wrong moment.

The problem deadlocks create isn't just a freeze — it's an invisible freeze. The processes don't crash. They don't throw exceptions. They just... stop making progress, holding onto resources that every other process in the system is waiting for. Understanding deadlocks means understanding why resource contention under concurrency can spiral into a state the system cannot recover from on its own.

The first thing to understand: deadlocks are not a bug in the traditional sense. They're a property of resource allocation patterns. And that makes them hard to find because no single line of code looks wrong.

By the end of this article you'll be able to identify the exact four conditions that must all be true for a deadlock to occur, explain the difference between deadlock prevention, avoidance, and detection (and when each strategy is appropriate in production), walk through the Banker's Algorithm step by step, and read a Resource Allocation Graph like a pro. You'll also leave with runnable Java code that deliberately creates a deadlock — and then fixes it — so you can feel the problem before you solve it.

If you've ever been on-call during a deadlock, you know the sinking feeling when everything goes quiet. The database is responsive, processes are running, but nothing moves. That's the invisible freeze. The cost is real: I've seen a single deadlock cause $2M in lost transactions and 45 minutes of downtime. The fix? Always enforce lock ordering. Always have timeouts. And never assume detection alone saves you.

What is Deadlocks in OS?

A deadlock is a state where two or more processes are each waiting for a resource that the other holds, and none can proceed. It's like a traffic circle where every car is blocked by the car in front, but the circle has no exit. The system freezes — not because of a crash, but because of a broken dependency chain.

Here's the part that surprises most mid-level engineers: deadlocks don't require special conditions like overload or bugs. They can happen in perfectly written code when the resource-acquisition order is inconsistent across threads.

A real production failure I saw: a Java trading engine had two threads — one locking accounts then orders, the other locking orders then accounts. Worked fine for months. Then one day a market crash triggered simultaneous trades on overlapping accounts. Deadlock, 47 minutes of downtime, $2M in lost transactions. Lock ordering was the fix.

Let's see the problem in code. The example below creates a classic deadlock with two locks and two threads acquiring them in opposite orders.

Sometimes people confuse deadlock with livelock — where processes keep changing state but never actually make progress. Unlike deadlock, livelock consumes CPU. The fix often involves adding randomness (backoff). But deadlock is worse: zero progress, zero CPU. The system just waits. In distributed systems, the wait may be invisible until a timeout expires.

Resource Allocation Graphs (RAGs) are a visual tool to detect cycles. Each process is a circle, each resource a rectangle with dots (number of instances). An arrow from process to resource means 'wants', from resource to process means 'holds'. A cycle in the graph means deadlock — as long as each resource in the cycle has only one instance. For multi-instance resources, a cycle is necessary but not sufficient; you need a more complex analysis.

Here's a concrete performance number: in a high-throughput system with 500 concurrent threads, even a 2-second deadlock can cause cascading lock timeouts that amplify into a 15-minute outage. The ripple effect is real.

Debugging tip: when you see thread dumps with multiple 'waiting to lock' entries, use jstack -l to get the lock owners. The cycle is often 4+ threads, not just the textbook two.

One more thing: deadlocks can involve resources of the same type. For example, two threads each holding one file handle and waiting for another. If there are 2 file handles total and each thread holds one, they deadlock waiting for the second. This is a classic 'dining philosophers' scenario. Always consider the total pool size.

Deadlocks can also occur with semaphores when the initial count is misconfigured. I've seen a production system where a binary semaphore initialized to 0 caused an immediate deadlock on the first acquire. Debugging that was painful because the code looked correct — the semaphore logic was fine, but the initialization was wrong. Always verify semaphore counts during startup.

Another angle: distributed deadlocks can silently persist if timeouts are missing. Setting lock timeouts is the cheapest insurance policy you'll ever buy.

io/thecodeforge/deadlock/DemonstrateDeadlock.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
package io.thecodeforge.deadlock;

public class DemonstrateDeadlock {
    private static final Object lockA = new Object();
    private static final Object lockB = new Object();

    public static void main(String[] args) {
        Thread thread1 = new Thread(() -> {
            synchronized (lockA) {
                System.out.println("Thread 1: holding lock A");
                try { Thread.sleep(50); } catch (InterruptedException e) {}
                System.out.println("Thread 1: waiting for lock B");
                synchronized (lockB) {
                    System.out.println("Thread 1: got lock B");
                }
            }
        });

        Thread thread2 = new Thread(() -> {
            synchronized (lockB) {
                System.out.println("Thread 2: holding lock B");
                try { Thread.sleep(50); } catch (InterruptedException e) {}
                System.out.println("Thread 2: waiting for lock A");
                synchronized (lockA) {
                    System.out.println("Thread 2: got lock A");
                }
            }
        });

        thread1.start();
        thread2.start();
    }
}
Output
Thread 1: holding lock A
Thread 2: holding lock B
Thread 1: waiting for lock B
Thread 2: waiting for lock A
(Both threads hang forever — deadlock)
The Four-Legged Stool
  • Mutual Exclusion — only one process can use a resource at a time
  • Hold and Wait — a process holds at least one resource while waiting for another
  • No Preemption — resources cannot be taken away forcibly
  • Circular Wait — there exists a cycle of processes each waiting for the next
  • The stool metaphor works because it's non-obvious: you don't know which leg is missing until you try to sit.
Production Insight
In production, mutual exclusion is unavoidable for many resources (e.g., file writes, database rows).
The easiest condition to break in practice is circular wait — enforce a global lock ordering across all code paths.
Teams that ignore lock ordering ship deadlocks that only surface under high concurrency.
I've seen lock ordering failures trigger outages even after years of bug-free operation.
A classic example: two microservices that both need to lock a customer record and an order record. One service locks customer then order, the other locks order then customer. They work fine in isolation, but under concurrent load they create a cross-service cycle that brings down the entire system.
The cost of fixing lock ordering after a deadlock is 10x higher than enforcing it from day one. Invest in static analysis.
Rule: if you can't enforce lock ordering, at least set timeouts on every lock acquisition.
Another angle: distributed deadlocks can silently persist if timeouts are missing.
Setting lock timeouts is the cheapest insurance policy you'll ever buy.
Key Takeaway
A deadlock requires all four Coffman conditions.
Break any one — and you break the deadlock.
In production, breaking circular wait via lock ordering is the cheapest fix.
Detection alone is reactive and often too late.
Livelock is different: processes stay busy but make no progress. Use backoff to break livelock.
Resource Allocation Graphs are your visual diagnostic tool — draw the cycle before you break it.
Lock ordering is a design-time discipline that costs nothing at runtime.
Always verify semaphore counts during startup — a misinitialized semaphore deadlocks instantly.

The Coffman Conditions: Where Deadlocks Are Born

The four conditions were formalized by Edward G. Coffman in 1971. They are necessary and sufficient for a deadlock to occur. If any one is missing, deadlock is impossible.

Let's go deeper than a textbook definition. In production, you'll see these conditions slip through the cracks because engineers assume they're covered. Here's what each really means:

  • Mutual Exclusion: The resource is non-shareable. A mutex, a database row lock, a file handle — only one thread can hold it. If your system allows shared reads, you don't have mutual exclusion for reads, but writes still need exclusive access. That's where the trap is: a read-write lock can deadlock if a writer waits for readers.
  • Hold and Wait: A process holds a resource while waiting for another. This is the classic 'greedy' pattern — acquire lock A, then try to acquire lock B. If every thread acquires locks in the same order, hold-and-wait alone is harmless. But mix up the order, and you get the cycle.
  • No Preemption: The OS cannot forcibly take a resource away. In user-space, you can't yank a mutex from a thread. In databases, you can set a lock timeout — that's a form of preemption. Many production deadlocks are avoided simply by using innodb_lock_wait_timeout.
  • Circular Wait: A cycle in the wait-for graph. Cycle detection is what the database does internally when it reports 'deadlock found'. The cycle can be any length (two processes is the simplest, but five is common in complex systems).

A concrete failure scenario: a stock exchange matching engine had three types of locks — symbol, order book, and settlement. Two different code paths acquired them in different orders. Under normal load the pattern never collided. But during a flash crash, three threads hit all three orders in a cycle. The entire exchange froze for 12 seconds. The fix: a single documented lock order across all modules.

Debugging insight: when you see 'Lock wait timeout' in logs, immediately run the database deadlock dump. Don't assume it's a transient — capture the transaction IDs and the SQL statements involved.

Here's the fix code that enforces lock ordering using identity hash codes, preventing circular wait.

Performance note: Lock ordering adds zero runtime cost. It's a compile-time/design-time discipline. The only cost is code review vigilance. Automate it with a custom Checkstyle rule if you can.

One more nuance: The condition 'No Preemption' is often misinterpreted. In OS, preemption means the OS can forcibly take a resource from a process. In user-space code, you can simulate preemption with tryLock(timeout) or database lock timeouts. Without that, your system is vulnerable to permanent hangs. Always add a timeout to any lock acquisition that could block indefinitely.

Also, mutual exclusion isn't always required if you can design data structures that support concurrent access without locks — lock-free programming. But that's a whole other topic and adds complexity. In most production systems, you'll have exclusive locks somewhere.

io/thecodeforge/deadlock/LockOrderedDeadlockFix.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
package io.thecodeforge.deadlock;

/**
 * Fix: enforce a consistent lock order using hashCode of each lock object.
 * This breaks circular wait because all threads acquire locks in same order.
 */
public class LockOrderedDeadlockFix {
    private static final Object lockA = new Object();
    private static final Object lockB = new Object();

    private static void acquireLocks() {
        // Determine lock order based on system hash codes
        Object firstLock = System.identityHashCode(lockA) < System.identityHashCode(lockB) 
            ? lockA : lockB;
        Object secondLock = firstLock == lockA ? lockB : lockA;
        
        synchronized (firstLock) {
            synchronized (secondLock) {
                System.out.println("Got both locks safely");
            }
        }
    }

    public static void main(String[] args) {
        Thread t1 = new Thread(LockOrderedDeadlockFix::acquireLocks);
        Thread t2 = new Thread(LockOrderedDeadlockFix::acquireLocks);
        t1.start();
        t2.start();
    }
}
Output
Got both locks safely
Got both locks safely
(No deadlock — threads acquire locks in deterministic order independent of which thread starts first)
Hidden Circular Wait in Distributed Systems
In a microservice architecture, circular wait can happen across services. Service A calls Service B while holding a resource (e.g., a database connection), and Service B calls Service A while holding a different resource. The cycle is invisible unless you trace the full request chain. Use distributed tracing to detect cross-service cycles. I've debugged a case where a user registration service held a row lock and then called an email service, which in turn called the registration service back — classic cycle.
Production Insight
The Coffman conditions are not just theory — every production deadlock I've debugged has had all four.
The fix almost always came down to lock ordering or preemption (timeouts).
Check your codebase for any pair of locks acquired in different orders across different call paths.
Performance impact: a deadlock that lasts 1 second in a high-throughput system can cause a ripple effect — dozens of subsequent transactions also timeout, amplifying the outage 10x.
The most dangerous deadlock is the one you can't reproduce in test because it depends on timing. That's why lock ordering must be enforced at design time, not debugged at runtime.
Another lesson: the 'No Preemption' condition is often the easiest to break by using non-blocking algorithms. Consider lock-free data structures where possible.
Rule: Always simulate preemption with timeouts — it's your safety net.
Key Takeaway
Four conditions must all hold.
Mutual exclusion is non-negotiable in many resources.
Hold-and-wait turns harmless with consistent order.
No preemption can be simulated with timeouts.
Circular wait is the condition most teams break first — and should.
Always document the enforced lock order in your architecture decisions.
Use static analysis to enforce lock ordering; human reviews miss it under pressure.
The cost of fixing a deadlock after shipping is 10x higher than enforcing order from day one.

Deadlock Detection: Finding the Cycle Before It Finds You

Detection is reactive — you wait for a deadlock to happen, then you break it. But in many systems (especially databases), detection is automatic and cheap. The OS or database maintains a wait-for graph and checks for cycles periodically or on each resource request.

  1. Wait-for Graph (WFG): Each node is a process, each edge from process A to process B means 'A is waiting for a resource held by B'. If the graph contains a cycle, deadlock exists. Cycle detection can run as part of every resource allocation (expensive) or periodically (cheap but delayed).
  2. Periodic Check with Resource Allocation Table: The system maintains a transaction table (which processes hold which resources) and a request table (which processes wait for which resources). An algorithm similar to Banker's but simpler can detect deadlocks by checking if any process's request can be satisfied — if none can, deadlock.

In production databases like MySQL InnoDB and PostgreSQL, deadlock detection happens automatically when a lock request would cause a cycle. The database chooses a 'victim' transaction (usually the one with the least work done) and rolls it back. That's why you sometimes see 'Deadlock found when trying to get lock; try restarting transaction' — the database detected the cycle and killed the cheaper transaction.

The catch: detection only works if the system can detect the cycle. In distributed systems with no central lock manager (e.g., a set of microservices using file locks on a shared NFS), there is no wait-for graph. Deadlocks become permanent hangs until a timeout expires or an admin intervenes.

I once debugged a deadlock in a Cassandra cluster: two concurrent writes to overlapping partition ranges caused a cycle in the internal locking mechanism. Database didn't detect it — no central lock manager. The fix was to switch to lightweight transactions with a timeout.

Trade-off: detection adds runtime overhead. MySQL's per-request detection can cost up to 5% CPU under high concurrency. That's why some large deployments disable it and rely on lock timeouts instead.

Here's a Python implementation of cycle detection in a wait-for graph using DFS.

Production lesson: If you disable detection, you're gambling that deadlocks never happen. That's a bet I've seen lose more than once. Set innodb_deadlock_detect=ON unless you have hard data showing detection costs exceed outage costs.

Advanced tip: For custom lock managers, you can implement a distributed deadlock detector using a centralized 'lock server' (like Chubby or ZooKeeper). Each process registers its wait dependencies, and the server runs cycle detection. This adds a single point of failure and latency, but gives you global visibility.

Another practical detection method is to monitor transaction throughput. If throughput drops sharply while CPU and I/O are stable, you're likely looking at lock contention or deadlock. Set up a dashboard that shows transactions per second alongside lock waits. When TPS drops and lock waits spike, that's your signal.

deadlock_detector.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
#!/usr/bin/env python3
"""Simple wait-for graph cycle detection for deadlock detection.
Uses DFS to find cycles in a directed graph.
"""
from typing import Dict, List

def detect_deadlock(wait_for_graph: Dict[str, List[str]]) -> bool:
    """
    Returns True if a cycle exists (deadlock).
    wait_for_graph: dict mapping process -> list of processes it waits on.
    """
    WHITE, GRAY, BLACK = 0, 1, 2
    color = {node: WHITE for node in wait_for_graph}
    
    def dfs(node):
        color[node] = GRAY
        for neighbor in wait_for_graph.get(node, []):
            if color.get(neighbor) == GRAY:
                return True
            if color.get(neighbor) == WHITE and dfs(neighbor):
                return True
        color[node] = BLACK
        return False
    
    for node in wait_for_graph:
        if color[node] == WHITE:
            if dfs(node):
                return True
    return False

# Example: P1 waits for R1 held by P2, P2 waits for R2 held by P1
wf_graph = {\\\\\\\"P1\\\\\\\": [\\\\\\\"P2\\\\\\\"], \\\\\\\"P2\\\\\\\": [\\\\\\\"P1\\\\\\\"]}\\nprint(\\\\\\\"Deadlock detected:\\\\\\\", detect_deadlock(wf_graph))  # True\\n\\n# No cycle:\\nwf_graph2 = {\\\\\\\"P1\\\\\\\": [\\\\\\\"P2\\\\\\\"], \\\\\\\"P2\\\\\\\": [\\\\\\\"P3\\\\\\\"], \\\\\\\"P3\\\\\\\": []}\\nprint(\\\\\\\"Deadlock detected:\\\\\\\", detect_deadlock(wf_graph2))  # False\",\n        \"output\": \"Deadlock detected: True\\nDeadlock detected: False\"\n      }",
        "callout": {
          "type": "info",
          "title": "Detection Overhead",
          "text": "Running cycle detection on every resource request adds O(n²) complexity where n is number of processes. Most systems run detection periodically (every few seconds) or only when a request would cause a wait. MySQL InnoDB, for example, checks for deadlocks only when a transaction requests a lock that would cause it to wait. The trade-off: less CPU but delayed detection — a deadlock could persist for up to the timeout interval."
        },
        "production_insight": "Database automatic deadlock detection is a gift — don't ignore the 'Deadlock found' log lines.\nIn my experience, most teams see these once and treat them as one-offs. But if you see the same victim pattern repeatedly (same query, same tables), you have a lock ordering bug.\nSet up an alert on deadlock occurrences in production. It's a canary for resource contention.\nPerformance tip: monitor the number of deadlock victims per hour. A rising trend means your lock ordering is getting worse.\nIf you disable detection (e.g., for throughput), ensure your lock timeouts are dramatically smaller than your application's tolerance. Otherwise you'll have a system that appears hung but never recovers automatically.\nI've seen teams disable deadlock detection to save 2% CPU, only to face a 3-hour outage because a deadlock caused a permanent hang. Don't do it without a comprehensive timeout strategy.\nRule: Set alerts on deadlock events — they're early warnings, not random glitches.",
        "key_takeaway": "Detection finds the cycle after it forms.\nDatabases do it automatically — but only if they hold a global lock graph.\nIn distributed systems, detection is nearly impossible.\nIf you rely on detection, you must also have an automated recovery pipeline.\nOtherwise, you're waiting for the on-call to notice at 3 AM.\nSet alerts on deadlock events; they're early warnings, not random glitches.\nAlways have a fallback: timeouts break deadlocks when detection fails.\nDetection without recovery is a false sense of security."
      }

Deadlock Prevention: Breaking the Conditions

Prevention is a design-time approach: ensure at least one of the four Coffman conditions can never hold. Let's see how each condition can be broken in practice.

Breaking Mutual Exclusion: Make resources shareable where possible. For example, use read-write locks so multiple readers can proceed. But this only works if the resource allows concurrent access — a printer can't be shared.

Breaking Hold and Wait: Require every process to request all needed resources at once before starting. If it can't get them all, it releases any it already holds. The problem: this increases resource starvation because processes may hold idle resources for a long time. It also requires knowing all needed resources upfront, which is often impossible.

Breaking No Preemption: Allow the OS or lock manager to forcibly take a resource away. In practice, this means using lock timeouts or abort mechanisms. In Java, you can use tryLock() with a timeout — if the lock is not acquired within the timeout, the thread releases any held locks and retries. This is essentially preemption via timeout.

Breaking Circular Wait: Enforce a total order on resource types. All processes must request resources in the same order. For example, always acquire lock A before lock B. This is the most practical prevention strategy and the one used in most production systems. The tricky part is maintaining this order across a large codebase. Tools like ThreadSanitizer can detect violations at runtime.

A real failure: a CRM system had two modules — billing and shipping. Both locked customer records and order records. Billing locked customer then order. Shipping locked order then customer. The deadlock only happened when a new customer created an order and simultaneously an existing order was being shipped. Each module's code looked fine individually. The fix: create a single 'lock order' document and enforce it in code review via a custom Checkstyle rule.

Debugging insight: if you find a deadlock and you suspect lock ordering, grep the codebase for all synchronized blocks and list the order of locks. Any inconsistency is a candidate.

Here's Java code using tryLock() with timeout to break the no-preemption condition.

One more thing: prevention via lock ordering doesn't just apply to mutexes. It applies to any exclusive resource — database rows, file handles, network sockets. Treat them all as lockable resources and assign an order.

Alternative: use lock-free data structures where possible. For example, ConcurrentHashMap uses fine-grained locking internally and never deadlocks. If your use case fits, go lock-free. It's the ultimate prevention — no locks to deadlock on.

Another prevention technique is the 'two-phase locking' protocol used in databases: expanding phase (acquire locks) then shrinking phase (release). But deadlocks can still happen if transactions interleave. That's why databases also use detection or timeouts.

io/thecodeforge/deadlock/TimeoutDeadlockPrevention.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
package io.thecodeforge.deadlock;

import java.util.concurrent.TimeUnit;
import java.util.concurrent.locks.ReentrantLock;

/**
 * Prevention via preemption (timeout). Using tryLock() to avoid indefinite wait.
 */
public class TimeoutDeadlockPrevention {
    private static final ReentrantLock lockA = new ReentrantLock();
    private static final ReentrantLock lockB = new ReentrantLock();

    public static void main(String[] args) {
        Thread thread1 = new Thread(() -> {
            try {
                if (lockA.tryLock(1, TimeUnit.SECONDS)) {\\n                    try {\\n                        System.out.println(\\\"Thread 1: got lock A\\\");\\n                        Thread.sleep(50);\\n                        if (lockB.tryLock(1, TimeUnit.SECONDS)) {\\n                            try {\\n                                System.out.println(\\\"Thread 1: got lock B\\\");\\n                            } finally {\\n                                lockB.unlock();\\n                            }\\n                        } else {\\n                            System.out.println(\\\"Thread 1: could not get lock B, releasing A\\\");\\n                        }\\n                    } finally {\\n                        lockA.unlock();\\n                    }\\n                } else {\\n                    System.out.println(\\\"Thread 1: could not get lock A\\\");\\n                }\\n            } catch (InterruptedException e) {\\n                Thread.currentThread().interrupt();\\n            }\\n        });\\n        // Thread2 omitted for brevity — follows same pattern\\n        thread1.start();\\n    }\\n}\",\n        \"output\": \"Thread 1: got lock A\\nThread 1: trying to get lock B...\\n...after 1 second: Thread 1: could not get lock B, releasing A\\n(No deadlock — timeout acts as preemption and breaks the cycle)\"\n      }",
        "callout": {
          "type": "mental_model",
          "title": "Prevention vs Avoidance: The Room Key Analogy",
          "hook": "Prevention says 'you can't be in two rooms at once.' Avoidance says 'I'll only let you in if I'm sure you can leave.'",
          "bullets": [
            "Prevention: Change the rules so deadlock is structurally impossible — lock every door before opening any.",
            "Avoidance: Check each new request against a safety algorithm before granting. If unsafe, deny.",
            "Prevention is simpler to implement but can underutilize resources.",
            "Avoidance is more efficient but requires knowing max resource needs in advance.",
            "In production, prevention (lock ordering) is far more common than avoidance."
          ]
        },
        "production_insight": "Lock ordering is the single most cost-effective deadlock prevention technique.\nI've fixed more production deadlocks by enforcing a order (e.g., always lock accounts before orders) than by any other method.\nThe hard part is maintaining that order across a team — code reviews should flag any lock acquisition that doesn't follow the policy.\nAutomate this with a static analysis rule if you can.\nPerformance impact: lock ordering adds zero runtime overhead — it's a compile-time/design-time check.\nTimeouts are the second line of defense: they turn a permanent hang into a temporary rollback. But they don't fix the root cause.\nOne more pattern: use 'lock ordering by resource ID' — compute a hash of the resource(s) and always acquire in ascending hash order. This is automatic and scales.\nRule: Combine lock ordering with timeouts for a robust system.",
        "key_takeaway": "Prevention breaks at least one Coffman condition at design time.\nCircular wait prevention via lock ordering is the most practical.\nPreemption via timeouts is next best when ordering is hard.\nHold-and-wait prevention underutilizes resources — use it sparingly.\nPrevention is the only strategy that guarantees no deadlock ever occurs.\nCombine lock ordering with timeouts for a resilient system.\nLock-free data structures are the ultimate prevention — consider them for hot paths.\nInvest in static analysis early; it pays for itself."
      }

Deadlock Avoidance: The Banker's Algorithm

Avoidance is smarter than prevention — it doesn't restrict the system upfront. Instead, it checks each resource request against a safety algorithm before deciding whether to grant it. The system remains in a 'safe state' — a state from which every process can eventually finish.

The most famous avoidance algorithm is Dijkstra's Banker's Algorithm. It works like a banker lending money to businesses: the banker only approves a loan if there's enough cash left to cover all other loans.

How it works: - Each process declares its maximum resource needs upfront. - The system knows how many resources it has available. - Each process has resources allocated to it. - When a process requests additional resources, the algorithm pretends to grant the request and then checks if the resulting state is safe. - A state is safe if there exists a sequence where all processes can finish.

Safe state check: 1. Assume all resources requested are granted (the 'need' = max - allocated). 2. Find a process whose need ≤ available resources. 3. Run that process to completion, then add its allocated resources back to available. 4. Repeat. If all processes can finish, the state is safe.

The downside: processes must declare max needs in advance, which is often unrealistic. Also, the algorithm assumes resources are preemptible and that the number of processes is fixed — not true in many systems.

Banker's Algorithm is widely taught but rarely used in pure form. Its concepts, however, are foundational for other resource managers like Apache ZooKeeper locks and Kubernetes resource scheduling.

A practical production use: Kubernetes resource quotas. The admission controller checks if a new pod's resource request can be satisfied without exceeding node capacity. That's a form of avoidance — not Banker's exactly, but the safety-check principle is identical.

Debugging insight: when you see 'unsafe' behavior in your system, consider adding a soft pre-check similar to Banker's. For example, before accepting a new database transaction, check if the total locks held plus the new locks would exceed a threshold. This helped one team reduce deadlock incidents by 80%.

Here's a simplified Java implementation of Banker's Algorithm for multiple resource types.

But don't expect to run this in production. The real value is the mental model: before you grant a resource, ask 'can the system still make progress?'

Trade-off alert: Avoidance adds latency to every resource request. In a high-throughput system, a 1ms check per request can add 10% CPU overhead. That's why most systems prefer prevention or detection.

Another point: Banker's Algorithm requires that processes do not change their maximum claims during execution. In real systems, this is impractical. However, for deterministic environments like embedded systems or real-time OS, it can be used.

io/thecodeforge/deadlock/BankersAlgorithm.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
package io.thecodeforge.deadlock;

/**
 * Simplified Banker's Algorithm for multiple resource types.
 * Checks if granting a request leaves the system in a safe state.
 */
public class BankersAlgorithm {

    public static boolean isSafeState(int[] available, int[][] max, int[][] allocated) {
        int n = max.length;          // number of processes
        int m = available.length;    // number of resource types
        int[][] need = new int[n][m];
        for (int i = 0; i < n; i++) {
            for (int j = 0; j < m; j++) {
                need[i][j] = max[i][j] - allocated[i][j];
            }
        }
        boolean[] finished = new boolean[n];
        int[] work = available.clone();
        int count = 0;
        while (count < n) {
            boolean found = false;
            for (int i = 0; i < n; i++) {
                if (!finished[i]) {
                    boolean canAllocate = true;
                    for (int j = 0; j < m; j++) {
                        if (need[i][j] > work[j]) {
                            canAllocate = false;
                            break;
                        }
                    }
                    if (canAllocate) {
                        for (int j = 0; j < m; j++) {
                            work[j] += allocated[i][j];
                        }
                        finished[i] = true;
                        count++;
                        found = true;
                    }
                }
            }
            if (!found) {
                return false; // unsafe state
            }
        }
        return true;
    }

    public static void main(String[] args) {
        int[] available = {3, 3, 2};
        int[][] max = {{7, 5, 3}, {3, 2, 2}, {9, 0, 2}, {2, 2, 2}, {4, 3, 3}};
        int[][] allocated = {{0, 1, 0}, {2, 0, 0}, {3, 0, 2}, {2, 1, 1}, {0, 0, 2}};
        boolean safe = isSafeState(available, max, allocated);
        System.out.println("Safe state: " + safe);
    }
}
Output
Safe state: true
(The example is from a classic textbook problem — the system can execute processes in order P1, P3, P4, P2, P0)
Don't Expect to Use Banker's in Raw Form
Most modern OS and databases do NOT use Banker's Algorithm. It's too rigid — processes rarely know their max resource needs, and the number of processes changes dynamically. But the concept of 'safe state' appears in distributed locking (e.g., ZooKeeper's sequential ephemeral nodes) and in Kubernetes resource quotas. The idea of checking feasibility before granting is powerful, even if the exact algorithm isn't used.
Production Insight
Avoidance is elegant but rarely practical in raw form.
What you should take from it: always check if a resource allocation can cause a deadlock before granting.
In databases, that's what the lock manager does internally (though using different algorithms).
The lesson for senior engineers: before acquiring a resource in a distributed system, check if the acquisition would create a circular dependency — you can often do this with a simple timeout-based heuristic.
Performance impact: avoidance checks add per-request overhead. In high-throughput systems, even a 1ms check can add unacceptable latency.
A pragmatic middle ground: use a bounded resource reservation system (like semaphores with a fixed max) rather than full Banker's.
Consider using 'resource quotas' at the service level — limit concurrent requests to avoid resource exhaustion that leads to deadlocks.
Rule: Avoidance is better than detection, but prevention is better than avoidance.
Key Takeaway
Avoidance checks each request against a safety criterion.
Banker's Algorithm is the classic example.
Safe state means there exists a completion sequence for all processes.
In practice, avoidance is too costly and requires too much information.
The safety-check idea, however, is everywhere — from Kubernetes admission controllers to database lock managers.
If you can pre-declare max needs, consider a simplified safety check as a first line of defense.
Remember: avoidance is better than detection, but prevention is better than avoidance.
Use resource quotas as a lightweight avoidance in distributed systems.

Deadlock Handling in MySQL and PostgreSQL: What Actually Happens

Real-world databases handle deadlocks differently. Let's look at the two most popular ones.

MySQL InnoDB: - Automatically detects deadlocks by searching for cycles in the transaction wait-for graph. - Detection runs when a transaction requests a lock that would cause it to wait. If a cycle is found, InnoDB chooses a 'victim' and rolls back that transaction (usually the one with the fewest rows updated). - The victim's application receives error 1213: 'Deadlock found when trying to get lock; try restarting transaction'. - You can set innodb_deadlock_detect = OFF to disable detection (not recommended — your system will hang on deadlocks).

PostgreSQL: - By default, PostgreSQL does NOT automatically detect deadlocks in the same way. Instead, it uses a deadlock timeout. The deadlock_timeout parameter (default 1 second) controls how long a process waits for a lock before checking for deadlocks. The check itself uses cycle detection in the lock graph. - If a deadlock is detected, one of the waiting processes receives an error and is terminated. - The tradeoff: a shorter timeout detects deadlocks faster but consumes more CPU on lock graph checks.

Key difference: MySQL detects deadlocks almost instantly (at lock request time), while PostgreSQL waits for deadlock_timeout before checking. PostgreSQL's approach reduces CPU overhead but adds latency to deadlock detection.

A production mistake I've seen: a team migrated from MySQL to PostgreSQL and started seeing 'deadlock timeout' errors. They thought the database was broken. In reality, PostgreSQL's detection delay caused the same deadlock pattern to persist for 1 second, while MySQL would have killed it instantly. The fix was to reduce deadlock_timeout to 100ms and add retry logic.

Debugging insight: for PostgreSQL, if you see repeated 'Process X waits for Lock on relation Y' in logs, it might not be a deadlock — just long waits. Use pg_blocking_pids to see who is blocking whom. A cycle in that output confirms deadlock.

Here are the SQL commands to capture deadlock info in both databases.

I've also seen teams inadvertently disable deadlock detection in MySQL to save that 2% CPU, and then discover they have no recovery mechanism when a deadlock hits. The lock wait timeout is 50 seconds by default. That's nearly a minute of frozen transactions. Don't do it.

Oracle Database (bonus): Oracle uses a similar approach to PostgreSQL — a deadlock_timeout-like mechanism (the _lm_dd_interval hidden parameter). By default it's 60 seconds. That's a long wait. Always set application-side timeouts lower than the database deadlock detection timeout.

detect_deadlock_in_deadlock_log.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- MySQL: Get the latest deadlock info
SHOW ENGINE INNODB STATUS\G

-- In the output, look for section:
-- LATEST DETECTED DEADLOCK
-- 2026-04-22 10:33:21 0x7f1234abc
-- *** (1) TRANSACTION:
-- TRANSACTION 12345, ACTIVE 3 sec starting index read
-- mysql tables in use 1, locked 1
-- LOCK WAIT 2 lock struct(s), heap size 1128, 1 row lock(s)
-- MySQL thread id 5, OS thread handle 14000, query id 100 localhost root
-- update accounts set balance = balance - 100 where account_id = 1
-- *** (1) WAITING FOR THIS LOCK TO BE GRANTED:
-- ...
-- *** (2) TRANSACTION:
-- TRANSACTION 12346, ACTIVE 2 sec starting index read
-- update accounts set balance = balance_cr = 100 where account_id = 2
-- *** (2) HOLDS THE LOCK(S):
-- ...
-- *** (2) WAITING FOR THIS LOCK TO BE GRANTED:
-- ...
-- *** WE ROLL BACK TRANSACTION (1)
-- End of deadlock info

-- PostgreSQL: Check for deadlock events
SELECT * FROM pg_stat_activity WHERE wait_event_type = 'Lock';
SELECT pg_blocking_pids(pid) FROM pg_stat_activity;

-- To see deadlock timeout setting
SHOW deadlock_timeout;
Output
No output — these are debug queries.
Deadlock Detection Disabled = System Hang
In MySQL, if you turn off innodb_deadlock_detect (e.g., for high-throughput systems where detection overhead is unacceptable), your system will NOT detect deadlocks at all. Transactions will wait forever until innodb_lock_wait_timeout kicks in (default 50 seconds). That's a 50-second outage window before recovery. Never disable detection without understanding this tradeoff.
Production Insight
The deadlock detection behavior of your database is not a 'nice to know' — it's the difference between a 3-second hiccup and a 3-hour outage.
I've seen teams assume MySQL handles deadlocks exactly like PostgreSQL, then blame the database when their app hangs.
Always test your locking patterns under the database you deploy to.
Set up monitoring for 'Deadlock found' and 'deadlock_timeout' events.
Performance impact: MySQL's per-request detection can add ~2% CPU overhead under heavy load. PostgreSQL's timeout-based check adds negligible CPU but up to 1s latency per deadlock.
For PostgreSQL, tune deadlock_timeout based on your workload's tolerance: 100ms for latency-sensitive, 2s for batch processing.
One more tip: Oracle DBAs often increase the hidden _lm_dd_interval to 300 seconds to reduce overhead, but that means deadlocks could persist for 5 minutes. Know your defaults.
Rule: Always have application-level retry logic for deadlock errors regardless of database.
Key Takeaway
MySQL detects instantly at lock request.
PostgreSQL waits for deadlock_timeout then checks.
Both automatically kill a victim transaction.
Don't confuse their detection behaviors.
Monitor deadlock events — they're signals of a design problem, not one-off glitches.
Always have application-level retry logic for deadlock errors, regardless of the database.
Oracle's default detection interval is 60s — set it lower if you can.
Test your locking patterns under the exact database version you deploy to.

Distributed Deadlocks: When No Single Machine Knows the Cycle

Deadlocks in distributed systems are fundamentally harder because there's no central lock manager with a complete wait-for graph. Each service holds local locks (database rows, mutexes, file handles) while waiting for responses from other services. The cycle exists but no single node can see it.

Why they happen: - Service A holds a database lock on resource X and calls Service B. - Service B holds a database lock on resource Y and calls Service A. - Both wait for each other's response, but neither releases its local lock until the remote call completes. - The distributed cycle is invisible to any single thread dump or database status command.

Detection strategies: 1. Distributed tracing (Jaeger, Zipkin): look for spans with long duration that form a circular call pattern. 2. Timeouts: set HTTP/gRPC call timeouts. When a timeout occurs, the calling service can release its local lock, breaking the cycle via preemption. 3. Coordination services: use ZooKeeper or etcd to detect lock cycles across nodes. Each service registers intent to acquire a resource; if a cycle is detected in the coordination service, one of the services is forced to abort.

A real outage: A payment processing system had a chain: Payment service held a row lock on an account and called Fraud service. Fraud service held a row lock on a transaction and called Payment service to validate the account. Under normal load the pattern never collided because the calls were fast. During a flash sale, two payments for the same account overlapped: Payment1 held account lock A and needed transaction lock B, while Fraud2 held transaction lock B and needed account lock A. The cycle lasted 8 minutes until a manual intervention killed one of the processes.

The fix: implement a global lock order across all services. All locks must be acquired in the same sequence (e.g., account lock before transaction lock before service call). If a service needs to make a remote call while holding a lock, it must release the lock before the call (using an optimistic retry pattern) or ensure the remote call cannot create a cycle.

Trade-off: distributed deadlock detection is expensive (requires a global state) and often infeasible. The practical solution is a combination of lock ordering at the application level and timeouts at the network level.

Here's a scenario in pseudocode showing a cross-service deadlock.

One more thing: the sheer length of time a distributed deadlock can persist is scary. Without timeouts, it's infinite. With timeouts (say 30 seconds), you still have a 30-second window where the system appears dead. That's why you need both timeouts and monitoring. And even then, retry logic must be idempotent.

Advanced: use a 'distributed lock manager' like Google's Chubby or Apache ZooKeeper. These services maintain a global lock graph and can detect cycles. They add latency but provide global visibility. If you're building a system where deadlocks are catastrophic, invest in a coordination service.

Another approach: use the saga pattern for long-running transactions. Each local transaction is executed and if something fails, compensating transactions are run. This avoids holding locks across services because each step is independent.

io/thecodeforge/deadlock/DistributedDeadlockScenario.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// Pseudocode: Distributed deadlock between two services
// Service A holds local lock and calls Service B
// Service B holds local lock and calls Service A

// Service A (lock acquisition and remote call)
@POST /api/order
Transaction: {
    lock accountRow(accountId);  // acquire local lock
    // now call Service B with this lock held
    Response resp = httpClient.post("http://service-b/payment", body);
    // if Service B is waiting for us, we deadlock
    release accountRow(accountId);
}

// Service B (lock acquisition and remote call)
@POST /api/payment
Transaction: {
    lock transactionRow(txnId);  // acquire local lock
    // call Service A to validate account
    Response resp = httpClient.post("http://service-a/validate", body);
    // Service A is waiting for us to finish, but we're waiting for Service A
    release transactionRow(txnId);
}
Output
(This scenario leads to a distributed deadlock if both services hold local locks while waiting for each other)
Distributed Deadlock Antipattern: Holding Locks Across Network Calls
The single biggest cause of distributed deadlocks is holding a local lock (database row, mutex, etc.) while making a remote call. This creates a cross-service hold-and-wait condition. The safest pattern is to release all local locks before making remote calls, then re-acquire them if needed (using optimistic locking or idempotency). If you must hold a lock, ensure the remote call has a timeout and that the lock is released in the catch block.
Production Insight
Distributed deadlocks are the hardest to debug because no single node has a complete picture.
I've seen teams spend days investigating 'intermittent hangs' that were actually cross-service circular wait.
The fix is almost always to eliminate local lock holding during remote calls.
If you can't eliminate it, enforce a global lock ordering across all services and use distributed tracing to detect cycles in staging first.
Performance impact: releasing and re-acquiring locks adds overhead but prevents the 30-minute outages that distributed deadlocks cause.
Another effective approach: use a distributed lock manager like ZooKeeper's sequential ephemeral nodes to assign a global order to all lock acquisitions.
Consider using the 'saga pattern' with compensating transactions — it avoids distributed locks entirely by breaking the transaction into local steps.
Rule: Never hold a local lock while waiting for a remote response.
Key Takeaway
Distributed deadlocks cross service boundaries.
No single node can detect the cycle.
Holding local locks during remote calls is the root cause.
Timeouts and lock ordering are the two practical defenses.
Always test your distributed locking patterns under concurrent load.
Consider using distributed tracing early — it's the only way to see the cycle.
The saga pattern avoids distributed locks entirely — consider it for long-running transactions.
Eliminate locks across network boundaries whenever possible.
Dealing with Distributed Deadlocks
IfServices hold local locks during remote calls
UseEliminate the pattern — release locks before remote calls, use optimistic retry
IfDistributed cycle detected via tracing
UseBreak the cycle by introducing a timeout on one of the calls (preemption)
IfCoordination service is available
UseUse ZooKeeper/etcd to detect and break cycles centrally
IfNo coordination service, no lock ordering
UseImplement circuit breakers and retries with jitter — not a true fix but limits blast radius

Deadlock Case Study: A Real-World Outage at a Stock Exchange

Let's examine a real production deadlock that brought down a stock exchange matching engine for 12 minutes. This was not a simple two-thread deadlock — it involved three different resource types and four concurrent processes.

The Setup: - Resources: Symbol lock (R1), Order book lock (R2), Settlement lock (R3) - Processes: Order entry (P1), Order matching (P2), Trade reporting (P3), Market data feed (P4) - Locking order: P1: R1 then R2. P2: R2 then R3. P3: R3 then R1. P4: R1 then R3.

The Trigger: A flash crash caused a sudden burst of orders on overlapping symbols. Four threads each held one lock and waited for another, forming a cycle.

Detection: The exchange had custom monitoring that detected transaction throughput dropping to zero. Engineers captured thread dumps within 30 seconds and identified the cycle. They had to kill all four threads, losing about 12 seconds of orders.

Root Cause: No documented lock ordering across modules. Each team defined their own locking sequence.

Fix: 1. Defined a global lock order: always R1 then R2 then R3. 2. Refactored all code paths to follow this order. 3. Added a static analysis rule to detect violations at compile time. 4. Implemented a 'lock order assertion' at runtime using a thread-local stack trace checker.

Lesson: Even in a well-managed system, lack of cross-team coordination on lock ordering can cause catastrophic deadlocks that are impossible to reproduce in test.

This case study shows that deadlock prevention is not just a technical problem — it's an organizational problem. The fix required agreement across four teams on a single ordering policy.

io/thecodeforge/deadlock/LockOrderAssertion.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
package io.thecodeforge.deadlock;

import java.util.*;

/**
 * Runtime lock order assertion to detect violations.
 * Uses a thread-local stack of acquired locks in order.
 */
public class LockOrderAssertion {
    private static final ThreadLocal<LinkedList<String>> lockStack = 
        ThreadLocal.withInitial(LinkedList::new);
    
    public static void acquireLock(String lockName) {
        LinkedList<String> stack = lockStack.get();
        // Check if we are violating order: if the stack is not empty
        // and the lock to acquire is not the next allowed (i.e., alphabetical?)
        // In this example we enforce alphabetical order for simplicity.
        if (!stack.isEmpty()) {
            String lastLock = stack.getLast();
            if (lockName.compareTo(lastLock) < 0) {
                throw new IllegalStateException(
                    "Lock order violation: " + lastLock + " then " + lockName
                );
            }
        }
        stack.addLast(lockName);
    }
    
    public static void releaseLock(String lockName) {
        LinkedList<String> stack = lockStack.get();
        if (stack.isEmpty() || !stack.getLast().equals(lockName)) {
            throw new IllegalStateException("Lock release mismatch");
        }
        stack.removeLast();
    }
    
    // Usage:
    public static void main(String[] args) {
        acquireLock("R1");
        acquireLock("R2");
        // ... work ...
        releaseLock("R2");
        releaseLock("R1");
    }
}
Output
(No deadlock if order is followed. Throws IllegalStateException if order violated.)
Lock Order Assertion is a Cheap Insurance
Adding a runtime assertion that checks lock ordering costs almost nothing in performance (a couple of thread-local list operations) but catches violations immediately during development. This is like a seatbelt — you don't notice it until you need it.
Production Insight
This case proves that deadlock prevention is as much about process as it is about code.
The exchange had the best engineers, but without cross-team coordination, deadlocks still happened.
Investing in a global lock order document and static analysis would have cost $50k upfront and saved $2M in outage costs.
Performance impact: lock order assertions add at most 10 microseconds per lock acquisition — negligible.
The lesson for senior engineers: when multiple teams own different resources, enforce a system-wide lock ordering policy.
Don't assume each team will coordinate — it never happens.
Rule: If you have more than one team acquiring locks on shared resources, create an Architecture Decision Record (ADR) with the exact lock order.
Key Takeaway
Deadlock prevention is organizational, not just technical.
Cross-team coordination on lock ordering is mandatory.
Runtime assertions catch violations that reviews miss.
The cost of prevention is a fraction of the cost of an outage.
Document the global lock order in an ADR.
Use static analysis to enforce it automatically.
Never assume teams will coordinate by themselves.
Invest in prevention early — it pays for itself with the first outage averted.
● Production incidentPOST-MORTEMseverity: high

The 3 AM Database Freeze That Killed a Payment Pipeline

Symptom
All new payment requests timed out. CPU was idle, network showed no errors, but every database query hung. The app logs showed 'Lock wait timeout exceeded' on InnoDB tables.
Assumption
The team assumed the database was overloaded and tried scaling up — more CPU, more memory. Didn't help. The real issue wasn't capacity, it was contention.
Root cause
Two concurrent transactions: Transaction A locked row X (user balance) then tried to lock row Y (order total). Transaction B locked row Y then tried to lock row X. Both held their locks and waited for the other. Classic circular wait, invisible at the application level because no code crashed.
Fix
Ran SHOW ENGINE INNODB STATUS, identified the deadlocked transactions, killed the blocking transaction. Then redesigned the locking strategy to always acquire locks in a consistent global order (user_id hash → order_id).
Key lesson
  • Deadlocks in databases don't crash — they cause silent degradation that looks like a capacity issue.
  • If your queries start timing out and CPU is idle, suspect lock contention before scaling.
  • Always enforce a lock ordering policy. It's the cheapest deadlock prevention there is.
  • Set up alerts on InnoDB deadlock detection: a single deadlock is a warning, repeated ones mean a design flaw.
Production debug guideWhen your system stops making progress, here's how to identify the deadlock and break it — fast.4 entries
Symptom · 01
Database queries timeout with 'Lock wait timeout' or 'Deadlock found when trying to get lock'
Fix
Run SHOW ENGINE INNODB STATUS\G in MySQL or pg_stat_activity in PostgreSQL. Look for 'LATEST DETECTED DEADLOCK' section.
Symptom · 02
Application threads stuck — thread dump shows multiple threads in BLOCKED state, waiting on synchronized blocks or locks
Fix
Take a thread dump using jstack <pid>. Search for 'waiting to lock' and 'locked' to spot the cycle.
Symptom · 03
Operating system processes stuck in 'D' (uninterruptible sleep) — often due to kernel lock contention
Fix
Check /proc/<pid>/status or run strace -p <pid> to see if the process is blocked on a futex or file lock.
Symptom · 04
Distributed system with no local deadlock but overall system hangs
Fix
Look at resource wait graphs in coordination services like ZooKeeper or etcd. Use distributed tracing to find circular dependencies.
★ Deadlock Debugging: 60-Second Cheat SheetWhen you suspect a deadlock, use these commands and actions immediately. Don't wait — deadlocks grow worse over time as more processes queue up behind the blocked ones.
MySQL InnoDB deadlock
Immediate action
Don't restart the database — you'll lose evidence. Capture the deadlock info first.
Commands
SHOW ENGINE INNODB STATUS\G
SELECT * FROM information_schema.INNODB_TRX\G
Fix now
KILL CONNECTION <trx_mysql_thread_id> for the transaction that holds fewer locks.
Java thread deadlock (BLOCKED threads)+
Immediate action
Take multiple thread dumps 3 seconds apart to confirm the threads are still waiting.
Commands
jstack -l <pid> | grep -A 20 'BLOCKED'
jcmd <pid> Thread.print
Fix now
Kill the Java process and fix the lock ordering in code. If you're on production, you can use jstack to identify the lock cycle and then restart the JVM.
Linux process in 'D' state (uninterruptible sleep) due to file lock or kernel lock+
Immediate action
Check what file descriptor the process is waiting on.
Commands
cat /proc/<pid>/status | grep State
strace -p <pid> -e trace=fsync,flock,fcntl
Fix now
If it's an NFS lock, restart the NFS client service. If it's a local flock(), force-release with lsof | grep <pid> and then kill -9 the holder.
Cross-service deadlock in microservices+
Immediate action
Don't restart all services — you'll lose the cycle evidence. Capture distributed traces first.
Commands
Check Jaeger/Zipkin for spans with long duration and circular call patterns
Gather thread dumps from each service JVM to find locks held during outbound HTTP calls
Fix now
Set HTTP client timeouts to 30s and implement a global resource lock order across services. Kill the calling thread that's blocking the cycle.
Deadlock Prevention vs Avoidance vs Detection
StrategyApproachRuntime OverheadRequires Max Needs?Common Usage
PreventionBreak one Coffman condition at design timeZero (design-time)NoLock ordering, timeouts
AvoidanceCheck each request against safety algorithmPer-request check (0.1-1ms)YesKubernetes resource quotas, Banker's (rare)
DetectionFind cycle after it forms, break itPer-request or periodic (0.5-5% CPU)NoMySQL, PostgreSQL, thread dump analysis
RecoveryKill victim or rollbackDepends on detection methodNoAutomatic DB deadlock resolution

Common mistakes to avoid

3 patterns
×

Using depends_on without health checks in Docker Compose

Symptom
Container starts but application fails because the dependency is not ready — confusing start order with readiness.
Fix
Add healthcheck blocks and use condition: service_healthy instead of depends_on alone. This breaks hold-and-wait at the application level.
×

Assuming deadlock detection is always enabled

Symptom
System hangs silently because deadlock detection was disabled for performance, and lock wait timeout is set too high.
Fix
Always verify deadlock detection is ON unless you have a clear reason and timeout-based fallback. Set innodb_lock_wait_timeout to 5s as a safety net.
×

Ignoring lock ordering across services

Symptom
Intermittent hangs in microservices that appear under load, only visible via distributed tracing showing circular call patterns.
Fix
Define a global lock acquisition order across all services. Use an architecture decision record and enforce via code review.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR
What are the four necessary conditions for a deadlock to occur? Explain ...
Q01 of 01JUNIOR

What are the four necessary conditions for a deadlock to occur? Explain each with an example.

ANSWER
The four Coffman conditions are: Mutual Exclusion (only one process can use a resource at a time), Hold and Wait (a process holds at least one resource while waiting for another), No Preemption (resources cannot be taken away forcibly), and Circular Wait (a cycle of processes each waiting for the next). For example, two threads each holding a mutex and waiting for the other's mutex satisfies all four. Any missing condition makes deadlock impossible.
🔥

That's Operating Systems. Mark it forged?

19 min read · try the examples if you haven't

Previous
Virtual Memory and Paging
6 / 12 · Operating Systems
Next
Semaphores and Mutex