Distributed Locking: The Production Guide to Avoiding Split-Brain and Data Corruption
Distributed locking prevents race conditions in distributed systems.
20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.
Distributed locking ensures only one process holds a lock on a resource at a time across a network. Use Redis Redlock, ZooKeeper ephemeral nodes, or database-based locks. Beware of clock drift, network partitions, and lock expiration — these cause split-brain scenarios where two processes believe they hold the lock.
Imagine a single bathroom key for a whole office building. Only one person can use the bathroom at a time because they hold the physical key. Distributed locking is that key, but for computer resources across multiple servers. If someone forgets to return the key (lock expires), someone else might walk in on them — that's a split-brain bug.
Distributed locking is one of those things that sounds simple until it breaks your production database at 3 AM. The textbook says 'use a lock' — but the real world says 'your lock just failed and now you have duplicate payments.' This article is the no-bullshit guide to distributed locking: what works, what doesn't, and how to debug when it all goes wrong.
The core problem is mutual exclusion across machines. Without it, two services can simultaneously process the same order, decrement the same inventory twice, or overwrite each other's data. You need a lock that all nodes respect — and that's harder than it sounds because networks are unreliable, clocks drift, and processes crash.
By the end of this, you'll be able to choose the right locking strategy for your system, implement it without the classic mistakes, and diagnose failures when locks misbehave. You'll also know when not to use distributed locking at all — because sometimes the simplest solution is no lock.
Why Distributed Locking Is Hard: The Fallacies of Distributed Computing
Before we talk about solutions, let's talk about why this is a hard problem. The network is not reliable — packets drop, latency spikes, and partitions happen. Clocks are not synchronized — NTP can drift, and even with PTP, you get skew. Processes can pause for garbage collection or get preempted by the OS. These three facts make distributed locking fundamentally different from single-process locking.
Without distributed locking, two nodes can simultaneously modify the same resource. Classic example: an inventory service that decrements stock on order placement. Two orders come in at the same time, both read stock=1, both write stock=0 — you just oversold. The fix is a distributed lock that serializes access to the inventory row.
But here's the kicker: even with a lock, you can still get corruption if the lock expires while the holder is still working. This is the split-brain problem. The holder thinks it has the lock, but another node acquires it and starts modifying the same resource. Now you have two writers. This is why lock fencing (a monotonically increasing token) is critical — it lets the resource reject stale writes.
Redis-Based Locks: The Good, the Bad, and the Split-Brain
Redis is the most popular choice for distributed locking because it's fast and simple. The basic pattern: SET key value NX PX 5000 — atomically set the key if it doesn't exist with a 5-second TTL. To release, delete the key only if the value matches your lock token (to avoid deleting someone else's lock).
But Redis has a fundamental problem: it's not strongly consistent. In a Redis cluster, if the master goes down after acknowledging the write but before replicating, the lock is lost. The new master doesn't have the lock, so another client can acquire it. This is why Redis Labs proposed Redlock — a consensus-based algorithm that acquires the lock from a majority of Redis nodes.
Redlock is controversial. Martin Kleppmann famously argued it's unsafe because of clock drift and GC pauses. In practice, it works well if you have tight clock sync (NTP) and short TTLs. But if you need absolute correctness, use ZooKeeper or etcd. For most systems, Redis locks are good enough — just be aware of the edge cases.
ZooKeeper and etcd: Strong Consistency at a Cost
When correctness matters more than latency, use ZooKeeper or etcd. Both use consensus algorithms (Zab and Raft respectively) to provide linearizable writes. The lock pattern: create an ephemeral sequential node. The client with the smallest sequence number holds the lock. When the client disconnects (or crashes), the ephemeral node is automatically deleted — no TTL needed.
This solves the lock expiration problem because the lock lives as long as the session. But it introduces new problems: session timeouts. If the client's session expires due to a network blip, the lock is released even though the client is still working. This can cause split-brain again. The solution is to use a fencing token: the lock service gives you a monotonically increasing token that you pass to the resource. The resource rejects any write with a stale token.
etcd has a built-in locking package (concurrency/stm) that handles this. ZooKeeper requires more manual work. The trade-off: ZooKeeper/etcd are slower than Redis (10-100ms vs 1ms) but provide stronger guarantees. Use them for critical resources like leader election or financial transactions.
Database-Based Locks: The Old Reliable
If you're already using a relational database, you can use row-level locks with SELECT FOR UPDATE. This is the simplest distributed lock — no extra infrastructure. The lock is held for the duration of the transaction. When the transaction commits or rolls back, the lock is released.
But there's a catch: database locks are coarse and can become a bottleneck. If you lock a row for too long, other transactions queue up. Also, if the application crashes mid-transaction, the lock is held until the database detects the dead connection (which can take minutes). This is why you should keep transactions short and use a timeout.
Another pattern: use a database table as a lock registry with a unique constraint on the lock name. INSERT a row with the lock name and a token. If the INSERT succeeds, you have the lock. DELETE to release. This is simple but doesn't handle crashes — you need a background job to clean up stale locks.
When Not to Use Distributed Locking
Distributed locking is a hammer, but not every problem is a nail. Sometimes you can avoid locks entirely by using idempotent operations or optimistic concurrency. For example, instead of locking an inventory row, use an atomic decrement: UPDATE inventory SET stock = stock - 1 WHERE stock > 0. If the update affects zero rows, you know stock was insufficient. No lock needed.
Another alternative: use a message queue with exactly-once semantics. Process orders sequentially from a single partition. This avoids locks but introduces ordering constraints. Or use a database transaction with SERIALIZABLE isolation — but that kills performance.
The rule of thumb: if you can design your system to be conflict-free (e.g., using event sourcing or CRDTs), do that instead. Distributed locking should be your last resort, not your first instinct. It adds latency, complexity, and failure modes.
The 4GB Container That Kept Dying
- Always bound retries and connections in distributed lock clients — an unconstrained retry loop is a self-inflicted DDoS.
unlock() in code — add a watchdog to auto-release stale locks.redis-cli TTL lock:order:123redis-cli GET lock:order:123Key takeaways
Interview Questions on This Topic
How does Redlock handle clock drift? What happens if two Redis nodes have unsynchronized clocks?
Frequently Asked Questions
20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.
That's Distributed Systems. Mark it forged?
5 min read · try the examples if you haven't