Senior 5 min · June 25, 2026

Distributed Locking: The Production Guide to Avoiding Split-Brain and Data Corruption

Distributed locking prevents race conditions in distributed systems.

N
Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.

Follow
Production
production tested
June 25, 2026
last updated
1,663
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer

Distributed locking ensures only one process holds a lock on a resource at a time across a network. Use Redis Redlock, ZooKeeper ephemeral nodes, or database-based locks. Beware of clock drift, network partitions, and lock expiration — these cause split-brain scenarios where two processes believe they hold the lock.

✦ Definition~90s read
What is Distributed Locking?

Distributed locking is a mechanism to coordinate access to a shared resource across multiple processes or machines, ensuring mutual exclusion in a distributed environment. It prevents race conditions and data corruption when concurrent operations modify the same data.

Imagine a single bathroom key for a whole office building.
Plain-English First

Imagine a single bathroom key for a whole office building. Only one person can use the bathroom at a time because they hold the physical key. Distributed locking is that key, but for computer resources across multiple servers. If someone forgets to return the key (lock expires), someone else might walk in on them — that's a split-brain bug.

Distributed locking is one of those things that sounds simple until it breaks your production database at 3 AM. The textbook says 'use a lock' — but the real world says 'your lock just failed and now you have duplicate payments.' This article is the no-bullshit guide to distributed locking: what works, what doesn't, and how to debug when it all goes wrong.

The core problem is mutual exclusion across machines. Without it, two services can simultaneously process the same order, decrement the same inventory twice, or overwrite each other's data. You need a lock that all nodes respect — and that's harder than it sounds because networks are unreliable, clocks drift, and processes crash.

By the end of this, you'll be able to choose the right locking strategy for your system, implement it without the classic mistakes, and diagnose failures when locks misbehave. You'll also know when not to use distributed locking at all — because sometimes the simplest solution is no lock.

Why Distributed Locking Is Hard: The Fallacies of Distributed Computing

Before we talk about solutions, let's talk about why this is a hard problem. The network is not reliable — packets drop, latency spikes, and partitions happen. Clocks are not synchronized — NTP can drift, and even with PTP, you get skew. Processes can pause for garbage collection or get preempted by the OS. These three facts make distributed locking fundamentally different from single-process locking.

Without distributed locking, two nodes can simultaneously modify the same resource. Classic example: an inventory service that decrements stock on order placement. Two orders come in at the same time, both read stock=1, both write stock=0 — you just oversold. The fix is a distributed lock that serializes access to the inventory row.

But here's the kicker: even with a lock, you can still get corruption if the lock expires while the holder is still working. This is the split-brain problem. The holder thinks it has the lock, but another node acquires it and starts modifying the same resource. Now you have two writers. This is why lock fencing (a monotonically increasing token) is critical — it lets the resource reject stale writes.

InventoryService.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// io.thecodeforge — System Design tutorial

// Pseudocode for inventory decrement with distributed lock
function decrementStock(orderId, quantity) {
  // Acquire lock on the inventory item
  lock = redisLock.acquire("inventory:item:123", ttl=5000) // 5 second TTL
  if (!lock) {
    throw new Exception("Could not acquire lock, try again")
  }
  try {
    // Read current stock
    stock = db.query("SELECT stock FROM inventory WHERE id=123")
    if (stock < quantity) {
      throw new Exception("Insufficient stock")
    }
    // Update stock
    db.execute("UPDATE inventory SET stock = stock - ? WHERE id=123", quantity)
  } finally {
    // Always release the lock
    redisLock.release(lock)
  }
}
Output
No output — this is a pseudocode pattern.
Production Trap: Lock Expiration
If your lock TTL is too short, the lock expires while the worker is still processing. Another worker grabs the lock and you get split-brain. Solution: use a watchdog thread that extends the TTL periodically, or use a fencing token to reject stale writes.
Locking Strategy Decision Tree
IfNeed strong consistency and can tolerate latency
UseUse ZooKeeper or etcd with ephemeral nodes and fencing tokens
IfNeed low latency and can tolerate occasional lock failure
UseUse Redis Redlock with short TTL and watchdog
IfAlready using a relational database with transactions
UseUse database row-level locks (SELECT FOR UPDATE) — simplest
Distributed Locking: Avoiding Split-Brain & Corruption THECODEFORGE.IO Distributed Locking: Avoiding Split-Brain & Corruption Comparison of lock implementations and production pitfalls Why Distributed Locking Is Hard Fallacies of distributed systems: network, clocks, failures Redis-Based Locks Fast but weak consistency; split-brain risk ZooKeeper / etcd Locks Strong consistency via consensus; higher latency Database-Based Locks Reliable with ACID; potential bottleneck When Not to Use Distributed Locking Alternatives: idempotency, optimistic concurrency ⚠ Redis locks without proper fencing can cause split-brain Use Redlock or etcd with fencing tokens for safety THECODEFORGE.IO
thecodeforge.io
Distributed Locking: Avoiding Split-Brain & Corruption
Distributed Locking

Redis-Based Locks: The Good, the Bad, and the Split-Brain

Redis is the most popular choice for distributed locking because it's fast and simple. The basic pattern: SET key value NX PX 5000 — atomically set the key if it doesn't exist with a 5-second TTL. To release, delete the key only if the value matches your lock token (to avoid deleting someone else's lock).

But Redis has a fundamental problem: it's not strongly consistent. In a Redis cluster, if the master goes down after acknowledging the write but before replicating, the lock is lost. The new master doesn't have the lock, so another client can acquire it. This is why Redis Labs proposed Redlock — a consensus-based algorithm that acquires the lock from a majority of Redis nodes.

Redlock is controversial. Martin Kleppmann famously argued it's unsafe because of clock drift and GC pauses. In practice, it works well if you have tight clock sync (NTP) and short TTLs. But if you need absolute correctness, use ZooKeeper or etcd. For most systems, Redis locks are good enough — just be aware of the edge cases.

redis_lock.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# io.thecodeforge — System Design tutorial

import redis
import uuid
import time

class RedisLock:
    def __init__(self, client, key, ttl=5000):
        self.client = client
        self.key = key
        self.ttl = ttl
        self.token = str(uuid.uuid4())  # Unique per lock attempt

    def acquire(self):
        # SET NX PX — atomic: set if not exists, with TTL
        result = self.client.set(self.key, self.token, nx=True, px=self.ttl)
        return result is True

    def release(self):
        # Lua script to atomically delete only if token matches
        lua = """
        if redis.call("get", KEYS[1]) == ARGV[1] then
            return redis.call("del", KEYS[1])
        else
            return 0
        end
        """
        self.client.eval(lua, 1, self.key, self.token)

# Usage
client = redis.Redis(host='localhost', port=6379)
lock = RedisLock(client, "lock:order:123", ttl=5000)
if lock.acquire():
    try:
        # Critical section
        print("Lock acquired, processing order 123")
    finally:
        lock.release()
else:
    print("Failed to acquire lock")
Output
Lock acquired, processing order 123
Never Do This: Non-Atomic Release
Don't do GET + DEL in two separate commands. Between GET and DEL, another client might have acquired the lock. Always use a Lua script or the UNLINK command with a check. Otherwise you'll delete someone else's lock and cause corruption.
Redis Lock Decision Tree
IfSingle Redis instance, no cluster
UseUse simple SET NX PX — but risk of lock loss on failover
IfRedis cluster with replication
UseUse Redlock (acquire from majority of nodes) — but beware of clock drift
IfNeed absolute correctness, can't tolerate split-brain
UseDon't use Redis — use ZooKeeper or etcd
Redis Lock LifecycleTHECODEFORGE.IORedis Lock LifecycleFrom acquisition to split-brain riskSET NX PXAtomic lock acquire with TTLHold LockClient does critical workGC PauseTTL expires, lock lostSecond ClientAcquires same lockSplit-BrainBoth clients write concurrently⚠ Without safe deletion, TTL expiry causes data corruptionTHECODEFORGE.IO
thecodeforge.io
Redis Lock Lifecycle
Distributed Locking

ZooKeeper and etcd: Strong Consistency at a Cost

When correctness matters more than latency, use ZooKeeper or etcd. Both use consensus algorithms (Zab and Raft respectively) to provide linearizable writes. The lock pattern: create an ephemeral sequential node. The client with the smallest sequence number holds the lock. When the client disconnects (or crashes), the ephemeral node is automatically deleted — no TTL needed.

This solves the lock expiration problem because the lock lives as long as the session. But it introduces new problems: session timeouts. If the client's session expires due to a network blip, the lock is released even though the client is still working. This can cause split-brain again. The solution is to use a fencing token: the lock service gives you a monotonically increasing token that you pass to the resource. The resource rejects any write with a stale token.

etcd has a built-in locking package (concurrency/stm) that handles this. ZooKeeper requires more manual work. The trade-off: ZooKeeper/etcd are slower than Redis (10-100ms vs 1ms) but provide stronger guarantees. Use them for critical resources like leader election or financial transactions.

etcd_lock.goGO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
// io.thecodeforge — System Design tutorial

package main

import (
	"context"
	"fmt"
	"log"
	"time"

	clientv3 "go.etcd.io/etcd/client/v3"
	"go.etcd.io/etcd/client/v3/concurrency"
)

func main() {
	cli, err := clientv3.New(clientv3.Config{
		Endpoints:   []string{"localhost:2379"},
		DialTimeout: 5 * time.Second,
	})
	if err != nil {
		log.Fatal(err)
	}
	defer cli.Close()

	// Create a session with a 10-second TTL
	session, err := concurrency.NewSession(cli, concurrency.WithTTL(10))
	if err != nil {
		log.Fatal(err)
	}
	defer session.Close()

	// Create a mutex for the resource
	mutex := concurrency.NewMutex(session, "/my-lock")

	// Acquire lock (blocks until acquired)
	if err := mutex.Lock(context.TODO()); err != nil {
		log.Fatal(err)
	}
	fmt.Println("Lock acquired")

	// Critical section
	time.Sleep(2 * time.Second)

	// Release lock
	if err := mutex.Unlock(context.TODO()); err != nil {
		log.Fatal(err)
	}
	fmt.Println("Lock released")
}
Output
Lock acquired
Lock released
Senior Shortcut: Fencing Tokens
Always use a fencing token with ZooKeeper/etcd locks. The token is a monotonically increasing integer from the lock service. Pass it to the resource (e.g., database). The resource rejects writes with a token older than the last write. This prevents stale lock holders from corrupting data.

Database-Based Locks: The Old Reliable

If you're already using a relational database, you can use row-level locks with SELECT FOR UPDATE. This is the simplest distributed lock — no extra infrastructure. The lock is held for the duration of the transaction. When the transaction commits or rolls back, the lock is released.

But there's a catch: database locks are coarse and can become a bottleneck. If you lock a row for too long, other transactions queue up. Also, if the application crashes mid-transaction, the lock is held until the database detects the dead connection (which can take minutes). This is why you should keep transactions short and use a timeout.

Another pattern: use a database table as a lock registry with a unique constraint on the lock name. INSERT a row with the lock name and a token. If the INSERT succeeds, you have the lock. DELETE to release. This is simple but doesn't handle crashes — you need a background job to clean up stale locks.

db_lock.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- io.thecodeforge — System Design tutorial

-- Create lock table
CREATE TABLE distributed_locks (
    lock_name VARCHAR(255) PRIMARY KEY,
    token VARCHAR(255) NOT NULL,
    acquired_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    ttl_seconds INT NOT NULL
);

-- Acquire lock: INSERT if not exists, with a TTL check
-- Use ON DUPLICATE KEY UPDATE to extend if same token
INSERT INTO distributed_locks (lock_name, token, ttl_seconds)
VALUES ('order:123', 'token-abc', 30)
ON DUPLICATE KEY UPDATE
    token = IF(acquired_at < NOW() - INTERVAL ttl_seconds SECOND, VALUES(token), token),
    acquired_at = IF(ROW_COUNT() = 0, NOW(), acquired_at);

-- Check if we acquired the lock
SELECT ROW_COUNT() > 0 AS acquired;

-- Release lock
DELETE FROM distributed_locks WHERE lock_name = 'order:123' AND token = 'token-abc';
Output
acquired: 1
Production Trap: Long Transactions
Holding a database lock for more than a few seconds will cause connection pool exhaustion and deadlocks. Keep critical sections under 100ms. If you need longer, use a heartbeat or switch to Redis/ZooKeeper.

When Not to Use Distributed Locking

Distributed locking is a hammer, but not every problem is a nail. Sometimes you can avoid locks entirely by using idempotent operations or optimistic concurrency. For example, instead of locking an inventory row, use an atomic decrement: UPDATE inventory SET stock = stock - 1 WHERE stock > 0. If the update affects zero rows, you know stock was insufficient. No lock needed.

Another alternative: use a message queue with exactly-once semantics. Process orders sequentially from a single partition. This avoids locks but introduces ordering constraints. Or use a database transaction with SERIALIZABLE isolation — but that kills performance.

The rule of thumb: if you can design your system to be conflict-free (e.g., using event sourcing or CRDTs), do that instead. Distributed locking should be your last resort, not your first instinct. It adds latency, complexity, and failure modes.

atomic_decrement.sqlSQL
1
2
3
4
5
6
7
8
9
-- io.thecodeforge — System Design tutorial

-- Atomic decrement with check — no lock needed
UPDATE inventory
SET stock = stock - 1
WHERE id = 123 AND stock > 0;

-- Check if the update succeeded
SELECT ROW_COUNT() > 0 AS decremented;
Output
decremented: 1
Senior Shortcut: Idempotency Keys
For payment processing, use an idempotency key. The client sends a unique key with each request. The server checks if it has seen the key before. If yes, return the previous response. This eliminates the need for locks on the payment resource.
Locking vs. IdempotencyTHECODEFORGE.IOLocking vs. IdempotencyWhen you can skip distributed locksDistributed LockRequires external coordinatorAdds latency and failure modesComplex fencing token logicHard to debug split-brainIdempotent OpNo coordinator neededRetry-safe by designAtomic SQL update with WHERESimpler, faster, resilientIdempotent ops avoid lock overhead entirelyTHECODEFORGE.IO
thecodeforge.io
Locking vs. Idempotency
Distributed Locking
● Production incidentPOST-MORTEMseverity: high

The 4GB Container That Kept Dying

Symptom
A payment processing service would randomly crash with OOMKilled every few hours. No pattern in traffic spikes.
Assumption
Memory leak in the payment processing logic — team spent days profiling heap dumps.
Root cause
The distributed lock client (Redis-based) had a bug: on lock acquisition failure, it retried in a tight loop creating thousands of connections. Each connection consumed ~50KB until the container hit the 4GB limit and got killed by Kubernetes.
Fix
Set a max retry limit (3 retries with exponential backoff) and a connection pool limit (max 10 connections). Also added a circuit breaker to stop retrying after 5 consecutive failures.
Key lesson
  • Always bound retries and connections in distributed lock clients — an unconstrained retry loop is a self-inflicted DDoS.
Production debug guideSystematic recovery paths for the failure modes engineers actually hit.3 entries
Symptom · 01
Workers stuck waiting for lock — no progress
Fix
1. Check lock key TTL in Redis: TTL lock:key. 2. If TTL is -1 (no expiry), delete the key manually. 3. Check for missing unlock() in code — add a watchdog to auto-release stale locks.
Symptom · 02
Duplicate processing — two workers handle the same job
Fix
1. Check lock key value — are both workers using the same token? 2. Verify lock acquisition is atomic (SET NX). 3. Check for clock drift between nodes — sync NTP. 4. Add fencing token to reject stale writes.
Symptom · 03
Lock acquisition fails intermittently with 'connection refused'
Fix
1. Check Redis server health: redis-cli ping. 2. Check connection pool exhaustion — increase max connections. 3. Check for network partitions between app and Redis. 4. Add retry with backoff.
★ Distributed Locking Triage Cheat SheetFirst-response commands for when things go wrong — copy-paste ready.
Lock not releasing — workers stuck. Error: `Timeout waiting for lock`
Immediate action
Check if lock key exists and its TTL
Commands
redis-cli TTL lock:order:123
redis-cli GET lock:order:123
Fix now
If TTL is -1: DEL lock:order:123. If TTL > 0: wait or manually delete if safe.
Split-brain — two workers think they hold the lock. Error: `Duplicate order processed`+
Immediate action
Check lock key value on both workers
Commands
redis-cli GET lock:order:123
Check worker logs for lock acquisition timestamps
Fix now
Add fencing token: increment a counter in Redis and pass it to the resource. Resource rejects writes with stale token.
Lock acquisition fails with `MOVED` error (Redis cluster)+
Immediate action
Check cluster topology
Commands
redis-cli CLUSTER NODES
redis-cli CLUSTER KEYSLOT lock:order:123
Fix now
Ensure client uses cluster mode: redis.RedisCluster(startup_nodes=[...])
ZooKeeper lock session expired — lock lost. Error: `Session expired`+
Immediate action
Check session timeout configuration
Commands
Check ZooKeeper config: zkCli.sh stat /lock
Check client heartbeat interval
Fix now
Increase session timeout (e.g., 30s) and ensure client sends heartbeats. Use fencing token to handle stale lock holders.
Feature / AspectRedis (SET NX PX)ZooKeeper / etcd
ConsistencyEventual (loss on failover)Linearizable (strong)
Latency (p99)1-5 ms10-100 ms
Lock expirationTTL-based (watchdog needed)Session-based (ephemeral nodes)
Fencing tokenManual (e.g., increment counter)Built-in (monotonic revision)
ComplexityLowMedium-High
Best forHigh-throughput, tolerate occasional split-brainCritical resources, leader election

Key takeaways

1
Distributed locking is hard because of network partitions, clock drift, and process pauses
always use a fencing token to reject stale writes.
2
Redis locks are fast but not strongly consistent
use Redlock or ZooKeeper for critical resources.
3
Always release locks in a finally block and use a unique token to avoid deleting someone else's lock.
4
Consider avoiding locks entirely with idempotent operations or atomic database updates
locks are a last resort.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
How does Redlock handle clock drift? What happens if two Redis nodes hav...
Q02SENIOR
When would you choose ZooKeeper over Redis for distributed locking in a ...
Q03SENIOR
What happens when a distributed lock holder pauses for a long GC — and h...
Q04JUNIOR
What is a distributed lock and why is it needed?
Q05SENIOR
You're debugging a production issue where two workers processed the same...
Q06SENIOR
How would you design a distributed lock service for a global e-commerce ...
Q01 of 06SENIOR

How does Redlock handle clock drift? What happens if two Redis nodes have unsynchronized clocks?

ANSWER
Redlock assumes bounded clock drift (e.g., < 10ms). If clocks drift more, a lock might be acquired on a majority of nodes but the TTL expires earlier on some, allowing another client to acquire the lock. Mitigation: use NTP with tight sync, and add a safety margin to TTL (e.g., 2x expected drift).
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
How do distributed locks work in Redis?
02
What's the difference between Redis and ZooKeeper for distributed locking?
03
How do I prevent split-brain in distributed locking?
04
What happens if a distributed lock holder crashes?
N
Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.

Follow
Verified
production tested
June 25, 2026
last updated
1,663
articles · all by Naren
🔥

That's Distributed Systems. Mark it forged?

5 min read · try the examples if you haven't

Previous
Distributed Transactions and 2PC
5 / 9 · Distributed Systems
Next
Consistency Models