Senior 5 min · June 25, 2026

Distributed Locking: The Production Guide to Avoiding Split-Brain and Data Corruption

Q: How do distributed locks work in Redis?

Redis distributed locks work by atomically setting a key with a TTL using SET NX PX. The key acts as the lock. Only one client can set it at a time. To release, delete the key only if the value matches your unique token. For stronger guarantees, use Redlock which acquires the lock from a majority of Redis nodes.

Q: What's the difference between Redis and ZooKeeper for distributed locking?

Redis is faster (1-5ms) but eventually consistent — locks can be lost on failover. ZooKeeper is slower (10-100ms) but provides linearizable consistency and automatic lock release via ephemeral nodes. Use Redis for high-throughput, tolerate occasional split-brain. Use ZooKeeper for critical resources like leader election.

Q: How do I prevent split-brain in distributed locking?

Use a fencing token: a monotonically increasing number from the lock service. Pass it to the resource (e.g., database). The resource rejects writes with a token older than the last write. Also ensure clock sync (NTP) and use a watchdog to extend lock TTL if processing takes longer than expected.

Q: What happens if a distributed lock holder crashes?

With Redis, the lock key has a TTL — it will expire automatically after the TTL. With ZooKeeper/etcd, the ephemeral node is deleted when the session times out (usually 10-30 seconds). During that window, no other client can acquire the lock. To reduce the window, use a short TTL or session timeout.

Distributed locking prevents race conditions in distributed systems.

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.

✓ Production

production tested

June 25, 2026

last updated

1,663

articles · all by Naren

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Distributed locking ensures only one process holds a lock on a resource at a time across a network. Use Redis Redlock, ZooKeeper ephemeral nodes, or database-based locks. Beware of clock drift, network partitions, and lock expiration — these cause split-brain scenarios where two processes believe they hold the lock.

✦ Definition~90s read

What is Distributed Locking?

Distributed locking is a mechanism to coordinate access to a shared resource across multiple processes or machines, ensuring mutual exclusion in a distributed environment. It prevents race conditions and data corruption when concurrent operations modify the same data.

★

Imagine a single bathroom key for a whole office building.

Plain-English First

Imagine a single bathroom key for a whole office building. Only one person can use the bathroom at a time because they hold the physical key. Distributed locking is that key, but for computer resources across multiple servers. If someone forgets to return the key (lock expires), someone else might walk in on them — that's a split-brain bug.

Distributed locking is one of those things that sounds simple until it breaks your production database at 3 AM. The textbook says 'use a lock' — but the real world says 'your lock just failed and now you have duplicate payments.' This article is the no-bullshit guide to distributed locking: what works, what doesn't, and how to debug when it all goes wrong.

The core problem is mutual exclusion across machines. Without it, two services can simultaneously process the same order, decrement the same inventory twice, or overwrite each other's data. You need a lock that all nodes respect — and that's harder than it sounds because networks are unreliable, clocks drift, and processes crash.

By the end of this, you'll be able to choose the right locking strategy for your system, implement it without the classic mistakes, and diagnose failures when locks misbehave. You'll also know when not to use distributed locking at all — because sometimes the simplest solution is no lock.

Why Distributed Locking Is Hard: The Fallacies of Distributed Computing

Before we talk about solutions, let's talk about why this is a hard problem. The network is not reliable — packets drop, latency spikes, and partitions happen. Clocks are not synchronized — NTP can drift, and even with PTP, you get skew. Processes can pause for garbage collection or get preempted by the OS. These three facts make distributed locking fundamentally different from single-process locking.

Without distributed locking, two nodes can simultaneously modify the same resource. Classic example: an inventory service that decrements stock on order placement. Two orders come in at the same time, both read stock=1, both write stock=0 — you just oversold. The fix is a distributed lock that serializes access to the inventory row.

But here's the kicker: even with a lock, you can still get corruption if the lock expires while the holder is still working. This is the split-brain problem. The holder thinks it has the lock, but another node acquires it and starts modifying the same resource. Now you have two writers. This is why lock fencing (a monotonically increasing token) is critical — it lets the resource reject stale writes.

InventoryService.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Pseudocode for inventory decrement with distributed lock
function decrementStock(orderId, quantity) {
  // Acquire lock on the inventory item
  lock = redisLock.acquire("inventory:item:123", ttl=5000) // 5 second TTL
  if (!lock) {
    throw new Exception("Could not acquire lock, try again")
  }
  try {
    // Read current stock
    stock = db.query("SELECT stock FROM inventory WHERE id=123")
    if (stock < quantity) {
      throw new Exception("Insufficient stock")
    }
    // Update stock
    db.execute("UPDATE inventory SET stock = stock - ? WHERE id=123", quantity)
  } finally {
    // Always release the lock
    redisLock.release(lock)
  }
}

Output

No output — this is a pseudocode pattern.

Production Trap: Lock Expiration

If your lock TTL is too short, the lock expires while the worker is still processing. Another worker grabs the lock and you get split-brain. Solution: use a watchdog thread that extends the TTL periodically, or use a fencing token to reject stale writes.

Locking Strategy Decision Tree

IfNeed strong consistency and can tolerate latency

→

UseUse ZooKeeper or etcd with ephemeral nodes and fencing tokens

IfNeed low latency and can tolerate occasional lock failure

→

UseUse Redis Redlock with short TTL and watchdog

IfAlready using a relational database with transactions

→

UseUse database row-level locks (SELECT FOR UPDATE) — simplest

thecodeforge.io

Distributed Locking: Avoiding Split-Brain & Corruption

Distributed Locking

Redis-Based Locks: The Good, the Bad, and the Split-Brain

Redis is the most popular choice for distributed locking because it's fast and simple. The basic pattern: SET key value NX PX 5000 — atomically set the key if it doesn't exist with a 5-second TTL. To release, delete the key only if the value matches your lock token (to avoid deleting someone else's lock).

But Redis has a fundamental problem: it's not strongly consistent. In a Redis cluster, if the master goes down after acknowledging the write but before replicating, the lock is lost. The new master doesn't have the lock, so another client can acquire it. This is why Redis Labs proposed Redlock — a consensus-based algorithm that acquires the lock from a majority of Redis nodes.

Redlock is controversial. Martin Kleppmann famously argued it's unsafe because of clock drift and GC pauses. In practice, it works well if you have tight clock sync (NTP) and short TTLs. But if you need absolute correctness, use ZooKeeper or etcd. For most systems, Redis locks are good enough — just be aware of the edge cases.

redis_lock.pyPYTHON

# io.thecodeforge — System Design tutorial

import redis
import uuid
import time

class RedisLock:
    def __init__(self, client, key, ttl=5000):
        self.client = client
        self.key = key
        self.ttl = ttl
        self.token = str(uuid.uuid4())  # Unique per lock attempt

    def acquire(self):
        # SET NX PX — atomic: set if not exists, with TTL
        result = self.client.set(self.key, self.token, nx=True, px=self.ttl)
        return result is True

    def release(self):
        # Lua script to atomically delete only if token matches
        lua = """
        if redis.call("get", KEYS[1]) == ARGV[1] then
            return redis.call("del", KEYS[1])
        else
            return 0
        end
        """
        self.client.eval(lua, 1, self.key, self.token)

# Usage
client = redis.Redis(host='localhost', port=6379)
lock = RedisLock(client, "lock:order:123", ttl=5000)
if lock.acquire():
    try:
        # Critical section
        print("Lock acquired, processing order 123")
    finally:
        lock.release()
else:
    print("Failed to acquire lock")

Output

Lock acquired, processing order 123

Never Do This: Non-Atomic Release

Don't do GET + DEL in two separate commands. Between GET and DEL, another client might have acquired the lock. Always use a Lua script or the UNLINK command with a check. Otherwise you'll delete someone else's lock and cause corruption.

Redis Lock Decision Tree

IfSingle Redis instance, no cluster

→

UseUse simple SET NX PX — but risk of lock loss on failover

IfRedis cluster with replication

→

UseUse Redlock (acquire from majority of nodes) — but beware of clock drift

IfNeed absolute correctness, can't tolerate split-brain

→

UseDon't use Redis — use ZooKeeper or etcd

thecodeforge.io

Redis Lock Lifecycle

Distributed Locking

ZooKeeper and etcd: Strong Consistency at a Cost

When correctness matters more than latency, use ZooKeeper or etcd. Both use consensus algorithms (Zab and Raft respectively) to provide linearizable writes. The lock pattern: create an ephemeral sequential node. The client with the smallest sequence number holds the lock. When the client disconnects (or crashes), the ephemeral node is automatically deleted — no TTL needed.

This solves the lock expiration problem because the lock lives as long as the session. But it introduces new problems: session timeouts. If the client's session expires due to a network blip, the lock is released even though the client is still working. This can cause split-brain again. The solution is to use a fencing token: the lock service gives you a monotonically increasing token that you pass to the resource. The resource rejects any write with a stale token.

etcd has a built-in locking package (concurrency/stm) that handles this. ZooKeeper requires more manual work. The trade-off: ZooKeeper/etcd are slower than Redis (10-100ms vs 1ms) but provide stronger guarantees. Use them for critical resources like leader election or financial transactions.

etcd_lock.goGO

// io.thecodeforge — System Design tutorial

package main

import (
	"context"
	"fmt"
	"log"
	"time"

	clientv3 "go.etcd.io/etcd/client/v3"
	"go.etcd.io/etcd/client/v3/concurrency"
)

func main() {
	cli, err := clientv3.New(clientv3.Config{
		Endpoints:   []string{"localhost:2379"},
		DialTimeout: 5 * time.Second,
	})
	if err != nil {
		log.Fatal(err)
	}
	defer cli.Close()

	// Create a session with a 10-second TTL
	session, err := concurrency.NewSession(cli, concurrency.WithTTL(10))
	if err != nil {
		log.Fatal(err)
	}
	defer session.Close()

	// Create a mutex for the resource
	mutex := concurrency.NewMutex(session, "/my-lock")

	// Acquire lock (blocks until acquired)
	if err := mutex.Lock(context.TODO()); err != nil {
		log.Fatal(err)
	}
	fmt.Println("Lock acquired")

	// Critical section
	time.Sleep(2 * time.Second)

	// Release lock
	if err := mutex.Unlock(context.TODO()); err != nil {
		log.Fatal(err)
	}
	fmt.Println("Lock released")
}

Output

Lock acquired

Lock released

Senior Shortcut: Fencing Tokens

Always use a fencing token with ZooKeeper/etcd locks. The token is a monotonically increasing integer from the lock service. Pass it to the resource (e.g., database). The resource rejects writes with a token older than the last write. This prevents stale lock holders from corrupting data.

Database-Based Locks: The Old Reliable

If you're already using a relational database, you can use row-level locks with SELECT FOR UPDATE. This is the simplest distributed lock — no extra infrastructure. The lock is held for the duration of the transaction. When the transaction commits or rolls back, the lock is released.

But there's a catch: database locks are coarse and can become a bottleneck. If you lock a row for too long, other transactions queue up. Also, if the application crashes mid-transaction, the lock is held until the database detects the dead connection (which can take minutes). This is why you should keep transactions short and use a timeout.

Another pattern: use a database table as a lock registry with a unique constraint on the lock name. INSERT a row with the lock name and a token. If the INSERT succeeds, you have the lock. DELETE to release. This is simple but doesn't handle crashes — you need a background job to clean up stale locks.

db_lock.sqlSQL

-- io.thecodeforge — System Design tutorial

-- Create lock table
CREATE TABLE distributed_locks (
    lock_name VARCHAR(255) PRIMARY KEY,
    token VARCHAR(255) NOT NULL,
    acquired_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    ttl_seconds INT NOT NULL
);

-- Acquire lock: INSERT if not exists, with a TTL check
-- Use ON DUPLICATE KEY UPDATE to extend if same token
INSERT INTO distributed_locks (lock_name, token, ttl_seconds)
VALUES ('order:123', 'token-abc', 30)
ON DUPLICATE KEY UPDATE
    token = IF(acquired_at < NOW() - INTERVAL ttl_seconds SECOND, VALUES(token), token),
    acquired_at = IF(ROW_COUNT() = 0, NOW(), acquired_at);

-- Check if we acquired the lock
SELECT ROW_COUNT() > 0 AS acquired;

-- Release lock
DELETE FROM distributed_locks WHERE lock_name = 'order:123' AND token = 'token-abc';

Output

acquired: 1

Production Trap: Long Transactions

Holding a database lock for more than a few seconds will cause connection pool exhaustion and deadlocks. Keep critical sections under 100ms. If you need longer, use a heartbeat or switch to Redis/ZooKeeper.

When Not to Use Distributed Locking

Distributed locking is a hammer, but not every problem is a nail. Sometimes you can avoid locks entirely by using idempotent operations or optimistic concurrency. For example, instead of locking an inventory row, use an atomic decrement: UPDATE inventory SET stock = stock - 1 WHERE stock > 0. If the update affects zero rows, you know stock was insufficient. No lock needed.

Another alternative: use a message queue with exactly-once semantics. Process orders sequentially from a single partition. This avoids locks but introduces ordering constraints. Or use a database transaction with SERIALIZABLE isolation — but that kills performance.

The rule of thumb: if you can design your system to be conflict-free (e.g., using event sourcing or CRDTs), do that instead. Distributed locking should be your last resort, not your first instinct. It adds latency, complexity, and failure modes.

atomic_decrement.sqlSQL

-- io.thecodeforge — System Design tutorial

-- Atomic decrement with check — no lock needed
UPDATE inventory
SET stock = stock - 1
WHERE id = 123 AND stock > 0;

-- Check if the update succeeded
SELECT ROW_COUNT() > 0 AS decremented;

Output

decremented: 1

Senior Shortcut: Idempotency Keys

For payment processing, use an idempotency key. The client sends a unique key with each request. The server checks if it has seen the key before. If yes, return the previous response. This eliminates the need for locks on the payment resource.

thecodeforge.io

Locking vs. Idempotency

Distributed Locking

● Production incidentPOST-MORTEMseverity: high

The 4GB Container That Kept Dying

Symptom

A payment processing service would randomly crash with OOMKilled every few hours. No pattern in traffic spikes.

Assumption

Memory leak in the payment processing logic — team spent days profiling heap dumps.

Root cause

The distributed lock client (Redis-based) had a bug: on lock acquisition failure, it retried in a tight loop creating thousands of connections. Each connection consumed ~50KB until the container hit the 4GB limit and got killed by Kubernetes.

Fix

Set a max retry limit (3 retries with exponential backoff) and a connection pool limit (max 10 connections). Also added a circuit breaker to stop retrying after 5 consecutive failures.

Key lesson

Always bound retries and connections in distributed lock clients — an unconstrained retry loop is a self-inflicted DDoS.

Production debug guideSystematic recovery paths for the failure modes engineers actually hit.3 entries

Symptom · 01

Workers stuck waiting for lock — no progress

→

Fix

1. Check lock key TTL in Redis: TTL lock:key. 2. If TTL is -1 (no expiry), delete the key manually. 3. Check for missing unlock() in code — add a watchdog to auto-release stale locks.

Symptom · 02

Duplicate processing — two workers handle the same job

→

Fix

1. Check lock key value — are both workers using the same token? 2. Verify lock acquisition is atomic (SET NX). 3. Check for clock drift between nodes — sync NTP. 4. Add fencing token to reject stale writes.

Symptom · 03

Lock acquisition fails intermittently with 'connection refused'

→

Fix

1. Check Redis server health: redis-cli ping. 2. Check connection pool exhaustion — increase max connections. 3. Check for network partitions between app and Redis. 4. Add retry with backoff.

★ Distributed Locking Triage Cheat SheetFirst-response commands for when things go wrong — copy-paste ready.

Lock not releasing — workers stuck. Error: `Timeout waiting for lock`−

Immediate action

Check if lock key exists and its TTL

Commands

redis-cli TTL lock:order:123

redis-cli GET lock:order:123

Fix now

If TTL is -1: DEL lock:order:123. If TTL > 0: wait or manually delete if safe.

Split-brain — two workers think they hold the lock. Error: `Duplicate order processed`+

Lock acquisition fails with `MOVED` error (Redis cluster)+

ZooKeeper lock session expired — lock lost. Error: `Session expired`+

Feature / Aspect	Redis (SET NX PX)	ZooKeeper / etcd
Consistency	Eventual (loss on failover)	Linearizable (strong)
Latency (p99)	1-5 ms	10-100 ms
Lock expiration	TTL-based (watchdog needed)	Session-based (ephemeral nodes)
Fencing token	Manual (e.g., increment counter)	Built-in (monotonic revision)
Complexity	Low	Medium-High
Best for	High-throughput, tolerate occasional split-brain	Critical resources, leader election

Key takeaways

Distributed locking is hard because of network partitions, clock drift, and process pauses

always use a fencing token to reject stale writes.

Redis locks are fast but not strongly consistent

use Redlock or ZooKeeper for critical resources.

Always release locks in a finally block and use a unique token to avoid deleting someone else's lock.

Consider avoiding locks entirely with idempotent operations or atomic database updates

locks are a last resort.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

How does Redlock handle clock drift? What happens if two Redis nodes hav...

Q02SENIOR

When would you choose ZooKeeper over Redis for distributed locking in a ...

Q03SENIOR

What happens when a distributed lock holder pauses for a long GC — and h...

Q04JUNIOR

What is a distributed lock and why is it needed?

Q05SENIOR

You're debugging a production issue where two workers processed the same...

Q06SENIOR

How would you design a distributed lock service for a global e-commerce ...

Q01 of 06SENIOR

How does Redlock handle clock drift? What happens if two Redis nodes have unsynchronized clocks?

ANSWER

Redlock assumes bounded clock drift (e.g., < 10ms). If clocks drift more, a lock might be acquired on a majority of nodes but the TTL expires earlier on some, allowing another client to acquire the lock. Mitigation: use NTP with tight sync, and add a safety margin to TTL (e.g., 2x expected drift).

FAQ · 4 QUESTIONS

Frequently Asked Questions

How do distributed locks work in Redis?

What's the difference between Redis and ZooKeeper for distributed locking?

How do I prevent split-brain in distributed locking?

What happens if a distributed lock holder crashes?

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.

✓ Verified

production tested

June 25, 2026

last updated

1,663

articles · all by Naren

🔥

That's Distributed Systems. Mark it forged?

5 min read · try the examples if you haven't