Senior 3 min · June 25, 2026

Distributed Locking Service: Build a Production-Grade Lock That Won't Fail at 3 AM

Q: What is the best distributed locking service for production?

There's no single best. For most applications, Redis Redlock is sufficient and fast. For strict correctness, use ZooKeeper or etcd. The choice depends on your tolerance for rare double-lock events and your operational capabilities.

Q: What's the difference between Redis Redlock and ZooKeeper locks?

Redlock uses timeouts and majority voting, making it fast but unsafe under network partitions. ZooKeeper uses ephemeral sequential nodes and sequential consistency, making it safe under partitions but slower. Use Redlock for performance, ZooKeeper for correctness.

Q: How do I implement a distributed lock in Redis?

Use SETNX with a TTL to acquire the lock, and a Lua script to release it (checking the lock value). For production, use the Redlock algorithm with multiple Redis nodes. Never use DEL without checking the value.

Q: How do you handle a distributed lock that expires while the holder is still working?

Use a watchdog thread that extends the lock TTL periodically. Also use fencing tokens so that if the lock expires and another client acquires it, the original client's writes are rejected by the resource.

Design a distributed locking service that survives network partitions, clock skew, and GC pauses.

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Everything here is grounded in real deployments.

✓ Production

production tested

June 25, 2026

last updated

1,663

articles · all by Naren

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Use Redis Redlock for most cases, but understand its limitations under network partitions. For strict correctness, use ZooKeeper or etcd with fencing tokens. Never implement your own consensus algorithm.

✦ Definition~90s read

What is Design a Distributed Locking Service?

A distributed locking service coordinates exclusive access to shared resources across multiple machines. It ensures only one process holds a lock at a time, even when nodes crash, networks partition, or clocks drift.

★

Imagine a single physical key to the only bathroom in a busy office.

Plain-English First

Imagine a single physical key to the only bathroom in a busy office. People line up, take the key, lock the door, do their business, and return the key. Distributed locking is that key, but spread across multiple buildings (servers) that must agree on who has it. If someone walks off with the key (a crash), the system must detect that and cut a new key without letting two people into the bathroom at once.

Distributed locking is the single most over-engineered and misunderstood primitive in distributed systems. I've seen it take down payment pipelines, corrupt databases, and cause 3 AM pages that no one could explain. The problem isn't the lock itself — it's the assumptions you make about time, failures, and consensus. By the end of this, you'll design a distributed lock that survives network partitions, clock skew, and GC pauses. You'll know exactly when to use Redis, ZooKeeper, or etcd, and more importantly, when not to use any of them.

Why Your Lock Will Fail: The Three Enemies of Distributed Locks

Before you write a single line of code, understand the three things that will break your lock in production. First: network partitions. If a lock holder gets partitioned, it can't release the lock, and the system must decide when to break the lock. Second: clock skew. If your lock uses timeouts and clocks drift, you get two lock holders. Third: GC pauses. A 30-second GC pause can outlive your lock TTL, and suddenly two processes think they own the resource. These aren't edge cases — they're the normal state of a distributed system. Design for them or don't bother locking.

LockFailureScenarios.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Scenario: GC pause causes lock expiry
lock = redis.setnx('resource_lock', instance_id, ttl=30)
// ... 45 seconds of GC ...
// Lock expired, another instance acquired it
// Now two instances think they hold the lock
// Data corruption follows

// Fix: Use a watchdog thread
watchdog = new Thread(() -> {
    while (holdingLock) {
        redis.expire('resource_lock', 30);
        Thread.sleep(10_000);
    }
});
watchdog.start();

Output

No direct output — this is a design pattern. The fix prevents dual lock holders.

Production Trap:

Setting lock TTL to 30 seconds because 'nothing takes that long' is how you get paged at 3 AM. Measure your worst-case pause, not your average.

Lock Failure Mode Decision Tree

IfNetwork partition > lock TTL

→

UseUse fencing tokens with monotonic IDs

IfClock skew > 10% of TTL

→

UseUse ZooKeeper or etcd (no timeouts)

IfGC pauses > 50% of TTL

→

UseAdd watchdog thread, reduce heap, tune GC

thecodeforge.io

Distributed Locking Service Architecture

Design Distributed Locking Service

Redis Redlock: The Good, The Bad, and The Ugly

Redlock is the most popular distributed lock algorithm, and it's also the most controversial. It works by acquiring the lock on a majority of Redis nodes. The good: it's fast, simple, and works for most cases. The bad: it's not safe under network partitions — if a majority of nodes are partitioned away, the lock can be acquired by two clients. The ugly: Martin Kleppmann wrote a famous critique showing that Redlock violates the safety property of mutual exclusion under asynchronous networks. For most applications, it's fine. For financial systems, don't touch it.

RedlockExample.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Redlock implementation (simplified)
public class Redlock {
    private static final int QUORUM = 3; // out of 5 nodes
    private static final int TTL = 1000; // ms

    public boolean acquire(String resource, String instanceId) {
        int acquired = 0;
        long start = System.currentTimeMillis();
        for (RedisNode node : nodes) {
            if (node.setnx(resource, instanceId, TTL)) {
                acquired++;
            }
        }
        // Check if we got majority AND didn't take too long
        long elapsed = System.currentTimeMillis() - start;
        return acquired >= QUORUM && elapsed < TTL;
    }

    public void release(String resource, String instanceId) {
        // Lua script to ensure we only delete our own lock
        String script = "if redis.call('get', KEYS[1]) == ARGV[1] then return redis.call('del', KEYS[1]) else return 0 end";
        for (RedisNode node : nodes) {
            node.eval(script, 1, resource, instanceId);
        }
    }
}

Output

No direct output — this is a code pattern. The Lua script prevents deleting another client's lock.

Senior Shortcut:

Redlock's safety depends on clock synchronization. If you can't guarantee NTP within 10ms, don't use it. Use ZooKeeper instead — it uses sequential consistency, not time.

Redlock or Not?

IfNeed strong consistency (financial)

→

UseUse ZooKeeper or etcd

IfCan tolerate rare double-lock (caching)

→

UseRedlock is fine

IfNetwork partitions are frequent

→

UseAvoid Redlock; use consensus-based lock

ZooKeeper Locks: The Gold Standard for Correctness

ZooKeeper provides a lock that is safe under asynchronous networks because it doesn't rely on timeouts. It uses ephemeral sequential nodes: each client creates an ephemeral node, and the one with the smallest sequence number holds the lock. If the client crashes, the ephemeral node disappears automatically. No clock skew issues, no GC pause problems. The trade-off: higher latency (multiple RTTs per lock acquisition) and more complex operations. Use this when correctness is non-negotiable.

ZooKeeperLock.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// ZooKeeper distributed lock using ephemeral sequential nodes
public class ZkLock {
    private ZooKeeper zk;
    private String lockPath = "/locks/resource";
    private String currentNode;

    public boolean acquire() throws Exception {
        // Create an ephemeral sequential node
        currentNode = zk.create(lockPath + "/lock-", new byte[0],
                ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL_SEQUENTIAL);
        
        // Get all children and check if we are the smallest
        List<String> children = zk.getChildren(lockPath, false);
        Collections.sort(children);
        String smallest = children.get(0);
        if (currentNode.endsWith(smallest)) {
            return true; // We hold the lock
        }
        
        // Watch the node just before us
        int myIndex = children.indexOf(currentNode.substring(lockPath.length()+1));
        String predecessor = lockPath + "/" + children.get(myIndex - 1);
        CountDownLatch latch = new CountDownLatch(1);
        zk.exists(predecessor, event -> latch.countDown());
        latch.await(); // Wait for predecessor to disappear
        return true;
    }

    public void release() throws Exception {
        zk.delete(currentNode, -1);
    }
}

Output

No direct output — this is a code pattern. The lock guarantees mutual exclusion even under partitions.

Interview Gold:

ZooKeeper locks use 'herd effect' — when a lock is released, all waiting clients wake up. Mitigate by using 'curator' library which implements a fair lock with a queue.

thecodeforge.io

Redlock vs. ZooKeeper Locks

Design Distributed Locking Service

Fencing Tokens: The One Trick That Saves Your Data

Even with a perfect lock, you can still corrupt data if the lock holder is slow. Imagine a client acquires a lock, writes to a file, but the write takes 30 seconds. Meanwhile, the lock expires, another client acquires it, and both write to the same file. Fencing tokens solve this: each time a lock is acquired, the lock service issues a monotonically increasing token. The client includes this token in every write request. The resource (e.g., database) rejects any write with a token older than the last accepted one. This is how you make locks safe under GC pauses.

FencingToken.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Fencing token generation in lock service
public class FencingTokenLock {
    private AtomicLong counter = new AtomicLong(0);
    
    public LockAcquireResult acquire(String resource, String clientId) {
        // ... acquire lock logic ...
        long token = counter.incrementAndGet();
        return new LockAcquireResult(true, token);
    }
}

// Client uses token in write
public void writeData(Data data, long fencingToken) {
    // Database checks: token must be > last seen token
    String sql = "UPDATE resource SET data = ?, token = ? WHERE token < ?";
    int rows = jdbc.update(sql, data, fencingToken, fencingToken);
    if (rows == 0) {
        throw new StaleLockException("Fencing token rejected");
    }
}

Output

No direct output — this is a design pattern. The database rejects stale writes.

Never Do This:

Skipping fencing tokens because 'our lock is perfect' is how you get silent data corruption. Always fence. Always.

thecodeforge.io

Fencing Token Flow

Design Distributed Locking Service

Lock Semantics: Reentrant, Read/Write, and Lease-Based

Not all locks are equal. Reentrant locks allow the same thread to acquire the lock multiple times without deadlocking. Read/write locks allow concurrent readers but exclusive writers. Lease-based locks have a time-bound lease that auto-renews. Choose the right semantics for your use case. For a distributed rate limiter, a lease-based lock with short TTL works. For a distributed mutex protecting a critical section, a non-reentrant lock with fencing tokens is safer.

LockSemantics.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Reentrant distributed lock using Redis
public class ReentrantRedisLock {
    private ThreadLocal<Integer> reentrantCount = ThreadLocal.withInitial(() -> 0);
    private String lockKey;
    private String instanceId;

    public boolean acquire() {
        if (reentrantCount.get() > 0) {
            reentrantCount.set(reentrantCount.get() + 1);
            return true;
        }
        boolean acquired = redis.setnx(lockKey, instanceId, 30);
        if (acquired) {
            reentrantCount.set(1);
        }
        return acquired;
    }

    public void release() {
        int count = reentrantCount.get();
        if (count == 0) throw new IllegalStateException("Not locked");
        if (count > 1) {
            reentrantCount.set(count - 1);
            return;
        }
        redis.eval("if redis.call('get', KEYS[1]) == ARGV[1] then return redis.call('del', KEYS[1]) else return 0 end",
                Collections.singletonList(lockKey), Collections.singletonList(instanceId));
        reentrantCount.set(0);
    }
}

Output

No direct output — this is a code pattern. Reentrancy prevents deadlocks in recursive calls.

Senior Shortcut:

Read/write locks in distributed systems are tricky because readers must coordinate. Use a separate lock for reads and writes, or use a lease-based approach where readers hold a shared lease.

Production Deployment: What Your Lock Service Needs to Survive

Your lock service is a critical piece of infrastructure. It needs: (1) High availability — at least 3 nodes, preferably 5 for consensus-based systems. (2) Monitoring — track lock acquisition latency, failure rate, and hold time. Alert if any lock is held longer than 2x TTL. (3) Graceful degradation — if the lock service is down, your application should fail fast, not hang. (4) Rate limiting — prevent a single client from hogging locks. (5) Authentication — not everyone should be able to lock any resource.

LockServiceHealthCheck.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Health check endpoint for lock service
@GET
@Path("/health")
public Response health() {
    // Check each backend node
    for (LockBackend backend : backends) {
        if (!backend.isHealthy()) {
            return Response.status(503)
                .entity("Backend " + backend.getName() + " is down")
                .build();
        }
    }
    // Check latency
    long start = System.nanoTime();
    boolean acquired = tryAcquire("health-check-lock", "health-check", 100);
    long latency = (System.nanoTime() - start) / 1_000_000;
    release("health-check-lock", "health-check");
    if (latency > 500) {
        return Response.status(503)
            .entity("Lock latency too high: " + latency + "ms")
            .build();
    }
    return Response.ok().build();
}

Output

HTTP 200 if healthy, 503 if not. The health check acquires and releases a lock to verify end-to-end functionality.

Production Trap:

Don't use the same Redis cluster for locks and caching. A cache miss storm can cause lock acquisition to timeout, taking down your entire system. Isolate lock traffic.

When Not to Use Distributed Locks

Distributed locks are overkill for many problems. If you need to prevent duplicate work, use idempotency keys instead. If you need to coordinate access to a database row, use database transactions with optimistic locking. If you need to rate limit, use a token bucket. Distributed locks add latency, complexity, and failure modes. Only use them when you absolutely need mutual exclusion across machines and cannot tolerate any chance of concurrent access.

IdempotencyKeyExample.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Idempotency key pattern (alternative to locks)
public class PaymentService {
    public PaymentResult processPayment(String idempotencyKey, PaymentRequest request) {
        // Check if already processed
        PaymentResult existing = cache.get(idempotencyKey);
        if (existing != null) return existing;
        
        // Use database unique constraint to prevent duplicates
        try {
            PaymentResult result = paymentGateway.charge(request);
            cache.set(idempotencyKey, result);
            return result;
        } catch (DuplicateKeyException e) {
            // Another request already processed this key
            return cache.get(idempotencyKey);
        }
    }
}

Output

No direct output — this is a code pattern. Idempotency keys are simpler and faster than locks for preventing duplicate operations.

Senior Shortcut:

Before reaching for a distributed lock, ask: 'Can I make this operation idempotent?' If yes, skip the lock. Idempotency is cheaper and safer.

● Production incidentPOST-MORTEMseverity: high

The 4GB Container That Kept Dying

Symptom

A critical job that processed financial reconciliations would randomly fail with 'Lock acquisition timeout' at 2 AM, causing a 4-hour delay in settlement.

Assumption

The team assumed the Redis cluster was overloaded and added more nodes.

Root cause

The lock TTL was 30 seconds, but the job's GC pause could spike to 45 seconds during full GC. The lock expired while the job still held it, allowing a second instance to start and corrupt data.

Fix

Reduced heap to 2GB to avoid full GC, set lock TTL to 60 seconds, and added a watchdog thread that extends the lock every 10 seconds while the job runs.

Key lesson

Lock TTL must be greater than the worst-case GC pause, not the average.
Watchdog threads are not optional for long-running operations.

Production debug guideSystematic recovery paths for the failure modes engineers actually hit.3 entries

Symptom · 01

Lock acquisition timeout: 'Could not acquire lock within 5 seconds'

→

Fix

1. Check Redis/ZooKeeper cluster health. 2. Check lock key TTL — is it stuck? 3. Force release with redis-cli DEL <key> if safe. 4. Increase timeout or add retry with backoff.

Symptom · 02

Two instances hold the same lock: 'Duplicate job execution'

→

Fix

1. Check clock skew across nodes: ntpdate -q <ntp-server>. 2. Check lock TTL vs GC pause: add -XX:+PrintGCDetails to JVM. 3. Implement fencing tokens immediately. 4. Switch to ZooKeeper if skew > 100ms.

Symptom · 03

Lock release fails: 'No such lock' or 'ERR no such key'

→

Fix

1. Check if lock expired naturally. 2. Verify release script uses Lua with value check. 3. Check if another client deleted it. 4. Add idempotency to release: ignore 'no such key' errors.

★ Distributed Locking Triage Cheat SheetFirst-response commands for when things go wrong — copy-paste ready.

Lock acquisition timeout: `Could not acquire lock within 5 seconds`−

Immediate action

Check if lock key exists and its TTL

Commands

redis-cli TTL lock:resource

redis-cli GET lock:resource

Fix now

If TTL > 0 and instance is dead: redis-cli DEL lock:resource. If TTL = -2: lock doesn't exist, check network.

Two instances hold the same lock: duplicate job execution+

Lock release fails: `ERR no such key`+

High lock acquisition latency: >100ms p99+

Feature / Aspect	Redis Redlock	ZooKeeper Lock
Consistency model	Probabilistic (time-based)	Strong (sequential consistency)
Safety under partition	Can fail (double lock)	Safe (ephemeral nodes)
Latency (p99)	~5ms	~20ms
Clock dependency	Yes (TTL)	No
Complexity	Low	Medium
Operational overhead	Low (Redis cluster)	Medium (ZooKeeper ensemble)

Key takeaways

Distributed locks are not safe under network partitions unless you use consensus-based systems like ZooKeeper or etcd.

Always use fencing tokens to prevent stale lock holders from corrupting data.

Lock TTL must exceed the worst-case GC pause, not the average. Watchdog threads are mandatory for long operations.

Before using a distributed lock, ask if idempotency or optimistic locking can solve the problem with less complexity.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

How does Redlock behave under a network partition where a minority of Re...

Q02SENIOR

When would you choose ZooKeeper over Redis for distributed locking in a ...

Q03SENIOR

What happens when a ZooKeeper lock holder's session expires due to a GC ...

Q04JUNIOR

What is a distributed lock?

Q05SENIOR

You see 'Lock acquisition timeout' errors in production for a Redis-base...

Q06SENIOR

Design a distributed locking service for a global e-commerce platform th...

Q01 of 06SENIOR

How does Redlock behave under a network partition where a minority of Redis nodes are isolated?

ANSWER

If a client acquires locks on a majority of nodes, then a partition isolates that majority, the client holds the lock. But another client could acquire locks on the remaining nodes (which are now a majority of the partitioned set) and think it holds the lock. This violates mutual exclusion. Redlock is not safe under asynchronous networks.

FAQ · 4 QUESTIONS

Frequently Asked Questions

What is the best distributed locking service for production?

What's the difference between Redis Redlock and ZooKeeper locks?

How do I implement a distributed lock in Redis?

How do you handle a distributed lock that expires while the holder is still working?

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Everything here is grounded in real deployments.

✓ Verified

production tested

June 25, 2026

last updated

1,663

articles · all by Naren

🔥

That's Real World. Mark it forged?

3 min read · try the examples if you haven't