Senior 4 min · June 25, 2026

Leader Election in Distributed Systems: Avoid Split-Brain and Downtime

Leader election explained with production patterns, ZooKeeper vs Raft, split-brain prevention, and debugging guide for distributed systems..

N
Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Everything here is grounded in real deployments.

Follow
Production
production tested
June 25, 2026
last updated
1,663
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer

Leader election ensures only one node acts as the master in a distributed system. Common implementations use ZooKeeper, etcd, or Raft consensus. The key challenge is handling network partitions without causing split-brain.

✦ Definition~90s read
What is Leader Election?

Leader election is a distributed systems pattern where nodes in a cluster agree on a single coordinator to manage shared resources, avoid conflicts, and ensure fault tolerance. It prevents split-brain scenarios where multiple nodes act as master simultaneously.

Imagine a team of chefs in a kitchen.
Plain-English First

Imagine a team of chefs in a kitchen. If everyone decides the menu, you get chaos. So they pick one head chef. If the head chef gets sick, the team quickly votes a new one. But if two chefs think they're head chef because of a miscommunication, you get two different meals. Leader election is the protocol to pick one head chef and handle when they disappear, without ending up with two.

You've got three database replicas. One handles writes, the others replicate. Then the network hiccups. Suddenly two replicas think they're the writer. You now have diverging data, angry customers, and a 3am restore. That's split-brain. Leader election is the only thing standing between you and that nightmare.

Without leader election, every node would need to coordinate on every write — that's a distributed lock per operation, and it kills throughput. With it, only the leader coordinates writes; followers just replicate. The problem is making sure exactly one leader exists at all times, even when nodes crash or networks partition.

By the end of this, you'll be able to design a leader election system using ZooKeeper or Raft, debug common failures like stale leaders and split-brain, and know exactly when a simpler approach like a single coordinator is better.

Why Leader Election Exists: The Split-Brain Problem

Before leader election, distributed systems used a single coordinator. If it died, the system was down until manual recovery. That's not acceptable for modern services. So we automated failover. But automation introduces a new problem: two nodes might both think they're the coordinator. That's split-brain. It corrupts data, breaks idempotency, and causes cascading failures.

Leader election solves this by ensuring that at most one node acts as leader at any time. It uses a consensus mechanism — either a distributed lock (like ZooKeeper) or a voting protocol (like Raft). The key property is safety: even under network partitions, only one leader is elected.

Without this, you get the classic disaster: two writers to a database, each overwriting the other's changes. I've seen this bring down a payments service when a network switch failed and two instances of the payment processor both accepted transactions. The result? Duplicate charges and a weekend of manual reconciliation.

LeaderElectionExample.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// io.thecodeforge — System Design tutorial

// Simulated leader election using a simple distributed lock (pseudocode)

class LeaderElection {
    private String leaderId = null;
    private final Lock lock = new DistributedLock("leader-lock");

    public boolean tryBecomeLeader(String nodeId) {
        // Attempt to acquire the lock with a TTL of 30 seconds
        boolean acquired = lock.tryAcquire(30, TimeUnit.SECONDS);
        if (acquired) {
            leaderId = nodeId;
            // Start heartbeat to renew lock
            startHeartbeat(nodeId);
            return true;
        }
        return false;
    }

    private void startHeartbeat(String nodeId) {
        // In production, this would be a scheduled task that renews the lock
        // If heartbeat fails, lock expires and another node can become leader
    }
}

// Usage:
LeaderElection election = new LeaderElection();
if (election.tryBecomeLeader("node-1")) {
    System.out.println("I am the leader!");
} else {
    System.out.println("Another node is leader.");
}
Output
I am the leader!
Another node is leader.
Production Trap:
If your lock TTL is too short, the leader might lose the lock before it finishes critical work. If too long, failover takes forever. Start with 30 seconds and monitor.
Leader Election in Distributed Systems THECODEFORGE.IO Leader Election in Distributed Systems Avoid split-brain and downtime with proven consensus methods Split-Brain Problem Multiple nodes act as leader, causing data corruption ZooKeeper Leader Election Sequential ephemeral znodes for leader selection Raft Consensus Leader election via randomized timeouts and log replication Fencing & Quorum Prevent split-brain with fencing mechanisms and majority Overkill Trap Avoid leader election for simple single-node systems ⚠ Overkill trap: leader election adds complexity Use only when fault tolerance and consistency are required THECODEFORGE.IO
thecodeforge.io
Leader Election in Distributed Systems
Leader Election

ZooKeeper-Based Leader Election: The Battle-Tested Approach

ZooKeeper is the old guard. It provides a reliable distributed coordination service with ephemeral nodes. The idea: each candidate creates an ephemeral sequential znode under an election path. The one with the smallest sequence number is the leader. If the leader dies, its ephemeral node disappears, and the next in line becomes leader.

Why ZooKeeper? It's battle-tested at scale (Kafka, HBase, Solr). But it's also a separate service to manage. You need to run a ZooKeeper ensemble (odd number, 3 or 5). That's operational overhead.

The classic rookie mistake: forgetting to set a session timeout. If the leader's session expires, the ephemeral node is deleted, triggering an election even if the leader is still alive. This causes unnecessary leader changes. Always set a session timeout that's longer than your heartbeat interval.

ZooKeeperLeaderElection.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
// io.thecodeforge — System Design tutorial

// ZooKeeper leader election using Curator framework

public class ZooKeeperLeaderElection {
    private final CuratorFramework client;
    private final String electionPath = "/election";
    private LeaderLatch leaderLatch;

    public ZooKeeperLeaderElection(String connectionString) {
        client = CuratorFrameworkFactory.newClient(connectionString,
                new ExponentialBackoffRetry(1000, 3));
        client.start();
    }

    public void start() throws Exception {
        leaderLatch = new LeaderLatch(client, electionPath, "node-1");
        leaderLatch.addListener(new LeaderLatchListener() {
            @Override
            public void isLeader() {
                System.out.println("I am the leader!");
                // Start accepting writes
            }

            @Override
            public void notLeader() {
                System.out.println("I am a follower.");
                // Stop accepting writes
            }
        });
        leaderLatch.start();
    }

    public void close() {
        leaderLatch.close();
        client.close();
    }
}
Output
I am the leader!
I am a follower.
Senior Shortcut:
Use Curator's LeaderLatch — it handles session management and re-election. Don't reinvent the ephemeral node logic.
ZooKeeper Leader Election FlowTHECODEFORGE.IOZooKeeper Leader Election FlowEphemeral sequential znodes determine the leaderCandidatesEach creates an ephemeral sequential znodeSequence OrderSmallest sequence number winsLeader ActiveMaintains ephemeral znode; sends heartbeatsLeader DiesEphemeral znode disappears automaticallyNext CandidateWatches path; becomes new leader⚠ Ensure watch notifications are reliable to avoid stale leadersTHECODEFORGE.IO
thecodeforge.io
ZooKeeper Leader Election Flow
Leader Election

Raft Consensus: The Modern Alternative

Raft is a consensus algorithm designed to be understandable. It's used in etcd, Consul, and TiKV. Unlike ZooKeeper which is a general coordination service, Raft is a protocol for replicated state machines. Leader election is built-in.

In Raft, nodes are in three states: Leader, Follower, or Candidate. Leaders send heartbeats. If followers don't hear from the leader within an election timeout, they become candidates and start a new election. The candidate that gets votes from a majority becomes the new leader.

Raft's advantage: no external dependency. The cluster manages itself. But it's more complex to implement correctly. Most teams use an existing implementation like etcd or HashiCorp's Raft library.

I've seen teams try to implement Raft from scratch and get it wrong — the leader election can livelock if election timeouts are not randomized. Always use a battle-tested library.

RaftLeaderElection.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// io.thecodeforge — System Design tutorial

// Pseudocode for Raft leader election

class RaftNode {
    enum State { FOLLOWER, CANDIDATE, LEADER }
    private State state = State.FOLLOWER;
    private int term = 0;
    private int votedFor = -1;
    private int votesReceived = 0;
    private final Random random = new Random();

    public void startElection() {
        state = State.CANDIDATE;
        term++;
        votedFor = nodeId;
        votesReceived = 1; // vote for self
        // Send RequestVote RPCs to all other nodes
        for (RaftNode peer : peers) {
            if (peer.requestVote(term, nodeId)) {
                votesReceived++;
            }
        }
        if (votesReceived > peers.size() / 2) {
            state = State.LEADER;
            System.out.println("Became leader for term " + term);
            startHeartbeats();
        } else {
            state = State.FOLLOWER;
        }
    }

    public boolean requestVote(int candidateTerm, int candidateId) {
        if (candidateTerm > term) {
            term = candidateTerm;
            state = State.FOLLOWER;
            votedFor = candidateId;
            return true;
        }
        return false;
    }

    private void startHeartbeats() {
        // Send AppendEntries RPCs periodically
    }
}
Output
Became leader for term 1
Interview Gold:
Raft's safety depends on the election timeout being randomized. Without randomization, multiple candidates start elections simultaneously, causing split votes and no leader.

When Not to Use Leader Election: The Overkill Trap

Leader election adds complexity. You need consensus, heartbeats, and failover logic. For many systems, a simpler approach works fine.

If your system can tolerate temporary inconsistency (e.g., caching layer), use a gossip protocol or CRDTs instead. If you have a single writer that rarely fails, manual failover might be acceptable. If your cluster is small and you control the network, a primary-replica setup with a static leader is simpler.

I've seen teams add ZooKeeper to a two-node system. That's madness. ZooKeeper needs at least three nodes to be fault-tolerant. For two nodes, use a shared disk or a simple heartbeat with STONITH (Shoot The Other Node In The Head).

The rule: only use leader election when you need automatic failover and you have at least three nodes. Otherwise, you're adding complexity without benefit.

Never Do This:
Running ZooKeeper with an even number of nodes. It doesn't improve fault tolerance — you still need a majority. 3 or 5 nodes only.
Leader Election vs Simpler ApproachesTHECODEFORGE.IOLeader Election vs Simpler ApproachesWhen consensus is overkillLeader ElectionRequires consensus protocolHeartbeats and failover logicPrevents split-brain with fencingAdds latency and ops overheadSimpler ApproachesGossip protocol for cachesCRDTs for eventual consistencySingle writer with manual failoverNo consensus; lower complexityUse leader election only when strong consistency is mandatoryTHECODEFORGE.IO
thecodeforge.io
Leader Election vs Simpler Approaches
Leader Election

Split-Brain Prevention: Fencing and Quorum

Even with leader election, split-brain can happen if the old leader doesn't know it's been deposed. The solution: fencing. When a new leader is elected, it must ensure the old leader can no longer access shared resources. This is done via a fence mechanism — e.g., revoking IAM permissions, killing the old leader's process, or using a distributed lock with a generation clock.

In Raft, the term number acts as a generation clock. The leader includes its term in every request. If a follower receives a request with an older term, it rejects it. This prevents stale leaders from writing.

In ZooKeeper, use a fencing token: the leader writes its epoch to a znode. Before writing, check that the epoch matches. If not, abort.

I've seen a production outage where a network partition caused two leaders to be elected (ZooKeeper misconfigured). The old leader kept writing to a database, corrupting data. The fix: add a fencing layer that kills the old leader's process when a new leader is elected.

FencingExample.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// io.thecodeforge — System Design tutorial

// Fencing with a generation clock

class FencedLeader {
    private int generation = 0;
    private final DistributedCounter counter = new DistributedCounter("leader-generation");

    public boolean tryBecomeLeader() {
        int newGen = counter.incrementAndGet();
        if (acquireLock("leader-lock")) {
            generation = newGen;
            return true;
        }
        return false;
    }

    public void writeData(String data) {
        // Before writing, check that our generation is still current
        if (counter.get() != generation) {
            throw new RuntimeException("Stale leader — aborting write");
        }
        // Perform write
        System.out.println("Writing: " + data);
    }
}
Output
Writing: some data
Senior Shortcut:
Use a generation counter stored in a reliable store (ZooKeeper, etcd). Always check it before writing. This prevents split-brain writes.
● Production incidentPOST-MORTEMseverity: high

The 4GB Container That Kept Dying

Symptom
Every 30 minutes, the leader node would crash with OOMKilled. A new leader would be elected, then crash again. Writes were down for 2 minutes each cycle.
Assumption
Assumed a memory leak in the application code.
Root cause
The leader node was running a ZooKeeper ephemeral node watcher that triggered a full heap dump on leader election. The heap dump exceeded the 4GB container memory limit, causing OOMKill.
Fix
Removed the heap dump trigger. Increased container memory to 8GB as a buffer. Added a memory limit alert at 70%.
Key lesson
  • Never run expensive operations like heap dumps in the leader election callback — it's a critical path that must complete quickly.
Production debug guideSystematic recovery paths for the failure modes engineers actually hit.3 entries
Symptom · 01
No leader elected for extended period
Fix
1. Check network connectivity between nodes. 2. Verify majority of nodes are alive. 3. Check election timeout configuration. 4. Restart all nodes if necessary.
Symptom · 02
Multiple leaders (split-brain)
Fix
1. Immediately stop all writes. 2. Check fencing mechanism. 3. Verify generation clock consistency. 4. Manually designate a single leader and restart others.
Symptom · 03
Leader flapping (frequent changes)
Fix
1. Check leader node resource usage (CPU, memory). 2. Increase heartbeat interval. 3. Increase election timeout. 4. Check for network packet loss.
★ Leader Election Triage Cheat SheetFirst-response commands for when things go wrong — copy-paste ready.
No leader elected — `No leader` error in logs
Immediate action
Check node count and network
Commands
curl -s http://node:2379/v2/stats/leader | jq .leader
ping <other-node-ip>
Fix now
Restart all nodes with systemctl restart etcd
Split-brain — data inconsistency+
Immediate action
Stop all writes immediately
Commands
etcdctl cluster-health
etcdctl member list
Fix now
Manually set a single leader via etcdctl set /election/leader node-1
Leader flapping — `leader changed` every few seconds+
Immediate action
Check leader CPU and memory
Commands
top -b -n1 | head -20
etcdctl --endpoints=http://localhost:2379 member list
Fix now
Increase election timeout to 1000ms in config
ZooKeeper session expired — `Session expired` error+
Immediate action
Check ZooKeeper server health
Commands
echo ruok | nc localhost 2181
echo stat | nc localhost 2181
Fix now
Increase session timeout in client config to 30s
Feature / AspectZooKeeperRaft (etcd)
External dependencyYes — separate ZooKeeper ensembleNo — embedded in application
Consistency modelLinearizableLinearizable
Election speedFast (sub-second)Fast (sub-second)
Operational complexityHigh — manage ZooKeeper clusterMedium — manage etcd cluster or embed
MaturityVery mature (HBase, Kafka)Mature (etcd, Consul)
Split-brain preventionRequires fencingBuilt-in via term numbers

Key takeaways

1
Leader election prevents split-brain by ensuring exactly one coordinator at any time.
2
ZooKeeper is mature but adds operational overhead; Raft is modern and embeddable.
3
Always use fencing (generation clock) to prevent stale leaders from writing.
4
If you have fewer than three nodes, leader election is probably overkill
use a simpler approach.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
How does Raft handle a network partition where the leader is isolated fr...
Q02SENIOR
When would you choose ZooKeeper over Raft for leader election in a produ...
Q03SENIOR
What happens if the leader in a Raft cluster crashes before sending a he...
Q04JUNIOR
What is the purpose of an ephemeral node in ZooKeeper leader election?
Q05SENIOR
You see 'leader changed' errors every few seconds in production. What do...
Q06SENIOR
Design a leader election system for a global multi-datacenter deployment...
Q01 of 06SENIOR

How does Raft handle a network partition where the leader is isolated from a majority of followers?

ANSWER
The isolated leader stops receiving heartbeats from a majority, so it steps down. The majority of nodes elect a new leader. The old leader, when it reconnects, sees a higher term and becomes a follower.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
What is leader election in distributed systems?
02
What's the difference between ZooKeeper and Raft for leader election?
03
How do I implement leader election in my application?
04
How does leader election handle network partitions?
N
Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Everything here is grounded in real deployments.

Follow
Verified
production tested
June 25, 2026
last updated
1,663
articles · all by Naren
🔥

That's Distributed Systems. Mark it forged?

4 min read · try the examples if you haven't

Previous
Distributed Consensus: Paxos and Raft
2 / 9 · Distributed Systems
Next
Quorum: R + W > N