Senior 4 min · June 25, 2026

Leader Election in Distributed Systems: Avoid Split-Brain and Downtime

Q: What is leader election in distributed systems?

Leader election is a process where nodes in a distributed system agree on a single coordinator to manage shared resources and avoid conflicts. It ensures fault tolerance by automatically selecting a new leader if the current one fails.

Q: What's the difference between ZooKeeper and Raft for leader election?

ZooKeeper is an external coordination service that uses ephemeral nodes for leader election. Raft is a consensus algorithm that can be embedded in your application. ZooKeeper requires managing a separate cluster; Raft (e.g., etcd) can be embedded but is more complex to implement correctly.

Q: How do I implement leader election in my application?

Use a library like Apache Curator for ZooKeeper or etcd's client library for Raft. For ZooKeeper, create ephemeral sequential znodes under an election path. For Raft, use an existing implementation like HashiCorp's Raft library.

Q: How does leader election handle network partitions?

In Raft, if the leader is isolated from a majority, it steps down. The majority elects a new leader. In ZooKeeper, if the leader's session expires due to partition, its ephemeral node is deleted, triggering a new election. Both prevent split-brain by ensuring only one leader exists.

Leader election explained with production patterns, ZooKeeper vs Raft, split-brain prevention, and debugging guide for distributed systems..

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Everything here is grounded in real deployments.

✓ Production

production tested

June 25, 2026

last updated

1,663

articles · all by Naren

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Leader election ensures only one node acts as the master in a distributed system. Common implementations use ZooKeeper, etcd, or Raft consensus. The key challenge is handling network partitions without causing split-brain.

✦ Definition~90s read

What is Leader Election?

Leader election is a distributed systems pattern where nodes in a cluster agree on a single coordinator to manage shared resources, avoid conflicts, and ensure fault tolerance. It prevents split-brain scenarios where multiple nodes act as master simultaneously.

★

Imagine a team of chefs in a kitchen.

Plain-English First

Imagine a team of chefs in a kitchen. If everyone decides the menu, you get chaos. So they pick one head chef. If the head chef gets sick, the team quickly votes a new one. But if two chefs think they're head chef because of a miscommunication, you get two different meals. Leader election is the protocol to pick one head chef and handle when they disappear, without ending up with two.

You've got three database replicas. One handles writes, the others replicate. Then the network hiccups. Suddenly two replicas think they're the writer. You now have diverging data, angry customers, and a 3am restore. That's split-brain. Leader election is the only thing standing between you and that nightmare.

Without leader election, every node would need to coordinate on every write — that's a distributed lock per operation, and it kills throughput. With it, only the leader coordinates writes; followers just replicate. The problem is making sure exactly one leader exists at all times, even when nodes crash or networks partition.

By the end of this, you'll be able to design a leader election system using ZooKeeper or Raft, debug common failures like stale leaders and split-brain, and know exactly when a simpler approach like a single coordinator is better.

Why Leader Election Exists: The Split-Brain Problem

Before leader election, distributed systems used a single coordinator. If it died, the system was down until manual recovery. That's not acceptable for modern services. So we automated failover. But automation introduces a new problem: two nodes might both think they're the coordinator. That's split-brain. It corrupts data, breaks idempotency, and causes cascading failures.

Leader election solves this by ensuring that at most one node acts as leader at any time. It uses a consensus mechanism — either a distributed lock (like ZooKeeper) or a voting protocol (like Raft). The key property is safety: even under network partitions, only one leader is elected.

Without this, you get the classic disaster: two writers to a database, each overwriting the other's changes. I've seen this bring down a payments service when a network switch failed and two instances of the payment processor both accepted transactions. The result? Duplicate charges and a weekend of manual reconciliation.

LeaderElectionExample.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Simulated leader election using a simple distributed lock (pseudocode)

class LeaderElection {
    private String leaderId = null;
    private final Lock lock = new DistributedLock("leader-lock");

    public boolean tryBecomeLeader(String nodeId) {
        // Attempt to acquire the lock with a TTL of 30 seconds
        boolean acquired = lock.tryAcquire(30, TimeUnit.SECONDS);
        if (acquired) {
            leaderId = nodeId;
            // Start heartbeat to renew lock
            startHeartbeat(nodeId);
            return true;
        }
        return false;
    }

    private void startHeartbeat(String nodeId) {
        // In production, this would be a scheduled task that renews the lock
        // If heartbeat fails, lock expires and another node can become leader
    }
}

// Usage:
LeaderElection election = new LeaderElection();
if (election.tryBecomeLeader("node-1")) {
    System.out.println("I am the leader!");
} else {
    System.out.println("Another node is leader.");
}

Output

I am the leader!

Another node is leader.

Production Trap:

If your lock TTL is too short, the leader might lose the lock before it finishes critical work. If too long, failover takes forever. Start with 30 seconds and monitor.

thecodeforge.io

Leader Election in Distributed Systems

Leader Election

ZooKeeper-Based Leader Election: The Battle-Tested Approach

ZooKeeper is the old guard. It provides a reliable distributed coordination service with ephemeral nodes. The idea: each candidate creates an ephemeral sequential znode under an election path. The one with the smallest sequence number is the leader. If the leader dies, its ephemeral node disappears, and the next in line becomes leader.

Why ZooKeeper? It's battle-tested at scale (Kafka, HBase, Solr). But it's also a separate service to manage. You need to run a ZooKeeper ensemble (odd number, 3 or 5). That's operational overhead.

The classic rookie mistake: forgetting to set a session timeout. If the leader's session expires, the ephemeral node is deleted, triggering an election even if the leader is still alive. This causes unnecessary leader changes. Always set a session timeout that's longer than your heartbeat interval.

ZooKeeperLeaderElection.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// ZooKeeper leader election using Curator framework

public class ZooKeeperLeaderElection {
    private final CuratorFramework client;
    private final String electionPath = "/election";
    private LeaderLatch leaderLatch;

    public ZooKeeperLeaderElection(String connectionString) {
        client = CuratorFrameworkFactory.newClient(connectionString,
                new ExponentialBackoffRetry(1000, 3));
        client.start();
    }

    public void start() throws Exception {
        leaderLatch = new LeaderLatch(client, electionPath, "node-1");
        leaderLatch.addListener(new LeaderLatchListener() {
            @Override
            public void isLeader() {
                System.out.println("I am the leader!");
                // Start accepting writes
            }

            @Override
            public void notLeader() {
                System.out.println("I am a follower.");
                // Stop accepting writes
            }
        });
        leaderLatch.start();
    }

    public void close() {
        leaderLatch.close();
        client.close();
    }
}

Output

I am the leader!

I am a follower.

Senior Shortcut:

Use Curator's LeaderLatch — it handles session management and re-election. Don't reinvent the ephemeral node logic.

thecodeforge.io

ZooKeeper Leader Election Flow

Leader Election

Raft Consensus: The Modern Alternative

Raft is a consensus algorithm designed to be understandable. It's used in etcd, Consul, and TiKV. Unlike ZooKeeper which is a general coordination service, Raft is a protocol for replicated state machines. Leader election is built-in.

In Raft, nodes are in three states: Leader, Follower, or Candidate. Leaders send heartbeats. If followers don't hear from the leader within an election timeout, they become candidates and start a new election. The candidate that gets votes from a majority becomes the new leader.

Raft's advantage: no external dependency. The cluster manages itself. But it's more complex to implement correctly. Most teams use an existing implementation like etcd or HashiCorp's Raft library.

I've seen teams try to implement Raft from scratch and get it wrong — the leader election can livelock if election timeouts are not randomized. Always use a battle-tested library.

RaftLeaderElection.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Pseudocode for Raft leader election

class RaftNode {
    enum State { FOLLOWER, CANDIDATE, LEADER }
    private State state = State.FOLLOWER;
    private int term = 0;
    private int votedFor = -1;
    private int votesReceived = 0;
    private final Random random = new Random();

    public void startElection() {
        state = State.CANDIDATE;
        term++;
        votedFor = nodeId;
        votesReceived = 1; // vote for self
        // Send RequestVote RPCs to all other nodes
        for (RaftNode peer : peers) {
            if (peer.requestVote(term, nodeId)) {
                votesReceived++;
            }
        }
        if (votesReceived > peers.size() / 2) {
            state = State.LEADER;
            System.out.println("Became leader for term " + term);
            startHeartbeats();
        } else {
            state = State.FOLLOWER;
        }
    }

    public boolean requestVote(int candidateTerm, int candidateId) {
        if (candidateTerm > term) {
            term = candidateTerm;
            state = State.FOLLOWER;
            votedFor = candidateId;
            return true;
        }
        return false;
    }

    private void startHeartbeats() {
        // Send AppendEntries RPCs periodically
    }
}

Output

Became leader for term 1

Interview Gold:

Raft's safety depends on the election timeout being randomized. Without randomization, multiple candidates start elections simultaneously, causing split votes and no leader.

When Not to Use Leader Election: The Overkill Trap

Leader election adds complexity. You need consensus, heartbeats, and failover logic. For many systems, a simpler approach works fine.

If your system can tolerate temporary inconsistency (e.g., caching layer), use a gossip protocol or CRDTs instead. If you have a single writer that rarely fails, manual failover might be acceptable. If your cluster is small and you control the network, a primary-replica setup with a static leader is simpler.

I've seen teams add ZooKeeper to a two-node system. That's madness. ZooKeeper needs at least three nodes to be fault-tolerant. For two nodes, use a shared disk or a simple heartbeat with STONITH (Shoot The Other Node In The Head).

The rule: only use leader election when you need automatic failover and you have at least three nodes. Otherwise, you're adding complexity without benefit.

Never Do This:

Running ZooKeeper with an even number of nodes. It doesn't improve fault tolerance — you still need a majority. 3 or 5 nodes only.

thecodeforge.io

Leader Election vs Simpler Approaches

Leader Election

Split-Brain Prevention: Fencing and Quorum

Even with leader election, split-brain can happen if the old leader doesn't know it's been deposed. The solution: fencing. When a new leader is elected, it must ensure the old leader can no longer access shared resources. This is done via a fence mechanism — e.g., revoking IAM permissions, killing the old leader's process, or using a distributed lock with a generation clock.

In Raft, the term number acts as a generation clock. The leader includes its term in every request. If a follower receives a request with an older term, it rejects it. This prevents stale leaders from writing.

In ZooKeeper, use a fencing token: the leader writes its epoch to a znode. Before writing, check that the epoch matches. If not, abort.

I've seen a production outage where a network partition caused two leaders to be elected (ZooKeeper misconfigured). The old leader kept writing to a database, corrupting data. The fix: add a fencing layer that kills the old leader's process when a new leader is elected.

FencingExample.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Fencing with a generation clock

class FencedLeader {
    private int generation = 0;
    private final DistributedCounter counter = new DistributedCounter("leader-generation");

    public boolean tryBecomeLeader() {
        int newGen = counter.incrementAndGet();
        if (acquireLock("leader-lock")) {
            generation = newGen;
            return true;
        }
        return false;
    }

    public void writeData(String data) {
        // Before writing, check that our generation is still current
        if (counter.get() != generation) {
            throw new RuntimeException("Stale leader — aborting write");
        }
        // Perform write
        System.out.println("Writing: " + data);
    }
}

Output

Writing: some data

Senior Shortcut:

Use a generation counter stored in a reliable store (ZooKeeper, etcd). Always check it before writing. This prevents split-brain writes.

● Production incidentPOST-MORTEMseverity: high

The 4GB Container That Kept Dying

Symptom

Every 30 minutes, the leader node would crash with OOMKilled. A new leader would be elected, then crash again. Writes were down for 2 minutes each cycle.

Assumption

Assumed a memory leak in the application code.

Root cause

The leader node was running a ZooKeeper ephemeral node watcher that triggered a full heap dump on leader election. The heap dump exceeded the 4GB container memory limit, causing OOMKill.

Fix

Removed the heap dump trigger. Increased container memory to 8GB as a buffer. Added a memory limit alert at 70%.

Key lesson

Never run expensive operations like heap dumps in the leader election callback — it's a critical path that must complete quickly.

Production debug guideSystematic recovery paths for the failure modes engineers actually hit.3 entries

Symptom · 01

No leader elected for extended period

→

Fix

1. Check network connectivity between nodes. 2. Verify majority of nodes are alive. 3. Check election timeout configuration. 4. Restart all nodes if necessary.

Symptom · 02

Multiple leaders (split-brain)

→

Fix

1. Immediately stop all writes. 2. Check fencing mechanism. 3. Verify generation clock consistency. 4. Manually designate a single leader and restart others.

Symptom · 03

Leader flapping (frequent changes)

→

Fix

1. Check leader node resource usage (CPU, memory). 2. Increase heartbeat interval. 3. Increase election timeout. 4. Check for network packet loss.

★ Leader Election Triage Cheat SheetFirst-response commands for when things go wrong — copy-paste ready.

No leader elected — `No leader` error in logs−

Immediate action

Check node count and network

Commands

curl -s http://node:2379/v2/stats/leader | jq .leader

ping <other-node-ip>

Fix now

Restart all nodes with systemctl restart etcd

Split-brain — data inconsistency+

Leader flapping — `leader changed` every few seconds+

ZooKeeper session expired — `Session expired` error+

Feature / Aspect	ZooKeeper	Raft (etcd)
External dependency	Yes — separate ZooKeeper ensemble	No — embedded in application
Consistency model	Linearizable	Linearizable
Election speed	Fast (sub-second)	Fast (sub-second)
Operational complexity	High — manage ZooKeeper cluster	Medium — manage etcd cluster or embed
Maturity	Very mature (HBase, Kafka)	Mature (etcd, Consul)
Split-brain prevention	Requires fencing	Built-in via term numbers

Key takeaways

Leader election prevents split-brain by ensuring exactly one coordinator at any time.

ZooKeeper is mature but adds operational overhead; Raft is modern and embeddable.

Always use fencing (generation clock) to prevent stale leaders from writing.

If you have fewer than three nodes, leader election is probably overkill

use a simpler approach.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

How does Raft handle a network partition where the leader is isolated fr...

Q02SENIOR

When would you choose ZooKeeper over Raft for leader election in a produ...

Q03SENIOR

What happens if the leader in a Raft cluster crashes before sending a he...

Q04JUNIOR

What is the purpose of an ephemeral node in ZooKeeper leader election?

Q05SENIOR

You see 'leader changed' errors every few seconds in production. What do...

Q06SENIOR

Design a leader election system for a global multi-datacenter deployment...

Q01 of 06SENIOR

How does Raft handle a network partition where the leader is isolated from a majority of followers?

ANSWER

The isolated leader stops receiving heartbeats from a majority, so it steps down. The majority of nodes elect a new leader. The old leader, when it reconnects, sees a higher term and becomes a follower.

FAQ · 4 QUESTIONS

Frequently Asked Questions

What is leader election in distributed systems?

What's the difference between ZooKeeper and Raft for leader election?

How do I implement leader election in my application?

How does leader election handle network partitions?

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Everything here is grounded in real deployments.

✓ Verified

production tested

June 25, 2026

last updated

1,663

articles · all by Naren

🔥

That's Distributed Systems. Mark it forged?

4 min read · try the examples if you haven't