Leader Election in Distributed Systems: Avoid Split-Brain and Downtime
Leader election explained with production patterns, ZooKeeper vs Raft, split-brain prevention, and debugging guide for distributed systems..
20+ years shipping large-scale distributed systems. Everything here is grounded in real deployments.
Leader election ensures only one node acts as the master in a distributed system. Common implementations use ZooKeeper, etcd, or Raft consensus. The key challenge is handling network partitions without causing split-brain.
Imagine a team of chefs in a kitchen. If everyone decides the menu, you get chaos. So they pick one head chef. If the head chef gets sick, the team quickly votes a new one. But if two chefs think they're head chef because of a miscommunication, you get two different meals. Leader election is the protocol to pick one head chef and handle when they disappear, without ending up with two.
You've got three database replicas. One handles writes, the others replicate. Then the network hiccups. Suddenly two replicas think they're the writer. You now have diverging data, angry customers, and a 3am restore. That's split-brain. Leader election is the only thing standing between you and that nightmare.
Without leader election, every node would need to coordinate on every write — that's a distributed lock per operation, and it kills throughput. With it, only the leader coordinates writes; followers just replicate. The problem is making sure exactly one leader exists at all times, even when nodes crash or networks partition.
By the end of this, you'll be able to design a leader election system using ZooKeeper or Raft, debug common failures like stale leaders and split-brain, and know exactly when a simpler approach like a single coordinator is better.
Why Leader Election Exists: The Split-Brain Problem
Before leader election, distributed systems used a single coordinator. If it died, the system was down until manual recovery. That's not acceptable for modern services. So we automated failover. But automation introduces a new problem: two nodes might both think they're the coordinator. That's split-brain. It corrupts data, breaks idempotency, and causes cascading failures.
Leader election solves this by ensuring that at most one node acts as leader at any time. It uses a consensus mechanism — either a distributed lock (like ZooKeeper) or a voting protocol (like Raft). The key property is safety: even under network partitions, only one leader is elected.
Without this, you get the classic disaster: two writers to a database, each overwriting the other's changes. I've seen this bring down a payments service when a network switch failed and two instances of the payment processor both accepted transactions. The result? Duplicate charges and a weekend of manual reconciliation.
ZooKeeper-Based Leader Election: The Battle-Tested Approach
ZooKeeper is the old guard. It provides a reliable distributed coordination service with ephemeral nodes. The idea: each candidate creates an ephemeral sequential znode under an election path. The one with the smallest sequence number is the leader. If the leader dies, its ephemeral node disappears, and the next in line becomes leader.
Why ZooKeeper? It's battle-tested at scale (Kafka, HBase, Solr). But it's also a separate service to manage. You need to run a ZooKeeper ensemble (odd number, 3 or 5). That's operational overhead.
The classic rookie mistake: forgetting to set a session timeout. If the leader's session expires, the ephemeral node is deleted, triggering an election even if the leader is still alive. This causes unnecessary leader changes. Always set a session timeout that's longer than your heartbeat interval.
Raft Consensus: The Modern Alternative
Raft is a consensus algorithm designed to be understandable. It's used in etcd, Consul, and TiKV. Unlike ZooKeeper which is a general coordination service, Raft is a protocol for replicated state machines. Leader election is built-in.
In Raft, nodes are in three states: Leader, Follower, or Candidate. Leaders send heartbeats. If followers don't hear from the leader within an election timeout, they become candidates and start a new election. The candidate that gets votes from a majority becomes the new leader.
Raft's advantage: no external dependency. The cluster manages itself. But it's more complex to implement correctly. Most teams use an existing implementation like etcd or HashiCorp's Raft library.
I've seen teams try to implement Raft from scratch and get it wrong — the leader election can livelock if election timeouts are not randomized. Always use a battle-tested library.
When Not to Use Leader Election: The Overkill Trap
Leader election adds complexity. You need consensus, heartbeats, and failover logic. For many systems, a simpler approach works fine.
If your system can tolerate temporary inconsistency (e.g., caching layer), use a gossip protocol or CRDTs instead. If you have a single writer that rarely fails, manual failover might be acceptable. If your cluster is small and you control the network, a primary-replica setup with a static leader is simpler.
I've seen teams add ZooKeeper to a two-node system. That's madness. ZooKeeper needs at least three nodes to be fault-tolerant. For two nodes, use a shared disk or a simple heartbeat with STONITH (Shoot The Other Node In The Head).
The rule: only use leader election when you need automatic failover and you have at least three nodes. Otherwise, you're adding complexity without benefit.
Split-Brain Prevention: Fencing and Quorum
Even with leader election, split-brain can happen if the old leader doesn't know it's been deposed. The solution: fencing. When a new leader is elected, it must ensure the old leader can no longer access shared resources. This is done via a fence mechanism — e.g., revoking IAM permissions, killing the old leader's process, or using a distributed lock with a generation clock.
In Raft, the term number acts as a generation clock. The leader includes its term in every request. If a follower receives a request with an older term, it rejects it. This prevents stale leaders from writing.
In ZooKeeper, use a fencing token: the leader writes its epoch to a znode. Before writing, check that the epoch matches. If not, abort.
I've seen a production outage where a network partition caused two leaders to be elected (ZooKeeper misconfigured). The old leader kept writing to a database, corrupting data. The fix: add a fencing layer that kills the old leader's process when a new leader is elected.
The 4GB Container That Kept Dying
- Never run expensive operations like heap dumps in the leader election callback — it's a critical path that must complete quickly.
curl -s http://node:2379/v2/stats/leader | jq .leaderping <other-node-ip>systemctl restart etcdKey takeaways
Interview Questions on This Topic
How does Raft handle a network partition where the leader is isolated from a majority of followers?
Frequently Asked Questions
20+ years shipping large-scale distributed systems. Everything here is grounded in real deployments.
That's Distributed Systems. Mark it forged?
4 min read · try the examples if you haven't