Distributed Consensus: Paxos vs Raft – Which One Won't Fail You at 3 AM?
Distributed consensus deep dive: Paxos and Raft internals, production gotchas, failure modes, and when to use each.
20+ years shipping large-scale distributed systems. Everything here is grounded in real deployments.
Paxos is the foundational but notoriously hard-to-implement consensus algorithm. Raft is its understandable, engineer-friendly cousin that's now the default for most new systems. For production, start with Raft unless you have a specific reason (e.g., existing Paxos infrastructure) — it's easier to debug and maintain.
Imagine a group of friends deciding where to eat dinner. They can't all talk at once, so they pick a leader who proposes a restaurant. Everyone votes, and if a majority agrees, that's where they go. But if the leader disappears mid-vote, someone else takes over. Paxos is like having a complex voting protocol where anyone can propose, but it's easy to get confused. Raft simplifies by always having a clear leader — like a designated decision-maker — making the whole process easier to follow and recover from failures.
You've never actually seen a Paxos implementation in production. Not a pure one. Every 'Paxos' system you've used — Google's Chubby, ZooKeeper's Zab — is a heavily modified variant. Why? Because the original Paxos is a mathematical proof, not a blueprint. It's elegant on paper but a nightmare to implement correctly. I've debugged a Paxos-based system at 2 AM where a network partition caused a split-brain that took down an entire payment pipeline. The fix wasn't a code change — it was rewriting the consensus layer on Raft.
This article is the definitive guide to distributed consensus algorithms for engineers who've shipped production code. You'll learn exactly how Paxos and Raft work under the hood, where they break, and which one to choose for your next system. By the end, you'll be able to debug a consensus failure in production, explain the trade-offs in a system design interview, and avoid the mistakes that have burned teams before yours.
Why Consensus Matters: The Problem Nobody Talks About
Before consensus algorithms, distributed systems used ad-hoc replication: primary-backup with a heartbeat. If the primary died, a backup took over. But what if the primary is just slow? Both think they're primary, and you get split-brain — two servers accepting writes, diverging data. That's how you lose money. Consensus solves this by ensuring that at most one leader is elected at any time, and that all replicas agree on the order of operations. Without it, you can't build a reliable replicated state machine — the foundation of databases, configuration stores, and coordination services.
Paxos: The Gold Standard You'll Never Implement
Paxos is the mathematical foundation of consensus. It proves that a set of nodes can agree on a value even if some fail. But it's notoriously hard to implement correctly. The original paper describes a 'single decree' Paxos, which is like agreeing on one value. Multi-Paxos extends this to a log of values, but the protocol is underspecified. Every production 'Paxos' system (e.g., Google's Chubby, ZooKeeper's Zab) is actually a custom variant. The core idea: a proposer sends a 'prepare' request with a unique epoch number to a quorum of acceptors. If a majority responds, the proposer sends an 'accept' request with the value. If another proposer with a higher epoch appears, it can override. This works, but the complexity of handling multiple proposers, failures, and log replication makes it a minefield.
Raft: The Engineer's Consensus Algorithm
Raft was designed to be understandable. It decomposes consensus into three subproblems: leader election, log replication, and safety. The key insight: always have a strong leader. The leader handles all client requests and replicates its log to followers. If the leader fails, a new one is elected with a higher term. Raft uses randomized election timeouts to avoid split votes. Log entries are committed once they're replicated to a majority. This simplicity makes Raft the go-to for new systems: etcd, Consul, TiKV, and many others. But don't be fooled — Raft has its own edge cases, like log inconsistency after a leader crash, which requires the leader to force its log on followers.
Log Replication and Commitment: The Devil in the Details
Once a leader is elected, it replicates log entries to followers. The leader sends AppendEntries RPCs with new entries. A follower appends the entry and replies. The leader commits an entry once it's stored on a majority of nodes. But what if a follower is behind? The leader retries until all followers catch up. What if the leader crashes after committing but before responding to the client? The client retries, and the new leader will see the committed entry. The tricky part: Raft guarantees that if an entry is committed in a given term, it will be present in all future leaders' logs. This is enforced by the election restriction: a candidate must have all committed entries to become leader.
Safety and Liveness: The Trade-offs That Bite
Consensus algorithms guarantee safety (no two nodes decide different values) but not liveness (the system may stop making progress under certain failures). Paxos and Raft are both safe under asynchronous networks and crash failures. But liveness can be compromised: in Paxos, multiple proposers can livelock by continuously raising epochs. Raft avoids this with randomized timeouts, but a network partition can still prevent a majority from forming, halting progress. The classic example: a 5-node cluster splits into 3 and 2. The partition with 3 nodes can elect a leader and make progress. The partition with 2 cannot. When the network heals, the minority partition's leader steps down. This is correct behavior, but it means the system is unavailable during the partition.
Performance: How to Not Make It Slow
Consensus is inherently synchronous: every write requires a round trip to a majority. That means latency is at least one network round trip. Raft's leader-based design means all writes go through the leader, which can be a bottleneck. To improve throughput, batch multiple entries into a single AppendEntries RPC. Use pipelining: the leader sends entries without waiting for previous ones to be committed. And use parallel disk writes: fsync the log in batches. In practice, etcd can handle tens of thousands of writes per second on modern hardware. But if you need more, consider sharding: run multiple Raft groups and distribute keys across them. This is what CockroachDB does.
When Not to Use Consensus: The Overkill Trap
Consensus is expensive. Every write requires fsync on a majority of nodes. If you need high throughput or low latency, consider alternatives. For leader election without log replication, use a lease-based approach (e.g., etcd's lease mechanism). For configuration management, use a gossip protocol (e.g., SWIM). For data replication, consider CRDTs if you can tolerate eventual consistency. Consensus is the right tool when you need strong consistency and fault tolerance, but it's not a silver bullet. I've seen teams use Raft for a simple counter that could have been a single Redis instance. Don't be that team.
Production Gotchas: What the Papers Don't Tell You
- Clock skew: Raft's election timeouts are based on real time. If clocks drift, elections can fail. Use NTP and set timeouts generously (e.g., 5-10x the expected network round trip). 2. Disk latency: fsync is slow. If your disk is shared (e.g., network-attached), a single slow node can bottleneck the entire cluster. Use local SSDs. 3. Membership changes: Adding or removing nodes is tricky. Raft's joint consensus approach is correct but complex. Use a single-server change (add one, remove one) to minimize risk. 4. Snapshotting: If the log grows unbounded, recovery takes forever. Set a snapshot threshold and test recovery time. 5. Network partitions: They happen. Ensure your cluster can survive a partition without data loss. Test with chaos engineering.
--initial-cluster-state=new when restarting a cluster after a full shutdown. This causes the nodes to think they're joining an existing cluster and fail. Always use new for fresh clusters, existing for restarts.The 4GB Container That Kept Dying
--snapshot-count=10000. Also added a memory limit of 8GB per container and configured log compaction to keep the log under 500MB.- Never run a consensus node without log compaction and resource limits.
- The default config is for a lab, not production.
etcdctl endpoint health --clusteretcdctl member listsystemctl restart etcdKey takeaways
Interview Questions on This Topic
How does Raft handle a situation where a leader crashes after committing an entry but before responding to the client?
Frequently Asked Questions
20+ years shipping large-scale distributed systems. Everything here is grounded in real deployments.
That's Distributed Systems. Mark it forged?
5 min read · try the examples if you haven't