Senior 5 min · June 25, 2026

Distributed Consensus: Paxos vs Raft – Which One Won't Fail You at 3 AM?

Q: What is the difference between Paxos and Raft?

Paxos is a family of consensus protocols that are mathematically elegant but notoriously hard to implement. Raft is a more understandable consensus algorithm that decomposes the problem into leader election, log replication, and safety. Raft is now the default choice for most new systems.

Q: Is Raft always better than Paxos?

For new implementations, yes. Raft is easier to implement correctly, has better documentation, and is used in production by etcd, Consul, and TiKV. Paxos may be preferred if you need to integrate with an existing Paxos-based system like Google's Chubby.

Q: How do I set up a Raft cluster in production?

Use an odd number of nodes (3 or 5). Configure election timeouts to be 5-10x the expected network round trip. Enable log compaction with a snapshot threshold. Use local SSDs for fast fsync. Test with chaos engineering to ensure fault tolerance.

Q: What happens if a Raft leader crashes?

A new leader is elected via randomized timeouts. The new leader's log must contain all committed entries. Uncommitted entries from the old leader may be lost. Clients should retry requests with idempotent operations.

Distributed consensus deep dive: Paxos and Raft internals, production gotchas, failure modes, and when to use each.

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Everything here is grounded in real deployments.

✓ Production

production tested

June 25, 2026

last updated

1,663

articles · all by Naren

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Paxos is the foundational but notoriously hard-to-implement consensus algorithm. Raft is its understandable, engineer-friendly cousin that's now the default for most new systems. For production, start with Raft unless you have a specific reason (e.g., existing Paxos infrastructure) — it's easier to debug and maintain.

✦ Definition~90s read

What is Distributed Consensus?

Distributed consensus is the problem of getting multiple servers to agree on a single value despite failures. Paxos and Raft are algorithms that solve this, enabling reliable replication in systems like etcd, ZooKeeper, and Consul.

★

Imagine a group of friends deciding where to eat dinner.

Plain-English First

Imagine a group of friends deciding where to eat dinner. They can't all talk at once, so they pick a leader who proposes a restaurant. Everyone votes, and if a majority agrees, that's where they go. But if the leader disappears mid-vote, someone else takes over. Paxos is like having a complex voting protocol where anyone can propose, but it's easy to get confused. Raft simplifies by always having a clear leader — like a designated decision-maker — making the whole process easier to follow and recover from failures.

You've never actually seen a Paxos implementation in production. Not a pure one. Every 'Paxos' system you've used — Google's Chubby, ZooKeeper's Zab — is a heavily modified variant. Why? Because the original Paxos is a mathematical proof, not a blueprint. It's elegant on paper but a nightmare to implement correctly. I've debugged a Paxos-based system at 2 AM where a network partition caused a split-brain that took down an entire payment pipeline. The fix wasn't a code change — it was rewriting the consensus layer on Raft.

This article is the definitive guide to distributed consensus algorithms for engineers who've shipped production code. You'll learn exactly how Paxos and Raft work under the hood, where they break, and which one to choose for your next system. By the end, you'll be able to debug a consensus failure in production, explain the trade-offs in a system design interview, and avoid the mistakes that have burned teams before yours.

Why Consensus Matters: The Problem Nobody Talks About

Before consensus algorithms, distributed systems used ad-hoc replication: primary-backup with a heartbeat. If the primary died, a backup took over. But what if the primary is just slow? Both think they're primary, and you get split-brain — two servers accepting writes, diverging data. That's how you lose money. Consensus solves this by ensuring that at most one leader is elected at any time, and that all replicas agree on the order of operations. Without it, you can't build a reliable replicated state machine — the foundation of databases, configuration stores, and coordination services.

ConsensusProblem.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Simulating split-brain without consensus
// Two nodes both think they're primary

Node1: I am primary. Accepting writes.
Node2: I am primary. Accepting writes.
Client1 writes X=1 to Node1.
Client2 writes X=2 to Node2.
Network heals. Node1 and Node2 sync. X is now 1 or 2? Data loss.

// With Raft consensus:
// Only one leader at a time. All writes go through leader.
// If leader fails, a new one is elected with a unique term.
// No split-brain.

Output

Without consensus: data divergence and loss.

With consensus: consistent state across replicas.

Production Trap:

Never assume a heartbeat-based failover is safe. Without a consensus protocol, you will eventually hit split-brain. I've seen it take down a stock exchange feed.

thecodeforge.io

Paxos vs Raft: Consensus at 3 AM

Consensus Paxos Raft

Paxos: The Gold Standard You'll Never Implement

Paxos is the mathematical foundation of consensus. It proves that a set of nodes can agree on a value even if some fail. But it's notoriously hard to implement correctly. The original paper describes a 'single decree' Paxos, which is like agreeing on one value. Multi-Paxos extends this to a log of values, but the protocol is underspecified. Every production 'Paxos' system (e.g., Google's Chubby, ZooKeeper's Zab) is actually a custom variant. The core idea: a proposer sends a 'prepare' request with a unique epoch number to a quorum of acceptors. If a majority responds, the proposer sends an 'accept' request with the value. If another proposer with a higher epoch appears, it can override. This works, but the complexity of handling multiple proposers, failures, and log replication makes it a minefield.

PaxosFlow.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Simplified Paxos single decree
// Proposer P1 with epoch 1
P1 -> All: Prepare(1)
Acceptors: Promise to reject any prepare with epoch < 1
P1 <- Majority: Promise(1, no previous value)
P1 -> All: Accept(1, value='A')
Acceptors: Accept if no higher epoch promised
P1 <- Majority: Accepted(1, 'A')
// Value 'A' is chosen.

// But if P2 with epoch 2 appears:
P2 -> All: Prepare(2)
Acceptors: Promise(2, value='A') // they've already accepted 'A'
P2 -> All: Accept(2, 'A') // must use the same value
// Consensus preserved.

Output

Value 'A' is chosen. If a higher epoch proposer appears, it must reuse the same value.

Senior Shortcut:

When implementing Paxos, use a single proposer (leader) to avoid conflicts. That's what Multi-Paxos does — it elects a leader and then runs the fast path (just accept requests) until the leader fails. This is essentially Raft.

Raft: The Engineer's Consensus Algorithm

Raft was designed to be understandable. It decomposes consensus into three subproblems: leader election, log replication, and safety. The key insight: always have a strong leader. The leader handles all client requests and replicates its log to followers. If the leader fails, a new one is elected with a higher term. Raft uses randomized election timeouts to avoid split votes. Log entries are committed once they're replicated to a majority. This simplicity makes Raft the go-to for new systems: etcd, Consul, TiKV, and many others. But don't be fooled — Raft has its own edge cases, like log inconsistency after a leader crash, which requires the leader to force its log on followers.

RaftLeaderElection.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Raft leader election
// Node A, B, C. Current term = 1. Leader is A.
// A crashes.
// B's election timeout expires first (random 150-300ms).
B: term=2, vote for self
B -> A, C: RequestVote(term=2, candidateId=B, lastLogIndex=5, lastLogTerm=1)
C: lastLogTerm=1, lastLogIndex=5 -> votes for B
A: crashed, no response
B receives majority (2/3) -> becomes leader
B -> C: AppendEntries(term=2, leaderId=B, entries=[], commitIndex=5)
// Now B is leader. All writes go through B.

Output

Node B becomes leader with term 2. Followers accept B's log as authoritative.

Interview Gold:

Raft's safety property: a leader must have all committed entries from previous terms. That's why the candidate's log must be at least as up-to-date as the voter's. This prevents a stale node from becoming leader and overwriting committed data.

thecodeforge.io

Raft Leader Election Flow

Consensus Paxos Raft

Log Replication and Commitment: The Devil in the Details

Once a leader is elected, it replicates log entries to followers. The leader sends AppendEntries RPCs with new entries. A follower appends the entry and replies. The leader commits an entry once it's stored on a majority of nodes. But what if a follower is behind? The leader retries until all followers catch up. What if the leader crashes after committing but before responding to the client? The client retries, and the new leader will see the committed entry. The tricky part: Raft guarantees that if an entry is committed in a given term, it will be present in all future leaders' logs. This is enforced by the election restriction: a candidate must have all committed entries to become leader.

RaftLogReplication.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Leader replicates entry 'SET X=5'
Leader (term 3): AppendEntries(entries=[{index:6, term:3, command:'SET X=5'}], commitIndex:5)
Follower1: appends entry, replies success
Follower2: appends entry, replies success
Leader: now has majority (2/3), updates commitIndex=6
Leader: applies command to state machine: X=5
Leader: responds to client: OK
// If leader crashes now, new leader will have entry 6 committed.

Output

Entry 6 is committed. State machine applies 'SET X=5'.

Never Do This:

Don't allow followers to serve reads without the leader's approval. Stale reads are a common source of inconsistency. In Raft, the leader must confirm it's still the leader before serving reads (via a heartbeat round).

Safety and Liveness: The Trade-offs That Bite

Consensus algorithms guarantee safety (no two nodes decide different values) but not liveness (the system may stop making progress under certain failures). Paxos and Raft are both safe under asynchronous networks and crash failures. But liveness can be compromised: in Paxos, multiple proposers can livelock by continuously raising epochs. Raft avoids this with randomized timeouts, but a network partition can still prevent a majority from forming, halting progress. The classic example: a 5-node cluster splits into 3 and 2. The partition with 3 nodes can elect a leader and make progress. The partition with 2 cannot. When the network heals, the minority partition's leader steps down. This is correct behavior, but it means the system is unavailable during the partition.

PartitionScenario.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// 5-node Raft cluster: nodes A, B, C, D, E
// Network partition: {A, B, C} and {D, E}
// Majority is 3. {A, B, C} can elect leader.
// {D, E} cannot (needs 3 votes).
// Writes to {A, B, C} succeed.
// Writes to {D, E} fail with 'no leader'.
// When partition heals, {D, E} see higher term from {A, B, C} and step down.
// Data is consistent.

Output

Cluster remains available in majority partition. Minority partition is unavailable. No data loss.

Production Trap:

If you have an even number of nodes (e.g., 2), a single failure causes a loss of majority. Always use an odd number. 3 is the minimum for fault tolerance. 5 is better for availability during rolling upgrades.

thecodeforge.io

Paxos vs Raft: Safety & Liveness

Consensus Paxos Raft

Performance: How to Not Make It Slow

Consensus is inherently synchronous: every write requires a round trip to a majority. That means latency is at least one network round trip. Raft's leader-based design means all writes go through the leader, which can be a bottleneck. To improve throughput, batch multiple entries into a single AppendEntries RPC. Use pipelining: the leader sends entries without waiting for previous ones to be committed. And use parallel disk writes: fsync the log in batches. In practice, etcd can handle tens of thousands of writes per second on modern hardware. But if you need more, consider sharding: run multiple Raft groups and distribute keys across them. This is what CockroachDB does.

RaftBatching.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Batching entries in Raft
// Instead of sending one entry per RPC:
Leader: AppendEntries(entries=[{index:6, term:3, cmd:'SET X=1'}, {index:7, term:3, cmd:'SET Y=2'}], commitIndex:5)
Follower: appends both, replies success
// This halves the number of RPCs.

// Pipelining: leader sends next batch before receiving ack for previous.
// But must handle out-of-order commits.

Output

Throughput increases linearly with batch size up to a point (limited by network MTU and disk write latency).

Senior Shortcut:

Set your batch size to fill a single network packet (usually 1500 bytes). Any larger and you're wasting bandwidth. Also, use a separate goroutine for fsync to overlap disk I/O with network I/O.

When Not to Use Consensus: The Overkill Trap

Consensus is expensive. Every write requires fsync on a majority of nodes. If you need high throughput or low latency, consider alternatives. For leader election without log replication, use a lease-based approach (e.g., etcd's lease mechanism). For configuration management, use a gossip protocol (e.g., SWIM). For data replication, consider CRDTs if you can tolerate eventual consistency. Consensus is the right tool when you need strong consistency and fault tolerance, but it's not a silver bullet. I've seen teams use Raft for a simple counter that could have been a single Redis instance. Don't be that team.

WhenNotToUseConsensus.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Scenario: You need a distributed counter that increments frequently.
// Bad: Use Raft to replicate each increment.
// Good: Use a CRDT counter (e.g., Redis with conflict-free replicated data type).
// Or: Use a single node with async replication to a backup.

// Decision rule:
// If you can tolerate eventual consistency, don't use consensus.
// If you need strong consistency but have low write volume, consensus is fine.
// If you need high write throughput, consider sharding or alternative consistency models.

Output

Choose the simplest solution that meets your consistency requirements.

Interview Gold:

When asked 'When would you choose Paxos over Raft?' the answer is: almost never. Raft is simpler to implement and debug. Choose Paxos only if you have an existing infrastructure (like Google's Chubby) or need to integrate with a system that uses Paxos.

Production Gotchas: What the Papers Don't Tell You

Clock skew: Raft's election timeouts are based on real time. If clocks drift, elections can fail. Use NTP and set timeouts generously (e.g., 5-10x the expected network round trip). 2. Disk latency: fsync is slow. If your disk is shared (e.g., network-attached), a single slow node can bottleneck the entire cluster. Use local SSDs. 3. Membership changes: Adding or removing nodes is tricky. Raft's joint consensus approach is correct but complex. Use a single-server change (add one, remove one) to minimize risk. 4. Snapshotting: If the log grows unbounded, recovery takes forever. Set a snapshot threshold and test recovery time. 5. Network partitions: They happen. Ensure your cluster can survive a partition without data loss. Test with chaos engineering.

RaftMembershipChange.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Adding a new node to a Raft cluster
// Step 1: Add new node as a learner (non-voting)
// Step 2: Wait for it to catch up with the log
// Step 3: Promote to voting member
// This prevents the new node from causing a split vote.

// Example using etcdctl:
etcdctl member add new-node --peer-urls=http://10.0.0.4:2380
// Wait for catch-up
etcdctl member promote new-node

Output

New node added without cluster disruption.

The Classic Bug:

Forgetting to set --initial-cluster-state=new when restarting a cluster after a full shutdown. This causes the nodes to think they're joining an existing cluster and fail. Always use new for fresh clusters, existing for restarts.

● Production incidentPOST-MORTEMseverity: high

The 4GB Container That Kept Dying

Symptom

A 3-node Raft cluster in Kubernetes kept losing its leader every few minutes. Writes failed with 'no leader' errors. The cluster was unusable.

Assumption

We assumed a network issue — maybe flaky pod-to-pod communication or a misconfigured CNI plugin.

Root cause

Each node was running in a container with only 4GB of RAM. The Raft log grew unbounded because we never set a snapshot threshold. When memory pressure hit, the OS OOM-killed the process. The node restarted, rejoined the cluster, and the cycle repeated.

Fix

Set a snapshot threshold: in etcd, that's --snapshot-count=10000. Also added a memory limit of 8GB per container and configured log compaction to keep the log under 500MB.

Key lesson

Never run a consensus node without log compaction and resource limits.
The default config is for a lab, not production.

Production debug guideSystematic recovery paths for the failure modes engineers actually hit.4 entries

Symptom · 01

No leader elected for extended period

→

Fix

1. Check network connectivity between nodes. 2. Verify election timeout config (should be > RTT). 3. Check for clock skew. 4. Restart nodes one by one to force election.

Symptom · 02

Leader flapping (frequent elections)

→

Fix

1. Check disk latency on leader (fsync may be slow). 2. Check CPU/memory pressure. 3. Increase election timeout. 4. Ensure heartbeat interval is less than election timeout.

Symptom · 03

Writes failing with 'not leader'

→

Fix

1. Verify client is connecting to the correct leader endpoint. 2. Check if leader has stepped down due to partition. 3. Check log for term changes. 4. If leader is present but not accepting writes, check quorum size.

Symptom · 04

Log inconsistency between nodes

→

Fix

1. Stop writes. 2. Compare log entries across nodes. 3. In Raft, force snapshot restore from the leader. 4. In Paxos, replay protocol for each entry. 5. Consider rebuilding the node from scratch.

★ Distributed Consensus: Paxos and Raft Triage Cheat SheetFirst-response commands for when things go wrong — copy-paste ready.

`etcdserver: no leader`−

Immediate action

Check cluster health and network

Commands

etcdctl endpoint health --cluster

etcdctl member list

Fix now

Restart the node with the highest term: systemctl restart etcd

`etcdserver: request timed out`+

`raft: leader is stepping down`+

`raft: log append error`+

Feature / Aspect	Paxos	Raft
Understandability	Notorious for being hard to implement correctly	Designed for understandability; clear decomposition
Leader election	Not specified; multiple proposers can conflict	Strong leader with randomized timeouts
Log replication	Complex; requires multiple phases per entry	Simple leader-based AppendEntries
Safety	Proven correct under asynchronous model	Proven correct with stronger guarantees
Performance	Can be optimized with leader (Multi-Paxos)	Leader-based, batching and pipelining
Production adoption	Chubby, ZooKeeper (Zab variant)	etcd, Consul, TiKV, HashiCorp Nomad
Implementation complexity	High; many edge cases	Moderate; well-documented

Key takeaways

Raft is the consensus algorithm you should use for new systems. It's simpler, better documented, and easier to debug than Paxos.

Always use an odd number of nodes (3, 5, 7) to avoid split-brain and maintain quorum during failures.

Log compaction (snapshotting) is not optional. Without it, your cluster will OOM or take forever to recover.

Consensus is expensive. Don't use it if you can tolerate eventual consistency or if your write volume is low enough for a simpler solution.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

How does Raft handle a situation where a leader crashes after committing...

Q02SENIOR

When would you choose Paxos over Raft in a production system?

Q03SENIOR

What happens in Raft when a follower's log is inconsistent with the lead...

Q04JUNIOR

What is the role of quorum in consensus algorithms?

Q05SENIOR

You have a 5-node Raft cluster. A network partition splits it into 3 and...

Q06SENIOR

How would you design a consensus-based system to handle millions of writ...

Q01 of 06SENIOR

How does Raft handle a situation where a leader crashes after committing an entry but before responding to the client?

ANSWER

The client will retry and the new leader will have the committed entry in its log (due to the election restriction). The new leader will re-apply the entry if it hasn't been applied yet. The client may see a duplicate response, so operations should be idempotent.

FAQ · 4 QUESTIONS

Frequently Asked Questions

What is the difference between Paxos and Raft?

Is Raft always better than Paxos?

How do I set up a Raft cluster in production?

What happens if a Raft leader crashes?

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Everything here is grounded in real deployments.

✓ Verified

production tested

June 25, 2026

last updated

1,663

articles · all by Naren

🔥

That's Distributed Systems. Mark it forged?

5 min read · try the examples if you haven't