Consensus and Raft basics
Raft from first principles: FLP, the election rules, log matching, snapshotting, membership changes, real bugs, and why everyone picks Raft over Paxos.
The consensus problem
N processes need to agree on a single value such that:
- Validity: the chosen value was proposed by some process.
- Agreement: all correct processes choose the same value.
- Termination: every correct process eventually chooses.
This sounds easy. It is not.
The FLP result (Fischer, Lynch, Paterson, 1985) proves no deterministic algorithm can solve consensus in an asynchronous system with even one crash-faulty process. The intuition: a process that is slow is indistinguishable from a process that has crashed. If you wait forever to be sure, you never decide. If you decide too soon, you might be wrong.
Real systems sidestep FLP by adding timing assumptions: "if a node hasn't responded in T seconds, treat it as failed." This converts crashes into a partial synchrony model where consensus is solvable. Raft and Paxos both work this way.
Why Raft over Paxos
Paxos (Lamport, 1989, published 1998) is the original consensus algorithm. It is correct, optimal in message complexity, and notoriously hard to understand and implement correctly.
Raft (Ongaro and Ousterhout, 2014) was explicitly designed for understandability. Same correctness guarantees, structured into three clean sub-problems (election, replication, safety). The paper has been read by every distributed systems engineer working today.
Practical reasons to pick Raft:
- Reference implementations exist (etcd-raft, hashicorp/raft).
- Documentation and tutorials are abundant.
- Membership changes are well-specified (joint consensus or single-server).
- Debugging tools exist (state transitions are clearer than Paxos rounds).
Practical reasons to look at Paxos variants:
- Multi-Paxos for very-high-throughput pipelined consensus.
- EPaxos for leaderless consensus (no leader bottleneck, useful for wide-area).
- Flexible Paxos for tunable quorums.
For 99% of use cases, Raft is the answer.
Raft state machine
Each node is in one of three states:
- Follower: passive. Responds to RPCs from leader and candidates. Default state.
- Candidate: trying to become leader. Has incremented its term and voted for itself.
- Leader: handles all client requests. Sends heartbeats. Replicates log entries.
State transitions:
- Follower -> Candidate: election timeout (150-300ms randomized) without heartbeat.
- Candidate -> Leader: received majority votes.
- Candidate -> Follower: discovered a leader with equal or higher term, or election timed out.
- Leader -> Follower: discovered a higher term.
Term numbers
Term is logical time. Monotonically increasing integer. Each term has at most one leader (election safety). Every RPC carries the sender's current term. If you see a higher term, you update yours and revert to follower. If you see a lower term, you reject the RPC.
Terms partition time into eras. They are critical for safety: stale leaders that wake up after a long pause see they are in an old term and step down.
Leader election in detail
When follower's election timer fires:
- Increment current term.
- Transition to candidate.
- Vote for self.
- Send RequestVote RPC to all other servers, with candidate's term and log info (last log index and term).
Each server votes for at most one candidate per term, on first-come-first-served basis, with one constraint: voter only grants vote if candidate's log is at least as up-to-date as voter's. "Up-to-date" means higher term in last entry, or same term and >= index.
This constraint ensures leader completeness: a new leader has all committed entries.
If candidate gets majority votes, becomes leader, immediately sends heartbeats.
If candidate gets a heartbeat from a leader with equal or higher term, becomes follower.
If election times out (split vote), increment term, retry with new randomized timeout.
Randomization prevents repeated split votes. Without it, two candidates could time out simultaneously every round.
Log replication in detail
Each log entry has: index, term, command.
Leader receives client command, appends to its log, sends AppendEntries RPC to all followers in parallel. The RPC includes:
- Leader's current term.
- Previous log index and term (for consistency check).
- New entries to append.
- Leader's commit index.
Follower accepts the RPC only if:
- Sender's term >= follower's term.
- Follower's log contains an entry at prevLogIndex with prevLogTerm.
If follower's log diverges (had entries from a deposed leader), it truncates and accepts leader's entries. Leader is the source of truth.
Once an entry is replicated on majority, leader marks it committed. Leader applies to state machine. On next heartbeat, leader sends new commit index. Followers apply up to that index.
Safety proof sketch
Five properties together imply linearizability:
- Election safety: at most one leader per term. By construction (majority votes).
- Leader append-only: leaders never overwrite entries. By rule.
- Log matching: if two logs agree at index i and term t, they agree on all preceding entries. By prevLogIndex consistency check.
- Leader completeness: any committed entry is in every future leader's log. By the "up-to-date" voting rule.
- State machine safety: same index always applies same command. By the previous four.
The hairy proof is leader completeness. The Raft paper has it; I recommend reading the thesis version for the full version.
Snapshotting
Logs grow forever otherwise. Periodically, each server takes a snapshot of its state machine and discards log entries up to that point. The snapshot includes:
- Last included index.
- Last included term.
- Application state.
If a follower is too far behind, leader sends a snapshot instead of incremental log entries (InstallSnapshot RPC).
Snapshots are checkpoints. Etcd defaults to a snapshot every 10000 entries.
Membership changes
Adding or removing servers without losing safety requires care. Naive approach (everyone switches config at once) is unsafe: there's a moment when two majorities can exist.
Raft solves this two ways:
- Joint consensus: transition through a "both old and new" phase. Decisions require majority in both. Slow but safe.
- Single-server changes: add or remove one server at a time. Old majority and new majority always overlap. Simpler.
Etcd uses single-server changes. Most implementations do.
Practical issues
-
Leader bottleneck: every write goes through the leader. Throughput is bounded by leader's I/O. Use batching and pipelining to maximize.
-
Cluster size: 3 nodes survives 1 failure. 5 nodes survives 2. Larger clusters are slower (must contact more for majority) and almost never used. 3 and 5 are the sizes you'll see.
-
Cross-region consensus: WAN latency dominates. Every commit pays cross-region RTT. Spanner, CockroachDB, etcd all use various tricks to mitigate. Don't put a Raft cluster across regions unless you must.
-
Network partitions: minority partition can elect no leader (no majority). Majority partition continues. When partition heals, minority catches up.
-
Disk sync: every entry must be fsync'd before responding. Otherwise crash recovery loses entries. This is the throughput killer. Consider NVMe with battery-backed cache.
Real bugs that happened
-
Etcd 3.4 had a bug where a node could vote twice in the same term under specific re-election conditions. Caught by Jepsen.
-
CockroachDB had a bug where a slow disk could cause heartbeat starvation and trigger spurious elections. Fixed by separating heartbeat path from log write path.
-
Multiple implementations have had pre-vote bugs. Pre-vote (sending RequestVote without incrementing term first) prevents disruptive elections from partitioned nodes. Subtle to get right.
-
The "ghost log entries" bug in old Raft implementations: an entry from a previous term, when replicated by a new leader, could be incorrectly committed. Fixed by requiring leader to commit at least one entry from its own term before committing older entries.
When to use Raft
Use Raft when:
- You need small-cluster (3-5) agreement on configuration or state.
- Strong consistency is required.
- Throughput is moderate (thousands of writes/sec, not millions).
- Cluster fits within a region for latency reasons.
Use Raft for the control plane (etcd, Consul, ZooKeeper). Don't use Raft for the data plane unless you understand the throughput implications. Kafka uses Raft for metadata (KRaft) but the data path is a separate replication protocol.
Where Raft doesn't fit
- Very high throughput data path (>100k writes/sec per partition). Use leaderless replication or per-partition Raft.
- Wide-area replication where every write must commit globally. Use Spanner-style TrueTime + Paxos, or accept eventual consistency.
- Byzantine fault tolerance. Raft assumes nodes fail by crashing, not lying. Use PBFT or similar for adversarial settings.
What to read next
- Raft paper (under 18 pages). Required.
- Raft thesis for the full formal treatment.
- Etcd's raft package source. The cleanest production implementation.
- Jepsen analyses for Etcd, Consul, MongoDB. See how implementations break.
Learn more
- PaperRaft paperDiego Ongaro and John Ousterhout
- PaperRaft thesis (full version)Diego Ongaro
- DocsRaft visualizationraft.github.io
- ArticleDesigning Data-Intensive Applications, Chapter 9Martin Kleppmann
- Repo
- PaperPaxos Made SimpleLeslie Lamport