Raft Consensus Algorithm Explained: Making Distributed Nodes Agree
Understand the Raft consensus algorithm — leader election, log replication, and safety guarantees, with implementation details and interview tips.
Raft Consensus Algorithm
Raft is a consensus algorithm that enables a cluster of nodes to agree on a sequence of values (a replicated log) even when some nodes fail, by electing a single leader that coordinates all changes.
What It Really Means
In a distributed system, you often need multiple nodes to agree on the same data — the current leader of the cluster, the order of transactions, or the state of a configuration. Consensus algorithms solve this problem: they guarantee that all non-faulty nodes eventually agree on the same value, even if some nodes crash or messages are delayed.
Raft was designed by Diego Ongaro and John Ousterhout at Stanford in 2014 specifically to be understandable. Its predecessor, Paxos, is notoriously difficult to understand and implement correctly. Raft achieves the same safety guarantees as Paxos but decomposes the problem into three clean subproblems: leader election, log replication, and safety.
Raft is used in production by etcd (which powers Kubernetes), CockroachDB, TiKV (the storage layer of TiDB), Consul by HashiCorp, and many other systems. If you use Kubernetes, every cluster state change goes through a Raft-based consensus in etcd.
How It Works in Practice
The Three Roles
Every node in a Raft cluster is in one of three states:
- Leader: Handles all client requests, replicates log entries to followers
- Follower: Passive — responds to RPCs from the leader and candidates
- Candidate: Transitional state during leader election
Leader Election
- All nodes start as followers with a randomized election timeout (e.g., 150-300ms)
- If a follower does not hear from a leader before its timeout expires, it becomes a candidate
- The candidate increments its term number, votes for itself, and sends RequestVote RPCs to all other nodes
- A node grants its vote to the first candidate it hears from in that term (first-come, first-served)
- If a candidate receives votes from a majority (quorum), it becomes the leader
- The leader sends periodic heartbeats (empty AppendEntries RPCs) to maintain authority
The randomized timeout is critical — it ensures that in most cases only one node times out first, avoiding split votes. If a split vote does occur, candidates retry with new randomized timeouts.
Log Replication
- Client sends a write request to the leader
- Leader appends the entry to its local log (uncommitted)
- Leader sends AppendEntries RPCs to all followers with the new entry
- Each follower appends the entry to its log and responds with success
- Once a majority of nodes have acknowledged, the leader marks the entry as committed
- Leader responds to the client with success
- Followers learn about the commit in subsequent heartbeats and apply the entry
Real-World: etcd and Kubernetes
Kubernetes stores all cluster state (pods, services, deployments, config maps) in etcd, a distributed key-value store built on Raft. When you run kubectl apply, the API server writes to etcd. The etcd leader replicates the change to a quorum of nodes before acknowledging the write. This guarantees that cluster state is consistent even if an etcd node crashes during the write.
A typical etcd cluster has 3 or 5 nodes. With 3 nodes, the system tolerates 1 failure. With 5, it tolerates 2. Using more than 7 nodes is rare because the cost of replication (leader must wait for majority acknowledgment) grows with cluster size.
Implementation
Trade-offs
Advantages
- Understandable: Decomposed into leader election, log replication, and safety — much easier to reason about than Paxos
- Strong consistency: Guarantees linearizable reads (if reads go through the leader)
- Well-proven: Used in production by etcd, CockroachDB, TiKV, Consul, and many others
- Deterministic recovery: The log-based approach makes crash recovery straightforward
Disadvantages
- Single-leader bottleneck: All writes go through one node, limiting write throughput
- Availability during elections: The cluster is temporarily unavailable for writes during leader election (typically <1 second)
- Not ideal for geo-distributed systems: Cross-datacenter latency makes Raft slow because every write requires a round trip to a majority of nodes
- Cluster size limits: Performance degrades with more nodes because the leader must replicate to more followers
Common Misconceptions
-
"Raft and Paxos solve different problems" — They solve the same fundamental problem (consensus on a replicated log) with the same safety guarantees. Raft is a specific implementation of Multi-Paxos ideas, designed for understandability.
-
"The leader always has the most up-to-date data" — A newly elected leader might not have the latest committed entries if it was partitioned. Raft handles this: a leader must commit a no-op entry in its new term to ensure its log is up to date before serving reads.
-
"More nodes means more fault tolerance with no downsides" — Each additional node adds replication overhead. A 5-node cluster tolerates 2 failures but every write waits for 3 acknowledgments. Going from 5 to 7 nodes adds latency for marginal improvement (tolerating 3 vs 2 failures).
-
"Raft guarantees zero data loss" — Committed entries are durable. But entries that were appended to the leader's log but not yet committed can be lost if the leader crashes. This is the gap between leader receiving a write and achieving quorum.
-
"You can do consistent reads from any follower" — Followers may lag behind the leader. Reading from a follower can return stale data. For linearizable reads, you must read from the leader (or use a lease-based approach).
How This Appears in Interviews
Raft appears in both system design and distributed systems interviews:
- "How does etcd ensure consistency?" — Walk through Raft: leader election, log replication, quorum writes, and how Kubernetes relies on these guarantees. See our system design interview guide.
- "Design a distributed configuration store" — etcd is the canonical answer. Explain why you choose Raft over Paxos (understandability, tooling), how you handle leader failure, and cluster sizing.
- "What happens when the leader fails?" — Explain election timeout, term increment, vote request, and that writes are unavailable for 150-300ms during election.
- "How does Raft differ from Paxos?" — Raft enforces a stronger leader and contiguous log; Paxos allows holes in the log and multi-proposer setups.
Related Concepts
- Paxos Consensus Protocol — The predecessor algorithm Raft was designed to replace
- Two-Phase Commit — A different coordination protocol for distributed transactions
- CAP Theorem — Raft chooses CP: consistent but unavailable during partition
- Vector Clocks — An alternative approach to ordering events without consensus
- System Design Interview Guide — Framework for answering design questions
- Algoroq Pricing — Practice consensus and distributed systems interview questions
GO DEEPER
Learn from senior engineers in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.