Raft Consensus Algorithm Explained: Making Distributed Nodes Agree

Understand the Raft consensus algorithm — leader election, log replication, and safety guarantees, with implementation details and interview tips.

raftconsensusdistributed-systemsreplicationfault-tolerance

Raft Consensus Algorithm

Raft is a consensus algorithm that enables a cluster of nodes to agree on a sequence of values (a replicated log) even when some nodes fail, by electing a single leader that coordinates all changes.

What It Really Means

In a distributed system, you often need multiple nodes to agree on the same data — the current leader of the cluster, the order of transactions, or the state of a configuration. Consensus algorithms solve this problem: they guarantee that all non-faulty nodes eventually agree on the same value, even if some nodes crash or messages are delayed.

Raft was designed by Diego Ongaro and John Ousterhout at Stanford in 2014 specifically to be understandable. Its predecessor, Paxos, is notoriously difficult to understand and implement correctly. Raft achieves the same safety guarantees as Paxos but decomposes the problem into three clean subproblems: leader election, log replication, and safety.

Raft is used in production by etcd (which powers Kubernetes), CockroachDB, TiKV (the storage layer of TiDB), Consul by HashiCorp, and many other systems. If you use Kubernetes, every cluster state change goes through a Raft-based consensus in etcd.

How It Works in Practice

The Three Roles

Every node in a Raft cluster is in one of three states:

  • Leader: Handles all client requests, replicates log entries to followers
  • Follower: Passive — responds to RPCs from the leader and candidates
  • Candidate: Transitional state during leader election

Leader Election

  1. All nodes start as followers with a randomized election timeout (e.g., 150-300ms)
  2. If a follower does not hear from a leader before its timeout expires, it becomes a candidate
  3. The candidate increments its term number, votes for itself, and sends RequestVote RPCs to all other nodes
  4. A node grants its vote to the first candidate it hears from in that term (first-come, first-served)
  5. If a candidate receives votes from a majority (quorum), it becomes the leader
  6. The leader sends periodic heartbeats (empty AppendEntries RPCs) to maintain authority

The randomized timeout is critical — it ensures that in most cases only one node times out first, avoiding split votes. If a split vote does occur, candidates retry with new randomized timeouts.

Log Replication

  1. Client sends a write request to the leader
  2. Leader appends the entry to its local log (uncommitted)
  3. Leader sends AppendEntries RPCs to all followers with the new entry
  4. Each follower appends the entry to its log and responds with success
  5. Once a majority of nodes have acknowledged, the leader marks the entry as committed
  6. Leader responds to the client with success
  7. Followers learn about the commit in subsequent heartbeats and apply the entry

Real-World: etcd and Kubernetes

Kubernetes stores all cluster state (pods, services, deployments, config maps) in etcd, a distributed key-value store built on Raft. When you run kubectl apply, the API server writes to etcd. The etcd leader replicates the change to a quorum of nodes before acknowledging the write. This guarantees that cluster state is consistent even if an etcd node crashes during the write.

A typical etcd cluster has 3 or 5 nodes. With 3 nodes, the system tolerates 1 failure. With 5, it tolerates 2. Using more than 7 nodes is rare because the cost of replication (leader must wait for majority acknowledgment) grows with cluster size.

Implementation

python

Trade-offs

Advantages

  • Understandable: Decomposed into leader election, log replication, and safety — much easier to reason about than Paxos
  • Strong consistency: Guarantees linearizable reads (if reads go through the leader)
  • Well-proven: Used in production by etcd, CockroachDB, TiKV, Consul, and many others
  • Deterministic recovery: The log-based approach makes crash recovery straightforward

Disadvantages

  • Single-leader bottleneck: All writes go through one node, limiting write throughput
  • Availability during elections: The cluster is temporarily unavailable for writes during leader election (typically <1 second)
  • Not ideal for geo-distributed systems: Cross-datacenter latency makes Raft slow because every write requires a round trip to a majority of nodes
  • Cluster size limits: Performance degrades with more nodes because the leader must replicate to more followers

Common Misconceptions

  • "Raft and Paxos solve different problems" — They solve the same fundamental problem (consensus on a replicated log) with the same safety guarantees. Raft is a specific implementation of Multi-Paxos ideas, designed for understandability.

  • "The leader always has the most up-to-date data" — A newly elected leader might not have the latest committed entries if it was partitioned. Raft handles this: a leader must commit a no-op entry in its new term to ensure its log is up to date before serving reads.

  • "More nodes means more fault tolerance with no downsides" — Each additional node adds replication overhead. A 5-node cluster tolerates 2 failures but every write waits for 3 acknowledgments. Going from 5 to 7 nodes adds latency for marginal improvement (tolerating 3 vs 2 failures).

  • "Raft guarantees zero data loss" — Committed entries are durable. But entries that were appended to the leader's log but not yet committed can be lost if the leader crashes. This is the gap between leader receiving a write and achieving quorum.

  • "You can do consistent reads from any follower" — Followers may lag behind the leader. Reading from a follower can return stale data. For linearizable reads, you must read from the leader (or use a lease-based approach).

How This Appears in Interviews

Raft appears in both system design and distributed systems interviews:

  • "How does etcd ensure consistency?" — Walk through Raft: leader election, log replication, quorum writes, and how Kubernetes relies on these guarantees. See our system design interview guide.
  • "Design a distributed configuration store" — etcd is the canonical answer. Explain why you choose Raft over Paxos (understandability, tooling), how you handle leader failure, and cluster sizing.
  • "What happens when the leader fails?" — Explain election timeout, term increment, vote request, and that writes are unavailable for 150-300ms during election.
  • "How does Raft differ from Paxos?" — Raft enforces a stronger leader and contiguous log; Paxos allows holes in the log and multi-proposer setups.

Related Concepts

GO DEEPER

Learn from senior engineers in our 12-week cohort

Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.