Audience: advanced distributed systems engineers who want a mental model that survives real outages.
Scenario: You operate a payment service. Product insists: “there must be exactly one current ledger head.” Ops insists: “it must survive machine failures.” Compliance insists: “no split-brain.”
You deploy 5 nodes across zones. Each node can crash, reboot, stall, or get partitioned. Messages can be delayed, reordered, duplicated, or dropped.
Your job: make the cluster behave like one reliable decision-maker.
If two nodes temporarily can’t talk, should both continue accepting writes?
Hold that thought - we’ll return to it when we talk about safety vs availability and why consensus is the “no split-brain” contract.
Key insight
Consensus is the mechanism that lets a distributed system act like it has a single authoritative state machine, even when nodes fail and networks misbehave.
You’re at a coffee shop with 5 baristas. Customers line up. The shop must maintain a single queue order and a single “next order to prepare.” If two baristas both start the same order or skip someone, chaos.
Which is the core problem consensus solves?
Pause.
Correct: 2. Agreeing on the next value/command in a sequence.
Consensus is about agreeing on an ordered log of decisions (commands) that all replicas apply in the same order.
Think of consensus as:
If you can do that, you can replicate a database, a configuration store, a scheduler, a lock service - anything modeled as a deterministic state machine.
A restaurant kitchen ticket rail: tickets (commands) must be processed in order. If two chefs disagree about ticket #17, the dining room burns.
Key insight
Most “consensus systems” are really replicated state machines built on top of a consensus log.
A delivery company must assign exactly one driver to each package. If two drivers deliver the same package, you have fraud and chargebacks.
Which property is non-negotiable for consensus?
Pause.
Correct: Safety is non-negotiable.
A consensus algorithm may become unavailable under certain failures (e.g., network partition), but it must not produce conflicting decisions.
In real systems, we typically accept temporary loss of liveness to preserve safety.
“Consensus means the system is always available.”
No. In the presence of partitions, you must choose: consistency or availability (CAP). Consensus algorithms in this family choose consistency/safety.
Key insight
If you need strict consistency, you are implicitly choosing CP behavior under partitions.
Your baristas communicate via walkie-talkies. Sometimes the walkie-talkie cuts out for 30 seconds. Sometimes a barista disappears (crash). Sometimes two baristas hear different things.
Match the failure to the typical model:
| Failure | Crash-fault tolerant (CFT) consensus (Paxos/Raft/ZK) | Byzantine fault tolerant (BFT) consensus |
|---|---|---|
| Node stops responding | Yes | Yes |
| Network delays/reordering/duplication | Yes | Yes |
| Node lies / sends conflicting messages intentionally | No | Yes |
| Disk corruption (silent) | No (needs extra measures) | Sometimes (depends on threat model + crypto) |
Pause and think: Paxos/Raft/ZooKeeper are designed for CFT, not Byzantine.
These protocols assume:
Key insight
Paxos/Raft/ZooKeeper assume crash faults and a non-malicious environment; liveness requires some eventual stability.
A committee of 5 decides the “official menu of the day.” You require at least 3 signatures to declare it official.
Why does requiring a majority help?
Pause.
Correct: B. Two majorities always overlap.
That overlap node acts as a “witness” ensuring two conflicting decisions can’t both be committed without someone noticing.
For N nodes, a quorum is usually floor(N/2) + 1.
This overlap is the foundation of safety.
A bank vault requiring 3-of-5 keys: any two sets of 3 share at least one keyholder, preventing two independent groups from opening two “different vaults” at once.
Key insight
Quorum intersection is the geometric backbone of consensus safety.

You want your key-value store to behave like one machine.
What must be true for RSM correctness?
Pause.
All three.
Consensus gives you (1) and (2). Your application must provide (3) or handle nondeterminism carefully.
Key insight
Consensus is not “replication.” It’s the agreement on order that makes replication safe.
You’re building:
Pause and think.
Key insight
Use consensus when conflicting decisions are catastrophic.
Cache invalidation often fails not because of ordering, but because of loss and retry storms. If invalidation correctness is security-critical (authz), treat it like configuration and use a CP store.
Five baristas must agree on today’s special drink. They can’t all meet at once. Some are on break. Walkie-talkies drop.
Paxos is the protocol that ensures: once a special is chosen, it never changes.
What does Paxos primarily decide?
Pause.
Correct: B. A value for a single slot.
Classic Paxos solves single-decree consensus: decide one value.
To get a log, you run Paxos for each slot (Multi-Paxos).
Key insight
Paxos is a building block: one slot at a time.
In practice, nodes often play multiple roles.
Match role -> responsibility:
| Role | Responsibility |
|---|---|
| Proposer | b) Initiates a proposal with a proposal number |
| Acceptor | a) Stores promises/accepts and enforces rules |
| Learner | c) Observes accepted values and determines chosen |
Key insight
Paxos safety lives inside acceptors.
Acceptors must persist (on stable storage):
If acceptors lose this state (disk loss), safety can be violated unless you treat it as a new node and reconfigure membership.
Two baristas propose different specials. We must avoid choosing both.
Paxos uses proposal numbers (ballots) and a two-phase handshake:
Why does Paxos need proposal numbers?
Pause.
Correct: B.
Proposal numbers establish a total order over attempts, independent of clocks.
“Paxos elects a leader and then everything is easy.”
Leader election is an optimization (Multi-Paxos). Basic Paxos does not require a stable leader to be correct.
Key insight
Prepare/Promise ensures that once a value could have been chosen, future proposals must respect it.
Real systems typically use a tuple (counter, nodeId):
This avoids collisions and ensures global ordering.
Cluster: A1..A5 acceptors. Quorum=3. Two proposers: P and Q.
Step 0: No value chosen.
Step 1: P sends Prepare(n=10) to A1,A2,A3.
Pause: what do acceptors reply?
Answer: They reply Promise(10, lastAccepted?) and record “promised n=10.”
Step 2: P receives promises from a quorum.
Pause: what value can P propose in Phase 2?
Answer: If any acceptor reported a previously accepted value, P must propose the value with the highest accepted proposal number. Otherwise, P can propose its own value.
Step 3: P sends Accept(n=10, value=VanillaLatte) to A1,A2,A3.
Step 4: A1,A2,A3 accept and reply Accepted.
Chosen: Once a quorum accepts, VanillaLatte is chosen.
A value can be chosen even if no learner has learned it yet. Learners learn via acceptor notifications or by querying.
Key insight
Paxos chooses when a quorum of acceptors accept the same (n, value).
Your coffee shop needs not one decision, but a sequence: specials for Monday, Tuesday, Wednesday…
Multi-Paxos runs Paxos for each log index but typically uses a stable leader to avoid repeating Phase 1 for every entry.
What does a stable leader buy you?
Pause.
Correct: B.
With a stable leader, you do Phase 1 once (establish leadership with a high ballot), then append many entries with Phase 2.
Key insight
Multi-Paxos ~= “Paxos with a long-lived distinguished proposer.”
Leader changes correlate with latency spikes because:
You run a 5-acceptor Multi-Paxos cluster.
Pause and think.
You inherit a Paxos-based system. It works, but nobody can confidently modify it. Incidents happen because engineers fear touching the consensus layer.
Raft was designed to be easier to understand and implement while providing similar properties (CFT, majority quorums, leader-based replication).
Which design choice is central to Raft?
Pause.
Correct: B.
Raft decomposes the problem into:
Key insight
Raft is “Multi-Paxos with a clearer story”: explicit terms, elections, and log matching.
A restaurant has:
Raft adds terms: numbered epochs of leadership.
What is a “term” most like?
Pause.
Correct: B.
Terms let nodes detect stale leaders and stale messages.
Key insight
Term numbers are the “version” of leadership.
Persist on stable storage:
currentTermvotedForIf you don’t, a reboot can cause a node to “forget” it voted and violate election safety.
If everyone tries to become head chef at once, you get chaos. Raft uses randomized election timeouts.
You have 5 nodes. All followers. Election timeout is random 150-300ms.
Pause and think: Why randomness?
Randomness reduces the probability that two nodes start elections simultaneously, preventing repeated split votes.
“Leader election is just ‘pick the lowest ID’.”
Not safe under partitions. You need a mechanism that ensures at most one leader per term and that leaders have sufficiently up-to-date logs.
Key insight
Raft’s voting rule couples leadership to log freshness, preventing leaders without committed entries.
If two candidates each get 2 votes in a 5-node cluster, nobody wins; they time out again with new randomized timeouts and retry.
The head chef writes tickets in order and ensures each station has the same ticket list.
What does Raft replicate?
Pause.
Correct: B.
If two logs contain an entry with the same index and term, then:
This is enforced by AppendEntries carrying prevLogIndex and prevLogTerm.
Pause: Why not just prevLogIndex?
Answer: Because a follower might have an entry at that index from a different leader/term; term disambiguates and detects conflicts.
Naively decrementing nextIndex one-by-one can be expensive. Production Raft implementations use a conflict hint (e.g., conflictTerm and conflictIndex) to jump back faster.
Key insight
Raft repairs divergence by forcing followers to match the leader’s log prefix.
A ticket is not “official” until enough stations have it.
In Raft:
When is a log entry committed?
Pause.
Correct: B.
A leader only uses “count replicas” to advance commitIndex for entries from its current term. Earlier-term entries become committed indirectly when a later entry from the current term commits.
Key insight
“Committed” is a quorum-backed promise that the entry will survive future leader changes.
Old leader may continue accepting client requests, but cannot commit without a majority. Correct behavior is to reject writes or accept-but-not-ack (rare; usually a bad API).
A follower with slow disk can lag far behind; leader maintains nextIndex/matchIndex per follower.
Pause.
A leader that pauses for 30s can trigger elections. When it returns, it will observe a higher term and step down. This is correct but can cause:
Mitigations:
Your coffee shop hires a new barista (add node) or someone quits (remove node). You must change membership without creating two different majorities.
Raft uses joint consensus (in many implementations): during transition, decisions require quorums from both old and new configs.
Why can’t you just switch configs instantly?
Because you could create a moment where two different majorities exist (old set and new set) that don’t intersect, risking split brain.

Key insight
Membership changes must preserve quorum intersection across time.
Correct implementations persist the joint configuration as a log entry. After crash/restart, the cluster continues the transition based on the committed log.
You run a fleet of services that need:
You don’t want every team implementing consensus. You deploy ZooKeeper as a shared coordination service.
ZooKeeper is best described as:
Pause.
Correct: B.
ZooKeeper provides a hierarchical namespace (znodes) with strong consistency guarantees.
Key insight
ZooKeeper is a CP coordination system optimized for metadata, not bulk data.
ZooKeeper is often a shared dependency; treat it like critical infrastructure:
ZooKeeper needs all servers to agree on the order of updates to the znode tree.
It uses ZAB (ZooKeeper Atomic Broadcast), a protocol similar in spirit to leader-based log replication (like Multi-Paxos/Raft).
What does ZooKeeper replicate?
Pause.
Correct: A.
ZAB ensures a total order of updates and that followers apply them consistently.
Key insight
ZooKeeper is effectively a replicated log + deterministic state machine (the znode tree).
You want leader election for a service “orders.” Each instance registers itself. When the leader dies, another should take over.
ZooKeeper gives you:
Which feature is the secret weapon for coordination?
Pause.
Correct: C.
Ephemeral nodes represent liveness; watches provide event-driven reactions.
Watches are:
Correct recipes always re-read state after a watch fires.
Each service instance creates /election/c_ as an ephemeral sequential znode._
The one with the smallest sequence number is leader.
If you watch the leader znode directly, what happens when there are 10,000 clients?
You risk the herd effect: everyone wakes up on leader change.
Better: each node watches its predecessor (the znode with the next smaller sequence). Only one node wakes up when predecessor disappears.

The following code is not real ZooKeeper; it’s a TCP-style demo of the recipe. Production ZooKeeper clients use the official libraries and handle session events.
Key insight
Correct ZooKeeper recipes avoid herd effects by watching predecessors, not the leader.
getChildren and exists(watch=true); always re-check.You store configuration in ZooKeeper. You want every service to see updates in a consistent order.
ZooKeeper provides:
But reads can be tricky:
Pause.
Key insight
ZooKeeper is strongly consistent for writes, but read semantics depend on how you read.
sync() when you must ensure your subsequent read reflects the latest committed state.
| Dimension | Paxos / Multi-Paxos | Raft | ZooKeeper |
|---|---|---|---|
| Primary goal | Consensus (single slot -> log) | Understandable consensus + log replication | Coordination service built on consensus (ZAB) |
| Typical deployment | Embedded in DBs/systems | Embedded in systems (etcd/Consul-like) | Dedicated ensemble used by many apps |
| Leader | Optional (optimization) | Central to design | Leader-based |
| API | Usually internal (append log) | Usually internal (append log) | External API (znodes, watches) |
| Strength | Proven minimal core | Clarity, safer implementations | Rich coordination primitives |
| Weakness | Hard to reason/teach | Membership changes & compaction complexity | Misuse leads to herd effects; not for bulk data |
| Failure behavior | CP; needs majority | CP; needs majority | CP; needs majority |
Key insight
Choose Raft/Multi-Paxos when you’re building a replicated system; choose ZooKeeper when many systems need shared coordination and you accept operating a dedicated ensemble.
Match each use case to the best option (Paxos/Multi-Paxos, Raft, ZooKeeper):
A client writes x=1, then reads x. They expect 1.
Which guarantee is this?
Pause.
It depends:
Consensus orders operations; your client read path determines whether you get linearizability.
Your replicated log grows forever. Disk fills. Restart time becomes hours.
Consensus systems use:
Why is snapshotting tricky?
Because you must ensure that:

Below is a toy illustration. Key fixes vs common mistakes:
max(commitIndex, lastIncludedIndex) (not decrease)Key insight
Snapshots are not just storage optimization - they are part of correctness and recovery.
A client wants to write. It contacts node A, but A is a follower.
Which client strategy is safest?
Pause.
Typically B, sometimes A depending on system design.
Broadcasting writes is expensive and can amplify failures.
You set election timeout to 1s to avoid spurious elections. But failover now takes seconds.
What happens if election timeout is too low?
You get unnecessary elections under transient delays (GC pauses, noisy neighbors), reducing throughput and increasing tail latency.
If too high, recovery is slow.
Key insight
Tuning timeouts is balancing false positives (unneeded elections) vs failover time.

Key insight
If you can reason about commitIndex and quorum intersection, you can reason about most outages.
You have a 5-node Raft cluster: A,B,C,D,E.
Pause and think.
True: 2, 3, 4.
A can accept writes but should typically fail them (or accept but not acknowledge) because it can’t reach a majority.
When the partition heals, the majority side’s leader will force log convergence; A’s uncommitted tail can be discarded.
Key insight
Committed entries never roll back; uncommitted entries can.
You operate 200 microservices. They need:
Constraints:
Choose an architecture:
Pause and think about:
Key insight
Consensus is easy to want and hard to operate. The best design is the one whose failure modes you can explain at 3am.
Note: the original appendix label said “Go” but the snippet was Python/JS. Fixed below.
[CODE: Python, context: “toy Raft append + commit simulation with partitions”; demonstrate leader appends, quorum acks, commit advancement, and rollback of uncommitted entries after leader change]