Courses 0%
12
Distributed Systems And Algorithms · Chapter 12 of 51

Split Brain Mitigation

Akhil
Akhil Sharma
20 min

Split Brain in Distributed Systems — The Problem, the Damage, and the Mitigations (Interactive)

Audience: engineers who build/operate distributed systems (databases, coordination services, caches, message brokers, consensus clusters). Goal: understand how split brain happens, why it is destructive, and how to mitigate it with concrete design/ops patterns.


Two managers, one restaurant, and a cut phone line

Scenario. You run a busy restaurant with two floor managers: A and B. They coordinate by phone: who seats parties, who comps meals, who closes the register.

One night the phone line between them is cut. Both managers can still talk to staff on their own side of the building.

  • Manager A thinks B quit.
  • Manager B thinks A quit.

Both decide: "I'm in charge." Both start closing tabs, comping meals, and reassigning tables.

By the time the phone line is fixed, you have:

  • duplicated reservations,
  • conflicting receipts,
  • missing inventory,
  • angry customers.

That's split brain: two (or more) partitions of a distributed system each believe they are the authoritative leader/primary and proceed independently.

Pause and think

If you were the restaurant owner, what would you prefer during the phone outage?

  1. Keep serving on both sides, even if receipts conflict later.
  2. Freeze one side to avoid conflicts, even if customers wait.

Hold your answer — we will come back to it when we discuss CAP trade-offs and quorum.

Key insight

Split brain is not "a node crashed." It is a coordination failure under partial connectivity where multiple sides continue making authoritative decisions.


What "split brain" really means (and what it does not)

Scenario

You have a 5-node cluster. A network partition splits it into groups of 2 and 3.

  • Group of 3 elects leader L3.
  • Group of 2 elects leader L2 (because of a bug/misconfiguration/timeout).

Now you have two primaries.

Interactive question (pause and think)

Which of these is required for split brain?

A. A node crash B. Network partition or asymmetric reachability C. High CPU usage D. Clock skew

Take 10 seconds.

Progressive reveal — answer

Correct: B. Split brain requires some form of communication failure that allows independent progress in multiple partitions.

  • A crash can trigger leader election, but does not itself create two leaders.
  • High CPU and clock skew can contribute (timeouts, lease mis-evaluation), but the core is loss of coordination.

Real-world parallel

Two delivery dispatch centers lose the shared system link. Each dispatches drivers to the same orders.

Key insight

Split brain is a safety failure (two authorities) caused by faulty membership/leadership decisions under partial failure.

Challenge question

In your system today, what mechanism decides "who is leader" and "who is allowed to accept writes"? Is it explicit (consensus) or implicit (configuration plus timeouts)?


Why split brain is so destructive (the "double spend" of state)

Scenario

You run a primary-replica database. Two primaries accept writes during a partition.

  • Client X writes balance = balance - 100 on Primary A.
  • Client Y writes balance = balance - 100 on Primary B.

After healing, you attempt to "merge."

Pause and think

What's the best case outcome?

  1. One side's writes are discarded (data loss).
  2. Both writes are applied (overdraft/double spend).
  3. You can always merge automatically.

Progressive reveal — answer

Best case is usually (1): discard one side. Many systems choose a winner and roll back the loser's divergent history.

  • (2) can violate invariants (money, inventory, uniqueness).
  • (3) is only possible if the data type and operations are designed for merge (CRDTs), and even then, not all invariants can be preserved.

Analogy

It is like two accountants editing the same ledger offline. When they reconnect, you cannot "merge" two conflicting realities without deciding whose edits count.

Failure blast radius

Split brain can cause:

  • Lost acknowledged writes (client got 200 OK, later rolled back).
  • Uniqueness violations (two users created with same ID).
  • Monotonicity breaks (time/order-based logic fails).
  • External side effects duplicated (emails sent twice, payments charged twice).

Key insight

Split brain is not just inconsistent reads. It is inconsistent decisions — often irreconcilable without losing correctness.

Production insight: the hardest part is external side effects

Even if you can reconcile internal state, you often cannot "un-send" an email or "un-charge" a card. Treat split brain primarily as a side-effect control problem.

Challenge question

List one invariant in your system that would be catastrophic to violate (e.g., "charge card at most once"). Could split brain violate it?


The mechanics — how split brain happens in clusters

Scenario

A typical leader-based cluster uses:

  • heartbeats,
  • timeouts,
  • membership changes,
  • leader election.

Under a partition, each side observes missing heartbeats.

Think about it

Why would the minority side ever elect a leader?

  • Because it cannot distinguish "leader is unreachable" from "leader is dead."
  • Because election rules may be based on local observations rather than quorum.

Mental model: "Local suspicion" vs "Global agreement"

  • Local suspicion: "I did not hear from leader; therefore leader is dead."
  • Global agreement: "A majority agrees leader is dead and agrees on the next leader."

Split brain occurs when a system allows local suspicion to trigger authoritative action.

[IMAGE: Diagram showing 5 nodes split into partitions (3 and 2). In the unsafe case, both partitions elect leaders and accept writes. In the safe quorum case, only the majority partition elects a leader; minority becomes read-only or unavailable.]

Key insight

The fix is almost always: make "authority to write" depend on quorum, not on local timeouts alone.

Network assumption (state it explicitly)

Assume an asynchronous network: messages can be delayed, dropped, duplicated, and reordered; clocks can drift; and you cannot reliably distinguish a slow node from a dead node.

Challenge question

Does your leader election require a majority (quorum) of current membership? Or can a minority self-elect?


Decision Game: Which statement is true?

Choose exactly two that are true.

  1. Split brain is impossible if you have TCP.
  2. Split brain can happen without a clean network partition (e.g., GC pauses).
  3. Quorum-based leader election prevents split brain but may reduce availability.
  4. If you use a load balancer, split brain cannot happen.
  5. Split brain is only a database problem.

Pause and decide

Write down your picks.

Reveal

True: (2) and (3).

  • (2) Long GC pauses, CPU starvation, or packet loss can create effective partitions (timeouts trigger elections). Also asymmetric reachability (A can reach B but not vice versa) can mimic partitions.
  • (3) Quorum protects safety but sacrifices availability during partitions (CAP).

False:

  • (1) TCP does not prevent partitions; it only provides ordered reliable delivery when a path exists.
  • (4) Load balancers can mask failures but can also worsen them; they do not solve coordination.
  • (5) Any system with leadership/ownership can split brain: caches, schedulers, lock managers, message brokers.

Key insight

Split brain is a coordination problem, not a transport problem.


CAP trade-off framing (availability vs safety)

Scenario

You operate a 3-node cluster across 2 availability zones.

A zone outage partitions the cluster. You must decide:

  • continue serving writes in both zones (high availability), or
  • allow only quorum side to write (safety).

Pause and think

During the outage, which is worse?

A) Reject writes (downtime) B) Accept writes that you might later roll back (data loss)

Mental model: CAP, but with "business invariants"

Under a partition, you cannot simultaneously guarantee:

  • Safety/consistency: one authoritative history for operations that must not conflict.
  • Availability: every request gets a non-error response.

Split brain is what happens when you choose availability for non-mergeable operations.

Consistency model clarification

When we say "consistency" here, we mean linearizability / single-writer safety for the operations in question (not merely "replicas eventually converge").

Restaurant analogy

If both managers keep taking payments, you are "available," but your ledger becomes untrustworthy.

Key insight

The real decision is: do you prefer being down, or being wrong?

Challenge question

Name one operation that must be "never wrong" (payments) and one that can be "eventually right" (analytics counters). You may need different strategies per operation.


Failure scenarios that trigger split brain (beyond obvious partitions)

1) Classic network partition

  • Switch failure, misconfigured ACL, BGP flap.

2) Asymmetric reachability

  • Node A can reach B, but B cannot reach A.
  • Often due to firewall state, NAT, MTU issues.

3) Gray failures

  • Packet loss, high latency, jitter.
  • Heartbeats are delayed; timeouts fire.

4) Stop-the-world pauses

  • JVM GC pause, kernel stalls, VM pause.
  • Node appears dead to others.

5) Clock issues (lease-based leaders)

  • If leadership is tied to time leases, clock skew can cause two nodes to believe they hold valid leases.

6) Control-plane/data-plane split

  • The consensus layer is healthy, but the application continues to accept writes because it caches leadership state or bypasses quorum checks.

Interactive matching exercise

Match the symptom to the likely root cause:

SymptomRoot cause options
Heartbeats occasionally missing, then recoveringA) Gray failure B) Hard partition C) Crash
One node sees everyone; everyone sees node as deadA) Asymmetric reachability B) Clock skew C) Disk full
Two leaders appear after a 30s JVM pauseA) GC pause B) BGP flap C) Operator error

Pause and match.

Answers:

  • Missing then recovering -> A (Gray failure)
  • One-way visibility -> A (Asymmetric reachability)
  • Two leaders after pause -> A (GC pause)

Key insight

Many split brain incidents start as performance problems that look like partitions to failure detectors.

Production insight: tune failure detectors to your tail latency

Election/heartbeat timeouts must be set with awareness of:

  • worst-case GC pauses,
  • noisy-neighbor CPU throttling,
  • network tail latency during incidents.

But remember: tuning reduces false failovers; it does not provide correctness.

Challenge question

What are your heartbeat timeouts? Are they tuned for your worst-case GC pause / network jitter?


The core mitigation families (overview)

We will group mitigations into four families:

  1. Quorum / consensus (Raft/Paxos/Zab): only majority can make progress.
  2. Fencing: even if two think they are leader, only one can act (STONITH, fencing tokens).
  3. Leases: time-bounded leadership, ideally backed by a quorum store.
  4. Mergeable data: design state to converge (CRDTs) and accept multi-writer.

[IMAGE: A 2x2 grid: Single-writer vs Multi-writer on one axis, Mergeable vs Non-mergeable invariants on the other. Place quorum/consensus in single-writer/non-mergeable, CRDT in multi-writer/mergeable, etc.]

Key insight

You mitigate split brain either by preventing dual authority, or by making dual authority safe.

Challenge question

Which quadrant is your system in: single-writer with strict invariants, or multi-writer with mergeable semantics?


"Just increase timeouts"

Scenario

An incident occurs: two leaders appear during a period of high latency. Someone says: "Let's double the election timeout."

Pause and think

Will increasing timeouts eliminate split brain?

  • It can reduce false positives.
  • It cannot eliminate partitions.
  • It may increase failover time.

Explanation

Timeout tuning is like telling restaurant staff: "Wait longer before assuming the other manager is gone."

That helps when the phone line is noisy — but if the phone line is truly cut, waiting longer only delays the inevitable decision.

Key insight

Timeouts are a stability knob, not a correctness guarantee.

Challenge question

What is your maximum tolerated failover time? How does it relate to your timeout settings?


Quorum and majority — the "only one side can be right" rule

Scenario

You have N nodes. A quorum is typically floor(N/2) + 1.

  • In a partition, at most one partition can have a majority.
  • Therefore, at most one partition can elect a leader (if elections require quorum).

Mental model

Think of quorum as a single shared cash register key that requires a majority of staff to turn. Only one group can assemble enough people to turn it.

Interactive exercise

For each cluster size, compute quorum:

  • N=3 -> ?
  • N=4 -> ?
  • N=5 -> ?
  • N=6 -> ?

Pause.

Answer:

  • 3 -> 2
  • 4 -> 3
  • 5 -> 3
  • 6 -> 4

Trade-off table

NQuorumTolerates failures (crash/partition minority)Write availability under 50/50 split
321One side continues (2 nodes)
431Neither side has 3 -> outage
532One side continues (3 nodes)
642Neither side has 4 -> outage

Think about it: even-sized clusters can be awkward for partitions.

Key insight

Quorum prevents split brain by ensuring intersection: any two majorities share at least one node, preventing two leaders from being independently confirmed.

Challenge question

Why do many consensus clusters prefer odd numbers of voters?


Deep dive — how consensus protocols prevent split brain

Scenario

You are using Raft (or Paxos-family). How does it stop two leaders?

Mental model: terms and voting (Raft)

  • Time is divided into terms.
  • A leader is elected for a term by receiving votes from a majority.
  • A node votes at most once per term.

Because majorities intersect, two candidates cannot both get a majority in the same term.

Pause and think

But what if partitions happen across terms — could you get two leaders in different terms at the same time?

Progressive reveal — answer

You can temporarily have a leader in the old term that has not learned it lost leadership yet. But:

  • it cannot commit new entries without a majority,
  • clients should only consider writes durable after commit.

So the protocol preserves safety, even if leadership perception is briefly inconsistent.

[CODE: python, context: demonstrate Raft leader election and majority vote check, emphasizing "commit requires majority"]

python

Real-world parallel

A corporate policy: a new CEO is valid only if a majority of board members sign. Two CEOs cannot both get majority signatures from the same board.

Key insight

Consensus does not prevent partitions; it prevents unsafe progress without quorum.

Challenge question

In your system, do clients treat "leader accepted write" as success, or "quorum committed write" as success?


"Leader election alone prevents split brain"

Scenario

A system has leader election but replication is asynchronous and commits are local.

Leader A accepts a write, responds success, then crashes before replication.

Pause and think

Is that split brain?

Not exactly — but it is a safety gap that often appears alongside split brain.

Explanation

Leader election prevents two leaders, but if durability/commit semantics are not quorum-based, you can still lose acknowledged writes.

Key insight

To mitigate split brain and protect acknowledged writes, you need quorum-based commit, not just quorum-based election.

Production insight: define what an ACK means

Document (and test) whether a successful write means:

  • written to leader memory,
  • fsynced on leader,
  • replicated to 1 follower,
  • replicated+fsynced to a quorum.

Ambiguity here is a common root cause of "we thought we were safe" incidents.

Challenge question

What is your replication mode: async, semi-sync, quorum commit? What does "ACK" mean?


Fencing — when you cannot trust membership, block the loser from acting

Scenario

You have an active-passive setup with shared storage (SAN, EBS, NFS). Two nodes might mount and write to the same volume during a partition.

That is catastrophic: filesystem corruption.

Mental model: "The bouncer at the door"

Even if two managers claim they are in charge, the bouncer only lets one into the cash room.

Fencing ensures the old leader cannot continue making changes.

Types of fencing

  1. STONITH (Shoot The Other Node In The Head): power off the other node via IPMI/iLO/cloud API.
  2. Storage fencing: revoke disk access (SCSI-3 PR, SAN zoning).
  3. Fencing tokens / epochs: monotonic tokens; only highest token can write.

Interactive question

Which is safer for protecting shared storage?

A) "If I can ping the other node, I will not write." B) "I will only write if I hold a fencing token from a quorum service."

Answer: B. Ping checks are not authoritative.

[IMAGE: Diagram showing two nodes connected to shared disk. A fencing service issues a token/lock; only token holder can write. Without fencing, both write and corrupt disk.]

Key insight

Fencing is about preventing side effects from the wrong leader, even if it still believes it is leader.

Failure scenario to plan for

  • Old leader is alive but isolated.
  • It continues to run scheduled jobs / write to storage / call payment APIs.

If you cannot reliably stop it via protocol, you must stop it via infrastructure (power/network/storage).

Challenge question

What side effects in your system are irreversible (emails, payments, disk writes)? Do you have fencing for them?


Leases — time-bounded leadership (and where they go wrong)

Scenario

You implement leader leases:

  • leader is valid until lease_expiry.
  • leader renews periodically.

If leader cannot renew (partition), it should stop.

Pause and think

What assumption do leases rely on?

  • Bounded clock skew (or a design that avoids trusting local clocks).
  • Bounded message delay for renewal (or conservative expiry).
  • Correct lease authority.

Clarification: leases are subtle

A lease is safe only if no two nodes can both hold a valid lease at the same time.

That typically requires:

  • a single, strongly consistent lease authority (quorum/consensus), and
  • a time model that prevents overlapping leases (e.g., server-side time, monotonic clocks, or conservative bounds).

Where leases fail dangerously

  • clocks drift or jump (NTP step adjustments, VM suspend/resume),
  • renewal is based on a non-quorum store,
  • the leader ignores lease expiry under load,
  • GC pauses prevent timely renewal but the process continues acting.

Safer lease pattern

Use a quorum-backed lease store (e.g., etcd/Consul/ZooKeeper) where lease grants/renewals require majority.

[CODE: javascript, context: acquire leadership via CAS + TTL — demo only]

javascript

Key insight

Leases mitigate split brain only when lease authority is itself split-brain-safe (quorum) and clients/servers obey expiry.

Challenge question

If your leader's clock jumps forward by 60 seconds, what happens to lease logic?


[MISCONCEPTION] "A distributed lock is enough"

Scenario

You use a "lock service" to ensure one leader.

But the lock service is:

  • single-node Redis,
  • or a DB row lock in a primary that can fail over,
  • or a non-quorum cache.

Pause and think

Can the lock service itself split brain?

Yes. If the lock authority is not strongly consistent, you moved the problem.

Explanation

A lock is only as good as:

  • the correctness of the lock provider,
  • fencing semantics (what happens to the old lock holder?),
  • clock assumptions.

Production insight: prefer locks that also provide fencing

If you must use a lock, prefer one that yields a monotonic fencing token (e.g., an increasing revision/epoch) that downstream systems can validate.

Real-world parallel

A single key cabinet helps only if the cabinet is guarded. If two copies of the key exist, you are back to split brain.

Key insight

Do not build safety on a component that can itself become inconsistent under partition.

Challenge question

Is your lock provider CP (quorum/consensus) or AP (eventual)? What does it guarantee under partition?


Witnesses, tie-breakers, and arbitrators (when you cannot afford a full quorum)

Scenario

You have 2 data centers (DC1, DC2) and want active-passive DB.

If the link between DCs fails, both sides might promote.

A common mitigation: add a witness in a third location.

Mental model: "The neutral cashier"

When two managers disagree, a neutral cashier decides who can close the register.

Witness patterns

  • 3rd node (lightweight) participates in quorum.
  • Arbitrator service: can grant permission to be primary.

Trade-offs

  • Witness must be reliable and reachable.
  • Latency to witness affects failover.
  • Witness placement must avoid correlated failures.

[IMAGE: Two DCs with 1 node each plus witness in third site. Show that only side with witness forms majority.]

Key insight

A witness breaks 50/50 ties by ensuring one side can form a majority.

Production insight: don’t put the witness on the same failure plane

Avoid placing the witness:

  • in the same region,
  • behind the same ISP,
  • on the same power domain,
  • or in the same Kubernetes cluster as either primary site.

Challenge question

Where would you place a witness to minimize correlated failures (power, network provider, region)?


Application-level fencing — idempotency keys, tokens, and monotonic epochs

Scenario

Even with consensus, you may still have:

  • clients retrying,
  • old leader briefly serving,
  • downstream systems that cannot be rolled back.

Mental model: "Stamped receipts"

Every action gets a stamped receipt number. Downstream accepts only the newest stamp.

Techniques

  1. Leader epoch in requests

    • Leader includes (term, leaderId) with every write.
    • Followers/downstream reject stale epochs.
  2. Fencing tokens

    • Monotonic integer token from quorum store.
    • Only requests with highest token are accepted.
  3. Idempotency keys

    • For external side effects: Idempotency-Key: <uuid>
    • Prevent double charge/email.

[CODE: python, context: fencing token enforcement for a single-writer boundary]

python

Key insight

Even perfect leader election does not stop stale leaders from causing side effects unless you fence at the boundary.

Challenge question

Pick one irreversible side effect in your system. How would you make it idempotent and/or fenced?


When multi-writer is okay — CRDTs and convergent design

Scenario

You run a globally distributed cache of "likes" counts.

If split brain occurs, you might accept increments on both sides and merge later.

Pause and think

Which operations are naturally mergeable?

  • increment counters
  • add to a set
  • last-write-wins register
  • bank transfer

Reveal

Mergeable (with care): counters, sets, some registers. Not mergeable without extra constraints: bank transfers.

Mental model: "Tally sheets"

Each partition keeps its own tally sheet. When they meet, they add tallies.

Trade-off

CRDTs provide availability under partition but:

  • may violate invariants that require coordination,
  • can grow metadata,
  • complicate deletes (tombstones).

[IMAGE: Two partitions maintain a G-Counter per node; after heal, counters merge by max per component then sum.]

Key insight

If you can design state to converge, split brain becomes less catastrophic — but not all domains allow it.

Performance consideration

CRDTs often trade coordination for:

  • larger payloads (metadata),
  • more expensive merges,
  • background compaction/tombstone GC.

Challenge question

Could any part of your system be redesigned as convergent data (CRDT) to reduce coordination needs?


Real-world systems and how they handle split brain

ZooKeeper / etcd / Consul (CP coordination)

  • Use quorum consensus.
  • Minority partition becomes unavailable.

Kafka (controller / KRaft)

  • Controller elected via quorum (KRaft) to avoid multiple controllers.
  • Producers can use idempotence and transactions to mitigate duplicates.

Redis Sentinel / Redis Cluster

  • Sentinel uses quorum of sentinels to failover, but misconfigurations can still cause split brain.
  • Need careful placement and quorum.

Elasticsearch (historical)

  • Older versions needed minimum_master_nodes to prevent split brain.
  • Modern versions auto-manage voting configuration, but the principle remains: quorum.

Kubernetes

  • etcd is the source of truth; if etcd loses quorum, the control plane becomes read-only/unavailable.

Key insight

Mature systems pick safety: no quorum, no writes.

Challenge question

Which component in your stack is the source of truth? Is it protected by quorum?


"Split brain only happens during failover"

Scenario

You do a rolling upgrade.

  • Node A is restarted.
  • Node B experiences latency spikes.
  • Health checks flap.

Suddenly, leader changes repeatedly; clients see inconsistent behavior.

Explanation

Split brain risk increases during:

  • maintenance,
  • topology changes,
  • scaling events,
  • network reconfiguration.

It is not just big outages. It is also small instabilities.

Key insight

Operational churn is a common trigger for coordination bugs.

Production insight: membership changes are a high-risk operation

In consensus systems, membership changes (adding/removing voters) must be controlled and ideally use safe reconfiguration mechanisms (e.g., Raft joint consensus). Avoid auto-scaling voters.

Challenge question

Do you have a change freeze policy for quorum components? Do you upgrade them one at a time with health verification?


Observability — how to detect split brain early

Scenario

You suspect split brain but logs are noisy.

Signals to instrument

  • Leader identity (node ID plus term/epoch) exported as a metric.
  • Number of leaders observed per time window.
  • Quorum status: can node reach majority?
  • Commit index lag (Raft) / zxid (ZK) divergence.
  • Client write acknowledgments vs committed writes.

[IMAGE: Dashboard mock: leader_id metric across nodes; in split brain you see two distinct leader IDs simultaneously.]

Interactive question

Which alert is more actionable?

A) "CPU high on node 3" B) "Two distinct leader epochs active for more than 10s"

Answer: B. It is directly tied to safety.

Key insight

Alert on safety invariants, not just resource usage.

Production insight: log the epoch on the write path

Include (term/epoch, leaderId) in:

  • server logs for every write,
  • client responses,
  • downstream side-effect requests.

This makes incident triage dramatically faster.

Challenge question

Do you log leader epoch with every write request? If not, can you add it?


Runbooks — what to do during a suspected split brain

Scenario

Pager goes off: "Two primaries detected."

Pause and think

What is your first objective?

  • Restore availability?
  • Preserve data correctness?

In most systems: preserve correctness.

A practical runbook outline

  1. Stop writes (freeze) on at least one side.
  2. Identify authoritative leader (quorum side, highest term/epoch, freshest commit index).
  3. Fence the losing side (power off, revoke credentials, block network) if needed.
  4. Assess divergence: what writes occurred on losing side?
  5. Decide recovery strategy:
    • discard losing side writes,
    • replay from clients,
    • manual reconciliation.
  6. Postmortem: why did minority become writable?

[IMAGE: Flowchart runbook: detect -> freeze -> determine quorum leader -> fence -> reconcile -> restore.]

Key insight

During split brain, speed matters — but correctness-first matters more.

Failure scenario to explicitly cover in the runbook

  • Partial fencing failure: you attempted to fence the loser but the action failed (cloud API rate limit, IPMI unreachable). Your runbook should include verification steps and fallback actions.

Challenge question

Do you have a big red button to make a partition read-only? Who is authorized to press it?


Pick a mitigation for each system

Scenario

You have three systems:

  1. A payment ledger (strict invariants).
  2. A social likes counter.
  3. A job scheduler that must not run the same job twice.

Pick the best primary mitigation:

A) CRDT multi-writer B) Quorum consensus plus fencing of side effects C) Increase timeouts D) Single-node primary with async replicas (no quorum)

Pause and pick

Reveal

  1. Payment ledger -> B
  2. Likes counter -> A (or B if you want strictness, but A is typical)
  3. Job scheduler -> B (plus idempotent job execution)

Key insight

Mitigation choice depends on invariants: strict domains need quorum plus fencing; soft domains can use convergent design.

Challenge question

What category is your system closest to, and why?


Configuration pitfalls that reintroduce split brain

Scenario

You deploy a consensus cluster but misconfigure it.

Common pitfalls

  • Wrong quorum size (e.g., Elasticsearch minimum_master_nodes mis-set historically).
  • Too few voters (2-node cluster without witness).
  • Auto-scaling quorum nodes (membership churn).
  • Shared fate: all quorum nodes in one rack/AZ.
  • Using non-durable disks for consensus logs.
  • Health checks that restart nodes aggressively (causing repeated elections).

Interactive checklist (pause and check your system)

  • Do I have an odd number of voters?
  • Are voters spread across failure domains?
  • Is membership change controlled (not auto-scaled randomly)?
  • Do I have fencing for shared resources?
  • Do clients require quorum commit for success?

Key insight

Split brain is often an ops/config bug, not a protocol bug.

Challenge question

Which of the checklist items is currently the weakest in your environment?


Designing client behavior to survive leadership changes

Scenario

Even if the cluster is safe, clients can amplify split brain symptoms.

Examples:

  • clients cache leader address too long,
  • retries go to old leader,
  • write timeouts cause duplicate submissions.

Mental model: "Customers who keep paying the wrong cashier"

If customers keep paying a cashier after management changed, you get accounting confusion.

Client-side patterns

  • Discover leader via a strongly consistent endpoint.
  • Include leader epoch in responses and require monotonicity.
  • Idempotent retries with unique request IDs.
  • Backoff and jitter to avoid thundering herds during elections.

[CODE: javascript, context: idempotent client retry with request ID + exponential backoff]

javascript

Key insight

Split brain mitigation is end-to-end: servers, clients, and downstream side effects.

Performance consideration

Retries during elections can create a thundering herd. Use:

  • exponential backoff,
  • jitter,
  • and circuit breakers when quorum is lost.

Challenge question

Do your write APIs support idempotency keys? If not, which endpoint should be first?


Trade-offs cheat sheet (comparison table)

MitigationPrevents dual leaders?Prevents wrong side effects?Availability under partitionComplexityTypical use
Quorum consensus (Raft/Paxos)YesIndirectly (needs fencing)Reduced (minority stops)HighDatabases, coordination
Witness / arbitratorYes (for 2-site)IndirectlyBetter than 2-nodeMediumActive-passive across DCs
STONITH / fencingNot by itselfYesDependsMedium-HighShared storage, HA pairs
Lease (quorum-backed)UsuallyPartiallyReducedMediumLeader election, locks
CRDT / convergent designNo (allows multi-writer)Depends on domainHighMediumCounters, sets, collaboration
Increase timeoutsNoNoSometimes higherLowStability tuning only

Key insight

There is no free lunch: safety, availability, and operational complexity trade off. Choose based on invariants.

Challenge question

Which row matches your current approach, and what risk are you accepting?


Progressive case study — a split brain incident timeline

Scenario

A 3-node cluster: A, B, C. Quorum=2.

At t=0, A is leader.

  1. t=10: network between A and (B,C) becomes lossy.
  2. t=12: B and C stop hearing A; election starts.
  3. t=13: B becomes leader (term 9) with votes from B and C.
  4. t=14: A still thinks it is leader (term 8) and continues serving writes because the application only checks "am I leader locally?"

Pause and think

Where is the bug?

  • Consensus layer is fine.
  • Application layer is not fenced: A should stop accepting writes when it loses quorum/lease.

Fix

  • Require commit confirmation from quorum before ACK.
  • Gate writes on "still have quorum" checks.
  • Include epoch in RPC and reject stale.

Key insight

Split brain often appears as a layering violation: the app bypasses the consensus safety boundary.

Challenge question

Do your write handlers check "leadership plus quorum" or only "leadership flag"? Where is that enforced?


Designing degraded mode instead of full outage

Scenario

You want safety but also want some availability.

Options

  • Read-only mode for minority partition.
  • Stale reads with explicit staleness indicators.
  • Local queueing of writes (accept but do not commit) — dangerous unless clearly communicated.

Restaurant analogy

During the phone outage, one side can still seat customers and take orders, but cannot finalize payments until the cashier system is back.

Key insight

A safe degraded mode is often: reads yes, writes no — or writes buffered but not acknowledged as final.

Production insight: be explicit in APIs

If you buffer writes, return a response that clearly indicates:

  • the write is not committed, and
  • it may be rejected later.

Otherwise you create "acknowledged but lost" writes — often worse than a 503.

Challenge question

Can your product tolerate read-only mode during partitions? If not, what data types could be made mergeable?


Final Synthesis Challenge: Design your split-brain strategy

Scenario

You are designing a globally deployed service with:

  • user profiles (must not lose updates),
  • login sessions (can be eventually consistent),
  • billing events (must be exactly-once),
  • analytics counters (approx ok).

A partition can split regions for 15 minutes.

Your task (pause and write)

For each subsystem, choose a strategy:

  1. User profiles: (quorum / CRDT / async primary / other)
  2. Login sessions: (quorum / CRDT / cache / other)
  3. Billing events: (quorum plus fencing plus idempotency / other)
  4. Analytics counters: (CRDT counters / stream processing / other)

Also answer:

  • Where do you put your quorum nodes?
  • Do you add a witness? Where?
  • What do clients do on leader change?

Suggested solution (compare after you decide)

  • User profiles: quorum consensus replicated store (or strongly consistent per-shard leaders) with quorum commit.
  • Login sessions: AP cache or CRDT set with TTL; tolerate duplicates.
  • Billing events: quorum plus fencing tokens plus idempotency keys; never acknowledge without durable commit.
  • Analytics counters: CRDT G-Counter / PN-Counter or append-only logs merged later.

Key insight

The best split brain mitigation is not one technique — it is a portfolio, aligned to invariants.


Appendix: Quick glossary

  • Split brain: multiple partitions act as primary simultaneously.
  • Quorum: majority required to make authoritative decisions.
  • Fencing: preventing stale leaders from acting.
  • Lease: time-bounded authority.
  • Witness: tie-breaker node/service.
  • CRDT: data type that converges under concurrent updates.

End-of-article challenge question

If you could add only one improvement this week to reduce split brain risk, what would it be: (a) add a witness, (b) add fencing tokens, (c) change commit semantics to quorum, (d) improve observability of leader epochs — and why?


Key Takeaways

  1. Split-brain occurs when network partitions cause two groups of nodes to each elect their own leader — both sides accept writes, causing conflicting data
  2. Quorum-based systems prevent split-brain — a leader needs a majority (N/2+1), and only one majority can exist at a time
  3. Fencing prevents deposed leaders from continuing to serve — using fencing tokens, STONITH, or epoch numbers to invalidate stale leaders
  4. Split-brain is the most dangerous failure mode in distributed systems — it can cause data corruption, financial loss, and inconsistency that's difficult to repair
Chapter complete!

Course Complete!

You've finished all 51 chapters of

System Design Advanced

Browse courses
Up next Linearizability and Sequential Consistency
Continue