Split-Brain Problem Explained: When Distributed Systems Disagree on Who Is in Charge
How split-brain occurs in distributed systems — causes, consequences, fencing tokens, STONITH, quorum-based prevention, and real-world outage examples.
Split-Brain Problem
Split-brain is a failure mode in distributed systems where a network partition causes two or more subsets of nodes to independently believe they are the active primary, leading to conflicting operations and potential data corruption.
What It Really Means
Consider a database cluster with a primary node and a standby. The standby monitors the primary's health via heartbeats. If the primary goes silent — not because it crashed, but because the network between them is broken — the standby assumes the primary is dead and promotes itself to primary. Now you have two nodes, both accepting writes, both believing they are the legitimate primary. This is split-brain.
The damage from split-brain can be catastrophic. Both primaries accept conflicting writes. An inventory system might sell the same item twice. A banking system might process conflicting transfers. When the partition heals and the two primaries discover each other, reconciling their divergent states may be impossible without data loss.
Split-brain is the reason that distributed systems engineers lose sleep. It is a failure mode that does not show up in unit tests, rarely occurs during normal operation, and causes maximum damage when it does occur. Every high-availability system must have a split-brain prevention strategy.
How It Works in Practice
How Split-Brain Occurs
Real-World Split-Brain Incidents
GitHub (2012): A network partition caused MySQL replicas to disagree on which node was the primary. The failover system promoted a replica that was behind, causing data loss for some repositories.
Elasticsearch clusters: Before version 7, Elasticsearch was notorious for split-brain. With a 3-node cluster and minimum_master_nodes set to 1 (the default), a partition could create two independent clusters, each with its own master. Version 7+ fixed this with a built-in voting-based quorum.
Redis Sentinel: If Redis Sentinel loses connectivity to the primary but the primary is still serving clients, Sentinel promotes a replica. Clients connected to the old primary continue writing to it while new clients write to the new primary.
Prevention Strategy 1: Quorum-Based Fencing
The most reliable prevention: require a majority (quorum) to operate. In a 5-node cluster, a node needs 3 votes to be the primary. During a partition, at most one side has the majority. The minority side cannot elect a primary.
etcd and Raft: Raft requires a majority for leader election. With 5 nodes partitioned into groups of 3 and 2, only the group of 3 can elect a leader. The group of 2 has no leader and refuses writes.
Prevention Strategy 2: STONITH (Shoot The Other Node In The Head)
Before a standby promotes itself, it forcibly shuts down the old primary — typically by sending a command to the server's management interface (IPMI/iLO) to power off the machine. This guarantees the old primary cannot continue accepting writes.
Pacemaker/Corosync (Linux HA) uses STONITH as its primary split-brain prevention. If the fencing device is unreachable, the failover is aborted entirely.
Prevention Strategy 3: Fencing Tokens
A coordination service (ZooKeeper, etcd) issues monotonically increasing tokens. When a new primary is elected, it receives token N+1. The storage system rejects any writes with token N or lower. Even if the old primary (holding token N) is still running, its writes are rejected.
Implementation
Trade-offs
Prevention Strategies Compared
| Strategy | Pros | Cons |
|---|---|---|
| Quorum | No external dependencies | Requires odd number of nodes; minority is unavailable |
| STONITH | Guarantees old primary is stopped | Requires hardware access; fencing failure blocks failover |
| Fencing tokens | Works with any storage | Requires all storage to check tokens |
| Lease-based | Time-bounded; simple | Depends on clock synchronization |
Advantages of Split-Brain Prevention
- Data integrity: Prevents conflicting writes and data corruption
- Deterministic behavior: Clear rules for which side operates during partitions
- Automated recovery: Systems can heal without manual intervention
Disadvantages
- Reduced availability: The minority side of a partition becomes unavailable
- Complexity: Fencing mechanisms add operational and engineering complexity
- False positives: Overly aggressive failure detection can trigger unnecessary fencing, causing availability loss
Common Misconceptions
- "Split-brain only happens with two nodes" — Split-brain can occur with any number of nodes if the system does not use quorum-based consensus. Even a 10-node cluster can split into two groups of 5, and without a tiebreaker, both groups might elect a leader.
- "Heartbeats prevent split-brain" — Heartbeats detect node failures but do not distinguish between a dead node and a partitioned one. Acting on a missed heartbeat without quorum can cause split-brain.
- "Having a standby database prevents split-brain" — A standby with automatic failover is the most common CAUSE of split-brain. If the standby promotes while the old primary is still alive, you have two primaries.
- "Split-brain always involves data loss" — Not always. If conflicting writes are to different keys, reconciliation may be straightforward. The damage depends on what was written during the split.
- "Cloud-managed databases do not have split-brain issues" — Managed databases (RDS, Cloud SQL) have their own split-brain prevention mechanisms, but they can still experience brief availability gaps during partition-triggered failovers.
How This Appears in Interviews
Split-brain is a critical topic in distributed systems interviews:
- "How do you prevent two primaries in a database cluster?" — Quorum-based leader election (Raft/Paxos), fencing tokens, or STONITH. Explain why heartbeat-based failover alone is insufficient.
- "Your Redis cluster has two masters. What happened?" — Network partition caused Sentinel to promote a replica while the original primary was still running. Explain the fix: require quorum for promotion, use fencing.
- "Design a highly available system that never corrupts data" — Choose CP during partitions. Use consensus protocols for leader election. Implement fencing tokens for all write operations.
- "What is the difference between split-brain and a network partition?" — A network partition is the cause (network failure). Split-brain is the consequence (two nodes both acting as primary).
See our interview questions on distributed systems for more practice.
Related Concepts
- Leader Election — proper leader election prevents split-brain
- Partition Tolerance — network partitions cause split-brain
- Quorum — majority voting prevents split-brain
- Heartbeat Mechanism — failure detection that can trigger split-brain if mishandled
- Replication — split-brain causes data divergence in replicated systems
- System Design Interview Guide
- Algoroq Pricing — practice distributed systems concepts
GO DEEPER
Learn from senior engineers in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.