The Split-Brain Problem: When Your Cluster Forgets It's One System

2026-05-23

You've built a redundant system. Two nodes, leader-follower, automatic failover. The network blips for 30 seconds, the follower can't reach the leader, decides the leader is dead, and promotes itself. Now you have two leaders both accepting writes. When the network heals, you have two divergent histories and no clean way to merge them. Congratulations: you have a split brain.

Split brain happens when a network partition makes nodes disagree about who's in charge. Each side thinks the other is dead. Both make decisions. Reality forks.

The real-world horror story: GitHub's October 2018 outage. A 43-second network partition between their US East Coast and West Coast data centers caused MySQL Orchestrator to fail over to the West Coast. When connectivity returned, both coasts had accepted writes. It took 24 hours and 11 minutes to reconcile — they had to manually replay binary logs and reconcile divergent data. Webhooks were delayed by days.

How to prevent it:

The odd-number rule of thumb: Always run quorum-based systems with an odd number of nodes. Three tolerates one failure. Five tolerates two. A 2-node cluster is the worst possible choice — losing either node loses quorum, and you can't distinguish "the other node is dead" from "I'm partitioned." Adding a third cheap node (even just an arbiter) doubles your fault tolerance.

The fencing math: For N nodes to tolerate F failures, you need N ≥ 2F + 1. Want to survive 2 simultaneous failures? You need 5 nodes minimum. Want to survive 3? You need 7. This is why etcd, Consul, and ZooKeeper docs all push you toward 3, 5, or 7 node clusters — never 4 or 6, which give you no extra tolerance over 3 or 5 but cost more.

If you find yourself reasoning "well, the other node is probably dead" — stop. Probability isn't a partition tolerance strategy. Quorum is.

See it in action: Check out 99-04 Mustang Secret Dash Trick/Test Mode by CodyDriven to see this theory applied.
Key Takeaway: Two nodes can never safely decide who's leader during a partition — use an odd number, require majority quorum, and fence the loser before the winner acts.

All newsletters