October 21, 2018: The Day GitHub Went Down
At 22:52 UTC, a routine network maintenance event disconnected GitHub's US East Coast data center from its primary MySQL database cluster for 43 seconds. The failover system, Orchestrator, correctly detected the primary was unreachable and promoted a replica in a different data center to be the new primary.
The problem: the old primary and the new primary both accepted writes during the partition. When the network reconnected 43 seconds later, GitHub had two divergent copies of their database. Writes from users on the old primary conflicted with writes on the new primary. This is the classic split-brain scenario, and it took 24 hours and 11 minutes to fully resolve.
GitHub's post-incident analysis: "This resulted in an inconsistency between our MySQL primary and replicas that left us unable to confidently restore service until we had resolved data integrity issues."
Why Did 43 Seconds Cause 24 Hours of Pain?
MySQL uses asynchronous replication by default. The old primary had accepted writes that hadn't replicated to the new primary before the partition. The new primary also accepted writes during those 43 seconds. When GitHub's engineers reconnected, they faced:
- Conflicting auto-increment IDs: both primaries assigned the same IDs to different rows.
- Lost writes: some writes on the old primary never reached the new primary.
- Inconsistent foreign keys: a commit might reference a repository that doesn't exist on the other primary.
GitHub couldn't just pick one primary and discard the other's writes — both had legitimate user data. They had to manually reconcile millions of rows across dozens of database clusters. This took over 24 hours of careful, manual work.
Figure 1: A 43-second network partition caused both data centers to accept writes independently. The resulting data inconsistencies took over 24 hours to resolve.
What GitHub Changed Afterward
- Improved Orchestrator configuration: added guardrails to prevent failover during short partitions. A 43-second blip should not trigger promotion.
- Better replication monitoring: track replication lag in real time and block failover if the lag exceeds a threshold.
- Runbook improvements: documented the exact steps for split-brain recovery so future incidents don't require 24 hours of ad-hoc work.
- Moved toward MySQL semi-synchronous replication: ensures at least one replica has the latest writes before the primary acknowledges them.
The Lesson: Failover Is the Hardest Problem
The irony: GitHub's failover system worked exactly as designed. It detected an unreachable primary and promoted a replica. The design was correct for a genuine primary failure. But this wasn't a failure — it was a 43-second maintenance blip. The failover system couldn't distinguish between "primary is dead" and "primary is temporarily unreachable."
Every distributed system that uses leader-follower replication faces this dilemma: fail over too fast and you get split-brain. Fail over too slow and you get prolonged downtime. There is no perfect timeout.