The Problem That Breaks Everything
Here is the simplest version of the hardest problem in distributed systems: you have five servers. They need to agree on a single value. Maybe it is "who is the leader?" or "what is the next entry in the log?" or "should we commit this transaction?" Sounds trivial, right? Just take a vote.
Except one server is slow. Another just rebooted. A third is responding, but its network packets are arriving out of order. The fourth is fine, but the message it sent to the fifth got dropped entirely. Welcome to consensus — the foundational problem that every distributed database, every coordination service, and every replicated state machine must solve. And it is provably impossible to solve perfectly.
Yet somehow, etcd runs the control plane for every Kubernetes cluster on the planet. CockroachDB handles serializable transactions across continents. TiKV powers the storage layer behind TiDB serving millions of queries per second. They all use consensus algorithms. How?
The Byzantine Generals Problem
In 1982, Leslie Lamport, Robert Shostak, and Marshall Pease published a paper that framed the consensus problem as a military analogy. Several generals are camped around a city. They need to agree on a common plan: attack or retreat. They communicate by messenger. The catch? Some generals might be traitors — they could send conflicting messages to different generals, trying to cause disagreement.
The result was brutal: if there are f traitors among n generals, you need n >= 3f + 1 to reach agreement. With 3 generals and 1 traitor, consensus is impossible. The traitor can tell one general "attack" and the other "retreat," and there is no way for the loyal generals to figure out who is lying.
Most consensus algorithms used in practice — Paxos, Raft, Zab — do not actually handle Byzantine (malicious) failures. They handle crash failures: nodes that stop responding but never lie. This is a reasonable assumption inside a data center where you trust your own hardware. Blockchains are the exception — they solve Byzantine consensus because they must tolerate actively malicious participants.
The Byzantine Generals Problem is not just an academic curiosity. It established the fundamental limits of fault tolerance. Every consensus protocol you use today is either solving the Byzantine version (expensive, slow) or the crash-fault version (fast enough for production).
FLP: The Impossibility Result
In 1985, Fischer, Lynch, and Paterson proved something devastating: no deterministic consensus algorithm can guarantee termination in an asynchronous system where even one node can crash. This is the FLP impossibility result, and it won the Dijkstra Prize.
An asynchronous system is one where there is no bound on how long a message takes to arrive. You cannot tell the difference between a very slow node and a dead one. In such a system, any algorithm that tries to reach consensus can always be kept in an undecided state by an adversarial scheduler that delays messages at just the wrong moments.
So how does anything work? Practical consensus algorithms cheat the FLP result by breaking one of its assumptions. Raft and Paxos use timeouts — they assume that eventually, messages do arrive in a bounded time (partial synchrony). This means they might get stuck temporarily during extreme network delays, but they will make progress once the network stabilizes. They sacrifice guaranteed liveness for guaranteed safety: they might stop deciding, but they will never decide incorrectly.
Paxos: Correct but Painful
Lamport published Paxos in 1998 (written in 1990, rejected for being "too whimsical" because he described it using a fictional Greek parliament). Paxos was the first practical crash-fault-tolerant consensus algorithm. It is provably correct. It is also famously difficult to understand and even harder to implement correctly.
The core idea: a proposer sends a proposal with a sequence number to a set of acceptors. Acceptors promise not to accept proposals with lower sequence numbers and report any values they have already accepted. If a proposer gets promises from a majority, it sends an accept request. If a majority accepts, the value is chosen. Simple on paper, nightmarish in practice.
The problem with Paxos is that the single-decree version (agreeing on one value) is only the starting point. Real systems need Multi-Paxos — repeated instances of consensus to build a replicated log. Multi-Paxos was never fully specified by Lamport. Every implementation (Google Chubby, Apache Zookeeper's Zab protocol) ended up building something subtly different, with its own bugs and corner cases.
Diego Ongaro, the creator of Raft, said: "There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system. The final system will be based on an unproven protocol."
Raft: Consensus for Humans
Raft was designed in 2013 by Diego Ongaro and John Ousterhout at Stanford with a single goal: be understandable. It provides the same safety guarantees as Multi-Paxos but decomposes the problem into three clean subproblems: leader election, log replication, and safety.
Leader Election
In Raft, one node is the leader and the rest are followers. The leader handles all client requests and replicates log entries to followers. If a follower does not hear from the leader within a randomized election timeout (typically 150-300ms), it becomes a candidate and starts an election.
The candidate increments its term number (a logical clock), votes for itself, and sends RequestVote RPCs to all other nodes. Each node votes for at most one candidate per term, on a first-come-first-served basis. If a candidate gets votes from a majority, it becomes the new leader. The randomized timeout is the key trick — it makes split votes unlikely because different nodes time out at different moments.
Figure 1: Raft leader election. When Node A (the leader) crashes, Node C's election timeout fires first. It increments the term, votes for itself, and requests votes from all peers. Nodes B and D grant their votes. With 3 out of 4 votes (a majority), Node C becomes the new leader for Term 4.
Log Replication
Once a leader is elected, it handles all client writes. The leader appends each new command to its log as an uncommitted entry and sends AppendEntries RPCs to all followers in parallel. When a majority of nodes (including the leader) have written the entry to their log, the leader marks the entry as committed and applies it to its state machine. It then notifies followers to commit and apply the entry too.
This is the heart of Raft's safety: an entry is committed only when a majority has durably stored it. Even if the leader crashes immediately after committing, the entry exists on a majority of nodes. Any future leader must have the entry in its log (because it needs votes from a majority, and any majority overlaps with the previous majority). This guarantees that committed entries are never lost.
Figure 2: Raft log replication. The leader appends a new entry (w=9) and sends AppendEntries RPCs to all followers. Once the leader and Node B have the entry (2 out of 3 = majority), it is committed. Node D will catch up eventually. Committed entries are never lost, even if the leader crashes.
etcd: Raft Powering Kubernetes
Every Kubernetes cluster depends on etcd, a distributed key-value store that uses Raft for consensus. When you run kubectl apply, the desired state goes into etcd. The scheduler reads from etcd. The controller manager reads from etcd. If etcd goes down, your Kubernetes control plane is effectively dead.
etcd typically runs as a 3-node or 5-node cluster. With 3 nodes, it tolerates 1 failure. With 5 nodes, it tolerates 2 failures. Going to 7 nodes gives you tolerance for 3 failures but increases the latency of every write (because the leader must wait for 4 acknowledgments instead of 2 or 3). Most production clusters settle on 5 nodes as the sweet spot.
The etcd team at CoreOS (now part of Red Hat) made a critical design decision: etcd's Raft implementation is a library, not a monolith. The Raft logic is cleanly separated from the storage layer and network transport. This means you can swap in different storage backends or network protocols without touching the consensus code. This same library powers CockroachDB's consensus layer.
CockroachDB: Multi-Raft at Scale
CockroachDB faces a challenge that etcd does not: it needs to store terabytes or petabytes of data across hundreds of nodes. Running a single Raft group across all those nodes would be catastrophically slow — every write would require acknowledgment from a majority of hundreds of nodes.
The solution is Multi-Raft. CockroachDB splits data into ranges (chunks of roughly 512 MB). Each range is its own Raft group, typically replicated to 3 or 5 nodes. A single CockroachDB node might participate in thousands of Raft groups simultaneously — it could be the leader for some ranges and a follower for others.
This design has a beautiful property: consensus happens in parallel across ranges. A write to range A does not block a write to range B. The system scales horizontally because adding more nodes means you can split ranges further and distribute leadership more evenly. TiKV (the storage engine behind TiDB) uses the exact same Multi-Raft architecture.
Figure 3: CockroachDB's Multi-Raft architecture. Data is split into ranges, each running its own Raft group across 3 replicas. Leadership is distributed across nodes, so consensus happens in parallel. Node 1 leads Range A, Node 2 leads Range B, and Node 4 leads Range C.
The Latency Tax of Consensus
Consensus is not free. Every committed write requires at least one round trip from the leader to a majority of followers. In a 3-node cluster within one data center, that is typically 0.5-2ms. Across data centers, it jumps to 30-100ms depending on geographic distance. Across continents, you are looking at 100-300ms.
This is why CockroachDB lets you configure locality-aware placement. You can pin a range's replicas to nodes in the same region, so consensus round trips stay within a few milliseconds. Cross-region writes still pay the latency tax, but reads (which in Raft can be served by the leader without consensus) are fast as long as your data is nearby.
Google Spanner takes a different approach: it uses Paxos (not Raft) for each shard and optimizes for cross-continent writes using its TrueTime API. The commit-wait interval (about 7ms) adds latency to every write, but it is a constant cost regardless of geographic distance. For a globally distributed database, a fixed 7ms penalty is far better than a variable 200ms round trip.
When Consensus Is Overkill
Not every problem needs consensus. If you are building a cache, eventual consistency is fine — Redis Cluster uses gossip-based replication and hash slots, not Raft. If you are building a message queue where at-least-once delivery is acceptable, Kafka's ISR (in-sync replicas) mechanism works without full consensus (though Kafka added KRaft mode in version 3.3 to replace ZooKeeper with Raft for metadata management).
The rule of thumb: use consensus when you need linearizable reads and writes — when the cost of two clients seeing different states is higher than the latency cost of coordination. Leader election, distributed locks, configuration management, transaction commits — these are consensus territory. Caching, analytics, event streaming, content delivery — these can usually get away with weaker consistency models.
The Bottom Line
Consensus is the cornerstone of reliable distributed systems, and it is fundamentally hard. The FLP result says you cannot have it all. The Byzantine Generals Problem says you need redundancy to tolerate failures. Paxos proved it was possible but made everyone miserable trying to implement it. Raft made it understandable and practical.
Today, Raft runs inside etcd (powering Kubernetes), CockroachDB (powering globally distributed SQL), TiKV (powering TiDB), Consul (powering service discovery), and dozens of other systems. Multi-Raft scales it to petabytes. And the fundamental tradeoff remains: every consensus round costs you latency. The art of distributed systems engineering is knowing when that cost is worth paying.
References and Further Reading
- In Search of an Understandable Consensus Algorithm (Extended Version) — Ongaro & Ousterhout (2014)
- The Part-Time Parliament (Paxos) — Leslie Lamport (1998)
- Impossibility of Distributed Consensus with One Faulty Process (FLP) — Fischer, Lynch & Paterson (1985)
- The Byzantine Generals Problem — Lamport, Shostak & Pease (1982)
- CockroachDB Replication Layer — Multi-Raft Architecture Documentation
- etcd API Guarantees and Raft Consensus — etcd Documentation