How Do 1,000 Servers Agree Without a Leader?

How Do 1,000 Servers Agree Without a Leader?

Spreading Information Like a Rumor

Imagine you know a secret. You tell two random people. Each of them tells two more random people. Within minutes, everyone in the building knows. This is gossip — and it is exactly how distributed systems like Cassandra, DynamoDB, and Consul share cluster state without any central coordinator.

Every node periodically picks a random peer and exchanges state information. After O(log N) rounds, every node in a cluster of N nodes has the information. For a 1,000-node cluster, that is about 10 rounds. For 1 million nodes, about 20 rounds. The math is the same as epidemic spreading.

Amazon's original Dynamo paper (2007) uses gossip for membership and failure detection. Each node gossips its view of the cluster ring every second. Within 10 seconds, a node failure is detected by every node in the cluster.

How Does Gossip Work?

Every node maintains a local state table — key-value pairs with version numbers or timestamps. Periodically (every 1-2 seconds), each node:

  • Picks a random peer from the cluster membership list
  • Sends a digest — a summary of its state (keys + version numbers, not full values)
  • The peer compares the digest to its own state and replies with any newer values it has
  • The initiator sends back any values it has that are newer than the peer's

This three-phase exchange (SYN, ACK, ACK2) ensures both nodes converge to the same state. This is called push-pull gossip and is what Cassandra uses.

Gossip Propagation: O(log N) Rounds Round 1 A B C 1 node knows Round 2 3 nodes know Round 3+ All nodes know Information spreads exponentially — doubles each round like an epidemic

Figure 1: Gossip spreads exponentially. Each informed node tells random peers, doubling the informed set each round. After O(log N) rounds, the entire cluster converges.

Failure Detection: Phi Accrual

Gossip's killer app is failure detection. Every gossip message includes a heartbeat counter. If a node's heartbeat hasn't increased for a while, it might be dead. But how long is "a while"?

Cassandra uses the Phi Accrual Failure Detector. Instead of a binary alive/dead decision, it calculates a suspicion level (phi) based on the statistical distribution of heartbeat arrival times. A phi of 5 means "there is a 1 in 100,000 chance this node is still alive." The threshold is configurable — higher values mean fewer false positives but slower detection.

SWIM: Scalable Gossip

Classic gossip sends heartbeats to every node — O(N^2) messages per round. The SWIM protocol (Scalable Weakly-consistent Infection-style Membership) fixes this. Instead of direct heartbeats, a node pings a random peer. If the peer doesn't respond, the node asks K other peers to ping it (indirect probing). If none succeed, the peer is marked suspicious. This gives O(N) message complexity while maintaining the same detection guarantees.

HashiCorp's Serf and Consul use a modified SWIM called Lifeguard. Kubernetes' memberlist library also implements SWIM. It is the go-to protocol for cluster membership in modern infrastructure.

Gossip in the Wild

  • Cassandra: gossip for cluster topology, token ring ownership, schema versions, and node health. Runs every second.
  • DynamoDB: gossip for membership and failure detection across the storage fleet.
  • Redis Cluster: gossip bus on port+10000. Nodes exchange cluster state, detect failures, and trigger failover via gossip.
  • CockroachDB: gossip for store descriptors, node liveness, and system config.
  • Consul: SWIM-based gossip for service discovery and health checking across data centers.

Consul can manage 10,000+ nodes in a single gossip pool with sub-second failure detection, sending only ~200 bytes per gossip message. The protocol overhead is negligible compared to the availability guarantees it provides.

Why Not Use a Central Registry?

A central registry (like ZooKeeper) is a single point of failure. It needs its own replication and consensus. Gossip has no single point of failure — every node is equal. It tolerates network partitions gracefully (nodes in each partition converge independently). And it scales to thousands of nodes with constant per-node overhead. The tradeoff: eventual consistency. Gossip does not guarantee when all nodes will have the same view — just that they eventually will.

References and Further Reading