How Do Distributed Systems Know What Happened First?

How Do Distributed Systems Know What Happened First?

Your Clock Is Lying to You

Here is a fun experiment. Open two terminals on two different servers. Run date +%s%N on both at the "same time." The timestamps will not match. They might be off by milliseconds, maybe seconds. Now imagine those two servers are processing financial transactions, and the order of operations matters. Which write came first? Your wall clock has no idea.

Wall clocks — the physical clocks sitting on your servers — are unreliable in distributed systems for three reasons. First, clock drift: quartz oscillators in commodity hardware drift at roughly 10-20 parts per million, which means up to 1.7 seconds of drift per day. Second, NTP corrections: when the Network Time Protocol daemon syncs a drifted clock, it can jump the time forward or backward, meaning timestamps are not even monotonic. Third, network latency: even if two events happen at the "same" physical time, the messages announcing them arrive at different times on different nodes.

So you cannot use wall clocks to order events across machines. But ordering events is critical. If Alice deposits $100 and then transfers $80, the deposit must be processed before the transfer. If the system gets the order wrong, Alice's account goes negative. Distributed systems needed a way to reason about causality — what happened before what — without relying on synchronized physical clocks.

Leslie Lamport's 1978 insight was deceptively simple: you do not need to know the exact time something happened. You just need to know the order in which things happened relative to each other.

Lamport Timestamps: The First Step

In his landmark 1978 paper "Time, Clocks, and the Ordering of Events in a Distributed System," Leslie Lamport introduced the happens-before relation. If event A occurs before event B on the same process, then A happens-before B. If event A is the sending of a message and event B is the receipt of that message, then A happens-before B. And if A happens-before B and B happens-before C, then A happens-before C. Transitivity gives you a partial order over events.

Lamport timestamps turn this into a simple algorithm. Every process maintains a counter. When a process performs a local event, it increments the counter. When a process sends a message, it includes its current counter value. When a process receives a message, it sets its counter to max(local_counter, received_counter) + 1. That is it. Three rules.

The guarantee: if event A happens-before event B, then L(A) < L(B). The timestamp of the cause is always less than the timestamp of the effect. But here is the catch — the converse is not true. If L(A) < L(B), you cannot conclude that A happened before B. The events might be concurrent: two things that happened independently on different nodes with no causal link between them. Lamport timestamps cannot distinguish "A caused B" from "A and B are unrelated." This is a real problem.

Lamport Timestamps Across Three Processes P1 P2 P3 1 2 5 1 3 6 1 4 7 msg msg msg msg msg Rule: on receive, set counter = max(local, received) + 1

Figure 1: Lamport timestamps propagate through message passing. Each process increments on local events and takes the max on message receipt. Note that P1's event at timestamp 1 and P3's event at timestamp 1 are concurrent — Lamport timestamps alone cannot tell you this.

Vector Clocks: Now You Can Detect Concurrency

Vector clocks fix the gap that Lamport timestamps leave open. Instead of a single counter, every process maintains a vector — an array of counters, one per process in the system. If you have three nodes, every event is tagged with a vector of three integers, like [2, 0, 3]. The i-th entry tracks what process i knows about its own progress and what it has learned from others.

The rules are almost the same as Lamport timestamps, but applied element-wise. When process i performs a local event, it increments V[i]. When process i sends a message, it includes its entire vector. When process i receives a message with vector V_msg, it sets each entry V[j] = max(V[j], V_msg[j]) for all j, then increments V[i].

The payoff is huge. Given two vector timestamps VA and VB, you can determine the exact relationship between the events:

  • A happened-before B if every entry in VA is less than or equal to the corresponding entry in VB, and at least one is strictly less.
  • B happened-before A if the reverse holds.
  • A and B are concurrent if neither vector dominates the other — some entries in VA are greater, some entries in VB are greater.

This is the key insight: vector clocks give you a perfect causality detector. If two events are concurrent (neither caused the other), vector clocks will tell you. Lamport timestamps cannot do this.

Vector Clocks: Detecting Concurrency A B C [1,0,0] [2,0,0] [3,2,1] [0,1,0] [2,2,0] [3,3,1] [0,0,1] [2,2,2] CONCURRENT: [1,0,0] vs [0,1,0] Neither vector dominates the other, so the events are independent

Figure 2: Vector clocks in action. Events [1,0,0] on A and [0,1,0] on B are concurrent — A's first element is greater, but B's second element is greater. No causal relationship exists between them. Compare [2,0,0] and [2,2,0]: every entry in the first is less than or equal to the second, so A's event happened before B's event.

DynamoDB's History with Vector Clocks

Amazon's original Dynamo paper (2007) is probably the most famous real-world use of vector clocks. Dynamo used vector clocks to track the version history of every key. When two replicas independently wrote to the same key (because of a network partition or concurrent client requests), Dynamo did not just pick one — it kept both versions and attached vector clocks so the system (or the application) could figure out the relationship later.

Here is how it worked. Say replica A writes key K with vector clock [A:1]. Then replica B, without seeing A's write, also writes key K with clock [B:1]. The clocks are concurrent — neither dominates. Dynamo stores both versions as siblings. When a client reads key K, it gets both versions and their vector clocks. The client (or application logic) must reconcile the conflict. For a shopping cart, the reconciliation strategy was simple: take the union of items in both carts.

But vector clocks had a practical problem in Dynamo. As more nodes touch a key, the vector grows. If hundreds of nodes write to a hot key, the vector clock becomes hundreds of entries. Dynamo truncated vector clocks after they exceeded a size threshold, pruning the oldest entries. This could cause false conflicts — the system would report concurrent writes where one actually happened before the other. Annoying, but safe: false conflicts lead to unnecessary reconciliation, not data loss.

Modern DynamoDB (the managed AWS service) moved away from client-side conflict resolution. It uses last-writer-wins with server-side timestamps for most operations, which is simpler but loses the ability to detect true concurrency. The tradeoff was worth it: most applications would rather not write custom reconciliation logic.

Riak's Dotted Version Vectors

Riak, the distributed key-value store inspired by Dynamo, took vector clocks further. They ran into a specific problem: sibling explosion. In classic vector clocks, if a client reads a value and writes it back without including the causal context, the system treats it as a new concurrent write. Do this a few times and you have dozens of siblings for a single key, all piling up.

Riak's solution was dotted version vectors (DVVs). A DVV extends a version vector with a "dot" — a single (node, counter) pair identifying the exact event that created this particular sibling. When a client writes a new value, the coordinating node creates a new dot. When the system receives a write with a causal context (the version vector from the prior read), it can discard any siblings that are dominated by that context. The dot lets the system precisely identify which sibling is being overwritten, even when the version vectors are ambiguous.

The result: far fewer false conflicts. DVVs were a significant improvement over plain vector clocks for real-world workloads where clients frequently read-modify-write the same key. Riak adopted them in version 2.0, and the reduction in sibling bloat was dramatic.

Google Spanner and TrueTime: The Hardware Approach

Google looked at the vector clock problem and said: what if we just fixed the clocks? Spanner's TrueTime API does not return a timestamp. It returns an interval [earliest, latest] representing the uncertainty in the current time. This interval is typically under 7 milliseconds, kept tight by GPS receivers and atomic clocks in every data center.

Spanner uses TrueTime for its commit-wait protocol. When a transaction commits, Spanner assigns it a timestamp and then waits out the uncertainty interval before making the transaction visible. This guarantees that any transaction that starts after the commit will see a strictly larger timestamp. No vector clocks needed — you get a global total order of transactions using (very accurate) physical time.

The catch? You need Google's hardware. GPS receivers and atomic clocks in every data center are not cheap. And you pay a latency cost: every commit waits 7+ milliseconds. For Google's workloads (globally distributed, strong consistency required), it is worth it. For most of us, it is not an option.

Four Approaches to Ordering Events Lamport Lamport Single integer counter Partial ordering only Cannot detect concurrency O(1) space Used in: Paxos, Raft Vector Clocks Array of N counters Causality + concurrency Perfect causality detection O(N) space Used in: Dynamo, Riak Dotted VVs Vector + event dot Fewer false conflicts Solves sibling explosion O(N) space Used in: Riak 2.0+ TrueTime GPS + atomic clocks Global total ordering Requires special hardware O(1) space Used in: Spanner

Figure 3: Four approaches to event ordering. Lamport timestamps are lightweight but cannot detect concurrency. Vector clocks detect concurrency perfectly but grow with the number of nodes. Dotted version vectors solve the sibling explosion problem. TrueTime sidesteps logical clocks entirely with specialized hardware.

Hybrid Logical Clocks: The Best of Both Worlds

What if you want the causality guarantees of logical clocks with the human-readability of physical timestamps? That is exactly what Hybrid Logical Clocks (HLCs), proposed by Kulkarni et al. in 2014, give you.

An HLC timestamp has two components: a physical part (the wall clock time, loosely synchronized via NTP) and a logical part (a counter that breaks ties). The physical part is always close to real time — within the NTP synchronization error, usually tens of milliseconds. The logical part increments when the physical clock has not advanced since the last event, just like a Lamport counter.

The rules: on a local event, if the wall clock has advanced, use the new wall clock time and reset the logical counter to zero. If the wall clock has not advanced (same millisecond), increment the logical counter. On message receipt, take the max of the local wall clock time and the received physical timestamp, then handle the logical counter accordingly. HLCs guarantee the happens-before property (like Lamport timestamps) while keeping timestamps close to real time (unlike pure logical clocks, which can diverge wildly from wall clock time).

CockroachDB uses HLCs. MongoDB uses a variant for its oplog. The advantage is practical: you get a single 64-bit timestamp that is both causally consistent and roughly human-readable. No vectors that grow with cluster size. No GPS receivers in your data center. The tradeoff is that HLCs, like Lamport timestamps, cannot detect concurrency — they give you a total order, but concurrent events get arbitrarily ordered by tie-breaking rules.

The choice between vector clocks and HLCs often comes down to this: do you need to know when two events are truly independent (use vector clocks), or do you just need a consistent ordering you can reason about (use HLCs)?

The Bottom Line

There is no single "right" clock for distributed systems. Lamport timestamps give you a cheap partial order. Vector clocks give you perfect concurrency detection at the cost of growing metadata. Dotted version vectors fix the practical problems of vector clocks in read-modify-write workloads. TrueTime throws hardware at the problem. Hybrid logical clocks find a sweet spot between physical and logical time.

The systems that survive at scale pick the mechanism that matches their conflict model. DynamoDB started with vector clocks and moved to last-writer-wins because most applications prefer simplicity over precision. Riak doubled down on concurrency detection with DVVs because their users needed it. Spanner invested in atomic clocks because Google can afford to and correctness is non-negotiable. CockroachDB chose HLCs because they wanted Spanner-like semantics without Spanner-like hardware.

Whatever you build, the lesson is the same: time in a distributed system is not a number on a clock. It is a relationship between events. Once you internalize that, everything else follows.

References and Further Reading