How Does Active-Active Replication Keep Two Databases in Sync?

How Does Active-Active Replication Keep Two Databases in Sync?

Why Would You Let Two Databases Accept Writes?

In a typical leader-follower setup, one database handles all writes. Simple. But what happens when your users are in Tokyo and New York? A write from Tokyo has to travel 12,000 km to a leader in Virginia, adding 150ms of latency. Multiply that across every checkout, every message, every like. Active-active replication solves this by letting both data centers accept writes locally, then syncing changes between them.

The tradeoff is brutal: conflicts. When two users update the same row at the same time in different data centers, you have two conflicting versions of reality. Every active-active system must answer the question: which write wins?

Uber operates active-active across multiple data centers, processing over 100 million trips per quarter. A single-leader architecture would mean half their users experience cross-continent latency on every write.

Active-Active vs Active-Passive: What Is the Difference?

Active-passive (leader-follower): One node accepts writes. Others are read-only replicas. If the leader dies, you promote a follower. Simple, but the leader is a bottleneck and a single point of failure for writes.

Active-active (multi-leader, multi-master): Multiple nodes accept writes simultaneously. Each node replicates its changes to the others asynchronously. Lower write latency for geographically distributed users, better fault tolerance. But now you need conflict resolution.

Active-Passive Active-Active Leader reads + writes Follower Follower One writer. Simple. No conflicts. reads + writes Node A reads + writes Node B async sync Conflict! User A writes balance=500 to Node A User B writes balance=300 to Node B Which one wins? Both write. Must resolve conflicts.

Figure 1: Active-passive has one writer, so conflicts never happen. Active-active allows writes at multiple nodes simultaneously, creating the possibility of conflicting updates to the same data.

How Do You Resolve Conflicts?

There are three main strategies, each with different tradeoffs:

Last Writer Wins (LWW). Attach a timestamp to every write. When two writes conflict, the one with the higher timestamp wins. This is what Cassandra does. The problem? Wall clocks are unreliable in distributed systems. Clock skew between nodes means "last" might not actually be last. You can silently lose writes. LWW trades correctness for simplicity.

Version vectors / vector clocks. Instead of timestamps, track a version counter per node. When versions diverge (concurrent writes), the system detects the conflict and either merges automatically or asks the application to resolve it. Amazon's original Dynamo paper used this approach. DynamoDB later switched to LWW because vector clocks added too much metadata overhead at scale.

CRDTs (Conflict-free Replicated Data Types). Design your data structures so that concurrent updates can always be merged automatically without conflicts. A counter that only increments can be safely merged by taking the max from each node. Redis Enterprise uses CRDTs for its active-active geo-replication feature.

Conflict Resolution Strategies Last Writer Wins Highest timestamp wins Pro: Simple, fast Con: Silent data loss Used by: Cassandra, DynamoDB Version Vectors Detect concurrent writes Pro: No silent loss Con: Metadata overhead Used by: Riak, Original Dynamo CRDTs Auto-merge via math Pro: Always converges Con: Limited data types Used by: Redis Enterprise, Figma

Figure 2: The three main approaches to conflict resolution in active-active systems. Each trades off simplicity, correctness, and applicability differently.

How Does CockroachDB Do It Differently?

CockroachDB takes a fundamentally different approach: it avoids conflicts entirely by using Raft consensus per data range. Your data is split into ranges (default 512MB each). Each range has a Raft group with a leader and followers. Writes to a range go through Raft consensus, so only one write can succeed at a time per range. There are no conflicts to resolve because conflicting writes are serialized.

The trick is that different ranges can have leaders in different data centers. Range A's leader might be in US-East while Range B's leader is in EU-West. So both data centers are actively handling writes — just for different ranges. This gives you active-active write throughput without conflict resolution complexity.

CockroachDB can run thousands of Raft groups simultaneously. Each range independently elects a leader and replicates writes, giving you the latency benefits of active-active for ranges whose leader is local.

DynamoDB Global Tables: AP at Planet Scale

DynamoDB Global Tables takes the AP approach. You create a table, enable global tables, and pick which regions to replicate to. Writes to any region are accepted immediately and replicated asynchronously to all other regions. Replication typically completes within one second.

Conflict resolution is LWW based on the item's timestamp. If two regions write to the same item within the replication window, the last timestamp wins. DynamoDB also supports conditional writes — you can write only if a condition is met (like a version number matching), which lets you build optimistic concurrency control on top of the AP model.

The Replication Lag Problem

Even without conflicts, replication lag creates headaches. A user updates their profile in US-East, then immediately reads it from EU-West. If the replication has not arrived yet, they see their old profile. Solutions include:

  • Read-your-writes consistency: Route a user's reads to the same node that handled their writes. Sticky sessions based on user ID.
  • Causal consistency: Track dependencies between operations. If write B depends on write A, ensure B is not visible until A has been replicated.
  • Conflict-free by design: Structure your data so that different regions own different data. User 42's data lives in US-East; user 99's data lives in EU-West. No cross-region conflicts possible.

When Is Active-Active Worth the Complexity?

Active-active is worth it when you need low-latency writes from multiple geographic regions and can tolerate the complexity of conflict resolution. It is overkill if all your users are in one region or if your writes are low-volume.

The pragmatic path: start with active-passive. When you expand to multiple regions and write latency becomes a problem, move to active-active for the specific tables or data that need it. Keep your conflict resolution strategy as simple as possible — LWW covers 90% of use cases.

Cassandra processes over 1 million writes per second at Apple, running active-active across multiple data centers. LWW is their conflict resolution strategy — simple, proven, and good enough for most workloads.

References and Further Reading