Every database has a breaking point. It does not matter how well you have tuned your queries, how many indexes you have added, or how expensive your hardware is. At some point, a single machine simply cannot keep up with the volume of reads and writes your application demands. That is when you shard.
But sharding is not a checkbox you tick. It is a fundamental architectural shift that changes how you write queries, how you handle transactions, and how you think about your data model. Let us dig into why databases break, how the biggest companies solve it, and what it actually costs you.
Why Your Single Database Will Eventually Break
A single PostgreSQL instance has hard limits that no amount of configuration tuning can eliminate. The default
max_connections is 100, and even bumping it to 500 starts causing contention on the process table. Each connection
is a separate OS process in PostgreSQL, consuming roughly 5-10MB of RAM. At 500 connections, that is 2.5-5GB just for connection
overhead before you have stored a single row of data.
Write throughput hits a ceiling around 10,000 transactions per second for write-heavy workloads. Every write must go through the Write-Ahead Log (WAL), and the WAL is sequential. You cannot parallelize it. When your WAL is the bottleneck, adding more CPU cores does not help. Autovacuum adds further pressure, competing with your application for I/O and CPU cycles. At tables with hundreds of millions of rows, vacuum operations can take hours and cause visible latency spikes.
Connection poolers like PgBouncer help, but they are band-aids. Read replicas help, but only for read-heavy workloads. If your bottleneck is writes, replicas do nothing for you. Every replica still replicates every write from the primary.
The $24K/Month Server That Still Is Not Enough
Vertical scaling has a hard ceiling, and the cost curve is brutal. AWS's largest RDS instance, the db.r6g.16xlarge,
gives you 512GB of RAM and 64 vCPUs for roughly $24,000 per month. That sounds like a lot of machine. And it is, until you have
a multi-terabyte dataset with write volumes that saturate the EBS bandwidth.
You cannot buy a server with 1TB of RAM forever. Even if you could, the cost-per-gigabyte at the top end of the hardware market is orders of magnitude higher than at the mid-range. A server with twice the RAM costs five to ten times more. At some point, it becomes cheaper to run four mid-range machines than one monster machine. That is horizontal scaling. That is sharding.
Vertical scaling hits a hard ceiling where hardware simply does not exist or costs become prohibitive. Horizontal sharding spreads load across commodity machines that scale linearly.
Hash vs. Range: Two Ways to Split Your Data
The core question in sharding is: given a row of data, which shard does it belong to? There are two primary strategies, and each has failure modes that will bite you if you are not careful.
Hash-based sharding takes your shard key (say, user_id), runs it through a hash function, and
uses modulo to pick a shard: shard = hash(user_id) % N. This gives you excellent distribution. Even if your user
IDs are sequential, the hash function scatters them uniformly across shards.
But here is the problem with simple modulo: when N changes, almost every key maps to a different shard. If you go from 4 shards
to 5, roughly 80% of your data needs to move. That is why production systems use consistent hashing instead.
With consistent hashing, adding a shard only requires moving about 1/N of the data. Vitess, YouTube's sharding
layer for MySQL, uses vindexes (virtual indexes) to map shard keys to keyspace IDs on a consistent hash ring,
making resharding a manageable operation instead of a data migration nightmare.
Range-based sharding divides the key space into contiguous ranges. Users 1-1M on Shard 1, 1M-2M on Shard 2, and so on. Range queries are fast because you know exactly which shard to ask. But monotonically increasing keys (auto-increment IDs, timestamps) create hot shards. All new writes slam into the last shard while the others sit idle. CockroachDB handles this elegantly: it auto-splits ranges when they exceed 512MB, then automatically rebalances the resulting ranges across nodes. No manual intervention required.
How Instagram Fits 300 Million Mappings Into 12 Shards
Instagram's sharding strategy is one of the most referenced case studies in the industry, and for good reason. They stored 300 million user-to-photo mappings across just 12 PostgreSQL shards using a clever combination of logical and physical sharding.
The key insight is their custom ID generation scheme. Every Instagram ID encodes three pieces of information: a timestamp (41 bits, milliseconds since a custom epoch), the logical shard ID (13 bits, supporting up to 8,192 logical shards), and a per-shard auto-incrementing sequence (10 bits, supporting 1,024 IDs per millisecond per shard). This means you can extract the shard from the ID itself, no lookup required.
To find a user's shard, they compute user_id % 2000 to get a logical shard ID. Those 2,000+ logical shards are
mapped to a much smaller number of physical PostgreSQL servers. Each physical server hosts many logical shards as separate
PostgreSQL schemas. When they need to scale, they move schemas between machines without changing the routing logic.
Instagram maps user IDs to logical shards via modulo, then groups many logical shards onto fewer physical PostgreSQL servers. Scaling means moving schemas, not rewriting routes.
This design is brilliant because it separates the routing decision from the physical topology. The application always computes
user_id % 2000 and that never changes. What changes is which physical server hosts that logical shard. You can
double your physical servers and rebalance without touching a line of application code.
How Discord Handles Trillions of Messages Without Breaking a Sweat
Discord started with MongoDB but quickly hit its limits as they scaled past millions of concurrent users. They migrated to Apache Cassandra, and later to ScyllaDB (a C++ rewrite of Cassandra), to handle their messaging workload. The numbers are staggering: trillions of messages across their cluster, serving 120 million monthly active users.
Their partition key design is the key to making it work: (channel_id, bucket). The bucket is a time-based
segment, so each partition contains messages from a specific channel within a specific time window. This prevents any single
partition from growing unbounded. A popular channel with years of history is spread across hundreds of time-bucketed partitions.
This fits Discord's access pattern perfectly. Users almost always read recent messages in a single channel. That query hits exactly one partition. Scrolling back in history means reading from an older bucket on the same node (or a different one, but still a single-partition read). Cross-channel queries are rare in the Discord use case, so cross-shard operations almost never happen.
Cross-Shard Joins: The Query That Brings Down Your System
Here is the ugly truth about sharding: SELECT * FROM users JOIN orders ON users.id = orders.user_id is trivial
in a single database. In a sharded system, it is a nightmare. If users and orders are sharded on different keys (or even
on the same key but across different tables), this join becomes a scatter-gather operation.
A scatter-gather query fans out to every shard, each shard executes its portion of the query, and the results are collected and merged in the application layer (or a middleware like Vitess). The cost is O(N) where N is the number of shards. With 4 shards, it is annoying. With 400 shards, it is a system-killing operation that can take seconds even for simple queries.
A cross-shard query fans out to every shard, collects partial results, and merges them. Total latency is bounded by the slowest shard, and cost grows linearly with shard count.
Worse, the total latency is not the average shard response time. It is the maximum. If 99 shards respond in 5ms but one shard is under heavy load and takes 200ms, your query takes 200ms. The more shards you have, the more likely one of them is having a bad moment.
The escape hatches are limited. Co-locate related data on the same shard by using the same shard key for related tables. Denormalize aggressively so each shard has the data it needs without joining. Precompute common aggregations into materialized views. None of these are free. They all trade write complexity and storage for read performance.
Distributed Transactions: The 2PC Tax
In a single database, transactions are straightforward. ACID guarantees come for free. In a sharded system, a transaction that spans multiple shards requires a distributed protocol. The standard approach is Two-Phase Commit (2PC): a coordinator asks all shards to prepare, waits for all of them to confirm, then tells them all to commit.
2PC works, but it is slow. Every transaction requires two network round trips, and all shards must hold locks until the coordinator confirms. If the coordinator crashes between the prepare and commit phases, all participating shards are stuck holding locks indefinitely. This is the classic blocking problem of 2PC.
Google Spanner takes a radically different approach. It uses GPS receivers and atomic clocks (called TrueTime) in every data center to provide a globally consistent timestamp to every transaction. This lets Spanner offer externally consistent reads across shards without the overhead of traditional 2PC. Spanner achieves 99.999% availability, which is less than 5.3 minutes of downtime per year, across globally distributed shards.
Vitess: Sharding as a Service
Not everyone can build a custom sharding layer. Vitess, originally built at YouTube, is now a CNCF graduated project that adds transparent sharding to MySQL. It powers YouTube, GitHub, Slack, and Square, managing millions of MySQL shards across these deployments.
Vitess sits between your application and MySQL. Your application sends queries to Vitess as if it were a single MySQL instance. Vitess parses the query, determines which shards need to be involved (using vindexes to map shard keys), routes the query to the right shards, and merges the results. It also handles connection pooling (solving MySQL's connection limit problem), query rewriting, and online schema migrations across all shards simultaneously.
The most impressive feature is online resharding. Vitess can split a shard into two or merge two shards into one while the system is live, serving traffic, with no downtime. It uses a combination of filtered replication and traffic cutover to make the transition seamless.
Resharding: The Migration Nobody Wants to Do
Adding shards to a live system is one of the hardest operational challenges in distributed databases. The safest approach is the double-write strategy: you write to both the old shard layout and the new shard layout simultaneously, backfill the new shards with historical data, verify consistency, then cut over reads to the new layout.
Another technique is ghost tables, popularized by GitHub's gh-ost tool. You create the new table
structure alongside the old one, stream changes from the old table to the new one via the binary log, and atomically rename
the tables when they are in sync. This avoids the need for locks during the migration.
CockroachDB and TiDB sidestep much of this pain by auto-splitting ranges at configurable thresholds (512MB for CockroachDB by default) and automatically rebalancing across nodes. You add a node, and the system figures out which ranges to move there. No manual resharding required. This is the future of sharding: systems that handle it for you.
The Bottom Line
Sharding is not a feature you add. It is an architectural decision that permeates every layer of your system: your data model, your query patterns, your transaction boundaries, your deployment pipeline, and your on-call runbooks. It solves the problem of a single machine not being enough. But it creates a dozen new problems in the process.
Before you shard, exhaust every alternative. Read replicas, connection pooling, caching, query optimization, archiving cold data. If you have genuinely outgrown a single machine, study how Instagram, Discord, and YouTube did it. Pick a shard key that matches your dominant access pattern. Use logical shards that map to physical servers so you can rebalance without rearchitecting. And if you can, let the database handle sharding for you with systems like CockroachDB, TiDB, or Vitess.
The best shard is the one you never had to create. But when you need to create one, make sure you understand the full cost of the decision. Sharding is a one-way door.
References and Further Reading
- Sharding & IDs at Instagram — Instagram Engineering Blog
- How Discord Stores Trillions of Messages — Discord Engineering Blog
- How Discord Stores Billions of Messages (MongoDB to Cassandra migration) — Discord Engineering Blog
- Vitess Architecture Overview — vitess.io
- Spanner: Google's Globally-Distributed Database — Google Research (OSDI 2012)
- Distribution Layer (Range Splitting and Rebalancing) — CockroachDB Architecture Docs
- Connection Settings (max_connections and tuning) — PostgreSQL Documentation
- MySQL High Availability at GitHub (Vitess adoption) — GitHub Engineering Blog
- gh-ost: GitHub's Online Schema Migration Tool for MySQL — GitHub
- Designing Data-Intensive Applications — Martin Kleppmann (O'Reilly, 2017)