How Does Discord Store Trillions of Messages?

MongoDB: The First Year

Discord launched in 2015 with MongoDB. Single replica set, simple document model. Messages were documents with a channel_id, author, content, and timestamp. It worked perfectly for the first 100 million messages. Then the indexes stopped fitting in RAM. Read latency spiked. MongoDB's WiredTiger engine struggled with the write amplification of constantly growing indexes.

By late 2015, Discord was storing billions of messages and growing fast. They needed a database that could scale writes horizontally and handle time-series data natively. They chose Cassandra.

Cassandra: The Middle Years

Cassandra was a natural fit. Messages are time-series data — you almost always read the most recent messages in a channel. Cassandra's partition key was (channel_id, bucket) where bucket is a time window (e.g., 10 days). Within each partition, messages are sorted by snowflake ID (time-ordered). Reads for "latest 50 messages in #general" hit a single partition — fast.

Discord ran Cassandra for 6 years and grew to trillions of messages across 177 nodes. But problems accumulated:

GC pauses: Cassandra runs on the JVM. At Discord's scale, garbage collection pauses caused latency spikes of 5-10 seconds. They tried tuning GC (G1, ZGC, Shenandoah), but the fundamental problem was JVM memory management.
Compaction storms: Cassandra's LSM-tree storage engine needs periodic compaction (merging SSTables). During compaction, read latency spikes because the disk is busy. With trillions of messages, compaction ran constantly.
Hot partitions: popular Discord servers (100K+ members) created hot partitions that overwhelmed individual Cassandra nodes.
Repair complexity: Cassandra's anti-entropy repair process is slow and resource-intensive at scale. Running repair across 177 nodes was a multi-day operation.

Discord's engineering blog: "We were spending a disproportionate amount of engineering effort maintaining Cassandra. GC tuning, compaction configuration, repair scheduling — it was a full-time job for multiple engineers."

ScyllaDB: The Migration

In 2022, Discord migrated to ScyllaDB — a drop-in replacement for Cassandra written in C++ instead of Java. Same CQL query language, same data model, same drivers. But no JVM, no garbage collection pauses, and a completely different I/O architecture.

ScyllaDB uses a shard-per-core architecture. Each CPU core owns a shard of data and handles its own I/O — no locks, no shared state between cores. This eliminates GC pauses entirely and makes performance predictable. Discord went from 177 Cassandra nodes to 72 ScyllaDB nodes with better performance.

p99 read latency: dropped from 40-125ms (Cassandra) to 15ms (ScyllaDB)
p99 write latency: dropped from 5-70ms to 5ms
Node count: 177 → 72 (59% reduction)
GC pauses: eliminated entirely

The Data Model

Discord's message table (simplified):

CREATE TABLE messages (
    channel_id bigint,
    bucket int,
    message_id bigint,  -- snowflake: timestamp + worker + sequence
    author_id bigint,
    content text,
    PRIMARY KEY ((channel_id, bucket), message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

The partition key (channel_id, bucket) groups messages by channel and time window. The clustering key message_id DESC sorts newest first. Reading the latest 50 messages is a single-partition range scan — the most efficient Cassandra/ScyllaDB query possible.

The Lesson

Discord's journey shows that database choices are not permanent. MongoDB worked at launch. Cassandra worked for 6 years. ScyllaDB works now. Each migration solved the current bottleneck. The key enabler: a clean data model that mapped naturally across all three databases, and the engineering willingness to migrate trillions of records when the current solution hit its limits.