Kafka Explained

Here is a fact that should bother you: sequential disk I/O is faster than random memory access. Not by a little. Sequential reads on a modern disk hit 600 MB/s. Random reads on that same disk? About 100 KB/s. That is a 6,000x difference. Apache Kafka was built on this single, counterintuitive insight, and it changed how the entire industry thinks about messaging.

Kafka started at LinkedIn around 2010. The problem was deceptively simple: LinkedIn had activity data — profile views, searches, page clicks, connection requests — and dozens of systems that needed it. They tried traditional message queues. They tried point-to-point pipelines. Everything buckled. Traditional queues delete messages after consumption, so you cannot have ten systems reading the same stream. They keep messages in memory, so they choke when the volume exceeds what RAM can hold. And they were designed for thousands of messages per second, not millions.

Today, LinkedIn processes 7 trillion messages per day through Kafka. Netflix ingests 8 million events per second. Apple runs one of the largest deployments on Earth — 1.5 petabytes of data flowing through Kafka daily. These numbers are not marketing. They are the direct consequence of a handful of engineering decisions that we are going to take apart.

Why Is Sequential Disk I/O Faster Than Random Memory Access?

Most messaging systems store messages in complex data structures in memory — trees, hash maps, priority queues. This means random memory access patterns, pointer chasing, and garbage collection pauses. Kafka does something radical: it writes everything to a plain file on disk. Sequentially. Append-only.

Why does this work? Modern operating systems aggressively read-ahead on sequential access patterns. When you read byte 1000, the OS pre-fetches bytes 1001 through 5000 into the page cache because it predicts you will want them next. Disk hardware itself is optimized for this — HDDs avoid seek time, and SSDs avoid write amplification. The result: Kafka achieves 605 MB/s producer throughput and 1.1 GB/s consumer throughput on a single broker running commodity hardware. That is faster than most in-memory systems.

A Kafka partition is literally a directory of segment files on disk. Each segment is an append-only log. Producers append to the end. Consumers read at their own pace from whatever offset they choose. No locks. No complex indexing. Just sequential I/O and an offset pointer.

Kafka's append-only log. Producers write sequentially to the tail. Each consumer tracks its own offset and reads independently. No message is ever deleted on consumption — only by retention policy.

Zero-Copy: The Trick That Makes Kafka Absurdly Fast

When a traditional application sends data from disk to a network socket, the data takes a scenic tour through your system. It gets read from disk into the kernel's page cache. Then it is copied from the kernel buffer into the application's user-space buffer. The application looks at it, does nothing useful, and copies it back into a kernel-space socket buffer. Finally the kernel sends it to the network card. That is four copies and two context switches.

Kafka skips all of that. It uses the sendfile() system call — a Linux feature that tells the kernel: "take this data from the page cache and send it directly to this network socket." The data never enters user space. Two copies instead of four. Zero context switches into user space. This is called zero-copy transfer, and it is one of the biggest reasons Kafka can saturate a 10 Gbps network link while barely touching the CPU.

There is a second layer to this. Kafka does not manage its own cache. Most databases and messaging systems maintain a carefully tuned in-process cache, which means the data lives in the JVM heap (in Kafka's case, since it runs on the JVM). That means garbage collection, memory management overhead, and the data being duplicated — once in the process cache and once in the OS page cache. Kafka said: forget it. Just let the operating system handle caching. The OS page cache is faster, larger (it uses all available RAM), and does not trigger GC pauses. When Kafka "reads from disk," it is usually reading from RAM anyway — the OS page cache has already pre-loaded it.

Traditional data path requires four copies and two user-kernel context switches. Kafka's zero-copy path uses sendfile() to transfer data directly from the page cache to the network socket, never entering user space.

What Happens When a Broker Dies Mid-Write?

Every Kafka partition has a leader and zero or more followers. All reads and writes go through the leader. Followers exist purely to replicate data. But not all followers are equal — Kafka tracks which followers are "caught up" in a set called the ISR (In-Sync Replicas).

A follower stays in the ISR as long as it has fetched all messages from the leader within a configurable lag time (default: 10 seconds via replica.lag.time.max.ms). If a follower falls behind — maybe its disk is slow or it had a network blip — it gets kicked out of the ISR. It can rejoin once it catches up.

Now, what happens when the leader broker dies? The Kafka controller (one broker elected as the cluster coordinator) detects the failure and promotes one of the ISR members to become the new leader. This is fast — typically under a second. Any follower that was not in the ISR is not eligible for promotion because it might be missing messages. This is how Kafka guarantees zero data loss with acks=all: a write is not acknowledged until every ISR member has it.

Here is the tradeoff you need to understand. With acks=all and min.insync.replicas=2, a write requires at least two replicas to confirm before the producer gets an acknowledgment. If only one replica is alive, writes are rejected entirely. You are choosing durability over availability. With acks=1, the leader acknowledges immediately — faster, but you risk losing data if the leader dies before followers replicate. With acks=0, the producer does not even wait for the leader to confirm. Fire and forget. You choose your consistency level per-topic.

Left: normal operation with leader, one in-sync follower, and one lagging follower. Right: when the leader dies, only ISR members (Broker 2) are eligible for promotion, guaranteeing no data loss.

Consumer Group Rebalancing: The Hardest Problem Nobody Talks About

Consumer groups sound simple. Multiple consumers share the partitions of a topic. When a consumer joins or leaves, Kafka rebalances — reassigning partitions to the remaining consumers. In practice, rebalancing is one of the most painful parts of running Kafka.

Here is how it works. One broker is elected as the Group Coordinator for each consumer group. Every consumer sends heartbeats to the coordinator. When the coordinator detects a membership change (new consumer, dead consumer, new topic partition), it triggers a rebalance. The coordinator picks one consumer as the Group Leader, sends it the list of members and subscriptions, and the leader computes the partition assignment using a configurable assignor strategy.

The original eager rebalance protocol was brutal: every consumer revokes all its partitions, pauses processing, and waits for reassignment. For a group with 100 consumers reading 500 partitions, this meant a complete stop-the-world pause. Every single partition stopped being consumed. If your rebalance took 30 seconds, you had 30 seconds of zero processing.

The StickyAssignor improved assignment quality — it tries to keep partitions assigned to the same consumer across rebalances — but it still used the eager protocol. The real fix came with the CooperativeStickyAssignor, which uses an incremental rebalance protocol. Instead of revoking everything, only the partitions that actually need to move are revoked and reassigned. The rest keep processing. This is a massive improvement for large consumer groups and is now the recommended default.

Exactly-Once Semantics: How Kafka Actually Achieves EOS

"Exactly-once" is the holy grail of distributed messaging, and most people assume it is impossible. Kafka actually pulled it off, though the machinery behind it is more nuanced than the marketing suggests.

The first building block is idempotent producers. When you enable enable.idempotence=true, the broker assigns each producer a unique Producer ID (PID) and tracks a sequence number for each partition. If the producer retries a send (say, due to a network timeout), the broker detects the duplicate sequence number and silently discards the duplicate. This gives you exactly-once delivery from a single producer to a single partition — no duplicates, guaranteed.

But what about writing to multiple partitions atomically? That is where the Transactional API comes in. You call producer.beginTransaction(), send messages to multiple partitions, optionally commit consumer offsets within the same transaction, then call producer.commitTransaction(). Kafka uses a two-phase commit protocol with a Transaction Coordinator (a broker) and a special internal topic called __transaction_state. Either all writes in the transaction are visible to consumers or none are. Consumers configured with isolation.level=read_committed will only see committed messages.

This is how Kafka Streams achieves exactly-once stream processing: it reads from input topics, processes records, and writes to output topics plus commits consumer offsets — all within a single transaction. If the processing node crashes mid-way, the uncommitted transaction is aborted, and another node picks up from the last committed offset. No duplicates. No data loss.

Compacted Topics: Kafka as a Database

Most Kafka topics use time-based or size-based retention. But there is a second mode: log compaction. A compacted topic guarantees that Kafka will always retain at least the last message for each unique key. Think of it as a changelog that only keeps the latest state.

Internally, Kafka runs a background Log Cleaner thread that scans the log, builds an offset map of keys, and rewrites old segments by dropping superseded entries. If key user-42 has messages at offsets 10, 50, and 200, compaction keeps only the one at offset 200. If you write a message with key user-42 and a null value (a tombstone), compaction eventually removes that key entirely.

This turns Kafka into a lightweight key-value store. It is heavily used for maintaining materialized views, CDC snapshots, and configuration distribution. Kafka's own internal topics — __consumer_offsets and __transaction_state — are compacted topics. When a new consumer starts, it reads the entire __consumer_offsets topic to load the latest committed offset for every partition-group combination.

How LinkedIn Handles 7 Trillion Messages a Day

The scale at which Kafka operates at its birthplace is staggering. LinkedIn runs thousands of brokers across multiple data centers. Their Kafka clusters handle over 7 trillion messages per day — that is roughly 80 million messages per second sustained. The data includes everything from member activity events to ad impressions to internal system metrics.

They achieve this through several strategies. First, aggressive partitioning — high-volume topics have hundreds or even thousands of partitions spread across the cluster. Second, tiered storage — cold data gets offloaded to HDFS or object storage rather than keeping it all on broker disks. Third, rack-aware replication — replicas are distributed across failure domains so that a rack power failure does not take out all copies of a partition. Fourth, they run a custom Kafka deployment called "Li-Kafka" with patches for multi-tenancy, quotas, and operational tooling that have not all been upstreamed.

The architecture lesson from Kafka is simple but profound: the append-only log is one of the most powerful abstractions in computer science. A database is a log. A filesystem is a log. A message queue is a log. Kafka made the log a first-class distributed primitive, and an entire ecosystem grew around it.

If you are building anything that involves multiple systems reacting to the same events, or data that needs to be durable and replayable, or throughput measured in millions of events per second — you are going to end up either using Kafka or reinventing it. And if the engineering choices behind it teach you one thing, let it be this: before you reach for a clever in-memory data structure, ask yourself whether a boring sequential file on disk might actually be faster.