How Does Write-Ahead Logging Prevent Data Loss?

How Does Write-Ahead Logging Prevent Data Loss?

Here is a question that sounds obvious until you actually think about it: what happens to your data when the database crashes halfway through a write? The row is half-updated. The index points to garbage. The free-space map is inconsistent. If the database just wrote directly to the data files, you would be staring at corruption every time someone tripped over a power cable.

The answer, in virtually every serious database engine built in the last four decades, is Write-Ahead Logging (WAL). The rule is dead simple: before you change anything in the actual data files, you first write a description of the change to a sequential log. That is it. That one rule is the foundation of crash recovery, replication, point-in-time recovery, and a surprising number of other features you probably take for granted.

The Crash That Destroys Everything

Imagine a database without WAL. You run UPDATE accounts SET balance = balance - 500 WHERE id = 42. The database needs to do several things: update the heap tuple, update the index entry, update the visibility map, and update the free-space map. These are separate writes to separate pages on disk.

Now the power goes out after the heap tuple is updated but before the index is fixed. When the database comes back up, the index says the old value is at one location, but the heap has the new value at another. Or worse, the page was only partially written — 4KB of an 8KB page made it to disk. You now have a torn page. The data is not just stale; it is structurally broken. No query can read it safely. No VACUUM can clean it up. You are restoring from backup.

This is not a theoretical concern. Hard drives lie about fsyncs. SSDs have volatile write caches. Operating systems reorder writes for performance. The gap between "the application called write()" and "the bits are physically on stable storage" is enormous, and crashes exploit that gap ruthlessly.

The WAL Contract: Log First, Write Later

Write-Ahead Logging inverts the problem. Instead of writing directly to the data pages, every modification goes through this sequence:

  1. The transaction makes changes in memory (to shared buffer pages).
  2. A WAL record describing the change is written to the WAL buffer.
  3. At commit time, the WAL buffer is flushed to the WAL file on disk (fsync).
  4. The client gets an "OK, committed" response.
  5. Later — possibly much later — the dirty data pages are written to the actual data files (this is called a checkpoint).

The critical guarantee: the WAL record hits stable storage before the corresponding data page change hits stable storage. If the system crashes at any point, the database can replay the WAL from the last checkpoint to reconstruct every committed transaction. Uncommitted transactions are simply ignored during recovery. The data files might be stale or partially written, but the WAL contains the truth.

Write-Ahead Logging: The Commit Path BEGIN; UPDATE transaction 1 Shared Buffers dirty pages in RAM 2 WAL Buffer in memory 3 fsync WAL on Disk sequential log 4 COMMIT OK app 5 (later, at checkpoint) Data Files heap + indexes crash recovery: replay WAL The WAL hits disk before the data pages. If a crash happens, replay the log to reconstruct.

The WAL commit path. Steps 1-4 happen during a transaction. Step 5 happens lazily during checkpoints. On crash, the WAL is replayed from the last checkpoint to bring data files up to date.

Inside PostgreSQL's WAL

PostgreSQL calls its WAL the "transaction log," and it lives in the pg_wal/ directory (renamed from pg_xlog/ in version 10). Each WAL file is 16MB by default and is named with a monotonically increasing hexadecimal number like 000000010000000000000001. The name encodes the timeline ID and the segment number.

Internally, WAL is a stream of records. Each record contains a resource manager ID (which subsystem generated it — heap, btree, transaction, etc.), the transaction ID, a reference to the page being modified, and the actual change data. For a simple UPDATE, the WAL record contains the new tuple data plus enough information to locate the old tuple.

PostgreSQL uses full-page writes (also called full-page images) to handle torn pages. The first time a page is modified after a checkpoint, the entire 8KB page is written to the WAL. Subsequent modifications to that same page only write the delta. This means that even if a data page is half-written during a crash, recovery can restore the complete page image from the WAL and then apply any subsequent deltas on top.

The configuration parameter wal_level controls how much information goes into the WAL. At replica level (the default since PostgreSQL 9.6), enough information is logged to support streaming replication and point-in-time recovery. At logical level, even more data is logged to support logical replication and change data capture. The minimal level logs just enough for crash recovery but cannot support any form of replication.

WAL Powers Replication

Here is where WAL gets really interesting. If the WAL is a complete, ordered record of every change to the database, then shipping that log to another machine and replaying it produces an exact copy of the database. That is physical replication in a nutshell.

PostgreSQL streaming replication works exactly this way. The primary generates WAL records. A walreceiver process on the standby connects to a walsender process on the primary and streams WAL records in real time. The standby applies those records to its own data files. At any given moment, the standby is a few milliseconds behind the primary — just the network latency plus the apply time.

MySQL does something similar but different. Its binary log (binlog) is not a physical WAL — it is a logical log that records SQL statements or row-change events. The InnoDB storage engine has its own redo log (which is a true WAL), but replication flows through the binlog. This two-log architecture is a historical artifact of MySQL's pluggable storage engine design, and it adds complexity. PostgreSQL's single-WAL design is cleaner: one log serves crash recovery, replication, and archiving.

WAL Streaming Replication Primary WAL: LSN 0/5A000000 walsender data files WAL stream network Standby walreceiver startup (replay) data files (copy) Standby replays WAL continuously — typically milliseconds behind the primary.

PostgreSQL streaming replication. The walsender ships WAL records over the network in real time. The standby's walreceiver applies them, maintaining a near-identical copy of the primary.

Point-in-Time Recovery (PITR)

WAL archiving unlocks a powerful capability: restoring your database to any specific moment in time. The idea is straightforward. You continuously archive WAL files to durable storage (S3, NFS, whatever). You also take periodic base backups using pg_basebackup. To restore, you start from a base backup and replay archived WAL files up to the exact timestamp or transaction ID you want to recover to.

This is how you recover from the classic "someone ran DELETE without a WHERE clause" disaster. You do not restore from last night's backup and lose a day of data. You restore up to 11:42:37 AM, one second before the bad query ran. The configuration parameter recovery_target_time in postgresql.conf controls exactly where replay stops.

Tools like pgBackRest and Barman automate WAL archiving and PITR. They handle compression, parallel transfer, retention policies, and the fiddly details of coordinating base backups with WAL archiving. In production, you should never manage WAL archiving by hand.

RocksDB: LSM Trees Still Need a WAL

RocksDB, the embedded storage engine behind CockroachDB, TiKV (TiDB's storage layer), and Meta's internal systems, uses a Log-Structured Merge-tree (LSM) architecture. Writes go to an in-memory buffer called a memtable. When the memtable is full, it is flushed to an immutable sorted file on disk called an SSTable (Sorted String Table). Background compaction merges SSTables to reclaim space and maintain read performance.

But the memtable is in RAM. If the process crashes before the memtable is flushed, those writes are gone. So RocksDB uses a WAL too. Every write goes to the WAL first, then to the memtable. On recovery, RocksDB replays the WAL to reconstruct the memtable. Once a memtable is flushed to an SSTable, the corresponding WAL file can be discarded.

RocksDB exposes a configuration option manual_wal_flush that lets you control when the WAL is fsynced. By default, RocksDB fsyncs the WAL on every write (safest, slowest). You can disable this for higher throughput and accept the risk of losing the last few writes on a crash. CockroachDB keeps the default because it needs strict durability. MyRocks (MySQL on RocksDB) does the same.

SQLite WAL Mode: One File, Two Readers

SQLite is a single-file database, and its default journaling mode uses a rollback journal. Before modifying a page, SQLite copies the original page to a separate -journal file. If a crash happens, the original pages are restored from the journal. This is the opposite of WAL — it is "write-behind logging."

Starting with version 3.7.0 (2010), SQLite added a WAL mode. In WAL mode, the original database file is never modified directly. All changes go to a -wal file. Readers read from the original database plus any committed changes in the WAL. This is a massive concurrency improvement: readers do not block writers, and writers do not block readers. In the default rollback journal mode, a writer holds an exclusive lock on the entire database.

The tradeoff is that WAL mode requires shared memory (-shm file), which means it does not work over network file systems. It also requires periodic checkpoints to transfer changes from the WAL back into the main database file, preventing the WAL from growing unbounded. SQLite auto-checkpoints by default when the WAL exceeds 1000 pages.

The fsync Problem: When "Written" Does Not Mean "Durable"

The entire WAL contract depends on one assumption: when the database calls fsync() and it returns success, the data is on stable storage. If fsync lies, WAL cannot protect you.

And fsync does lie. In 2018, the PostgreSQL community discovered that on Linux, if an fsync() call fails (returns EIO), the kernel marks the dirty pages as clean even though they never made it to disk. A subsequent fsync on the same file descriptor returns success because the kernel thinks the pages are already clean. The data is silently lost. PostgreSQL now panics and refuses to continue if fsync fails, forcing a WAL replay on restart. This is the correct behavior — if the WAL itself might be corrupt, the only safe option is to stop.

Hardware write caches add another layer of risk. Many SSDs and RAID controllers have volatile write caches that report a write as complete before it reaches non-volatile storage. If power is lost, those cached writes vanish. Enterprise SSDs have power-loss protection (capacitors that flush the cache to NAND on power failure). Consumer drives often do not. If you are running a database on consumer hardware without battery-backed write cache, your WAL guarantees are weaker than you think.

Group Commit: Batching for Throughput

Calling fsync for every single transaction commit is expensive. Each fsync can take anywhere from 0.1ms (NVMe SSD) to 10ms (spinning disk), and it is a synchronous operation — the committing process blocks until it completes. At 10ms per fsync, you cap out at 100 commits per second on a spinning disk. That is terrible.

Group commit solves this by batching multiple concurrent transactions into a single fsync. Here is how it works: Transaction A finishes and requests an fsync. While the fsync is in progress, transactions B, C, and D also finish and queue up. When the fsync completes, A is committed. Then B, C, and D's WAL records are written and a single fsync covers all three of them. One fsync, three commits.

PostgreSQL implements group commit automatically. The GUC parameter commit_delay adds a brief pause (in microseconds) before fsyncing, giving other transactions a chance to queue up. The commit_siblings parameter controls how many concurrent transactions must be active before the delay kicks in. With enough concurrent load, group commit can improve throughput by 10-50x on fsync-bound workloads.

Group Commit: Batching fsync Calls Without Group Commit Txn A fsync Txn B fsync Txn C fsync 3 fsyncs = 3x latency With Group Commit Txn A Txn B Txn C 1 fsync OK OK OK 1 fsync = 1x latency 3 commits batched Higher concurrency = more batching = better throughput. PostgreSQL: commit_delay + commit_siblings control the batching window.

Without group commit, each transaction pays the full fsync cost. With group commit, concurrent transactions share a single fsync, dramatically improving throughput under load.

WAL Sizing and Operational Concerns

WAL files are not free. A busy PostgreSQL instance can generate gigabytes of WAL per hour. The max_wal_size parameter (default 1GB) controls how much WAL accumulates before a checkpoint is triggered. A larger value means fewer checkpoints (less I/O overhead) but more WAL to replay on crash recovery (longer recovery time). A smaller value means faster recovery but more frequent checkpoint I/O. You are tuning a tradeoff between steady-state performance and recovery time.

The pg_stat_wal view (added in PostgreSQL 14) gives you visibility into WAL generation rate, fsync counts, and buffer usage. If wal_buffers_full is nonzero, your WAL buffer is too small and processes are waiting — bump wal_buffers from the default of -1 (auto) to a higher value. In most production setups, 64MB is a good starting point.

WAL compression (wal_compression = on) can reduce WAL volume by 40-60% for full-page writes, with minimal CPU overhead. PostgreSQL 15 added support for LZ4 and Zstandard compression algorithms in addition to the original pglz. If you are shipping WAL over a network for replication, compression meaningfully reduces bandwidth usage.

The Bottom Line

Write-Ahead Logging is not a feature. It is the foundation. Every transaction you commit, every replica that stays in sync, every point-in-time recovery that saves your data — they all work because of one simple rule: write to the log before you write to the data.

The rule is simple, but the engineering around it is deep. Full-page writes handle torn pages. Group commit handles throughput. WAL archiving handles disaster recovery. Streaming replication handles high availability. And the entire stack collapses if fsync does not actually do what it promises. Every database engineer should understand not just that WAL exists, but why each of these mechanisms is necessary and how they interact.

The WAL is the single source of truth. The data files are just a materialized cache of what the WAL has already recorded. Once you internalize that perspective, crash recovery, replication, and PITR all stop being mysterious.

References and Further Reading