Write-Ahead Log (WAL) Explained: Durability Before Performance
How write-ahead logging works — crash recovery, log structure, checkpointing, and how PostgreSQL, SQLite, Kafka, and etcd use WAL for durability.
Write-Ahead Log (WAL)
A write-ahead log is a sequential, append-only file where every data modification is recorded before it is applied to the main database, ensuring that changes can be recovered after a crash.
What It Really Means
Databases store data in complex structures — B-trees, hash indexes, heap files. Updating these structures involves modifying multiple disk pages. If the system crashes in the middle of an update (say, after writing the index but before writing the data page), the database is left in an inconsistent, corrupted state.
The write-ahead log solves this with a simple rule: before modifying any data page, first write a description of the change to a sequential log file and flush it to disk. If the system crashes at any point, it can replay the log to reconstruct the exact state of all in-progress operations and either complete them or roll them back.
The key insight is that sequential writes are fast. Writing to the end of a log file is an append operation — the disk head does not need to seek. Random writes to B-tree pages are slow because the disk head jumps around. By writing changes sequentially to the WAL first, the database can batch and defer the expensive random writes to data pages while still guaranteeing durability.
How It Works in Practice
PostgreSQL WAL
PostgreSQL writes all changes to its WAL (in pg_wal/ directory) before modifying the actual data files (in base/ directory). The WAL is divided into 16MB segment files. Each WAL record contains:
- Transaction ID
- The page being modified
- The old and new values (for undo/redo)
- A checksum
Crash recovery: On startup after a crash, PostgreSQL reads the WAL from the last checkpoint and replays all committed transactions. Uncommitted transactions are rolled back. The data files are restored to a consistent state.
Streaming replication: PostgreSQL ships WAL records to replicas in real-time. Replicas replay the WAL to stay in sync with the primary. This is why WAL is the foundation of PostgreSQL replication.
SQLite WAL Mode
SQLite's WAL mode writes changes to a separate WAL file instead of directly modifying the database file. Readers access the original database file (which is never modified during writes), so reads never block writes and writes never block reads. Periodically, the WAL is "checkpointed" — its contents are transferred to the main database file.
Apache Kafka
Kafka's entire architecture is built on the log abstraction. Each topic partition is a commit log — an append-only, ordered sequence of records. Producers append to the end of the log. Consumers read from any position in the log. The log provides durability, ordering, and replayability.
Kafka flushes log segments to disk based on time or size thresholds (log.flush.interval.messages, log.flush.interval.ms). The sequential nature of log writes is why Kafka achieves such high write throughput.
etcd
etcd (used by Kubernetes) uses a WAL in combination with the Raft consensus protocol. Every state change is first written to the WAL, then applied to the in-memory state machine. If an etcd node crashes, it replays the WAL to restore its state. The WAL is also used to bring new cluster members up to date.
Implementation
Trade-offs
Advantages
- Crash durability: No data loss for committed transactions, even during power failure
- Fast writes: Sequential append to log is faster than random writes to B-tree pages
- Atomic transactions: Multi-page updates either fully commit or fully roll back
- Enables replication: Ship WAL records to replicas for consistent replication
- Point-in-time recovery: Replay WAL to any point in time (PITR)
Disadvantages
- Double write overhead: Data is written twice — once to the WAL, once to the data files
- Disk space: WAL files accumulate until checkpointed and archived
- Recovery time: Replaying a large WAL after a crash can take minutes to hours
- Checkpoint stalls: Checkpointing (flushing WAL to data files) causes I/O spikes
Checkpointing
The WAL grows continuously. Without cleanup, it would consume unlimited disk space. A checkpoint flushes all WAL changes to the actual data files and marks the WAL records as no longer needed for recovery.
PostgreSQL triggers checkpoints based on time (checkpoint_timeout, default 5 minutes) or WAL size (max_wal_size, default 1GB). During a checkpoint, dirty pages in the buffer cache are written to disk, causing an I/O burst.
| Parameter | Trade-off |
|---|---|
| Frequent checkpoints | Less WAL to replay on crash, but more I/O overhead |
| Infrequent checkpoints | Less I/O overhead, but longer crash recovery time |
Common Misconceptions
- "WAL makes writes slower" — WAL adds a sequential write, but it enables deferring random writes. The net effect is usually faster writes because the database can batch random I/O. This is why PostgreSQL with WAL is faster than without.
- "You need WAL only for crash recovery" — WAL is also the foundation for replication, point-in-time recovery, and change data capture. It is one of the most versatile components in a database.
- "Calling fsync after every write is always necessary" — Some databases (like Kafka) batch fsync calls for performance. This creates a small window of data loss risk, but the throughput improvement is significant. The trade-off is configurable.
- "WAL and the transaction log are different things" — In most databases, they are the same thing. PostgreSQL calls it WAL, SQL Server calls it the transaction log, MySQL/InnoDB calls it the redo log. Same concept, different names.
- "Disabling WAL improves performance" — Disabling WAL removes durability guarantees. A crash can corrupt the database irreparably. The performance gain is not worth it for any data you care about.
How This Appears in Interviews
WAL is a foundational concept in database and distributed systems interviews:
- "How does PostgreSQL ensure no data loss during a crash?" — WAL. All changes are written to the WAL and fsynced before acknowledging the commit. On recovery, PostgreSQL replays the WAL from the last checkpoint.
- "Why is Kafka so fast at writes?" — Kafka's append-only log structure means all writes are sequential. Sequential disk I/O is 100-1000x faster than random I/O.
- "Design a durable key-value store" — WAL for durability, in-memory hash map for fast reads, periodic snapshots (checkpoints) to bound recovery time.
- "How does database replication work?" — Ship WAL records from primary to replicas. Replicas replay the WAL to stay in sync. This is the basis of PostgreSQL streaming replication.
- "What is the difference between a WAL and a changelog?" — A WAL is a low-level durability mechanism recording physical page changes. A changelog (or event log) records logical events. Kafka is a logical log; PostgreSQL WAL is a physical log.
See our interview questions on databases for more practice.
Related Concepts
- Replication — WAL records are shipped to replicas for replication
- Consistent Reads — WAL position tracking enables read-after-write consistency
- Idempotency — WAL replay must be idempotent to handle repeated recovery
- Merkle Trees — used alongside WAL for data integrity verification
- Bloom Filters — used in storage engines alongside WAL for read optimization
- System Design Interview Guide
- Algoroq Pricing — practice database and distributed systems concepts
GO DEEPER
Learn from senior engineers in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.