Write-Ahead Logging Explained: How Databases Survive Crashes

How WAL (Write-Ahead Logging) works — why databases write logs before data, crash recovery, checkpointing, and performance implications for durability.

walwrite-ahead-loggingdatabasesdurabilitycrash-recovery

Write-Ahead Logging

Write-Ahead Logging (WAL) is a technique where all modifications to data are first recorded in a sequential log before being applied to the actual data pages, ensuring crash recovery without data loss.

What It Really Means

Imagine you are updating a row in a database. The database needs to modify the data page on disk where that row lives. But disks are unreliable — a power failure or crash can occur mid-write, leaving the page in a corrupt, half-written state. How does the database recover?

The answer is WAL. Before modifying any data page, the database first writes a log record describing the change to a sequential, append-only log file. Only after the log record is safely on disk does the database modify the actual data page. If the system crashes at any point, the database can read the WAL on startup and either redo committed changes or undo uncommitted ones.

This is the fundamental mechanism behind ACID properties — specifically Atomicity and Durability. Every major relational database uses WAL: PostgreSQL calls it WAL, MySQL InnoDB calls it the redo log, SQLite calls it the WAL file, and SQL Server calls it the transaction log.

How It Works in Practice

The Write Path

The key insight: the data page on disk might be stale (still showing balance = 1000). But the WAL on disk records the committed change. On crash recovery, the database replays the WAL to bring data pages up to date.

Crash Recovery Scenarios

Crash after step 2 (WAL written, page not modified): On restart, replay WAL: apply the change to the data page. No data lost.

Crash after step 3 (page modified in memory, not flushed to disk): On restart, the in-memory page is gone. Replay WAL: apply the change to the on-disk page. No data lost.

Crash during step 2 (partial WAL write): The WAL record has a checksum. On restart, the partial record fails the checksum and is discarded. The transaction is treated as uncommitted. No corruption.

Checkpointing

Without checkpoints, crash recovery would need to replay the entire WAL from the beginning of time. Checkpoints periodically flush all dirty pages to disk and record a checkpoint position in the WAL. On recovery, only WAL records after the last checkpoint need to be replayed.

Implementation

PostgreSQL WAL configuration:

sql

Group commit optimization:

Group commit batches multiple transactions into a single fsync, dramatically improving throughput under concurrent workloads.

Trade-offs

Benefits:

  • Crash recovery without data loss
  • Sequential writes are faster than random writes (WAL is append-only)
  • Enables replication (replicas replay the WAL from the primary)
  • Enables point-in-time recovery (replay WAL to any past moment)
  • Enables Change Data Capture

Costs:

  • Write amplification: every change is written twice (WAL + data page)
  • Storage overhead: WAL files consume disk space until archived or recycled
  • fsync latency: synchronous commit adds ~1-5ms to each transaction
  • Checkpoint I/O spikes: flushing dirty pages causes temporary performance degradation

Tuning considerations:

  • synchronous_commit = off improves write throughput but risks losing the last ~600ms of transactions on crash
  • Larger max_wal_size reduces checkpoint frequency but increases recovery time
  • wal_compression = on reduces WAL size at the cost of CPU

Common Misconceptions

  • "WAL is only for crash recovery" — WAL enables replication, CDC, point-in-time recovery, and logical decoding. It is the foundation of most database features beyond basic storage.
  • "Disabling fsync improves performance safely" — Disabling fsync removes durability guarantees entirely. A crash can corrupt your database. Never disable fsync in production.
  • "WAL is a PostgreSQL-specific feature" — Every major database uses write-ahead logging. The names differ (redo log, transaction log) but the concept is universal.
  • "More frequent checkpoints are better" — More frequent checkpoints reduce recovery time but increase I/O load during normal operation. Find the balance based on your recovery time objectives.
  • "WAL records are only for the current transaction" — WAL records persist across transactions and are the basis for streaming replication and backup.

How This Appears in Interviews

  1. "How does a database ensure durability?" — Explain the WAL write-before-data pattern, fsync, and crash recovery.
  2. "What happens when a database crashes mid-transaction?" — WAL replay: committed transactions are redone, uncommitted are undone.
  3. "How does PostgreSQL replication work?" — Streaming replication ships WAL records from primary to replicas.
  4. "Design a system that needs exactly-once processing" — WAL-based CDC with idempotent consumers ensures no data is lost or duplicated.

Related Concepts

GO DEEPER

Learn from senior engineers in our 12-week cohort

Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.