How Change Data Capture (CDC) works — Debezium, WAL-based capture, event-driven architectures, and keeping derived data stores in sync with your database.

Change Data Capture

Change Data Capture (CDC) is a pattern that identifies and captures changes made to data in a database, then delivers those changes in real-time to downstream systems like search indexes, caches, data warehouses, or event streams.

What It Really Means

Every application has derived data — data that exists in multiple forms across multiple systems. Your product catalog lives in PostgreSQL, but it also needs to be in Elasticsearch for search, in Redis for caching, and in a data warehouse for analytics. The fundamental challenge is keeping these copies in sync.

The naive approach is dual writes: when your application writes to PostgreSQL, it also writes to Elasticsearch and Redis in the same request. This fails in practice because any one of those writes can fail, leaving systems out of sync with no reliable way to detect or fix the inconsistency.

CDC solves this by treating the database's transaction log as the single source of truth. Instead of the application writing to multiple systems, a CDC pipeline reads committed changes from the database's WAL (Write-Ahead Log) and streams them to downstream consumers. If Elasticsearch is down when a change occurs, the CDC pipeline retries until it succeeds. The database log is the authoritative record.

How It Works in Practice

Architecture

CDC Methods

Log-based CDC (recommended): Read changes directly from the database's transaction log (WAL in PostgreSQL, binlog in MySQL). This is non-invasive — no changes to the application or schema required.

Trigger-based CDC: Database triggers write changes to a staging table. A separate process reads the staging table and publishes events. Higher overhead on the database.

Polling-based CDC: Periodically query for rows with updated_at > last_poll_time. Simple but misses deletes, has latency proportional to poll interval, and requires an updated_at column.

Timestamp-based CDC: Similar to polling but uses a monotonically increasing sequence or timestamp. Cannot capture deletes without soft-delete patterns.

Debezium Example

Debezium is the most popular open-source CDC platform. It reads database transaction logs and publishes change events to Kafka.

json

The event includes both the before and after state, the operation type (c=create, u=update, d=delete, r=read/snapshot), and the source position in the transaction log for exactly-once processing.

Implementation

Setting up Debezium with PostgreSQL and Kafka:

json

Consuming CDC events to update Elasticsearch:

python

Trade-offs

Benefits:

Database is the single source of truth (no dual writes)
Low latency (sub-second propagation)
Non-invasive (log-based CDC requires no application changes)
Reliable (based on durable transaction log)
Captures all changes including deletes

Costs:

Infrastructure complexity (Kafka, Debezium, connectors)
Replication slot management in PostgreSQL (slots that are not consumed grow unbounded)
Schema evolution requires careful handling (what happens when you add a column?)
Initial snapshot of existing data can be resource-intensive

When to use CDC:

Keeping search indexes in sync with the database
Populating caches that must reflect database state
Building event-driven microservices from a monolithic database
Real-time data warehouse ingestion
Audit logging

Common Misconceptions

"CDC replaces event-driven architecture" — CDC captures data changes, not domain events. "User signed up" is a domain event with business meaning. "Row inserted in users table" is a data change. Both have value, but they serve different purposes.
"Log-based CDC adds load to the database" — Reading the WAL is lightweight. PostgreSQL already writes the WAL for durability. CDC reads it as an additional consumer. The overhead is minimal compared to trigger-based or polling approaches.
"CDC guarantees exactly-once delivery" — Most CDC systems provide at-least-once delivery. Your consumers must be idempotent (processing the same event twice produces the same result).
"CDC works without Kafka" — Kafka is common but not required. You can use CDC with Amazon Kinesis, Pulsar, or even direct HTTP webhooks, though Kafka provides the best durability and replay guarantees.

How This Appears in Interviews

"How do you keep your search index in sync with the database?" — CDC with Debezium streaming changes from PostgreSQL to Elasticsearch via Kafka.
"What are the problems with dual writes?" — Explain partial failures, ordering issues, and why CDC's log-based approach solves them.
"Design a real-time analytics pipeline" — CDC from operational database to Kafka to data warehouse. Discuss latency, exactly-once semantics, and schema evolution.
"How do you build an audit log?" — CDC captures every change with before/after state, timestamps, and transaction IDs.

Related Concepts

Write-Ahead Logging — the database mechanism CDC reads from
BASE Properties — CDC creates eventually consistent derived stores
Materialized Views — database-level alternative to CDC for precomputed data
Database Transactions — CDC respects transaction boundaries
Server-Sent Events — delivering real-time updates to clients
System Design Interview Guide
Algoroq Pricing — access all concept deep-dives

Change Data Capture Explained: Streaming Database Changes in Real Time