Skip to content

CDC Basics

Change Data Capture is a technique for capturing every INSERT, UPDATE, and DELETE that happens in a database and streaming those changes in real-time to other systems.

Instead of polling a table with "any new rows?", CDC subscribes to the database's transaction log (the WAL in Postgres) and receives changes as they happen — with millisecond latency and near-zero overhead on the database.

Polling#

You repeatedly ask the DB: "is there anything new?"

t=0s:  Poller: any new rows? → DB: no
t=5s:  Poller: any new rows? → DB: no
t=10s: Poller: any new rows? → DB: yes, 3 rows → publish
t=15s: Poller: any new rows? → DB: no

Problems: - Wasted DB queries when nothing is new - Up to N seconds latency (polling interval) - DB load scales with polling frequency, not with actual data volume

Tailing (CDC)#

You subscribe once. The DB pushes changes to you as they happen.

t=10:00:00.001: row inserted → CDC receives it instantly → publish

t=10:00:05.234: row inserted → CDC receives it instantly → publish
(no unnecessary queries in between)

Think of it like: - Polling = refreshing your email inbox every 5 seconds - Tailing = push notification — email arrives, you're notified instantly


The WAL — Write-Ahead Log#

Every DB write in Postgres first gets written to the WAL (Write-Ahead Log) before being applied to the actual tables. This is how Postgres guarantees crash recovery — if it crashes mid-write, it replays the WAL on startup.

App writes order_123
WAL entry appended: "INSERT orders (123, created, 49.99) at LSN 0/1A2B3C"
Data applied to orders table

The WAL is a sequential, append-only log of every single change to the DB. It already exists — CDC just reads it.


How CDC Reads the WAL#

Postgres has a feature called logical replication — the same mechanism it uses to replicate data to read replicas.

CDC tools connect to Postgres using logical replication and receive WAL changes in real-time:

graph TD
    APP[App Service] -->|write| PG[(Postgres)]
    PG -->|WAL entry| WAL[Write-Ahead Log]
    WAL -->|logical replication\nsame as read replica| CDC[CDC Tool\ne.g. Debezium]
    CDC -->|publish| K[Kafka]

    style WAL fill:#4a9,color:#fff

Key point: Postgres doesn't write to CDC — CDC reads from WAL via logical replication. Zero extra write overhead on Postgres.


What CDC Captures#

CDC captures every change at the row level:

INSERT: { op: "c", table: "outbox", after: { id: 1, event_type: "OrderCreated", ... } }

UPDATE: { op: "u", table: "outbox", before: { published: false }, after: { published: true } }

DELETE: { op: "d", table: "orders", before: { order_id: 123 } }

You can filter to only capture the tables you care about (e.g., just the outbox table).


CDC vs Polling Comparison#

Polling CDC
Latency Up to N seconds Milliseconds
DB overhead Queries every N seconds Near zero (reads WAL)
Complexity Simple to implement Requires CDC tool setup
Missed events Possible if polling gaps None (WAL is complete)
Use case Low-throughput, simple systems High-throughput, real-time

Key Insight#

CDC is not polling with a shorter interval — it's a fundamentally different mechanism. Polling adds load proportional to frequency. CDC adds near-zero load because it piggybacks on the WAL that Postgres was already writing for crash recovery.