Senior 3 min · June 25, 2026

Change Data Capture: How to Stream Database Changes Without Breaking Production

Q: What is Change Data Capture (CDC) and how does it work?

CDC captures row-level changes from a database's transaction log. Tools like Debezium read the log (e.g., PostgreSQL WAL) and emit events to Kafka. No SELECT queries are used, so the database isn't loaded.

Q: What's the difference between CDC and database triggers?

Triggers run in the database and can slow down writes. CDC runs outside the database by reading the log, so it doesn't affect write performance. Use CDC for streaming, triggers for simple audit logs.

Q: How do I set up CDC for PostgreSQL with Debezium?

Set `wal_level=logical` in postgresql.conf. Create a publication for the table. Configure Debezium connector with `plugin.name=pgoutput` and the publication name. Start connector; it creates a replication slot and streams changes.

Q: What happens if the CDC connector crashes and the database recycles the WAL?

You lose events that weren't consumed. To prevent this, monitor replication lag and set `max_slot_wal_keep_size` high enough. If data loss is unacceptable, use a transactional outbox pattern as a fallback.

Change Data Capture (CDC) streams database changes to downstream systems.

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.

✓ Production

production tested

June 25, 2026

last updated

1,663

articles · all by Naren

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

CDC works by reading the database's transaction log (e.g., PostgreSQL WAL, MySQL binlog) and converting each change into an event. Tools like Debezium or AWS DMS handle this. You get a stream of change events without modifying your application code.

✦ Definition~90s read

What is Change Data Capture (CDC)?

Change Data Capture (CDC) is a pattern that captures row-level changes (inserts, updates, deletes) from a database transaction log and streams them to consumers in near real-time. It avoids batch polling and reduces load on the source database.

★

Imagine a security camera that records every package that goes through a warehouse conveyor belt.

Plain-English First

Imagine a security camera that records every package that goes through a warehouse conveyor belt. Instead of stopping the belt to check each package, you watch the recording later and note every change. CDC is that recording — it captures every row change from the database transaction log without slowing down the database itself.

Everyone thinks CDC is just 'read the binlog and send to Kafka.' Then your payment service goes down at 3 AM because the CDC connector ran out of memory trying to process a 2GB transaction. CDC is deceptively simple in theory and brutally complex in production. The problem it solves is real: how do you react to database changes without polling every second and hammering your DB with SELECT queries? You need a stream of changes — for caches, search indexes, analytics, or event-driven architectures. After reading this, you'll know exactly how CDC works under the hood, which failure modes will bite you, and how to build a production-grade pipeline that doesn't fall over when a DBA runs an UPDATE without a WHERE clause.

Why Polling Is a Trap and CDC Is the Escape

Before CDC, teams polled tables with SELECT * FROM orders WHERE updated_at > ?. That works until your table has 100M rows and you're polling every 5 seconds. The DB spends CPU on repeated full scans. Writes slow down. Connection pools exhaust. CDC sidesteps this by reading the transaction log — the database's own record of changes. No SELECTs, no load. The trade-off: you now depend on log retention and connector health. But for any system that needs low-latency change streams (cache invalidation, search indexing, analytics), polling is a dead end.

PollingVsCDC.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Polling approach (bad)
SELECT * FROM orders WHERE updated_at > '2025-01-01 00:00:00';
// Every 5 seconds. Full table scan. Kills DB at scale.

// CDC approach (good)
// Debezium reads PostgreSQL WAL via logical replication slot.
// Outputs JSON like:
{"op":"c","after":{"id":123,"status":"shipped","updated_at":"2025-01-01T00:00:05Z"}}
// No SELECT. No load. Real-time.

Output

Polling: SELECT returns 0 rows after initial load. CDC: emits event within milliseconds of commit.

Production Trap:

Never use polling for tables with >1M rows updated frequently. You'll hit ERROR: connection pool exhausted within hours. CDC is not optional at scale.

thecodeforge.io

CDC: Stream DB Changes Without Breaking Production

Change Data Capture

thecodeforge.io

Polling vs CDC

Change Data Capture

How CDC Reads the Transaction Log Without Breaking the Database

CDC connectors attach to the database's transaction log. In PostgreSQL, that's a logical replication slot. In MySQL, it's the binlog position. The connector reads the log sequentially, parses each transaction into row-level events, and emits them to a message bus (Kafka, Pulsar, etc.). The key insight: the log is append-only and sequential. The connector never runs SELECT. It just streams. But there's a catch: the log has a retention window. If your connector falls behind, the DB may recycle old segments, and you lose data. That's why monitoring replication lag is critical.

DebeziumPostgresConnector.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Debezium connector config for PostgreSQL (JSON)
{
  "name": "orders-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres-primary",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "${DB_PASSWORD}",
    "database.dbname": "ecommerce",
    "database.server.name": "ecommerce-pg",
    "plugin.name": "pgoutput",
    "slot.name": "debezium_slot",
    "publication.name": "debezium_pub",
    "publication.autocreate.mode": "filtered",
    "table.include.list": "public.orders",
    "max.queue.size": 8192,
    "max.batch.size": 2048,
    "snapshot.mode": "when_needed",
    "tombstones.on.delete": "false"
  }
}

Output

Connector starts. Creates replication slot 'debezium_slot'. Streams changes to Kafka topic 'ecommerce-pg.public.orders'.

Senior Shortcut:

Always set publication.autocreate.mode=filtered to avoid publishing all tables. Unfiltered publications expose every table change — including internal tables — and bloat the log.

thecodeforge.io

CDC Connector Pipeline

Change Data Capture

The Three Failure Modes That Will Wake You Up at 3 AM

CDC in production fails in three predictable ways. First: the connector falls behind and the DB recycles the log. You get ERROR: replication slot "debezium_slot" is active but the server's wal_level is insufficient. Fix: increase wal_keep_segments or max_slot_wal_keep_size. Second: the connector OOMs because a bulk UPDATE floods the queue. Fix: rate-limit with max.queue.size and max.batch.size. Third: schema changes break the connector. If you ALTER TABLE ADD COLUMN, the connector may fail to deserialize old events. Fix: use Avro with Schema Registry and enable schema evolution.

MonitorLag.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Check replication lag in PostgreSQL
SELECT slot_name, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes
FROM pg_replication_slots
WHERE slot_name = 'debezium_slot';
// If lag_bytes > 1GB, alert immediately.

// Check connector offset lag in Kafka
// Run kafka-consumer-groups:
kafka-consumer-groups --bootstrap-server kafka:9092 --group connect-orders --describe
// Look at LAG column. If > 100000, investigate.

Output

slot_name: debezium_slot, lag_bytes: 52428800 (50MB). Group: connect-orders, LAG: 0 (healthy).

Never Do This:

Don't set wal_keep_segments to unlimited. It fills disk. Set a reasonable max (e.g., 100GB) and alert on lag. Also, never drop a replication slot manually without stopping the connector — you'll lose the offset and force a full snapshot.

Exactly-Once Semantics: The Holy Grail and How to Get Close

CDC connectors promise at-least-once delivery by default. That means duplicates. For idempotent consumers (e.g., upsert into a cache), duplicates are fine. For non-idempotent (e.g., sending emails), duplicates are a disaster. To get exactly-once, you need: (1) connector with exactly-once support (Debezium 2.0+ with Kafka exactly-once source), (2) transactional outbox pattern in the source app, and (3) idempotent consumer with dedup by event ID. Even then, edge cases like connector crash after emit but before commit can cause duplicates. The only true exactly-once is in the consumer — make it idempotent.

IdempotentConsumer.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Python consumer with idempotent upsert
import psycopg2

def handle_event(event):
    event_id = event['payload']['source']['ts_ms'] + '-' + event['payload']['after']['id']
    # Check if already processed
    cur.execute("SELECT 1 FROM processed_events WHERE event_id = %s", (event_id,))
    if cur.fetchone():
        return  # Skip duplicate
    # Process: upsert into cache
    cache.upsert(event['payload']['after']['id'], event['payload']['after'])
    # Mark processed
    cur.execute("INSERT INTO processed_events (event_id) VALUES (%s)", (event_id,))
    conn.commit()

Output

Duplicate events: first processes, second skipped. No duplicate side effects.

Interview Gold:

When asked 'How do you achieve exactly-once in CDC?' answer: 'You don't. You make the consumer idempotent.' That shows you understand the distributed systems reality.

Schema Evolution: Why Your Connector Will Break on Monday Morning

Databases evolve. You add a column, rename one, or change a type. CDC connectors capture the schema at snapshot time. When the schema changes, the connector may fail to deserialize old events in the log. Solution: use a schema registry (Confluent Schema Registry) with Avro or Protobuf. The registry stores every version. The connector emits events with schema ID. Consumers fetch the schema and handle backward/forward compatibility. Without this, you'll see org.apache.kafka.connect.errors.DataException: JsonConverter failed to deserialize JSON.

AvroSchemaEvolution.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Debezium connector with Avro and Schema Registry
{
  "key.converter": "io.confluent.connect.avro.AvroConverter",
  "key.converter.schema.registry.url": "http://schema-registry:8081",
  "value.converter": "io.confluent.connect.avro.AvroConverter",
  "value.converter.schema.registry.url": "http://schema-registry:8081",
  "schema.name.adjustment.mode": "avro"
}

// Consumer reads Avro
// Schema evolves: add column 'discount' with default 0.0
// Old events have no 'discount' field. Consumer handles via schema compatibility.

Output

Connector registers schema 'ecommerce-pg.public.orders.Envelope' v1. After ALTER TABLE, registers v2. Consumer reads both versions.

The Classic Bug:

If you use JSON converter without schema registry and rename a column, the connector will fail with org.apache.kafka.connect.errors.DataException: JsonConverter failed to deserialize JSON. Always use Avro with schema registry in production.

When Not to Use CDC: The Overkill Trap

CDC is powerful but not free. You need to run Kafka (or similar), a connector cluster, and schema registry. That's operational overhead. If your use case is: (1) batch processing with hourly updates, (2) a single table with <1M rows, or (3) you don't need real-time, then polling with a timestamp column is simpler and cheaper. Also, if your database is a small SQLite or a read replica, CDC adds complexity without benefit. Don't use CDC for everything. Use it when you need sub-second change propagation and can't afford to poll.

Senior Shortcut:

Ask yourself: 'Can I tolerate 5-second delay?' If yes, polling with updated_at index is fine. If no, CDC is your tool. Also, if you already have Kafka, CDC is a no-brainer.

● Production incidentPOST-MORTEMseverity: high

The 4GB Container That Kept Dying

Symptom

Debezium connector pod OOM-killed every 20 minutes. Kafka topic had millions of uncommitted offsets.

Assumption

Assumed memory leak in Debezium. Restarted connector — same result.

Root cause

A developer ran UPDATE orders SET status='shipped' without a WHERE clause, generating 12 million change events. Debezium buffered them in memory (max.queue.size default 8192). Memory spiked to 4GB, container OOM killed.

Fix

Set max.queue.size=10000 and max.batch.size=2048. Added memory limit of 6GB. Also set snapshot.mode=when_needed to avoid re-snapshot on restart.

Key lesson

Always rate-limit your CDC connector and assume someone will run a bulk UPDATE.
Set memory limits based on max event size × queue size.

Production debug guideSystematic recovery paths for the failure modes engineers actually hit.3 entries

Symptom · 01

Connector OOM killed (pod restart loop)

→

Fix

1. Check connector logs for OutOfMemoryError. 2. Reduce max.queue.size to 8192. 3. Increase container memory to 4GB. 4. Restart connector. 5. If persists, check for bulk UPDATE in source DB.

Symptom · 02

Replication lag > 1 hour

→

Fix

1. Check pg_replication_slots lag. 2. Check if long-running transaction blocks WAL cleanup. 3. Kill the transaction. 4. Increase max_slot_wal_keep_size to 10GB. 5. Restart connector to force re-snapshot if lag too high.

Symptom · 03

Duplicate events in consumer

→

Fix

1. Check connector config: exactly.once.source=true? 2. If not, enable it. 3. Make consumer idempotent (upsert, dedup by event ID). 4. If using Kafka, check enable.idempotence=true in producer.

★ Change Data Capture (CDC) Triage Cheat SheetFirst-response commands for when things go wrong — copy-paste ready.

Connector logs `OutOfMemoryError`−

Immediate action

Check max.queue.size and memory limit

Commands

kubectl logs <connector-pod> --tail=100 | grep -i 'OutOfMemory'

kubectl describe pod <connector-pod> | grep -A2 Limits

Fix now

Set max.queue.size=8192 and container memory to 4Gi

Replication lag alert+

Connector fails with `DataException`+

No events in Kafka topic+

Feature / Aspect	CDC (Log-Based)	Polling (Timestamp)
Database Load	None (reads log)	High (SELECT queries)
Real-Time Latency	Sub-second	Poll interval (seconds to minutes)
Data Loss Risk	Log retention expiry	None (if timestamp indexed)
Operational Complexity	High (Kafka, connectors, schema registry)	Low (just a query)
Schema Changes	Requires schema registry	Handled by app code
Best For	High-throughput, low-latency streams	Low-volume, batch-friendly systems

Key takeaways

CDC reads the transaction log, not the table

zero load on the source database.

Always use Avro with schema registry for production CDC to survive schema changes.

Monitor replication lag via pg_replication_slots

it's the #1 cause of data loss.

CDC is overkill for low-volume systems; polling with an index is simpler and cheaper.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

How does Debezium handle schema changes in PostgreSQL? What happens if y...

Q02SENIOR

When would you choose CDC over dual-writes in an event-driven architectu...

Q03SENIOR

What happens when a CDC connector crashes after emitting an event but be...

Q04JUNIOR

What is Change Data Capture (CDC)?

Q05SENIOR

You notice the CDC connector is falling behind and replication lag is gr...

Q06SENIOR

Design a CDC pipeline that streams changes from a PostgreSQL database to...

Q01 of 06SENIOR

How does Debezium handle schema changes in PostgreSQL? What happens if you add a column without a default?

ANSWER

Debezium captures the schema at snapshot time. When a column is added, the connector may fail to deserialize old events if using JSON. With Avro and schema registry, it registers a new schema version. If the column has no default, old events lack the field; the consumer must handle null. Always use Avro with backward compatibility.

FAQ · 4 QUESTIONS

Frequently Asked Questions

What is Change Data Capture (CDC) and how does it work?

What's the difference between CDC and database triggers?

How do I set up CDC for PostgreSQL with Debezium?

What happens if the CDC connector crashes and the database recycles the WAL?

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.

✓ Verified

production tested

June 25, 2026

last updated

1,663

articles · all by Naren

🔥

That's Database Internals. Mark it forged?

3 min read · try the examples if you haven't