Senior 3 min · June 25, 2026

Change Data Capture: How to Stream Database Changes Without Breaking Production

Change Data Capture (CDC) streams database changes to downstream systems.

N
Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.

Follow
Production
production tested
June 25, 2026
last updated
1,663
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer

CDC works by reading the database's transaction log (e.g., PostgreSQL WAL, MySQL binlog) and converting each change into an event. Tools like Debezium or AWS DMS handle this. You get a stream of change events without modifying your application code.

✦ Definition~90s read
What is Change Data Capture (CDC)?

Change Data Capture (CDC) is a pattern that captures row-level changes (inserts, updates, deletes) from a database transaction log and streams them to consumers in near real-time. It avoids batch polling and reduces load on the source database.

Imagine a security camera that records every package that goes through a warehouse conveyor belt.
Plain-English First

Imagine a security camera that records every package that goes through a warehouse conveyor belt. Instead of stopping the belt to check each package, you watch the recording later and note every change. CDC is that recording — it captures every row change from the database transaction log without slowing down the database itself.

Everyone thinks CDC is just 'read the binlog and send to Kafka.' Then your payment service goes down at 3 AM because the CDC connector ran out of memory trying to process a 2GB transaction. CDC is deceptively simple in theory and brutally complex in production. The problem it solves is real: how do you react to database changes without polling every second and hammering your DB with SELECT queries? You need a stream of changes — for caches, search indexes, analytics, or event-driven architectures. After reading this, you'll know exactly how CDC works under the hood, which failure modes will bite you, and how to build a production-grade pipeline that doesn't fall over when a DBA runs an UPDATE without a WHERE clause.

Why Polling Is a Trap and CDC Is the Escape

Before CDC, teams polled tables with SELECT * FROM orders WHERE updated_at > ?. That works until your table has 100M rows and you're polling every 5 seconds. The DB spends CPU on repeated full scans. Writes slow down. Connection pools exhaust. CDC sidesteps this by reading the transaction log — the database's own record of changes. No SELECTs, no load. The trade-off: you now depend on log retention and connector health. But for any system that needs low-latency change streams (cache invalidation, search indexing, analytics), polling is a dead end.

PollingVsCDC.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
// io.thecodeforge — System Design tutorial

// Polling approach (bad)
SELECT * FROM orders WHERE updated_at > '2025-01-01 00:00:00';
// Every 5 seconds. Full table scan. Kills DB at scale.

// CDC approach (good)
// Debezium reads PostgreSQL WAL via logical replication slot.
// Outputs JSON like:
{"op":"c","after":{"id":123,"status":"shipped","updated_at":"2025-01-01T00:00:05Z"}}
// No SELECT. No load. Real-time.
Output
Polling: SELECT returns 0 rows after initial load. CDC: emits event within milliseconds of commit.
Production Trap:
Never use polling for tables with >1M rows updated frequently. You'll hit ERROR: connection pool exhausted within hours. CDC is not optional at scale.
CDC: Stream DB Changes Without Breaking Production THECODEFORGE.IO CDC: Stream DB Changes Without Breaking Production From transaction log to exactly-once delivery Transaction Log Capture Read committed changes without querying tables Change Event Stream Ordered, immutable log of inserts/updates/deletes Exactly-Once Semantics Idempotent sinks and offset tracking Schema Evolution Handling Avro/Protobuf with schema registry Failure Modes Log truncation, connector crash, network split ⚠ Overkill trap: CDC adds latency and ops cost for simple syncs Use polling or batch jobs when latency > 1 minute is acceptable THECODEFORGE.IO
thecodeforge.io
CDC: Stream DB Changes Without Breaking Production
Change Data Capture
Polling vs CDCTHECODEFORGE.IOPolling vs CDCWhy polling breaks at scalePollingSELECT with updated_at > ?Full scans on 100M rowsDB CPU spikes every 5sConnection pool exhaustionCDCReads transaction log onlyNo repeated table scansMinimal DB overheadStreams to message busCDC avoids polling's CPU and connection costs at scaleTHECODEFORGE.IO
thecodeforge.io
Polling vs CDC
Change Data Capture

How CDC Reads the Transaction Log Without Breaking the Database

CDC connectors attach to the database's transaction log. In PostgreSQL, that's a logical replication slot. In MySQL, it's the binlog position. The connector reads the log sequentially, parses each transaction into row-level events, and emits them to a message bus (Kafka, Pulsar, etc.). The key insight: the log is append-only and sequential. The connector never runs SELECT. It just streams. But there's a catch: the log has a retention window. If your connector falls behind, the DB may recycle old segments, and you lose data. That's why monitoring replication lag is critical.

DebeziumPostgresConnector.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// io.thecodeforge — System Design tutorial

// Debezium connector config for PostgreSQL (JSON)
{
  "name": "orders-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres-primary",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "${DB_PASSWORD}",
    "database.dbname": "ecommerce",
    "database.server.name": "ecommerce-pg",
    "plugin.name": "pgoutput",
    "slot.name": "debezium_slot",
    "publication.name": "debezium_pub",
    "publication.autocreate.mode": "filtered",
    "table.include.list": "public.orders",
    "max.queue.size": 8192,
    "max.batch.size": 2048,
    "snapshot.mode": "when_needed",
    "tombstones.on.delete": "false"
  }
}
Output
Connector starts. Creates replication slot 'debezium_slot'. Streams changes to Kafka topic 'ecommerce-pg.public.orders'.
Senior Shortcut:
Always set publication.autocreate.mode=filtered to avoid publishing all tables. Unfiltered publications expose every table change — including internal tables — and bloat the log.
CDC Connector PipelineTHECODEFORGE.IOCDC Connector PipelineFrom transaction log to message busDB Transaction LogPostgreSQL WAL or MySQL binlogCDC ConnectorDebezium reads sequentiallyRow-Level EventsParsed insert/update/deleteMessage BusKafka, Pulsar, or Kinesis⚠ Connector falls behind → DB recycles log → data lossTHECODEFORGE.IO
thecodeforge.io
CDC Connector Pipeline
Change Data Capture

The Three Failure Modes That Will Wake You Up at 3 AM

CDC in production fails in three predictable ways. First: the connector falls behind and the DB recycles the log. You get ERROR: replication slot "debezium_slot" is active but the server's wal_level is insufficient. Fix: increase wal_keep_segments or max_slot_wal_keep_size. Second: the connector OOMs because a bulk UPDATE floods the queue. Fix: rate-limit with max.queue.size and max.batch.size. Third: schema changes break the connector. If you ALTER TABLE ADD COLUMN, the connector may fail to deserialize old events. Fix: use Avro with Schema Registry and enable schema evolution.

MonitorLag.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
// io.thecodeforge — System Design tutorial

// Check replication lag in PostgreSQL
SELECT slot_name, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes
FROM pg_replication_slots
WHERE slot_name = 'debezium_slot';
// If lag_bytes > 1GB, alert immediately.

// Check connector offset lag in Kafka
// Run kafka-consumer-groups:
kafka-consumer-groups --bootstrap-server kafka:9092 --group connect-orders --describe
// Look at LAG column. If > 100000, investigate.
Output
slot_name: debezium_slot, lag_bytes: 52428800 (50MB). Group: connect-orders, LAG: 0 (healthy).
Never Do This:
Don't set wal_keep_segments to unlimited. It fills disk. Set a reasonable max (e.g., 100GB) and alert on lag. Also, never drop a replication slot manually without stopping the connector — you'll lose the offset and force a full snapshot.

Exactly-Once Semantics: The Holy Grail and How to Get Close

CDC connectors promise at-least-once delivery by default. That means duplicates. For idempotent consumers (e.g., upsert into a cache), duplicates are fine. For non-idempotent (e.g., sending emails), duplicates are a disaster. To get exactly-once, you need: (1) connector with exactly-once support (Debezium 2.0+ with Kafka exactly-once source), (2) transactional outbox pattern in the source app, and (3) idempotent consumer with dedup by event ID. Even then, edge cases like connector crash after emit but before commit can cause duplicates. The only true exactly-once is in the consumer — make it idempotent.

IdempotentConsumer.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// io.thecodeforge — System Design tutorial

// Python consumer with idempotent upsert
import psycopg2

def handle_event(event):
    event_id = event['payload']['source']['ts_ms'] + '-' + event['payload']['after']['id']
    # Check if already processed
    cur.execute("SELECT 1 FROM processed_events WHERE event_id = %s", (event_id,))
    if cur.fetchone():
        return  # Skip duplicate
    # Process: upsert into cache
    cache.upsert(event['payload']['after']['id'], event['payload']['after'])
    # Mark processed
    cur.execute("INSERT INTO processed_events (event_id) VALUES (%s)", (event_id,))
    conn.commit()
Output
Duplicate events: first processes, second skipped. No duplicate side effects.
Interview Gold:
When asked 'How do you achieve exactly-once in CDC?' answer: 'You don't. You make the consumer idempotent.' That shows you understand the distributed systems reality.

Schema Evolution: Why Your Connector Will Break on Monday Morning

Databases evolve. You add a column, rename one, or change a type. CDC connectors capture the schema at snapshot time. When the schema changes, the connector may fail to deserialize old events in the log. Solution: use a schema registry (Confluent Schema Registry) with Avro or Protobuf. The registry stores every version. The connector emits events with schema ID. Consumers fetch the schema and handle backward/forward compatibility. Without this, you'll see org.apache.kafka.connect.errors.DataException: JsonConverter failed to deserialize JSON.

AvroSchemaEvolution.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
// io.thecodeforge — System Design tutorial

// Debezium connector with Avro and Schema Registry
{
  "key.converter": "io.confluent.connect.avro.AvroConverter",
  "key.converter.schema.registry.url": "http://schema-registry:8081",
  "value.converter": "io.confluent.connect.avro.AvroConverter",
  "value.converter.schema.registry.url": "http://schema-registry:8081",
  "schema.name.adjustment.mode": "avro"
}

// Consumer reads Avro
// Schema evolves: add column 'discount' with default 0.0
// Old events have no 'discount' field. Consumer handles via schema compatibility.
Output
Connector registers schema 'ecommerce-pg.public.orders.Envelope' v1. After ALTER TABLE, registers v2. Consumer reads both versions.
The Classic Bug:
If you use JSON converter without schema registry and rename a column, the connector will fail with org.apache.kafka.connect.errors.DataException: JsonConverter failed to deserialize JSON. Always use Avro with schema registry in production.

When Not to Use CDC: The Overkill Trap

CDC is powerful but not free. You need to run Kafka (or similar), a connector cluster, and schema registry. That's operational overhead. If your use case is: (1) batch processing with hourly updates, (2) a single table with <1M rows, or (3) you don't need real-time, then polling with a timestamp column is simpler and cheaper. Also, if your database is a small SQLite or a read replica, CDC adds complexity without benefit. Don't use CDC for everything. Use it when you need sub-second change propagation and can't afford to poll.

Senior Shortcut:
Ask yourself: 'Can I tolerate 5-second delay?' If yes, polling with updated_at index is fine. If no, CDC is your tool. Also, if you already have Kafka, CDC is a no-brainer.
● Production incidentPOST-MORTEMseverity: high

The 4GB Container That Kept Dying

Symptom
Debezium connector pod OOM-killed every 20 minutes. Kafka topic had millions of uncommitted offsets.
Assumption
Assumed memory leak in Debezium. Restarted connector — same result.
Root cause
A developer ran UPDATE orders SET status='shipped' without a WHERE clause, generating 12 million change events. Debezium buffered them in memory (max.queue.size default 8192). Memory spiked to 4GB, container OOM killed.
Fix
Set max.queue.size=10000 and max.batch.size=2048. Added memory limit of 6GB. Also set snapshot.mode=when_needed to avoid re-snapshot on restart.
Key lesson
  • Always rate-limit your CDC connector and assume someone will run a bulk UPDATE.
  • Set memory limits based on max event size × queue size.
Production debug guideSystematic recovery paths for the failure modes engineers actually hit.3 entries
Symptom · 01
Connector OOM killed (pod restart loop)
Fix
1. Check connector logs for OutOfMemoryError. 2. Reduce max.queue.size to 8192. 3. Increase container memory to 4GB. 4. Restart connector. 5. If persists, check for bulk UPDATE in source DB.
Symptom · 02
Replication lag > 1 hour
Fix
1. Check pg_replication_slots lag. 2. Check if long-running transaction blocks WAL cleanup. 3. Kill the transaction. 4. Increase max_slot_wal_keep_size to 10GB. 5. Restart connector to force re-snapshot if lag too high.
Symptom · 03
Duplicate events in consumer
Fix
1. Check connector config: exactly.once.source=true? 2. If not, enable it. 3. Make consumer idempotent (upsert, dedup by event ID). 4. If using Kafka, check enable.idempotence=true in producer.
★ Change Data Capture (CDC) Triage Cheat SheetFirst-response commands for when things go wrong — copy-paste ready.
Connector logs `OutOfMemoryError`
Immediate action
Check max.queue.size and memory limit
Commands
kubectl logs <connector-pod> --tail=100 | grep -i 'OutOfMemory'
kubectl describe pod <connector-pod> | grep -A2 Limits
Fix now
Set max.queue.size=8192 and container memory to 4Gi
Replication lag alert+
Immediate action
Check pg_replication_slots
Commands
SELECT slot_name, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes FROM pg_replication_slots WHERE slot_name = 'debezium_slot';
SELECT * FROM pg_stat_replication;
Fix now
Increase max_slot_wal_keep_size to 10GB; kill long-running transactions
Connector fails with `DataException`+
Immediate action
Check schema registry
Commands
curl -s http://schema-registry:8081/subjects | jq
curl -s http://schema-registry:8081/subjects/ecommerce-pg.public.orders-value/versions | jq
Fix now
Register new schema version; set schema.compatibility=BACKWARD
No events in Kafka topic+
Immediate action
Check connector status
Commands
curl -s http://connect:8083/connectors/orders-connector/status | jq
curl -s http://connect:8083/connectors/orders-connector/tasks/0/status | jq
Fix now
Restart connector; if still fails, check DB replication slot exists
Feature / AspectCDC (Log-Based)Polling (Timestamp)
Database LoadNone (reads log)High (SELECT queries)
Real-Time LatencySub-secondPoll interval (seconds to minutes)
Data Loss RiskLog retention expiryNone (if timestamp indexed)
Operational ComplexityHigh (Kafka, connectors, schema registry)Low (just a query)
Schema ChangesRequires schema registryHandled by app code
Best ForHigh-throughput, low-latency streamsLow-volume, batch-friendly systems

Key takeaways

1
CDC reads the transaction log, not the table
zero load on the source database.
2
Always use Avro with schema registry for production CDC to survive schema changes.
3
Monitor replication lag via pg_replication_slots
it's the #1 cause of data loss.
4
CDC is overkill for low-volume systems; polling with an index is simpler and cheaper.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
How does Debezium handle schema changes in PostgreSQL? What happens if y...
Q02SENIOR
When would you choose CDC over dual-writes in an event-driven architectu...
Q03SENIOR
What happens when a CDC connector crashes after emitting an event but be...
Q04JUNIOR
What is Change Data Capture (CDC)?
Q05SENIOR
You notice the CDC connector is falling behind and replication lag is gr...
Q06SENIOR
Design a CDC pipeline that streams changes from a PostgreSQL database to...
Q01 of 06SENIOR

How does Debezium handle schema changes in PostgreSQL? What happens if you add a column without a default?

ANSWER
Debezium captures the schema at snapshot time. When a column is added, the connector may fail to deserialize old events if using JSON. With Avro and schema registry, it registers a new schema version. If the column has no default, old events lack the field; the consumer must handle null. Always use Avro with backward compatibility.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
What is Change Data Capture (CDC) and how does it work?
02
What's the difference between CDC and database triggers?
03
How do I set up CDC for PostgreSQL with Debezium?
04
What happens if the CDC connector crashes and the database recycles the WAL?
N
Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.

Follow
Verified
production tested
June 25, 2026
last updated
1,663
articles · all by Naren
🔥

That's Database Internals. Mark it forged?

3 min read · try the examples if you haven't

Previous
ACID vs BASE
6 / 9 · Database Internals
Next
Database Federation