Change Data Capture: How to Stream Database Changes Without Breaking Production
Change Data Capture (CDC) streams database changes to downstream systems.
20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.
CDC works by reading the database's transaction log (e.g., PostgreSQL WAL, MySQL binlog) and converting each change into an event. Tools like Debezium or AWS DMS handle this. You get a stream of change events without modifying your application code.
Imagine a security camera that records every package that goes through a warehouse conveyor belt. Instead of stopping the belt to check each package, you watch the recording later and note every change. CDC is that recording — it captures every row change from the database transaction log without slowing down the database itself.
Everyone thinks CDC is just 'read the binlog and send to Kafka.' Then your payment service goes down at 3 AM because the CDC connector ran out of memory trying to process a 2GB transaction. CDC is deceptively simple in theory and brutally complex in production. The problem it solves is real: how do you react to database changes without polling every second and hammering your DB with SELECT queries? You need a stream of changes — for caches, search indexes, analytics, or event-driven architectures. After reading this, you'll know exactly how CDC works under the hood, which failure modes will bite you, and how to build a production-grade pipeline that doesn't fall over when a DBA runs an UPDATE without a WHERE clause.
Why Polling Is a Trap and CDC Is the Escape
Before CDC, teams polled tables with SELECT * FROM orders WHERE updated_at > ?. That works until your table has 100M rows and you're polling every 5 seconds. The DB spends CPU on repeated full scans. Writes slow down. Connection pools exhaust. CDC sidesteps this by reading the transaction log — the database's own record of changes. No SELECTs, no load. The trade-off: you now depend on log retention and connector health. But for any system that needs low-latency change streams (cache invalidation, search indexing, analytics), polling is a dead end.
ERROR: connection pool exhausted within hours. CDC is not optional at scale.How CDC Reads the Transaction Log Without Breaking the Database
CDC connectors attach to the database's transaction log. In PostgreSQL, that's a logical replication slot. In MySQL, it's the binlog position. The connector reads the log sequentially, parses each transaction into row-level events, and emits them to a message bus (Kafka, Pulsar, etc.). The key insight: the log is append-only and sequential. The connector never runs SELECT. It just streams. But there's a catch: the log has a retention window. If your connector falls behind, the DB may recycle old segments, and you lose data. That's why monitoring replication lag is critical.
publication.autocreate.mode=filtered to avoid publishing all tables. Unfiltered publications expose every table change — including internal tables — and bloat the log.The Three Failure Modes That Will Wake You Up at 3 AM
CDC in production fails in three predictable ways. First: the connector falls behind and the DB recycles the log. You get ERROR: replication slot "debezium_slot" is active but the server's wal_level is insufficient. Fix: increase wal_keep_segments or max_slot_wal_keep_size. Second: the connector OOMs because a bulk UPDATE floods the queue. Fix: rate-limit with max.queue.size and max.batch.size. Third: schema changes break the connector. If you ALTER TABLE ADD COLUMN, the connector may fail to deserialize old events. Fix: use Avro with Schema Registry and enable schema evolution.
wal_keep_segments to unlimited. It fills disk. Set a reasonable max (e.g., 100GB) and alert on lag. Also, never drop a replication slot manually without stopping the connector — you'll lose the offset and force a full snapshot.Exactly-Once Semantics: The Holy Grail and How to Get Close
CDC connectors promise at-least-once delivery by default. That means duplicates. For idempotent consumers (e.g., upsert into a cache), duplicates are fine. For non-idempotent (e.g., sending emails), duplicates are a disaster. To get exactly-once, you need: (1) connector with exactly-once support (Debezium 2.0+ with Kafka exactly-once source), (2) transactional outbox pattern in the source app, and (3) idempotent consumer with dedup by event ID. Even then, edge cases like connector crash after emit but before commit can cause duplicates. The only true exactly-once is in the consumer — make it idempotent.
Schema Evolution: Why Your Connector Will Break on Monday Morning
Databases evolve. You add a column, rename one, or change a type. CDC connectors capture the schema at snapshot time. When the schema changes, the connector may fail to deserialize old events in the log. Solution: use a schema registry (Confluent Schema Registry) with Avro or Protobuf. The registry stores every version. The connector emits events with schema ID. Consumers fetch the schema and handle backward/forward compatibility. Without this, you'll see org.apache.kafka.connect.errors.DataException: JsonConverter failed to deserialize JSON.
org.apache.kafka.connect.errors.DataException: JsonConverter failed to deserialize JSON. Always use Avro with schema registry in production.When Not to Use CDC: The Overkill Trap
CDC is powerful but not free. You need to run Kafka (or similar), a connector cluster, and schema registry. That's operational overhead. If your use case is: (1) batch processing with hourly updates, (2) a single table with <1M rows, or (3) you don't need real-time, then polling with a timestamp column is simpler and cheaper. Also, if your database is a small SQLite or a read replica, CDC adds complexity without benefit. Don't use CDC for everything. Use it when you need sub-second change propagation and can't afford to poll.
updated_at index is fine. If no, CDC is your tool. Also, if you already have Kafka, CDC is a no-brainer.The 4GB Container That Kept Dying
UPDATE orders SET status='shipped' without a WHERE clause, generating 12 million change events. Debezium buffered them in memory (max.queue.size default 8192). Memory spiked to 4GB, container OOM killed.max.queue.size=10000 and max.batch.size=2048. Added memory limit of 6GB. Also set snapshot.mode=when_needed to avoid re-snapshot on restart.- Always rate-limit your CDC connector and assume someone will run a bulk UPDATE.
- Set memory limits based on max event size × queue size.
OutOfMemoryError. 2. Reduce max.queue.size to 8192. 3. Increase container memory to 4GB. 4. Restart connector. 5. If persists, check for bulk UPDATE in source DB.pg_replication_slots lag. 2. Check if long-running transaction blocks WAL cleanup. 3. Kill the transaction. 4. Increase max_slot_wal_keep_size to 10GB. 5. Restart connector to force re-snapshot if lag too high.exactly.once.source=true? 2. If not, enable it. 3. Make consumer idempotent (upsert, dedup by event ID). 4. If using Kafka, check enable.idempotence=true in producer.kubectl logs <connector-pod> --tail=100 | grep -i 'OutOfMemory'kubectl describe pod <connector-pod> | grep -A2 Limitsmax.queue.size=8192 and container memory to 4GiKey takeaways
pg_replication_slotsInterview Questions on This Topic
How does Debezium handle schema changes in PostgreSQL? What happens if you add a column without a default?
Frequently Asked Questions
20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.
That's Database Internals. Mark it forged?
3 min read · try the examples if you haven't