CDC and Debezium for DE interviews

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Why CDC exists

Change Data Capture is a stream of row-level changes flowing out of your OLTP database into the warehouse in near real time. The alternative — a nightly batch dump of full tables — works fine until the table crosses a few hundred million rows, or a product team asks for the intermediate states of an order, or finance wants to reconcile cancellations within the hour.

Picture a checkout flow at a payments company like Stripe. An order transitions through new → authorized → captured → refunded in under five minutes. A nightly batch picks up only the final state — refunded — and the analyst loses every intermediate transition. Multiply by a million orders per day and the warehouse systematically under-reports authorization failures and confuses every cohort study downstream. That gap is exactly what CDC closes.

DE panels at Snowflake, Databricks, and Airbnb load up on CDC questions because the topic separates engineers who have only built batch pipelines from those who have shipped streaming systems. Expect three questions in rapid order: "What is CDC?", "How does log-based differ from query-based?", and "Walk me through Debezium." The third one is where mid-level candidates pull ahead.

Query-based CDC

The cheapest form of CDC is a poll loop against an updated_at column.

SELECT *
FROM orders
WHERE updated_at > :last_run;

It is the first approach most teams try — zero special permissions, any RDBMS with a timestamp column. That simplicity is also its weakness.

The good: no replication slot, no privileged log access, works against managed databases you do not control. Often the only option for vendor systems.

The bad: four failure modes that sound innocent until production breaks.

Failure mode What goes wrong Typical blast radius
Hard DELETE Row disappears from the result set, warehouse never learns it existed Refunds, cancellations, GDPR erasures
Updates without updated_at change Backfill scripts, ORM hooks, manual SQL skip the timestamp Silent state drift
Intermediate states Row mutates 5 times between polls, you see only the final value Funnel analytics, fraud signals
Late writes Transaction commits with updated_at < last_run after your poll Permanent data loss

When to reach for it anyway: dimension tables that change weekly, taxonomies, configuration tables, anything where DELETE does not occur and you can tolerate the daily granularity. For fact tables with active writes, do not.

Log-based CDC and Debezium

Log-based CDC reads the database's own write-ahead log — the same log the engine uses for crash recovery and replication. Every committed change is already there in order, with timestamps, transaction boundaries, and full before/after images.

The mapping by engine:

Database Log mechanism
Postgres WAL via pgoutput or wal2json (logical replication)
MySQL binlog (row format)
Oracle LogMiner or GoldenGate
SQL Server SQL Server CDC, change tracking
MongoDB oplog
DynamoDB DynamoDB Streams

Debezium is the open-source CDC platform that sits on top of these logs. It runs as a Kafka Connect source connector (or as a standalone Debezium Server), reads the database log, and publishes one Kafka topic per source table.

[Postgres WAL] -> Debezium PG connector -> Kafka topic "orders"
                                        -> consumers: warehouse, search, cache invalidation

A typical Debezium event for an UPDATE looks like this:

{
  "op": "u",
  "before": {"id": 1, "amount": 100, "status": "new"},
  "after":  {"id": 1, "amount": 100, "status": "paid"},
  "source": {"ts_ms": 1730000000, "lsn": "0/16B6378"},
  "ts_ms": 1730000001
}

The op field is the headline: c for create, u for update, r for snapshot read, d for delete. The before and after images let you reconstruct any column-level change. The source block carries the log sequence number (LSN) for Postgres or the binlog position for MySQL — that LSN is your idempotency key downstream.

Load-bearing trick: the source.lsn (or source.ts_ms for engines without LSN) is monotonic per source database. Dedup by (table, primary_key, lsn) and you will never apply a stale update on top of a fresh one — even when Kafka redelivers.

Why log-based wins on production fact tables: deletes are first-class, the database is not scanned, every intermediate state is captured, and latency drops to single-digit milliseconds under steady load. The price is operational: replication slots, log retention, and the risk that a slow consumer stalls WAL recycling and fills your primary's disk.

Snapshot and streaming

When you first turn Debezium on against a non-empty table, it cannot start from the current LSN — the warehouse would be missing every row that existed before that moment. So Debezium runs an initial snapshot.

T0: connector starts, records current LSN
T1: snapshot reads all 10M existing rows, emits 10M `op=r` events
T2: connector switches to streaming mode, resumes from LSN captured at T0
T3: real INSERT/UPDATE/DELETE events flow as `c` / `u` / `d`

Four snapshot modes you should be able to recite:

  • initial — the default. Snapshot once, then stream forever.
  • never — skip the snapshot. Use when you only care about changes from now on.
  • initial_only — snapshot and stop. Useful for one-off backfills.
  • when_needed — snapshot only when the connector cannot resume from the last LSN.

The historical gotcha: a classic snapshot takes a global read lock for the duration of the table scan. On a 1 TB orders table that meant hours of degraded write performance on the primary — a real incident pattern at companies on Debezium 1.5 and earlier. The incremental snapshot in Debezium 1.6+ chunks the table by primary key, snapshots each chunk under a short lock, and interleaves with live log reads. Treat incremental as the default.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Delivery into the warehouse

Once CDC events land in Kafka, you have two dominant patterns.

Pattern A — append-only changelog. Every event lands as a new row. State is reconstructed at query time.

-- fact_orders_changelog
-- ts_ms | op | order_id | amount | status
-- 1000  | c  | 1        | 100    | new
-- 2000  | u  | 1        | 100    | paid
-- 3000  | d  | 1        | NULL   | NULL

SELECT order_id,
       argMax(amount, ts_ms) AS amount,
       argMax(status, ts_ms) AS status
FROM fact_orders_changelog
WHERE op != 'd'
GROUP BY order_id;

In Postgres or BigQuery you would use LAST_VALUE(...) OVER (PARTITION BY order_id ORDER BY ts_ms). In ClickHouse argMax is the idiom.

Pattern B — materialized current state. Every event triggers a MERGE/UPSERT and the warehouse holds the latest value per primary key.

Engine Idiomatic write
Snowflake MERGE INTO target USING stream ...
BigQuery MERGE with partition pruning on _PARTITIONTIME
ClickHouse ReplacingMergeTree with version column, periodic OPTIMIZE FINAL
Iceberg / Delta MERGE INTO with row-level deletes

Pattern A is cheaper to write, more expensive to read. Pattern B is the inverse. Big shops run both: bronze is append-only, silver is materialized. That two-layer split is also where dbt earns its keep.

Deduplication is non-optional. Kafka guarantees at-least-once, which means duplicate events will happen — connector restarts, consumer rebalances, network blips. The standard defense is to dedup on (source.lsn, table, primary_key) at the warehouse boundary, before the MERGE.

Schema changes

ALTER TABLE is where CDC pipelines go to die. The taxonomy you should bring into an interview:

Add column is the friendly case. New events carry the new field; old events do not. Warehouses with schema evolution (Iceberg, Delta, BigQuery autodetect) absorb this with zero downtime.

Drop column is medium-hard. New events stop carrying the column. Readers must treat missing values as NULL. Downstream views referencing the dropped column break loudly — and that is the right behavior.

Rename column is the worst. The old column disappears and a new one appears with no semantic link. The fix is a manual mapping in the consumer config or a coordinated dual-write window in the application.

Type change is "do not do this in place." Add a new column with the new type, dual-write, backfill, then drop the old column.

Debezium publishes a schema change topic carrying every DDL it observes. Mature pipelines subscribe to that topic and either auto-evolve the warehouse schema (Iceberg, Delta) or page a human.

Common pitfalls

Query-based CDC without soft delete. The single most common production incident pattern: the team picks query-based for simplicity, forgets that hard DELETEs are invisible, and a quarter later discovers the warehouse customer count is 15% higher than the source. The fix is either switch to log-based, or enforce soft-delete (deleted_at) at the application layer. There is no middle ground.

Not monitoring replication slot lag. Debezium falls behind, Postgres keeps every WAL segment the slot still needs, and disk fills to 100% at 3am. The primary refuses new writes and the on-call engineer learns about Postgres replication internals the hard way. The fix is an alert on pg_replication_slots.lag_bytes at the 75% disk-free threshold, plus a runbook entry to drop the slot under extreme pressure.

Ignoring Kafka partition ordering. Kafka guarantees order within a partition, not across. If your Debezium topic is keyed by order_id, all events for one order land in one partition and stay ordered. If somebody changes the key to customer_id, two updates for the same order_id can land on different partitions and arrive in the wrong order. The warehouse merges them in arrival order and ships the stale value. Always assert the key strategy in the connector config.

Snapshotting a terabyte table without incremental mode. The classic snapshot holds a read lock for hours and write latency spikes. The fix is to set snapshot.mode=incremental (or use execute-snapshot signals on Debezium 2.0+) and configure a sensible chunk size — typically 1024 rows per chunk for OLTP tables.

No idempotency at the consumer. Kafka duplicates leak straight into the warehouse, you get double-counted revenue, finance escalates. Dedup on (source.lsn, primary_key) at the staging layer, before MERGE. Bonus: this makes the pipeline replay-safe.

Mishandling tombstones. A Debezium DELETE is actually two messages: the op=d event with after=null, and a tombstone with a null payload telling Kafka log compaction it is safe to drop the key. Consumers that crash on null payloads silently drop deletes. Test deletes end-to-end before going live.

Assuming Debezium gives you history before T0. The initial snapshot captures the current state of every row at startup. Every change before Debezium was deployed is gone unless you pre-staged a dump. If business needs full historical lineage, run a one-time backfill from periodic dumps in parallel with the live stream, joined on primary key.

If you want to drill DE interview questions like CDC, Kafka, and warehouse modeling on a daily cadence, NAILDD is launching with a streaming-and-pipelines question bank built around exactly this pattern.

FAQ

Debezium or Kafka Connect JDBC Source — which should I pick?

The JDBC Source connector is query-based. It is straightforward to set up, runs against any database with JDBC drivers, and ships with zero special permissions. It also inherits every weakness of query-based CDC: no DELETE capture, no snapshot, no intermediate states. Debezium is log-based, harder to configure (replication slot, privileged user, log retention tuning), but production-grade. Rule of thumb: JDBC for analytics dimensions that change weekly, Debezium for anything transactional.

Can I run CDC without Kafka?

Yes. Debezium Server is a standalone runtime that emits events directly to Kinesis, Pub/Sub, Pulsar, or an HTTP sink. Debezium Embedded Engine is a Java library you drop inside your own application. Both skip Kafka. The trade-off is that you also skip Kafka's replay log, fan-out, and consumer group rebalancing — which you re-implement when you outgrow a single consumer. For greenfield projects with one downstream system, Debezium Server is the cleaner choice.

What does exactly-once mean in Debezium?

Strictly, Debezium gives you at-least-once delivery into Kafka. End-to-end exactly-once in the warehouse is the consumer's responsibility — typically through an idempotent MERGE keyed on the primary key plus the source LSN. Kafka transactional producers help, but the warehouse side still needs idempotent writes because you cannot bring Snowflake or BigQuery into the Kafka transaction.

Does CDC replace ETL?

It replaces the extract step on transactional sources, not the transform step. The modern pattern is CDC into a raw bronze layer (append-only changelog), dbt for transformation into silver and gold, BI on top. CDC does not handle file sources, vendor APIs, or anything outside a database — so most platforms run CDC alongside an ELT tool (Fivetran, Airbyte) for the long tail.

Does Debezium work against Postgres read replicas?

Yes, with caveats. Postgres logical replication can run from a primary or from a replica, but the replica needs hot_standby_feedback=on and the publication must exist on the primary. Reading from a replica offloads WAL decode cost from a busy primary, but you inherit replication lag — any hiccup in primary -> replica shows up in warehouse latency. For Postgres 10+, pgoutput is the recommended plugin.

Is this content official?

No. This guide is synthesized from the Debezium documentation, Postgres logical replication docs, and patterns observed across production deployments. Specific configuration values depend on your engine version, hardware, and workload — treat the numbers as a starting point.