dbt incremental models on the data engineering interview
Contents:
Why interviewers love this question
Shadow a senior data engineer at Snowflake, Databricks, or any analytics team at Stripe for a week and you will see the same conversation at every standup: a fact table is getting too big, the nightly job is running for six hours, and the on-call engineer needs to switch a model from full refresh to incremental before the SLO breaks. dbt incremental models are the standard way that conversation ends, which is why they show up in nearly every senior DE loop. The interviewer is not testing the keyword — they are testing whether you can choose the right strategy, reason about idempotency, and avoid silent duplicates when late-arriving data lands at 3am.
You usually get a setup like: "You have an orders fact table of 1.2 TB growing 8 GB per day, the source is a Postgres CDC stream, and analytics needs hourly freshness. Walk me through your dbt model." The bar is not a syntactically correct config block — it's whether you talk about unique_key, what happens on full refresh, how you handle a record updated three days after insert, and whether your model survives a re-run without producing one duplicate row.
Load-bearing rule: an incremental model is only correct if a re-run on the same window produces the same output. If you cannot defend idempotency in one sentence, your design is not done.
Why incremental at all
A full refresh of a 1 TB fact table on Snowflake burns warehouse credits every night. On BigQuery it ties up slots; on Databricks it spins a cluster long enough that the FinOps channel notices. The economic case is obvious — process only the rows that changed instead of rewriting the world. The engineering case is more interesting: incremental models give you predictable runtimes, which makes downstream SLAs schedulable, and let you scan raw object storage at most once per row.
{{ config(
materialized='incremental',
unique_key='order_id',
incremental_strategy='merge'
) }}
SELECT *
FROM {{ source('raw', 'orders') }}
{% if is_incremental() %}
WHERE updated_at > (SELECT COALESCE(MAX(updated_at), '1900-01-01') FROM {{ this }})
{% endif %}The first run reads the whole source and materializes the target. Every subsequent run only pulls rows where updated_at is newer than the highest updated_at already in the target. That single filter is the entire trick — everything else in this article is about not breaking it.
Incremental strategies compared
dbt exposes four incremental strategies, and choosing between them is the most common follow-up question after "what is incremental." The right answer almost always depends on two things: the warehouse you're on and whether updates can happen to existing rows. The table below is the cheat sheet I keep open during interview prep.
| Strategy | What it does | Needs unique_key |
Best for | Warehouses |
|---|---|---|---|---|
merge |
UPSERT — update matched rows, insert new ones | Yes | Mutable facts, dims with updates | Snowflake, BigQuery, Databricks, Postgres 15+ |
append |
INSERT new rows, no dedup, no update | No | Append-only event logs, clickstream | All adapters |
delete+insert |
DELETE rows by unique_key, then INSERT |
Yes | MERGE-less engines, Redshift older versions | Redshift, Postgres <15 |
insert_overwrite |
Rewrite whole partitions atomically | No (uses partition_by) |
Time-partitioned huge tables | BigQuery, Spark, Databricks |
merge is the default mental model for mutable data — a row can update, and you want the target to reflect the latest version. It uses the underlying SQL MERGE statement and requires a unique_key. If you forget the key, dbt either errors or silently duplicates depending on the adapter.
append is the cheapest strategy because it skips the dedup work entirely, but it is only safe when rows never update and you have an upstream guarantee against duplicates (Kafka with idempotent producer, S3 with partition naming, etc.). Use it for event logs, page views, ad impressions — anywhere the row, once written, is immutable.
insert_overwrite is the Databricks / BigQuery favorite for daily-partitioned event tables. Instead of merging individual rows, dbt rewrites entire partitions in one atomic step, which is much cheaper on columnar warehouses than row-level merges. The trade-off: your unique_key is implicitly the partition column, so any row in a touched partition gets reprocessed.
delete+insert is the fallback for engines without MERGE support. It runs two statements in a transaction — first DELETE by unique_key, then INSERT — and ends up logically equivalent to merge, with worse performance on most warehouses.
is_incremental() and a worked model
Two Jinja macros do all the heavy lifting. {{ this }} is a reference to the model's own target table — it lets you read the high-watermark out of the table you're about to write to. is_incremental() returns true only when the model already exists and you didn't pass --full-refresh. The combination of those two lets you write a single SQL file that behaves correctly on first run, every incremental run, and every refresh.
{{ config(
materialized='incremental',
unique_key=['user_id', 'event_date'],
on_schema_change='append_new_columns'
) }}
WITH source AS (
SELECT
user_id,
DATE_TRUNC('day', event_at) AS event_date,
COUNT(*) AS events,
SUM(amount) AS revenue
FROM {{ source('raw', 'events') }}
{% if is_incremental() %}
WHERE event_at >= (SELECT MAX(event_date) FROM {{ this }})
{% endif %}
GROUP BY 1, 2
)
SELECT * FROM sourceNotice the composite unique_key — [user_id, event_date] — which is what makes the daily aggregate idempotent. If you re-run today's window, dbt updates the existing (user_id, today) row instead of inserting a duplicate. The on_schema_change='append_new_columns' option tells dbt to silently extend the target table when a new column appears in the source, which is the right default for production. The other options are ignore (silently drop new columns — dangerous), sync_all_columns (also drop removed columns — destructive), and fail (best for tightly governed pipelines).
A small ergonomic note: always alias your CTE and select from it at the bottom. Inline SELECT ... {% if is_incremental %} filters get unreadable fast.
Late-arriving data
This is where the interview gets interesting. Late-arriving data is any record that lands in the source warehouse with an event_at older than the high-watermark you used last time. Mobile apps with offline batching produce late events constantly — a user goes through a tunnel, opens the app two hours later, and 47 events flush at once with timestamps from earlier in the day. If your incremental filter is event_at > MAX(event_at), those events vanish forever.
There are three production-tested fixes.
Lookback window. Reprocess the last N days every run, regardless of the high-watermark.
{% if is_incremental() %}
WHERE event_at >= (SELECT MAX(event_date) FROM {{ this }}) - INTERVAL '7 days'
{% endif %}This is the most common pattern. Pick N based on your p99 late-arrival window — for most consumer apps it's 2 to 3 days, for B2B SaaS with batch ingestion it can be 7 to 14 days. Critically, the lookback only works if your strategy is merge or delete+insert; with append you'll just duplicate the late events.
Use updated_at instead of event_at. If your source guarantees an updated_at that bumps every time the row mutates, filter by that instead. You catch every change including backdated edits, no lookback needed. The catch: this only works if the source actually maintains updated_at correctly, which is much rarer than upstream teams claim.
CDC. With Debezium, Fivetran, or a native CDC connector, every change in the source produces an event in the stream, including soft-deletes. Your incremental model consumes that stream, and the question of "what changed" is no longer your problem. CDC is the right answer for any pipeline where correctness matters more than infrastructure cost.
Gotcha: a lookback window without merge (or delete+insert) produces silent duplicates. Always pair them.
Full refresh and snapshots
Incremental builds are not a perfect substitute for full refresh — sometimes you need to rewrite the whole table. dbt run --full-refresh drops the target and rebuilds from scratch, which is the right move in three cases: you changed the model's business logic (a new column, a fixed join, a corrected aggregation), the upstream schema broke in a way that corrupted historical rows, or you suspect drift between the incremental output and what a full rebuild would produce. A reasonable cadence is monthly for stable models and weekly for newer ones; refreshing nightly defeats the whole point.
Snapshots are a separate dbt materialization for Slowly Changing Dimensions type 2 (SCD2). Where an incremental model overwrites the latest state, a snapshot keeps the history of every change with dbt_valid_from and dbt_valid_to columns.
{% snapshot dim_users_history %}
{{ config(
target_schema='snapshots',
unique_key='user_id',
strategy='TIMESTAMP',
updated_at='updated_at'
) }}
SELECT * FROM {{ source('raw', 'users') }}
{% endsnapshot %}Snapshots are the right answer when an interviewer asks "how would you track user plan changes over time" or "how do you handle a customer renaming their account." Incremental models give you the latest state cheaply; snapshots give you the point-in-time history at the cost of more storage.
Common pitfalls
The single most common interview failure is configuring an incremental model without a unique_key while using merge or delete+insert. dbt cannot deduplicate without a key, so on the second run you get every reprocessed row inserted twice. The fix is to either declare a real natural key on the model or fall back to append if and only if the source is genuinely append-only.
A close second is filtering on event_at when the source has true updated_at semantics. The model looks correct, passes review, ships to production, and then quietly misses every row that updates after its initial insert — common for orders, subscriptions, and any entity that transitions through statuses. The fix is to filter on updated_at (and confirm the upstream team actually maintains it) or add a generous lookback window backed by merge.
Teams also routinely forget the lookback window entirely. The model technically works, an interviewer's EXPLAIN shows a clean incremental filter, and yet 3% of late-arriving events vanish every day. The data team finds out two quarters later when an executive asks why MTD revenue doesn't reconcile to the OLTP source. Pick a lookback based on your p99 late-arrival window and pair it with merge.
Setting on_schema_change='ignore' in production is another quiet killer. A new column lands in the source, dbt silently drops it, and the downstream team that requested the field thinks it's been delivered. Use append_new_columns as your default and fail for any model with a regulatory contract.
Finally, running dbt run concurrently on the same model without orchestration causes race conditions on the MAX(updated_at) lookup — both runs see the same high-watermark, both insert the same rows, and you get duplicates even with merge. Either serialize the run through Airflow / Dagster, or take an advisory lock at the start of the model.
Related reading
- MERGE and UPSERT for data engineers
- What is dbt — Data Build Tool overview
- dbt Elementary for data quality
- Airflow on the data engineering interview
- Airflow backfill patterns
If you want to drill questions like this every day before your loop, NAILDD ships data engineering scenarios with the exact pitfalls interviewers probe — incremental strategies, late-arriving data, idempotency.
FAQ
Should I default to merge or delete+insert?
merge wins on any modern warehouse — Snowflake, BigQuery, Databricks, Postgres 15+ — because it runs as a single atomic statement and the optimizer handles row-level concurrency. delete+insert is the right fallback for engines without MERGE support (older Redshift, some niche adapters) and runs the two statements inside a transaction. Performance-wise, merge is usually 1.5x to 3x faster on identical workloads, and the gap widens as the target table grows.
What does insert_overwrite actually do on BigQuery?
It rewrites entire partitions instead of merging individual rows. dbt looks at the rows your model produced, identifies which partitions they touch, and atomically replaces those partitions in the target. This is cheaper than merge on columnar warehouses because BigQuery (and Databricks Delta) optimize partition-level rewrites very aggressively. The trade-off is that you cannot have per-row uniqueness without making the partition column part of your effective key.
Can I incrementally build a model that joins two large tables?
Technically yes, but it's a trap. If your incremental filter only looks at the left table, any update on the right side after the row was first written is missed. Example: you incrementally build fact_orders joined to dim_customers. A customer changes plan tier — the join key didn't change, your filter never reprocesses the order rows, and the historical fact silently lies. The pragmatic answer is to full-refresh join-heavy models weekly, or denormalize dimension columns into the fact at ingestion.
How do I test that an incremental model is idempotent?
Run it twice on the same source data and diff the output. If row counts or any aggregate change between run 1 and run 2, the model is not idempotent. The standard CI wiring is a dbt test asserting count(*) is stable across two consecutive runs of a sample dataset. dbt Elementary and Great Expectations cover the production monitoring side.
Is this dbt's official guidance?
No. This article is based on dbt 1.7+ documentation and dbt Labs community materials. Confirm specific behavior against the docs for the version and adapter you run, since incremental_strategy defaults differ across Snowflake, BigQuery, Databricks, and Postgres.