dbt on the Data Engineer interview
Contents:
Why dbt shows up in DE interviews
dbt is the default transformation layer for warehouse stacks — Snowflake, BigQuery, Databricks SQL, Redshift, ClickHouse. If a JD mentions any of those plus modeling, expect dbt questions. Netflix, Airbnb, Stripe, and GitLab publish dbt-style guides on their engineering blogs.
The pain dbt solves is load-bearing. Before dbt, transformation logic lived as scattered SQL scripts inside Airflow DAGs — no dependency graph, no tests, no documentation that stayed in sync with the code. dbt gives you three things you should recite under pressure: lineage (a DAG built from references between models), materialization (the rule for how a model's result is stored), and tests (assertions on column- and row-level data). A candidate who confuses "dbt is a transformation tool" with "dbt is an orchestrator" loses the room within five minutes.
Load-bearing trick: dbt does not move data or schedule itself. It compiles Jinja-templated SQL into warehouse-native DDL/DML through a database adapter. Airflow, Dagster, Prefect, or dbt Cloud's scheduler are what actually fire dbt run on a cadence.
Models, ref, and source
A model is a single .sql file with exactly one SELECT. dbt wraps it in CREATE TABLE AS, CREATE VIEW AS, or an incremental MERGE, depending on the materialization. Community naming: stg_* for staging, int_* for intermediate, fct_* / dim_* for marts.
-- models/marts/orders_daily.sql
SELECT
DATE_TRUNC('day', created_at) AS order_date,
COUNT(*) AS orders_cnt,
SUM(amount) AS revenue
FROM {{ ref('stg_orders') }}
GROUP BY 1The Jinja function ref() points at another model in the same project. From every ref call dbt builds the dependency graph and decides execution order — you never write the order yourself. That graph also powers dbt docs serve.
source() points at a raw table loaded into the warehouse by something outside dbt — Fivetran, Airbyte, an Airflow job, or a streaming ingester. Sources are declared once in YAML:
# models/staging/_sources.yml
sources:
- name: raw
schema: raw
tables:
- name: orders
loaded_at_field: _ingested_at
freshness:
warn_after: { count: 6, period: hour }
error_after: { count: 24, period: hour }Usage inside a model is FROM {{ source('raw', 'orders') }}. The reason source exists as a separate concept is twofold. First, it makes the boundary between "data we received" and "data we produced" explicit in the lineage graph. Second, freshness checks only run on sources, so you find out when an upstream stops shipping before the BI dashboard goes stale.
The single most predictable question on dbt-heavy interviews is "what's the difference between ref and source?". Clean answer: ref points at models dbt creates; source points at tables some other process loaded. Anything fancier is decoration.
Materializations
Materialization is the rule for how dbt persists a model's result. The four built-in materializations and when each is the right call:
| Materialization | What dbt creates | Right call when |
|---|---|---|
view |
CREATE VIEW |
Light transformation, query volume low, source is fast |
table |
CREATE TABLE AS SELECT |
Expensive joins/aggregations, read often, recompute cheap |
incremental |
INSERT / MERGE of new rows |
Large fact tables, append- or update-heavy |
ephemeral |
Inlined as CTE in downstream models | One-off intermediate logic with one consumer |
Materialization is configured either in dbt_project.yml (folder-level default) or inline at the top of the model file:
{{ config(materialized='incremental', unique_key='order_id') }}The classic ephemeral trap deserves its own paragraph. An ephemeral model is never written to the warehouse — dbt inlines its SQL as a CTE inside every downstream model that references it via ref. If five models depend on the same ephemeral model and the underlying logic is heavy, the warehouse executes that logic five times per run, once per downstream call site. Use ephemeral only for trivial staging-shape transformations with a single consumer; reach for view or table the moment the logic costs anything.
A useful default in production is to make staging models views and marts tables, then promote individual marts to incremental once they cross the "too slow to rebuild nightly" threshold.
Incremental strategies
An incremental model rebuilds only the new or changed rows on each run instead of recomputing the whole table. dbt exposes the strategy — the SQL pattern used to merge new rows in — as a config knob, and the available strategies depend on the adapter.
The four strategies you should be able to name:
append — plain INSERT of new rows. Cheapest. Does not deduplicate, so a re-delivered row from upstream will land as a duplicate in your fact table.
merge — UPSERT keyed by unique_key. More expensive than append, but updates existing rows when a new version arrives. This is the default on Snowflake, BigQuery, Databricks SQL, and Postgres 15+ where native MERGE exists.
delete+insert — delete matching rows by key, then insert. The standard pattern on Redshift and ClickHouse adapters where native MERGE is missing or slow. Functionally equivalent to merge for most workloads.
insert_overwrite — replace whole partitions at once. The best choice on partitioned BigQuery and Spark tables when you process data in daily batches; it sidesteps row-level UPSERT cost entirely.
The shape of a real incremental model:
{{ config(
materialized='incremental',
unique_key='event_id',
incremental_strategy='merge'
) }}
SELECT
event_id,
user_id,
event_type,
created_at
FROM {{ source('raw', 'events') }}
{% if is_incremental() %}
WHERE created_at > (
SELECT MAX(created_at) - INTERVAL '1 hour' FROM {{ this }}
)
{% endif %}The subtlety is the is_incremental() filter. If events arrive late — and at any company with mobile clients, they always do — a strict > MAX(created_at) filter silently drops late arrivals. The defensive pattern is to look back a buffer window matching your real lateness distribution and rely on unique_key + merge to dedupe.
Sanity check: if your incremental model drops unique_key while using merge, you will get silent duplicates the next time upstream delivers any row twice.
Backfill question, asked verbatim about half the time: "how do you fully reload an incremental model?" Answer: dbt run --full-refresh -s model_name rebuilds the table from scratch and ignores is_incremental() for that run.
Tests in dbt
dbt ships two layers of tests, and senior candidates are expected to know both.
Generic tests are declarative — written in YAML, executed against columns:
models:
- name: stg_orders
columns:
- name: order_id
tests:
- not_null
- unique
- name: status
tests:
- accepted_values:
values: ['new', 'paid', 'cancelled', 'refunded']
- name: user_id
tests:
- relationships:
to: ref('stg_users')
field: user_idThe four built-ins — unique, not_null, accepted_values, relationships — close roughly 80% of basic data quality checks. The relationships test is the dbt equivalent of a foreign-key assertion, useful in warehouses that don't enforce FKs themselves (which is most of them).
Singular tests are imperative — SQL files in the tests/ directory. A test fails when the query returns any rows:
-- tests/no_negative_revenue.sql
SELECT *
FROM {{ ref('orders_daily') }}
WHERE revenue < 0Run with dbt test. In CI, prefer dbt build — it interleaves run, test, and seed in DAG order, so a failing test on an upstream model stops downstream models from being built on top of bad data.
Two packages worth knowing: dbt-utils (helpers like equal_rowcount, expression_is_true, recency) and dbt-expectations (Great Expectations ported as dbt tests, with checks like expect_column_values_to_be_within_n_stdevs). Mentioning them unprompted is a senior signal.
Snapshots and SCD2
A snapshot is dbt's implementation of slowly changing dimension type 2 (SCD2). It tracks row-level changes over time by appending dbt_valid_from and dbt_valid_to columns, so you can answer "what did this customer's address look like on 2025-11-01?" without a separate audit table.
-- snapshots/orders_snapshot.sql
{% snapshot orders_snapshot %}
{{ config(
target_schema='snapshots',
unique_key='order_id',
strategy='check',
check_cols=['status', 'amount']
) }}
SELECT * FROM {{ source('raw', 'orders') }}
{% endsnapshot %}Two strategies. The timestamp strategy diffs by a single updated_at column — if the upstream value is newer, a new version row is written. Cheap, but only safe when upstream updates updated_at on every change. The check strategy compares the values in check_cols; use it when there is no trustworthy timestamp.
Run with dbt snapshot. On each run, changed rows get a new version inserted with dbt_valid_from = now() and the previous version's dbt_valid_to closed off.
Common follow-up: "when do you reach for a snapshot versus an incremental model?" Snapshots when you need history (the customer's plan on any past date, audit trails, regulatory replay). Incrementals when only the current state matters or the source is append-only by construction.
Common pitfalls
Using ephemeral for anything expensive. Because ephemeral models inline as CTEs in every downstream model, the warehouse re-executes the logic once per consumer. The fix is to materialize as view or table. A safe heuristic: if more than one downstream model references the same ephemeral, promote it.
Incremental models without a unique_key. With merge and no key, dbt cannot build the ON clause and either errors out or silently degrades to insert-only behavior depending on the adapter. With append, you get the same dedup gap without a warning. Always declare unique_key even when you think duplicates are impossible — late-arriving data and pipeline retries will eventually prove you wrong.
Nightly --full-refresh as a coping mechanism. An incremental model fails at 3am, someone is on call, and the temptation is to bolt --full-refresh onto the cron and walk away. A year later that fact table is 1.2 TB and the refresh takes four hours, blocking every downstream model. Alert on the original failure, run a scoped backfill with --full-refresh -s model_name, and leave the regular cadence incremental.
Tests only in YAML. Generic tests are great for column-shape assertions but cannot express invariants like "daily revenue is within 3 standard deviations of the trailing 30-day average". Use singular SQL tests in tests/ for cross-model invariants, plus dbt-expectations for statistical checks.
Sources without freshness. A sources.yml block that omits freshness means you find out the upstream stopped delivering when a stakeholder pings you. Set at minimum warn_after matching your SLA and error_after at twice that, then wire dbt source freshness into the same CI workflow that runs dbt build.
Snapshots in the production marts schema. Snapshots grow monotonically — every change on every row adds a version, forever. Left in production, they bloat storage and confuse consumers who don't expect SCD2 columns. Use a dedicated snapshots schema, plus a retention policy if regulations don't force unbounded history.
Related reading
- dbt incremental models for the Data Engineer interview
- dbt elementary tests for the Data Engineer interview
- dbt mart best practices for the DE interview
- What is dbt (data build tool)
- Airflow on the Data Engineer interview
- SQL for Data Engineer interviews
If you want to drill DE questions like this every day with explanations and warehouse-specific variants, NAILDD has 1,500+ problems across exactly this pattern.
FAQ
Is dbt an orchestrator like Airflow?
No, and the distinction matters. dbt does not schedule itself, does not pull from APIs, does not move data between systems. It is a transformation engine that compiles Jinja-templated SQL and executes it against a warehouse adapter. In production you have Airflow, Dagster, Prefect, or dbt Cloud's scheduler triggering dbt run and dbt test on a cadence. The cleanest mental model is "Airflow tells dbt when to run; dbt tells the warehouse what SQL to execute."
dbt Core versus dbt Cloud — which one do interviewers expect?
dbt Core is the open-source CLI you install via pip and run anywhere — laptop, CI runner, EC2, Kubernetes Job. dbt Cloud is the SaaS layer on top: browser IDE, scheduler, lineage UI, run history, alerts, semantic layer. Interviewers usually expect fluency in Core because that is what shows up in CI configs and on-call runbooks. Familiarity with Cloud is a plus; if you have only used Cloud, learn dbt run, dbt test, dbt build, and dbt compile from the command line before any senior loop.
How do I test a single model locally during development?
dbt run -s model_name builds just that model. dbt test -s model_name runs only its tests. dbt build -s model_name+ builds the model plus everything downstream — the trailing + is dbt's downstream selector, and +model_name means upstream-and-itself. For inspecting compiled SQL without executing, use dbt compile -s model_name and read the rendered file in target/compiled/ — the single most useful debugging trick when a Jinja macro misbehaves.
What do I do when a dbt model is slow?
Three moves in order. First, profile with dbt_project_evaluator and the warehouse's query history to find which step dominates runtime. Second, check the lineage in dbt docs serve for huge upstream scans that could be filtered earlier. Third, change the materialization: heavy views become tables, fast-growing tables become incremental, partitioned warehouses get insert_overwrite. Warehouse-specific tricks help too — clustering keys on Snowflake and BigQuery, sort keys on Redshift, ReplacingMergeTree on ClickHouse.
Does dbt work cleanly on ClickHouse?
Through the dbt-clickhouse adapter, with caveats. The merge strategy is emulated via delete+insert or routed through ReplacingMergeTree engines that dedupe asynchronously. Snapshots are supported with reduced options. For heavy joins, the idiomatic ClickHouse pattern is materialized views maintained at the engine level, not dbt-incremental — mentioning that boundary signals you have shipped on it.
Is this an official dbt Labs reference?
No. This guide is based on public dbt documentation and patterns commonly used by data teams. Specific expectations at any given company depend on their warehouse, scheduler, and review culture.