What is dbt (data build tool)
Contents:
What dbt is, in one paragraph
dbt (data build tool) is the framework analytics engineers use to transform data inside the warehouse using only SELECT statements. Instead of stitching together Python scripts, scheduler DAGs, and brittle CREATE TABLE AS chains, you write SQL files called models, reference them with ref(), and let dbt figure out run order, materialization, and tests. Work that used to need a data engineer with Airflow now needs anyone who can write a CTE.
In hiring terms this is the wedge that created the analytics engineer title. Stripe, Notion, and Airbnb post these roles at $140k-$190k base in the US — the bar is fluent SQL plus dbt fluency, not Spark internals. If you can walk through your own warehouse on a whiteboard, you are most of the way to the loop.
Load-bearing idea: dbt does not move data. It takes whatever is in your warehouse and turns it into clean tables and views in a deterministic order, with tests attached. Loading is somebody else's job (Fivetran, Airbyte, Snowpipe).
ELT vs ETL — why dbt became possible
The old shape was ETL: extract from a source, transform on a separate box (Informatica, Talend, Python workers), then load the cleaned result into the warehouse. Transform-before-load made sense when warehouses charged per terabyte scanned. Every join was a budget conversation.
Cloud warehouses flipped the economics. Snowflake, BigQuery, Databricks SQL, and Redshift bill compute separately from storage and scale on demand. The cheap move is now ELT: dump raw rows in first, transform them in place with SQL. The transform step lives next to the data and any analyst with read access can inspect every intermediate result.
dbt is the canonical "T" in ELT. It does not extract or load. It compiles your Jinja-flavored SQL, resolves the dependency graph, and materializes results as tables, views, or incremental tables. That narrow scope is the whole reason it won.
| Era | Loader | Transform | Orchestrator | Owner |
|---|---|---|---|---|
| 2010 ETL | Informatica | Informatica / SSIS | Cron | Data engineer |
| 2016 hybrid | Stitch | Custom Python | Airflow | Data engineer |
| 2020+ ELT | Fivetran / Airbyte | dbt | Airflow / dbt Cloud | Analytics engineer |
| 2026 modern | Fivetran / native CDC | dbt + SQLMesh | dbt Cloud / Dagster | Analytics engineer |
How dbt works: models, ref, sources
Model
A model is a .sql file with a single SELECT. dbt wraps it in DDL based on the model's materialization (view, table, incremental, ephemeral). Example models/mart/daily_revenue.sql:
SELECT
DATE(order_date) AS day,
COUNT(*) AS order_count,
SUM(amount) AS revenue,
AVG(amount) AS avg_order_value
FROM {{ ref('stg_orders') }}
WHERE status = 'completed'
GROUP BY DATE(order_date)The {{ ref('stg_orders') }} is a Jinja call into another model. At compile time dbt rewrites it to the fully-qualified table name for the current target — analytics_dev.stg_orders in your sandbox, analytics_prod.stg_orders in production. You never hardcode database or schema names again.
ref() — the function that does the work
ref() does two things at once, both critical on a real team:
- Builds the DAG. dbt scans every model for
ref()calls and constructs a directed acyclic graph. Parallel-safe branches run concurrently. - Resolves the right physical name. Dev, staging, prod, per-engineer schema, CI ephemeral schema for a pull request — same model code, different targets.
You can rename a model or change its materialization without grepping every dashboard for the old table name. That alone is worth the price of admission.
source() — declaring the raw data contract
Raw tables landed by Fivetran or Snowpipe are not dbt models, so they cannot be referenced with ref(). They are declared as sources instead:
-- models/staging/stg_orders.sql
SELECT
id AS order_id,
user_id,
amount,
created_at AS order_date,
status
FROM {{ source('app_prod', 'orders') }}
WHERE created_at >= '2025-01-01'Sources are wired up in YAML alongside the staging models:
# models/staging/sources.yml
version: 2
sources:
- name: app_prod
database: raw
schema: app_prod
tables:
- name: orders
loaded_at_field: _loaded_at
freshness:
warn_after: { count: 6, period: hour }
error_after: { count: 24, period: hour }
- name: usersThe freshness block lets dbt source freshness page you if Fivetran has been stuck for a day. Cheap insurance against the dashboard-shows-zeros conversation.
Project structure and the three layers
A canonical dbt project looks like this:
my_project/
├── dbt_project.yml
├── profiles.yml # connection config (not in git)
├── models/
│ ├── staging/ # 1:1 with sources, lightly cleaned
│ │ ├── stg_orders.sql
│ │ ├── stg_users.sql
│ │ └── sources.yml
│ ├── intermediate/ # joins, business logic, reusable
│ │ └── int_user_orders.sql
│ └── marts/
│ ├── finance/
│ │ └── daily_revenue.sql
│ └── product/
│ └── user_ltv.sql
├── tests/ # singular tests (free-form SQL)
├── macros/ # reusable Jinja/SQL functions
├── seeds/ # small CSVs versioned in git
└── snapshots/ # SCD type-2 history tablesThree layers, in order:
- staging — one model per raw source table. Rename columns, cast types, filter junk. No joins. Convention:
stg_<source>__<entity>. - intermediate — joins, window functions, anything two or more marts reuse. Usually materialized as
vieworephemeral. - marts — wide, dashboard-ready tables. Subdivided by domain (
finance,product,growth). Convention:fct_*for facts,dim_*for dimensions.
Without this every analyst writes their own version of user_orders and the warehouse turns into a junk drawer. With it, a new hire reads the lineage graph and knows where to put their next model in ten minutes.
Tests and documentation
dbt ships two test flavors, both run with dbt test:
# models/marts/schema.yml
version: 2
models:
- name: daily_revenue
description: "One row per calendar day. Revenue includes refunds netted."
columns:
- name: day
description: "Calendar day in UTC."
tests:
- not_null
- unique
- name: revenue
tests:
- not_null
- dbt_utils.expression_is_true:
expression: ">= 0"The four built-ins (not_null, unique, accepted_values, relationships) cover most of what you want; dbt_utils and dbt_expectations add another fifty. A singular test is a SQL file under tests/ that should return zero rows on success:
-- tests/assert_revenue_matches_finance.sql
SELECT day, revenue, finance_revenue
FROM {{ ref('daily_revenue') }} r
JOIN {{ ref('finance_close') }} f USING (day)
WHERE ABS(r.revenue - f.finance_revenue) > 1.00Documentation comes for free. Every description field renders into a static site via dbt docs generate, complete with a clickable lineage graph. Onboarding compresses from a week of Slack archaeology to an afternoon.
Sanity check: if a model has no tests and no description, treat it as untrusted. A green dbt build with no assertions is just a faster way to publish wrong numbers.
dbt Core vs dbt Cloud
| Dimension | dbt Core | dbt Cloud |
|---|---|---|
| License | Open source (Apache 2.0) | Commercial SaaS |
| Cost | Free | $100/dev/mo Team, custom Enterprise |
| Runtime | Your laptop, your container, your Airflow | Managed runners |
| Scheduler | Bring your own (Airflow, Dagster, cron) | Built-in jobs UI |
| IDE | VS Code + extension | Browser IDE with autocomplete |
| CI on PRs | Wire it yourself with GitHub Actions | First-class Slim CI |
| Hosted docs | Self-host or skip | Hosted at a stable URL |
Solo learner or a team already on Airflow — Core is plenty. Mid-size analytics team that wants Slim CI, scheduled jobs, and hosted docs without owning the infrastructure — Cloud is worth it. Enterprises increasingly run Core under Dagster or Airflow to keep dbt as one node in a wider pipeline.
End-to-end example: raw orders to LTV mart
Three files. No Airflow, no Python.
Step 1 — staging:
-- models/staging/stg_orders.sql
SELECT
id AS order_id,
user_id,
amount,
created_at AS order_date
FROM {{ source('app_prod', 'orders') }}
WHERE amount > 0Step 2 — intermediate:
-- models/intermediate/int_user_orders.sql
SELECT
user_id,
COUNT(*) AS total_orders,
SUM(amount) AS total_revenue,
MIN(order_date) AS first_order_at,
MAX(order_date) AS last_order_at
FROM {{ ref('stg_orders') }}
GROUP BY user_idStep 3 — mart:
-- models/marts/product/fct_user_ltv.sql
SELECT
o.user_id,
u.signup_date,
o.total_orders,
o.total_revenue AS ltv_usd,
o.first_order_at,
o.last_order_at,
DATEDIFF('day', u.signup_date, o.last_order_at) AS lifetime_days
FROM {{ ref('int_user_orders') }} o
JOIN {{ ref('stg_users') }} u USING (user_id)Run it:
dbt deps # install packages
dbt build # run + test, in dependency order
dbt docs generate && dbt docs servedbt build runs stg_orders and stg_users in parallel, then int_user_orders, then fct_user_ltv, executing every attached test along the way. If unique on fct_user_ltv.user_id fails the build stops and the downstream dashboard never sees a duplicated row.
Common pitfalls
Most teams trip on the same five things in their first six months, and none are about SQL.
Hardcoding table names instead of using ref(). It is tempting in a hotfix to write FROM analytics_prod.stg_orders directly. The cost stays invisible until a junior's CI run on a feature branch silently reads production, or until you rename a model and three dashboards break. Fix it with a pre-commit hook (dbt-checkpoint or a SQLFluff rule) that fails any FROM referencing a real schema.
Skipping the staging layer. "It's just a rename, I'll do it in the mart." Six months later your mart joins a raw source whose schema drifted last Tuesday, and ten downstream models need fixing instead of one. Always pin every raw table behind a stg_* model. That file is your contract boundary.
Treating tests as optional. Teams ship, watch dbt run go green for a month, never run dbt test. Then a NULL appears in a primary key and dashboards show duplicated revenue. Wire dbt build into your scheduler and treat a failing test the same as a failing build.
Ignoring incremental materialization on big fact tables. A table materialization rebuilds the entire model every run. For a 200M-row events table on Snowflake this is $40-$80 of compute per run. Switch to incremental with a sensible unique_key and your nightly run goes from twenty minutes to two.
Macros as a clever-code playground. A macro earns its place when it abstracts a pattern used by ten models. A macro wrapping two lines of SQL nobody else will read is a debugging tax on whoever inherits the project. Worth writing only if removing it would force you to copy code into at least three models.
Related reading
- What is Apache Airflow
- SQL window functions interview questions
- CTE vs subquery in SQL
- A/B testing peeking mistake
If you want to drill the SQL patterns that show up in dbt models (window functions, incremental merges, slowly changing dimensions), the NAILDD app is launching with 500+ analytics-engineering SQL problems pulled from real interview loops.
FAQ
Do I need a data warehouse to use dbt?
Yes. dbt executes SQL inside an existing warehouse — Snowflake, BigQuery, Databricks SQL, Redshift, Postgres, DuckDB, ClickHouse, and a long tail of community adapters. For learning, the cheapest credible setup is a local DuckDB file plus dbt-duckdb: installs in a minute, supports the same Jinja and ref semantics as production. BigQuery's free tier and Snowflake's 30-day trial are fine once you want to feel the cloud cost surface.
Is dbt hard to learn if I already know SQL?
If you can write a SELECT with joins, window functions, and CTEs, the dbt-specific concepts (ref, source, Jinja, materializations) are a weekend of reading and a week of building something real. The official dbt Fundamentals course is free and runs about five hours. The harder part is the cultural shift — version control, code review, treating SQL like software — and that takes a project, not a tutorial.
Does dbt replace Airflow?
No. dbt owns the transform step. Airflow (Dagster, Prefect) owns orchestration: extraction, Fivetran syncs, kicking off dbt, downstream ML jobs, alerts. The common shape is Airflow calls dbt build on a schedule. dbt Cloud's scheduler is fine for dbt-only orgs but does not replace a real orchestrator.
What is the difference between dbt Core and dbt Cloud?
Core is the open-source CLI. Cloud is a paid SaaS wrapping Core with a browser IDE, a job scheduler, hosted docs, Slim CI on pull requests, and SSO. Teams with strong CI/CD and an orchestrator can run Core and skip Cloud. Teams that want turnkey scheduling and a stable docs URL pay for Cloud. Same model code, migration later is mostly config.
How does dbt fit into the analytics engineer career path?
The analytics engineer title formalized around 2020 for the person who owns the warehouse model layer — between the data engineer (who lands raw data) and the analyst (who queries the marts). Posted base ranges sit at $140k-$190k in the US for senior IC roles at Stripe, Notion, Linear, and DoorDash, with total comp landing $180k-$260k once equity stacks. The loop is heavy on SQL, Kimball-style dimensional modeling, and a take-home where you extend a small dbt project.
What should I build first to learn dbt for real?
Pick a public dataset (Snowflake samples, BigQuery public datasets, Kaggle CSV in DuckDB), stand up a project with staging, intermediate, and marts folders, and ship one fact and one dimension with tests and a schema.yml. Add an incremental model for the largest table and wire GitHub Actions to run dbt build on every PR. Beats three online courses on a resume.