What is dbt (data build tool)

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

What dbt is, in one paragraph

dbt (data build tool) is the framework analytics engineers use to transform data inside the warehouse using only SELECT statements. Instead of stitching together Python scripts, scheduler DAGs, and brittle CREATE TABLE AS chains, you write SQL files called models, reference them with ref(), and let dbt figure out run order, materialization, and tests. Work that used to need a data engineer with Airflow now needs anyone who can write a CTE.

In hiring terms this is the wedge that created the analytics engineer title. Stripe, Notion, and Airbnb post these roles at $140k-$190k base in the US — the bar is fluent SQL plus dbt fluency, not Spark internals. If you can walk through your own warehouse on a whiteboard, you are most of the way to the loop.

Load-bearing idea: dbt does not move data. It takes whatever is in your warehouse and turns it into clean tables and views in a deterministic order, with tests attached. Loading is somebody else's job (Fivetran, Airbyte, Snowpipe).

ELT vs ETL — why dbt became possible

The old shape was ETL: extract from a source, transform on a separate box (Informatica, Talend, Python workers), then load the cleaned result into the warehouse. Transform-before-load made sense when warehouses charged per terabyte scanned. Every join was a budget conversation.

Cloud warehouses flipped the economics. Snowflake, BigQuery, Databricks SQL, and Redshift bill compute separately from storage and scale on demand. The cheap move is now ELT: dump raw rows in first, transform them in place with SQL. The transform step lives next to the data and any analyst with read access can inspect every intermediate result.

dbt is the canonical "T" in ELT. It does not extract or load. It compiles your Jinja-flavored SQL, resolves the dependency graph, and materializes results as tables, views, or incremental tables. That narrow scope is the whole reason it won.

Era Loader Transform Orchestrator Owner
2010 ETL Informatica Informatica / SSIS Cron Data engineer
2016 hybrid Stitch Custom Python Airflow Data engineer
2020+ ELT Fivetran / Airbyte dbt Airflow / dbt Cloud Analytics engineer
2026 modern Fivetran / native CDC dbt + SQLMesh dbt Cloud / Dagster Analytics engineer

How dbt works: models, ref, sources

Model

A model is a .sql file with a single SELECT. dbt wraps it in DDL based on the model's materialization (view, table, incremental, ephemeral). Example models/mart/daily_revenue.sql:

SELECT
    DATE(order_date)           AS day,
    COUNT(*)                   AS order_count,
    SUM(amount)                AS revenue,
    AVG(amount)                AS avg_order_value
FROM {{ ref('stg_orders') }}
WHERE status = 'completed'
GROUP BY DATE(order_date)

The {{ ref('stg_orders') }} is a Jinja call into another model. At compile time dbt rewrites it to the fully-qualified table name for the current target — analytics_dev.stg_orders in your sandbox, analytics_prod.stg_orders in production. You never hardcode database or schema names again.

ref() — the function that does the work

ref() does two things at once, both critical on a real team:

  1. Builds the DAG. dbt scans every model for ref() calls and constructs a directed acyclic graph. Parallel-safe branches run concurrently.
  2. Resolves the right physical name. Dev, staging, prod, per-engineer schema, CI ephemeral schema for a pull request — same model code, different targets.

You can rename a model or change its materialization without grepping every dashboard for the old table name. That alone is worth the price of admission.

source() — declaring the raw data contract

Raw tables landed by Fivetran or Snowpipe are not dbt models, so they cannot be referenced with ref(). They are declared as sources instead:

-- models/staging/stg_orders.sql
SELECT
    id          AS order_id,
    user_id,
    amount,
    created_at  AS order_date,
    status
FROM {{ source('app_prod', 'orders') }}
WHERE created_at >= '2025-01-01'

Sources are wired up in YAML alongside the staging models:

# models/staging/sources.yml
version: 2

sources:
  - name: app_prod
    database: raw
    schema: app_prod
    tables:
      - name: orders
        loaded_at_field: _loaded_at
        freshness:
          warn_after: { count: 6, period: hour }
          error_after: { count: 24, period: hour }
      - name: users

The freshness block lets dbt source freshness page you if Fivetran has been stuck for a day. Cheap insurance against the dashboard-shows-zeros conversation.

Project structure and the three layers

A canonical dbt project looks like this:

my_project/
├── dbt_project.yml
├── profiles.yml          # connection config (not in git)
├── models/
│   ├── staging/          # 1:1 with sources, lightly cleaned
│   │   ├── stg_orders.sql
│   │   ├── stg_users.sql
│   │   └── sources.yml
│   ├── intermediate/     # joins, business logic, reusable
│   │   └── int_user_orders.sql
│   └── marts/
│       ├── finance/
│       │   └── daily_revenue.sql
│       └── product/
│           └── user_ltv.sql
├── tests/                # singular tests (free-form SQL)
├── macros/               # reusable Jinja/SQL functions
├── seeds/                # small CSVs versioned in git
└── snapshots/            # SCD type-2 history tables

Three layers, in order:

  • staging — one model per raw source table. Rename columns, cast types, filter junk. No joins. Convention: stg_<source>__<entity>.
  • intermediate — joins, window functions, anything two or more marts reuse. Usually materialized as view or ephemeral.
  • marts — wide, dashboard-ready tables. Subdivided by domain (finance, product, growth). Convention: fct_* for facts, dim_* for dimensions.

Without this every analyst writes their own version of user_orders and the warehouse turns into a junk drawer. With it, a new hire reads the lineage graph and knows where to put their next model in ten minutes.

Tests and documentation

dbt ships two test flavors, both run with dbt test:

# models/marts/schema.yml
version: 2

models:
  - name: daily_revenue
    description: "One row per calendar day. Revenue includes refunds netted."
    columns:
      - name: day
        description: "Calendar day in UTC."
        tests:
          - not_null
          - unique
      - name: revenue
        tests:
          - not_null
          - dbt_utils.expression_is_true:
              expression: ">= 0"

The four built-ins (not_null, unique, accepted_values, relationships) cover most of what you want; dbt_utils and dbt_expectations add another fifty. A singular test is a SQL file under tests/ that should return zero rows on success:

-- tests/assert_revenue_matches_finance.sql
SELECT day, revenue, finance_revenue
FROM {{ ref('daily_revenue') }} r
JOIN {{ ref('finance_close') }} f USING (day)
WHERE ABS(r.revenue - f.finance_revenue) > 1.00

Documentation comes for free. Every description field renders into a static site via dbt docs generate, complete with a clickable lineage graph. Onboarding compresses from a week of Slack archaeology to an afternoon.

Sanity check: if a model has no tests and no description, treat it as untrusted. A green dbt build with no assertions is just a faster way to publish wrong numbers.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

dbt Core vs dbt Cloud

Dimension dbt Core dbt Cloud
License Open source (Apache 2.0) Commercial SaaS
Cost Free $100/dev/mo Team, custom Enterprise
Runtime Your laptop, your container, your Airflow Managed runners
Scheduler Bring your own (Airflow, Dagster, cron) Built-in jobs UI
IDE VS Code + extension Browser IDE with autocomplete
CI on PRs Wire it yourself with GitHub Actions First-class Slim CI
Hosted docs Self-host or skip Hosted at a stable URL

Solo learner or a team already on Airflow — Core is plenty. Mid-size analytics team that wants Slim CI, scheduled jobs, and hosted docs without owning the infrastructure — Cloud is worth it. Enterprises increasingly run Core under Dagster or Airflow to keep dbt as one node in a wider pipeline.

End-to-end example: raw orders to LTV mart

Three files. No Airflow, no Python.

Step 1 — staging:

-- models/staging/stg_orders.sql
SELECT
    id          AS order_id,
    user_id,
    amount,
    created_at  AS order_date
FROM {{ source('app_prod', 'orders') }}
WHERE amount > 0

Step 2 — intermediate:

-- models/intermediate/int_user_orders.sql
SELECT
    user_id,
    COUNT(*)            AS total_orders,
    SUM(amount)         AS total_revenue,
    MIN(order_date)     AS first_order_at,
    MAX(order_date)     AS last_order_at
FROM {{ ref('stg_orders') }}
GROUP BY user_id

Step 3 — mart:

-- models/marts/product/fct_user_ltv.sql
SELECT
    o.user_id,
    u.signup_date,
    o.total_orders,
    o.total_revenue                              AS ltv_usd,
    o.first_order_at,
    o.last_order_at,
    DATEDIFF('day', u.signup_date, o.last_order_at) AS lifetime_days
FROM {{ ref('int_user_orders') }} o
JOIN {{ ref('stg_users') }}      u USING (user_id)

Run it:

dbt deps      # install packages
dbt build     # run + test, in dependency order
dbt docs generate && dbt docs serve

dbt build runs stg_orders and stg_users in parallel, then int_user_orders, then fct_user_ltv, executing every attached test along the way. If unique on fct_user_ltv.user_id fails the build stops and the downstream dashboard never sees a duplicated row.

Common pitfalls

Most teams trip on the same five things in their first six months, and none are about SQL.

Hardcoding table names instead of using ref(). It is tempting in a hotfix to write FROM analytics_prod.stg_orders directly. The cost stays invisible until a junior's CI run on a feature branch silently reads production, or until you rename a model and three dashboards break. Fix it with a pre-commit hook (dbt-checkpoint or a SQLFluff rule) that fails any FROM referencing a real schema.

Skipping the staging layer. "It's just a rename, I'll do it in the mart." Six months later your mart joins a raw source whose schema drifted last Tuesday, and ten downstream models need fixing instead of one. Always pin every raw table behind a stg_* model. That file is your contract boundary.

Treating tests as optional. Teams ship, watch dbt run go green for a month, never run dbt test. Then a NULL appears in a primary key and dashboards show duplicated revenue. Wire dbt build into your scheduler and treat a failing test the same as a failing build.

Ignoring incremental materialization on big fact tables. A table materialization rebuilds the entire model every run. For a 200M-row events table on Snowflake this is $40-$80 of compute per run. Switch to incremental with a sensible unique_key and your nightly run goes from twenty minutes to two.

Macros as a clever-code playground. A macro earns its place when it abstracts a pattern used by ten models. A macro wrapping two lines of SQL nobody else will read is a debugging tax on whoever inherits the project. Worth writing only if removing it would force you to copy code into at least three models.

If you want to drill the SQL patterns that show up in dbt models (window functions, incremental merges, slowly changing dimensions), the NAILDD app is launching with 500+ analytics-engineering SQL problems pulled from real interview loops.

FAQ

Do I need a data warehouse to use dbt?

Yes. dbt executes SQL inside an existing warehouse — Snowflake, BigQuery, Databricks SQL, Redshift, Postgres, DuckDB, ClickHouse, and a long tail of community adapters. For learning, the cheapest credible setup is a local DuckDB file plus dbt-duckdb: installs in a minute, supports the same Jinja and ref semantics as production. BigQuery's free tier and Snowflake's 30-day trial are fine once you want to feel the cloud cost surface.

Is dbt hard to learn if I already know SQL?

If you can write a SELECT with joins, window functions, and CTEs, the dbt-specific concepts (ref, source, Jinja, materializations) are a weekend of reading and a week of building something real. The official dbt Fundamentals course is free and runs about five hours. The harder part is the cultural shift — version control, code review, treating SQL like software — and that takes a project, not a tutorial.

Does dbt replace Airflow?

No. dbt owns the transform step. Airflow (Dagster, Prefect) owns orchestration: extraction, Fivetran syncs, kicking off dbt, downstream ML jobs, alerts. The common shape is Airflow calls dbt build on a schedule. dbt Cloud's scheduler is fine for dbt-only orgs but does not replace a real orchestrator.

What is the difference between dbt Core and dbt Cloud?

Core is the open-source CLI. Cloud is a paid SaaS wrapping Core with a browser IDE, a job scheduler, hosted docs, Slim CI on pull requests, and SSO. Teams with strong CI/CD and an orchestrator can run Core and skip Cloud. Teams that want turnkey scheduling and a stable docs URL pay for Cloud. Same model code, migration later is mostly config.

How does dbt fit into the analytics engineer career path?

The analytics engineer title formalized around 2020 for the person who owns the warehouse model layer — between the data engineer (who lands raw data) and the analyst (who queries the marts). Posted base ranges sit at $140k-$190k in the US for senior IC roles at Stripe, Notion, Linear, and DoorDash, with total comp landing $180k-$260k once equity stacks. The loop is heavy on SQL, Kimball-style dimensional modeling, and a take-home where you extend a small dbt project.

What should I build first to learn dbt for real?

Pick a public dataset (Snowflake samples, BigQuery public datasets, Kaggle CSV in DuckDB), stand up a project with staging, intermediate, and marts folders, and ship one fact and one dimension with tests and a schema.yml. Add an incremental model for the largest table and wire GitHub Actions to run dbt build on every PR. Beats three online courses on a resume.