dbt mart best practices for DE interviews
Contents:
Why interviewers ask about dbt layers
A Data Engineer interview at Stripe, Airbnb, or Snowflake almost always includes a project-structure question — usually phrased as "walk me through how you'd organize a dbt project for an order-events pipeline." What the interviewer is really probing is whether you can separate raw-source cleanup from business logic from downstream consumption. Candidates who answer with one giant SELECT filled with twelve CTEs fail this signal fast.
The three-layer pattern — staging, intermediate, mart — is the dbt Labs recommendation and the de facto industry standard. Knowing the layer boundaries, the naming rules, and why each layer exists separates a senior DE candidate from a junior one. The mart layer is where most expensive mistakes happen because that is the surface BI and ML teams consume, and rewrites there are politically painful.
Load-bearing trick: every interviewer wants to hear "staging is 1:1 with sources, marts are business-grain" in the first 90 seconds. Say it explicitly.
This guide walks through each layer with runnable SQL, naming patterns, the tests you should mention, and the pitfalls that cost candidates the offer. If you've only ever touched a single models/ folder with no subdirectories, this is the upgrade.
Layered structure that scales
A healthy dbt project ships with three clear subdirectories under models/. Each layer has a strict job and a strict naming rule. Mixing the jobs — joining sources directly inside a mart, or aggregating inside staging — is the most common cause of brittle pipelines.
models/
staging/ — raw cleanup, 1:1 with sources, no joins, no aggregates
intermediate/ — reusable joins and business-agnostic transformations
marts/ — business-facing fact and dimension tables consumed by BI| Layer | Purpose | Joins allowed? | Aggregates allowed? | Exposed to BI? |
|---|---|---|---|---|
| staging | Rename, cast, light filter | No | No | No |
| intermediate | Reusable joins, derived columns | Yes | Yes (only when reused) | No |
| marts | Business grain, BI-ready | Yes | Yes | Yes |
The reason this split matters is dependency direction. Staging depends only on source(). Intermediate depends only on ref('stg_*'). Marts depend on ref('int_*') or ref('stg_*') directly when no intermediate step is needed. If a mart starts pulling from source() directly, you've broken the contract — every audit, lineage graph, and column-level test downstream gets harder.
Staging models
Staging models follow the pattern stg_<source>__<table>.sql with a double underscore between source name and table name. The double underscore is not cosmetic — it's how dbt's select syntax discriminates between source and grain in larger projects.
Purpose: a thin cleanup layer that gives downstream models a stable, well-typed contract. One staging model per source table. No joins, no aggregations.
-- stg_app__users.sql
SELECT
id AS user_id,
email,
CAST(created_at AS TIMESTAMP) AS created_at,
country_code AS country
FROM {{ source('app', 'users') }}
WHERE deleted_at IS NULLWhat belongs here: column renames to your team's standard (id → user_id, snake_case everywhere), explicit casts (especially for timestamps and money), and soft-delete filters. What does not belong here: joins to other sources, currency conversion (that's business logic), or any aggregation. If a junior engineer reads stg_app__users.sql and wonders "is this what raw looks like, or has someone transformed it?" — you've failed the staging contract.
Intermediate models
Intermediate models use the prefix int_ and live in models/intermediate/. Their job is DRY — when the same join or derived calculation feeds three different marts, you lift it into an int_ model so the logic exists in exactly one place.
-- int_users_with_order_stats.sql
WITH users AS (
SELECT * FROM {{ ref('stg_app__users') }}
),
orders AS (
SELECT * FROM {{ ref('stg_app__orders') }}
)
SELECT
u.user_id,
u.email,
u.country,
COUNT(o.order_id) AS orders_count,
COALESCE(SUM(o.amount_usd), 0) AS lifetime_revenue_usd,
MIN(o.created_at) AS first_order_at,
MAX(o.created_at) AS last_order_at
FROM users u
LEFT JOIN orders o USING (user_id)
GROUP BY 1, 2, 3Crucially, intermediate models are not exposed to BI tools. They are scaffolding. If a Looker explore or a Tableau dashboard starts pointing at an int_ model, that's a signal the model should be promoted to a mart with a proper grain declaration.
Gotcha: if you only have one downstream consumer for a piece of logic, skip the intermediate layer entirely. Premature intermediate models bloat the DAG and slow CI.
Mart models
Marts are the contract with the rest of the company. They use one of three prefixes — fct_ for facts, dim_ for dimensions, and mart_ for wider denormalized tables that combine multiple subjects for a specific use case (executive dashboards, reverse-ETL exports).
-- fct_orders_daily.sql
SELECT
order_date,
country,
channel,
COUNT(DISTINCT order_id) AS orders,
COUNT(DISTINCT user_id) AS buyers,
SUM(amount_usd) AS gross_revenue_usd,
SUM(refund_amount_usd) AS refunds_usd,
SUM(amount_usd - refund_amount_usd) AS net_revenue_usd
FROM {{ ref('int_orders_enriched') }}
GROUP BY 1, 2, 3Two non-negotiable mart rules: every mart must declare its grain in a comment or YAML description, and every mart must have a primary key test. A fct_orders_daily model with the grain (order_date, country, channel) should have a dbt_utils.unique_combination_of_columns test on exactly those three columns. Skip this and the first time someone double-joins to a dimension table, your numbers silently inflate.
Marts are where on-call pages come from. Treat them like production code, because they are.
Naming conventions
Predictable names are a productivity multiplier. When a new hire joins and sees fct_orders_daily, they should know — without opening the file — that it's a daily-grain fact table consumed by BI. Compare with orders_v2_final_FINAL (real example, real shame) and you understand why interviewers care.
| Layer | Pattern | Example |
|---|---|---|
| source | raw_<system>.<table> |
raw_app.users |
| staging | stg_<source>__<table> |
stg_app__users |
| intermediate | int_<purpose> |
int_users_with_order_stats |
| dimension | dim_<entity> |
dim_users |
| fact | fct_<grain> |
fct_orders_daily |
| wide mart | mart_<subject>__<grain> |
mart_finance__monthly |
Two extra rules worth memorizing for the interview: plural for facts, singular for dimensions is the Kimball convention some teams follow (fct_orders vs dim_user), and avoid version suffixes — use git branches and dbt environments, not _v2.
Tests and contracts
When the interviewer asks "how do you guarantee data quality?", they expect three concrete answers. First, schema tests on every staging and mart model: unique and not_null on primary keys, accepted_values on enums, relationships for foreign keys. Second, freshness checks on sources via dbt source freshness so a stalled Fivetran sync trips an alert before BI does. Third, dbt contracts (added in dbt-core 1.5) that lock column types and names so a silent schema change in upstream breaks CI, not Monday-morning dashboards.
# models/marts/_marts.yml
models:
- name: fct_orders_daily
config:
contract:
enforced: true
columns:
- name: order_date
data_type: date
tests: [not_null]
- name: country
data_type: varchar
tests:
- not_null
- accepted_values:
values: ['US', 'CA', 'GB', 'DE', 'FR']A practical rule: every mart gets at least three tests — unique grain, not-null grain columns, and one relationship test to a dimension. Anything less and you're flying blind.
Common pitfalls
The most expensive pitfall on the mart layer is fan-out from a missing grain declaration. A junior engineer joins fct_orders to dim_users thinking both are at user grain, but fct_orders is at order grain, and the join silently multiplies revenue by orders-per-user. The fix is non-negotiable: every model declares its grain in YAML, and every consumer reads that grain before joining. This single discipline prevents most "the numbers don't match finance" incidents.
A second pitfall is business logic in staging. The temptation is real — you already have a WHERE clause, why not also convert currency or apply the active-subscription rule? Because the moment you do, a different mart that needs the raw view of the same data has to build a parallel staging model, and now you have two sources of truth for the same source table. Staging is a contract with raw, nothing more. Currency conversion, fiscal-period mapping, soft-delete-vs-hard-delete decisions — all of that belongs in intermediate or mart.
The third pitfall is the god-mart, a single table with 200 columns that tries to answer every dashboard question. It looks DRY in the abstract but in practice every change breaks something and the compile time creeps past 10 minutes. Split god-marts by subject area (mart_finance__monthly, mart_growth__weekly) and let BI tools join across them when needed.
Fourth, skipping incremental on large facts. A fct_events table with 2 billion rows does not need a full refresh nightly. Use materialized='incremental' with a partition predicate on event_date and your CI bill drops by 90%. Interviewers at Snowflake, Databricks, and any consumer-app shop will probe this.
Finally, letting BI point at staging or intermediate models. The whole point of the mart layer is to give consumers a stable, tested surface. If Looker is pulling from stg_app__users, you've broken the contract and any cleanup in staging becomes a coordinated cross-team migration. The fix is a dbt exposures declaration plus a lint rule in CI that flags ref('stg_*') calls from outside dbt.
Optimization tips
For projects beyond ~200 models, three optimizations pay back fast. First, incremental materializations on any fact larger than 50M rows — full refreshes are wasted compute. Second, +materialized: view for staging models in dev environments so you don't burn warehouse credits previewing renames. Third, Slim CI (dbt build --select state:modified+) so PRs only rebuild what changed plus downstream, not the entire project.
If you train these patterns through a question bank, naildd has DE-track problems specifically on dbt project structure, layer boundaries, and the test-coverage questions interviewers love.
Related reading
- What is dbt (data build tool)
- dbt incremental models for DE interviews
- dbt exposures for DE interviews
- dbt Elementary for DE interviews
- Data mart design for DE interviews
- DWH modeling for DE interviews
FAQ
Do I really need an intermediate layer if my project is small?
No. For projects under 30 models with a single downstream consumer per logic block, going straight from staging to marts is fine. The intermediate layer earns its weight when the same join or derived column feeds two or more marts — that's when DRY pays back. Adding int_ models prematurely just clutters the DAG and slows compile time without delivering reuse.
Should I put one model per source table in staging, or can I combine related tables?
One model per source table, always. Combining users and user_profiles into a single stg_app__users_with_profile model violates the 1:1-with-source contract and makes column-level lineage harder to trace. The join belongs in an int_users_enriched model. The discipline of "staging mirrors raw" is what makes the layer useful.
How do I name a mart that combines orders, refunds, and subscriptions for the finance team?
Use the mart_<subject>__<grain> pattern: mart_finance__monthly. The double underscore separates the subject area from the grain. Avoid stuffing every detail into the name — mart_finance_orders_refunds_subs_monthly_v2 is the wrong direction. The YAML description and column comments carry the detail; the name carries the navigation.
When should I use fct_ versus mart_ as the prefix?
Use fct_ for true Kimball facts — narrow, additive, at a clear grain like fct_orders_daily or fct_payments. Use mart_ for wider denormalized tables built for a specific consumer or dashboard, where you've intentionally pre-joined dimensions for query simplicity. The rule of thumb: if a BI tool would star-join it to a dimension, it's a fct_; if a BI tool would query it standalone, it's a mart_.
Is there a single-source-of-truth doc for these conventions?
The dbt Labs "How we structure our dbt projects" guide is the canonical reference and worth reading end-to-end before any DE interview. Most teams adapt it lightly — some prefer singular fact names, some keep an explicit core/ directory under marts — but the layer boundaries and naming patterns hold across companies. Interviewers will not penalize a reasonable adaptation; they will penalize a candidate who can't explain why the layers exist.
What's the most common dbt question in DE interviews right now?
Two questions dominate. First, "walk me through your project structure" — covered above. Second, "how would you make a fact table incremental?" — expect to discuss unique_key, merge vs delete+insert strategies, late-arriving data, and how you handle backfills. Both questions reward concrete, project-grounded answers over textbook recitation.