Data mart design on a Data Engineering interview
Contents:
What a data mart actually is
When a Data Engineering interviewer asks you to design a data mart for the marketing team, they are not asking for a Kimball lecture. They want to see whether you can scope a subject area, pick a materialization strategy, defend a refresh cadence, and assign clear ownership — in roughly five minutes, with hand-wavy whiteboard SQL. The candidates who flunk this question almost always over-engineer it: they try to model the entire warehouse, draw seven fact tables, and run out of clock before they talk about refresh.
A data mart is a slice of the warehouse curated for one consumer group. Marketing wants campaigns, leads, attribution paths. Finance wants revenue, costs, deferred balances, and P&L roll-ups. Operations wants orders, fulfillment SLAs, and inventory positions. Each of those becomes its own mart, pre-modeled and indexed against the queries that team actually runs every day. The mart is opinionated on purpose — that is the whole point.
Load-bearing trick: If the team can run their top five dashboards without joining across more than two tables in your mart, the scope is right. If they need a five-way join to answer "how did last week's campaign convert?", your model is wrong.
The biggest source of confusion on this question is the difference between a mart and a DWH layer. The DWH is the integrated, normalized truth — usually a 3NF or wide-fact Kimball star spanning every domain. The mart is downstream: denormalized, query-optimized, and scoped to one audience. You build the mart on top of the DWH, not in parallel with it.
Subject scope and naming
Interviewers love to test scope discipline because it is the cheapest signal of seniority. A junior candidate will say "let's put everything marketing-adjacent here". A senior candidate will say "this mart owns paid-acquisition funnel metrics from impression to MQL — anything past MQL belongs in the sales mart". Pick one subject area, one audience, one definition of done.
Name your tables so the subject is obvious in a Slack search. The convention most teams converge on looks like this:
| Layer | Prefix | Example | Purpose |
|---|---|---|---|
| Mart fact | mkt_*_fact |
mkt_campaign_performance_fact |
Daily-grain campaign metrics |
| Mart dim | mkt_*_dim |
mkt_channel_dim |
Slowly-changing channel attributes |
| Mart agg | mkt_*_agg |
mkt_attribution_funnel_agg |
Pre-aggregated funnel by week |
| Mart view | mkt_*_v |
mkt_active_campaigns_v |
Always-fresh exposure layer |
The prefix is doing real work — it lets a marketing analyst find every relevant table by typing mkt_ into the search box, and it tells anyone outside the team to leave these tables alone unless they have a strong reason to depend on them. Naming is the cheapest piece of governance you will ever ship.
Materialization choices
This is where most candidates either shine or stumble. The interviewer wants to hear that you can pick between four options and justify the choice with a workload reason, not a personal preference.
-- Always-fresh view: simple, slow on heavy joins
CREATE VIEW mkt_active_campaigns_v AS
SELECT c.campaign_id, c.name, l.leads_count, l.cost_usd
FROM raw.campaigns c
LEFT JOIN raw.leads_daily l ON l.campaign_id = c.campaign_id
WHERE c.status = 'active';
-- Full table refresh: fast reads, expensive rebuild
CREATE TABLE mkt_campaign_performance_fact AS
SELECT campaign_id, dt, impressions, clicks, leads, spend_usd
FROM raw.campaign_events
GROUP BY campaign_id, dt;
-- Incremental table: scan only new partitions
INSERT INTO mkt_campaign_performance_fact
SELECT campaign_id, dt, impressions, clicks, leads, spend_usd
FROM raw.campaign_events
WHERE dt > (SELECT MAX(dt) FROM mkt_campaign_performance_fact)
GROUP BY campaign_id, dt;Here is the cheat sheet that earns the nod:
| Materialization | Read latency | Build cost | Storage | When to use |
|---|---|---|---|---|
| View | High | None | None | Tiny tables, prototype layer, always-fresh requirement |
| Full-refresh table | Low | High | High | Small dim tables, nightly rebuild is fine |
| Incremental table | Low | Low (deltas only) | High | Append-only event marts, the default in Snowflake/BigQuery/ClickHouse |
| Materialized view | Low | Auto-managed | Medium | When the warehouse can refresh it for you (Snowflake DT, BQ MV, PG matview) |
In modern warehouses — Snowflake, BigQuery, Databricks, ClickHouse — the default answer for a mart fact table is an incremental table built by dbt or a similar tool, partitioned by event date, clustered on the high-cardinality dimension. Views work for dim tables and exposure layers. Full-refresh works for anything under a few million rows where rebuild cost is trivial. Materialized views are great when the warehouse manages them, painful when you manage them yourself.
Sanity check: If your build job runs longer than your refresh cadence, you have picked the wrong materialization. Daily mart that takes 26 hours to rebuild is not a daily mart.
Refresh cadence
Refresh is the question most candidates underweight. The interviewer wants a number, a justification, and an awareness of the trade-off between freshness and build cost. "Hourly" is not an answer. "Hourly on the campaign performance fact because the bidding team adjusts spend off it before 10am and 6pm, daily on the attribution roll-ups because they feed weekly business reviews" — that is an answer.
The default cadences I would defend on a whiteboard:
- Hourly for anything that drives operational decisions inside the day — pacing, fraud, on-call dashboards.
- Daily for reporting, finance, executive dashboards. Run it at 04:00 local so the East Coast morning is clean.
- Weekly for slow-moving aggregates — cohort retention, LTV roll-ups, leadership scorecards.
- On-demand for backfills and audits, triggered by a Slack command or a manual dbt run.
The implementation detail interviewers probe is how you handle late-arriving data. Marketing platforms backfill conversions for 48-72 hours after the event date. If your incremental key is strictly WHERE dt > MAX(dt), you will silently miss those backfills. The right move is a rolling reprocess window — re-run the last seven days every night, then a longer monthly catch-up for any deeper corrections.
-- Rolling reprocess window: handles 7-day backfill correctly
DELETE FROM mkt_campaign_performance_fact
WHERE dt >= CURRENT_DATE - INTERVAL '7 days';
INSERT INTO mkt_campaign_performance_fact
SELECT campaign_id, dt, impressions, clicks, leads, spend_usd
FROM raw.campaign_events
WHERE dt >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY campaign_id, dt;The candidates who mention late-arriving data without being prompted are the ones who get the offer. It signals real production experience.
Ownership and Data Mesh
The last 90 seconds of a strong answer is about who owns the mart. In the classic model, the central data team builds and maintains every mart, which scales to roughly two teams and then collapses under a backlog. The modern answer — the one interviewers at companies like Netflix, Airbnb, Notion, Stripe, and Uber want to hear — is domain ownership, the core Data Mesh principle.
- Marketing team owns
mkt_*tables: schema decisions, metric definitions, SLA on freshness, communication with stakeholders. - Finance team owns
fin_*tables: revenue recognition rules, deferred-revenue logic, audit lineage. - Platform DE team owns the infrastructure that lets every domain ship a mart: dbt project conventions, CI templates, monitoring, cost guardrails, and the contract tests that catch a breaking change before it hits production.
The DE on a Data Mesh team is a platform engineer first, modeler second. You ship the tools — the dbt macros, the freshness alerts, the data-contract checks — that let a marketing analyst safely own their own tables. That framing is the difference between a senior DE answer and a staff DE answer.
If you want to drill this exact interview pattern with 1,500+ DE questions and timed mock rounds, NAILDD is the practice tool we built for it.
Common pitfalls
The biggest pitfall is scope creep inside the mart itself. Candidates start with "marketing mart" and end with a model that includes product analytics, customer support tickets, and revenue. Each addition feels small in the moment, but six months later the mart has 47 tables, three definitions of "active campaign", and no clear owner. The fix is a one-sentence charter at the top of the dbt project — "this mart serves the paid-acquisition funnel from impression to MQL" — and a code-review rule that any PR adding a non-marketing column gets rejected.
A second pitfall is shipping a mart on top of raw tables instead of on top of the integrated DWH layer. It looks faster in the short run because you skip the staging and intermediate models, but you end up with three marts that each compute "campaign cost" slightly differently, and the finance team will eventually find the discrepancy on a board deck. Always layer mart -> intermediate -> staging -> raw, even when the intermediate layer looks redundant. The redundancy is the contract.
A third pitfall is ignoring cost. A daily full-refresh on a 2TB fact table costs real money on Snowflake or BigQuery, and the finance team will eventually ask why the warehouse bill jumped 40%. Track cost per refresh as a first-class metric, set a budget alert at $50 per run for any single model, and convert anything that crosses the line to incremental. The candidate who mentions this number on a whiteboard signals they have actually owned a production warehouse bill.
A fourth pitfall is silent breakage between the source and the mart. The campaigns team renames a column in the source CRM, your mart job swallows the change, and a downstream dashboard quietly shows zeros for two weeks. The fix is data contracts — explicit schema, type, and freshness expectations between every layer, enforced in CI. Even a lightweight check (column exists, row count above floor, max date within 24 hours) catches 80% of the failures.
A fifth and final pitfall is no documentation of refresh schedule and SLA. Analysts open the table at 09:00, see stale data, and either Slack the DE team or — worse — make decisions on numbers they think are fresh. Every mart table should expose a _last_refreshed_at column and the expected SLA in the dbt YAML, and the freshness dashboard should be a public page anyone can read without asking.
Related reading
- Data contracts — Data Engineering interview
- Data lineage — Data Engineering interview
- dbt incremental models — Data Engineering interview
- Materialized views — Data Engineering interview
- SQL for the Data Engineer interview
FAQ
Is a data mart the same as a data warehouse?
No. The warehouse is the integrated, governed layer that holds the canonical view of every business entity — usually modeled as a 3NF set of tables or a wide Kimball star. The mart is a downstream, opinionated slice built for one team's queries, with denormalization, aggregations, and a naming convention that team controls. You can think of the warehouse as the library and the mart as the curated shelf for one reading group.
Should every team have its own mart?
Only if the team has at least three regular use cases that don't fit cleanly into an existing mart and someone on that team is willing to own schema decisions. Spinning up a mart for a one-off project or for a team without a designated owner is how warehouses end up with 40 orphan tables. A useful heuristic: if the team can't name the person on-call for the mart in six months, don't ship it.
What refresh cadence should I default to in an interview answer?
Daily at 04:00 local for the bulk of the mart, hourly for any table driving same-day operational decisions, weekly for slow-moving aggregates. State the cadence, justify it with the consumer's decision loop (operational vs reporting vs strategic), and mention the rolling reprocess window for late-arriving data. That structure earns full marks on every rubric I have seen.
Incremental table or materialized view — which one wins?
In Snowflake, BigQuery, Databricks, and ClickHouse, the practical answer for a mart fact table is an incremental table built by dbt, partitioned by event date. Materialized views are great when the warehouse fully manages them and your query pattern is stable, but they get expensive and hard to debug when the underlying logic involves multiple joins or window functions. For dim tables and simple exposure layers, views are fine — the warehouse will plan around them.
How do I handle late-arriving data without breaking idempotency?
Use a rolling reprocess window — the last seven days are recomputed every run, and a monthly job re-runs the last 90 days. The mart job becomes idempotent because you delete and re-insert the same window, so running it twice produces the same result. For deeper corrections, expose an on-demand backfill command that takes a date range and rebuilds only that slice without rewriting history.
What is the difference between owning a mart and owning the platform?
The domain team that owns the mart owns the business semantics — what counts as an active campaign, how attribution windows work, whether refunds offset revenue. The central DE platform team owns the tools, conventions, and guardrails — the dbt project structure, the CI checks, the freshness alerts, the cost budgets. Both ownership models coexist in a Data Mesh, and the platform team's job is to make the domain team's job as boring and safe as possible.