DWH layers on the data engineering interview

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Why interviewers keep asking

"Walk me through your DWH layers" shows up in roughly every data engineering loop, and the reason is mundane — it's the fastest way to see whether a candidate has touched a real warehouse or only read Snowflake docs. A senior at Stripe or Databricks names staging, ODS, DDS, mart, and semantic in the same breath; a junior collapses half of them into "the silver table" and moves on. Both can be right, but the framing tells the interviewer how to calibrate the next forty minutes.

The trap is that layer names are not universal. Inmon, Kimball, and lakehouse vendors use overlapping vocabulary for non-overlapping concepts, and the modern medallion (bronze / silver / gold) is a marketing simplification of a 1990s architecture. Reciting Databricks docs is how strong engineers fail this round. The interviewer wants to hear why each layer exists, what breaks if you skip it, and how your last team actually drew the line.

Load-bearing trick: name the layers, then for each one answer two sub-questions — what is the write pattern (append, overlay, merge) and who owns the next transformation. If you can do that, you've answered 80% of the follow-ups before they ask.

The classic layered DWH

Source -> Staging -> ODS -> DDS -> Data Mart -> Semantic Layer -> BI

Every layer has a single job, and the contract between layers is what makes the pipeline debuggable two years later. Skip a layer and you save engineering hours up front; you also lose the ability to answer "what did this row look like before the cleanup job touched it" when audit comes knocking.

The medallion model collapses this into three buckets — bronze, silver, gold — which works fine for greenfield lakehouses but maps awkwardly onto regulated industries where ODS and DDS are explicit contractual obligations.

Staging

Staging holds raw data exactly as the source sent it, with the minimum transformations needed to survive a query. Append-only, original payload preserved as JSON or blob, a couple of type casts so the warehouse stops complaining, and audit columns — load_timestamp, source_file, batch_id, ingestion_run_uuid.

The mental model is forensic. If a downstream metric goes weird on Monday, you should be able to walk back to staging and replay exactly what landed. No dedup, no business logic, no joins with reference tables. Retention is usually 7 to 30 days — long enough to debug, short enough that you're not paying Snowflake to store the same JSON twice. After that, data rolls off or moves to cold S3 / GCS. In medallion vocabulary, staging is bronze.

ODS — Operational Data Store

The Operational Data Store holds cleaned, normalized, current-state data that looks like a snapshot of the production OLTP database. Third normal form, foreign keys resolved, lookups joined, duplicates collapsed, business validation applied. The write pattern is overlay — you replace current state on each load, no history.

ODS is most useful when you integrate two or more source systems into one canonical view. A customer record from Salesforce, a billing record from Stripe, and an event log from the product DB all reconcile in ODS into a single customer row. Downstream consumers write SELECT * FROM ods.customers and trust the value matches what production thinks right now.

In Inmon's original architecture ODS is a distinct layer. In Kimball and modern lakehouse stacks it often disappears — the team writes directly from staging into the historical fact table, and "current state" becomes a WHERE valid_to IS NULL filter on the SCD2. The interview-safe answer: ODS exists when you have integration work; skip it when you do not.

DDS — Data Domain Store

DDS — Data Domain Store, sometimes called the historical core — is where the warehouse remembers everything. The shape is SCD Type 2 across the board: every change in a source row becomes a new record with valid_from and valid_to timestamps, and nothing is physically deleted. Some teams build this as literal Data Vault 2.0 (Hub / Link / Satellite); others use 3NF plus temporal columns.

The reason this layer is non-negotiable in regulated industries: auditors ask "what did this customer's address look like on March 14th two years ago" and you need an answer that does not require a backup restore. DDS is the source of truth for history. Marts, reports, and ML training tables derive their history claims from here.

The trade-off is volume. A high-churn dimension like "user preferences" can produce 50-100 rows per user per year in SCD2. Storage is cheap on Snowflake and BigQuery, but every "current state" query needs that WHERE valid_to IS NULL predicate, and forgetting it silently breaks dashboards.

Layer Write pattern History Typical retention
Staging Append-only Raw payload 7-30 days
ODS Overlay (current state) None Live snapshot
DDS SCD2 / Data Vault Full Forever
Data mart Rebuild or merge Aggregated 2-7 years
Semantic Definitions only N/A Versioned in git
Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Data mart

The data mart is denormalized for one specific use case. Marketing wants attribution by channel — that mart aggregates clicks, sessions, conversions, and CAC into a star schema. Finance wants revenue by product line — that's a different mart, different grain, different owner.

The shape is almost always star or snowflake schema: one fact table at a meaningful grain (one row per order, per user-day, per impression) surrounded by dimensions. Pre-aggregated rollups live here when latency matters — a fact_revenue_daily grouped by date, product, and country saves the BI tool from doing the heavy GROUP BY on every dashboard refresh.

The mart is what dashboards read from. That makes it the gold layer in medallion vocabulary, and the layer where business teams have the strongest opinions about column names, because they see those names in Looker tomorrow morning.

-- A small fact_orders mart fed from DDS
CREATE TABLE marts.fact_orders AS
SELECT
    o.order_id,
    o.user_id,
    o.order_ts::DATE         AS order_date,
    o.amount_usd,
    u.country_code,
    u.acquisition_channel,
    p.product_category
FROM dds.orders_h o
JOIN dds.users_h u
    ON o.user_id = u.user_id
   AND o.order_ts BETWEEN u.valid_from AND COALESCE(u.valid_to, '9999-12-31')
JOIN dds.products_h p
    ON o.product_id = p.product_id
   AND o.order_ts BETWEEN p.valid_from AND COALESCE(p.valid_to, '9999-12-31')
WHERE o.order_status = 'completed';

Semantic layer

The semantic layer is a business-readable description of metrics and dimensions that sits above the marts. Not a storage layer — a definition layer. A revenue metric is declared once with its formula, allowed dimensions, access controls, and joins, and every downstream tool (Looker, Tableau, the LLM answering Slack questions) consumes the same definition.

Tools that ship this in 2026: Looker LookML, dbt Semantic Layer, Cube.dev, AtScale, plus embedded layers inside Snowflake Cortex and Databricks Genie. The promise is that "monthly active users" means the same thing on the finance dashboard and the growth bot. One team that defines revenue two different ways destroys the whole point.

A semantic layer becomes mandatory at three or more analytics teams consuming the same marts. Below that, a documented dbt project and a shared metrics doc usually does the job. Above that, every metric without a definition becomes a Slack thread about why two dashboards disagree.

Mapping to the medallion model

Traditional DWH Medallion
Staging Bronze
ODS Silver (current-state view)
DDS Silver (historical core)
Data mart Gold
Semantic layer Semantic (above gold)

The medallion model is the same architecture with fewer names. The collapse works because lakehouse vendors target greenfield teams that do not need to explain ODS to a compliance officer. In a regulated stack — banking, healthcare, insurance — the explicit five-layer split survives, because regulators ask questions that "silver" cannot answer on its own.

Sanity check: if your interviewer uses medallion vocabulary, mirror it. If they use Inmon vocabulary, mirror that. Translating between the two on the whiteboard is fine; insisting one is correct and the other is wrong is the fastest way to lose the room.

Common pitfalls

The first pitfall is collapsing staging and ODS because "it's just two append-only tables, why duplicate". Staging exists to be forensic; ODS exists to be canonical. The moment you start cleaning data in staging, you lose the ability to replay raw payloads, and the first audit question that requires it costs you a week of backup restores.

The second is skipping DDS and going straight from ODS to marts. This works until a stakeholder asks "what was this segment six months ago" and the answer is "we overwrote that data on every load". If history matters to anyone — analytics, finance, ML training — DDS is non-negotiable, and the right time to build it is before you need it, not after.

The third is treating the semantic layer as a tool decision rather than an organizational one. Buying Cube.dev or turning on dbt Semantic Layer does not give you a semantic layer; defining metrics with discipline does. Teams that adopt the tool without naming a metric owner end up with three definitions of "active user" and a dashboard war by month two.

The fourth is building data marts directly off staging because the team is small and DDS feels like overkill. The short-term gain is real; the long-term cost is that every mart contains its own opinion about deduplication, joins, and history. Two marts disagree, a dashboard breaks, a sprint goes to reconciling logic that should have lived in one DDS table. Rule of thumb: as soon as you have two marts that need the same join, build DDS.

The fifth is over-engineering for a 10-person company. A team of fifteen on Postgres and a Looker dashboard does not need ODS, DDS, marts, and a semantic layer — they need one dbt project with models/staging and models/marts and a clear definition of revenue. If you reflexively recommend the full Inmon stack for a Series A team, you've signaled you cannot calibrate to context.

If you want to drill data engineering interview questions like this every day, NAILDD ships 500+ DE and SQL problems built around exactly this layered-DWH framing.

FAQ

Can I skip ODS in a modern lakehouse?

Often yes. With a single source system and no integration work, ODS does not add value — staging dedup plus a "current state" view on DDS covers the same need. The decision flips the moment you have two sources describing the same business entity. Then ODS becomes where you reconcile keys, resolve conflicts, and produce one canonical row. Lakehouse vendors sometimes hide ODS inside a "silver" table; the work still happens, the name just changed.

Is staging always raw JSON, or can I cast to columns?

Light type casting in staging is fine and usually expected. The line you do not cross is business logic — no dedup, no joins with lookup tables, no derived columns, no "fix this nullable field by defaulting to zero". The reason is forensic: if a downstream metric breaks, you want staging to reflect exactly what the source sent, not what your team thought the source should have sent. If casting int to bigint is the only transformation, that is still raw enough.

What's the difference between DDS and Data Vault 2.0?

DDS is the architectural role — the historical normalized core of the warehouse. Data Vault 2.0 is one specific implementation of that role, using Hub, Link, and Satellite tables with hash keys and explicit load timestamps. You can build DDS without using Data Vault — most teams use 3NF tables with SCD2 columns and call it a day. Data Vault adds scaffolding that pays off at petabyte scale and in highly regulated industries; below that scale, plain SCD2 is usually enough.

How do marts get refreshed — full rebuild or incremental?

Both patterns are common. Full rebuild is simpler and bulletproof: drop the mart, rebuild from DDS, swap. It works up to 100M-500M rows depending on the engine. Incremental merge scales further but introduces state — watermarks, late-arriving facts, and tests that the incremental output matches the full rebuild. dbt's incremental models give you most of this out of the box, which is why dbt won this category.

Do I need a semantic layer if I already have dbt?

Below three analytics teams, usually no — a curated dbt project with a metrics doc and PR review on metric changes covers most needs. Above three teams, almost certainly yes. The pain point is not "we don't know SQL" — it is "the finance team's revenue and the growth team's revenue disagree by 2% and no one can explain why". A semantic layer makes that disagreement impossible by construction.

How does the semantic layer interact with LLM-powered analytics in 2026?

Tightly. LLM agents need a typed contract to query against — letting one write raw SQL against marts produces hallucinated joins and made-up column names. A semantic layer gives the agent a constrained vocabulary of metrics, dimensions, and pre-defined joins, and the warehouse executes deterministic SQL underneath. Without a semantic layer, LLM analytics is a demo; with one, it is a product.