DWH modeling for the Data Engineer interview
Contents:
Why DWH modeling shows up in every DE loop
Walk into any Data Engineer loop at Snowflake, Stripe, or Airbnb and the second-round whiteboard goes the same way: a vague business prompt, a Sharpie, and forty minutes to defend your grain choices. The interviewer isn't checking whether you memorized Kimball's Toolkit. They're checking whether you can pick the right fact-table grain, draw conformed dimensions without sweating, and know when a junk dim beats ten boolean flags glued onto a fact row.
The answers are deceptively simple in isolation — transaction vs snapshot vs accumulating — but the trap is the second-order question. The follow-up is always: "Now what if we add returns?" or "What if a customer can have multiple billing addresses?" That's where modeling pretenders get filtered out from people who've actually shipped a star schema to production.
Fact table types
Four flavors. Knowing which one fits a prompt is 80% of the interview signal.
| Type | Grain | When to use | Classic example |
|---|---|---|---|
| Transaction fact | One row per event | Default. Atomic, append-only. | Orders, page views, payments |
| Periodic snapshot | One row per entity per period | Stable metric you want to compare over time | Daily inventory, monthly account balance |
| Accumulating snapshot | One row per process, updated as stages complete | Funnel / pipeline analysis with clear milestones | Order lifecycle, mortgage application |
| Factless fact | No measure, just dim FKs | Modeling relationships or coverage | Student-course enrollment, promotion eligibility |
A transaction fact is what you reach for unless something forces your hand. It's immutable, plays well with column stores, and stays small per row.
A periodic snapshot is the answer when the prompt mentions "balance," "headcount," or "open positions on day X." You can't reconstruct a snapshot reliably from transactions because of late-arriving data and corrections, so you take a picture daily and keep it.
An accumulating snapshot is the right answer when stages of a single process are the unit of analysis. The row gets updated — yes, an update in a warehouse, on purpose — as the order moves through placed → paid → shipped → delivered:
-- accumulating snapshot example
CREATE TABLE fact_order_pipeline (
order_id BIGINT PRIMARY KEY,
customer_sk BIGINT,
placed_at TIMESTAMP,
paid_at TIMESTAMP,
shipped_at TIMESTAMP,
delivered_at TIMESTAMP,
cancelled_at TIMESTAMP,
hours_to_pay NUMERIC,
hours_to_ship NUMERIC,
hours_total NUMERIC
);The columns hours_to_pay and friends are pre-computed lags between milestones — the load-bearing trick that makes funnel queries cheap. This is also why analysts love accumulating snapshots even when they pretend to be relational purists.
A factless fact has no numeric measure. It exists to record that "customer X was eligible for promotion Y on date Z" or "student S was registered for course C." You count rows to get the metric.
Gotcha: if the interviewer says "model registrations to a course," the right answer is factless fact, not a fact with is_registered = 1. The boolean is a smell.
Dimension flavors
Dimensions are the descriptors — the who, what, where, when — that make facts queryable.
Standard dimensions are the obvious ones: dim_customer, dim_product, dim_date, dim_store. Each row is one entity with a surrogate key (customer_sk) plus the natural business key (customer_id) and a pile of attributes.
Date dimension deserves its own line. You pre-populate it once for ~50 years, add columns like day_of_week, is_weekend, is_us_holiday, fiscal_quarter, iso_week, and never recompute date logic in ad-hoc queries again. Worth its weight in gold during quarterly close.
Time dimension is separate from date when minute-or-hour grain matters (think trading, ride-hailing surge windows). Most warehouses skip it and use a raw TIMESTAMP column on the fact — acceptable, but you lose pre-baked attributes like is_business_hour.
Role-playing dimensions are the same dim referenced multiple times under different aliases. An order has order_date, ship_date, and delivery_date — all three are foreign keys into dim_date, exposed as views:
CREATE VIEW dim_order_date AS SELECT * FROM dim_date;
CREATE VIEW dim_ship_date AS SELECT * FROM dim_date;
CREATE VIEW dim_delivery_date AS SELECT * FROM dim_date;Degenerate dimensions are dimensions without a separate table — usually a high-cardinality identifier like transaction_id or invoice_number that you carry on the fact directly. Creating dim_transaction with one row per transaction is pointless and an interviewer will call it out.
Conformed dimensions
A conformed dimension is a single dim referenced by multiple fact tables, with identical structure and meaning in each context. This is the difference between a real data warehouse and a pile of disconnected marts.
fact_orders ──▶ dim_customer ◀── fact_returns
fact_orders ──▶ dim_product ◀── fact_inventory
fact_payments ──▶ dim_customer ◀── fact_support_ticketsIf customer_sk = 4218 means the same human across fact_orders, fact_returns, and fact_support_tickets, you can drill across business processes in one query. If it means subtly different things — maybe one team includes guest checkouts, the other doesn't — your KPIs will disagree forever and the data team will spend three quarters in reconciliation hell.
Sanity check: if you can't write a single SELECT that joins two fact tables through a shared dim and produces a coherent number, your dimensions aren't conformed.
The implementation rule is single source of truth per entity. One dim_customer, hydrated by the customer master pipeline, referenced by every fact. Not three near-copies in three marts.
Junk dimension
You have ten low-cardinality boolean and enum columns on an order — is_premium, is_promo, is_gift, shipping_speed, payment_method, channel. Putting them all on the fact bloats every row and tanks compression. Putting each in its own dim table creates ten tiny dims and ten extra joins.
The junk dimension combines them into one wide-but-low-row-count table:
CREATE TABLE junk_dim_order (
junk_sk BIGINT PRIMARY KEY,
is_premium BOOLEAN,
is_promo BOOLEAN,
is_gift BOOLEAN,
shipping_speed VARCHAR(16), -- 'standard' | 'express' | 'next_day'
payment_method VARCHAR(16), -- 'card' | 'paypal' | 'apple_pay' | 'klarna'
channel VARCHAR(16) -- 'web' | 'ios' | 'android'
);Total rows: at most the Cartesian product of distinct values across columns, often a few thousand. The fact carries one junk_sk instead of six raw flags. The compression win on a billion-row fact is genuinely measurable — typically 30–50% smaller on Parquet/ORC.
Mini-dimension
When a dimension has both stable attributes (name, date of birth) and rapidly changing attributes (age band, income tier, churn risk score updated weekly), Type 2 SCD on the whole table explodes the row count. A 50M-customer dim with weekly demographic updates becomes 2.6B rows per year.
The fix: split into a stable dim_customer and a mini-dimension mini_dim_customer_demographics that holds only the volatile attributes. The fact joins both:
fact_orders.customer_sk → dim_customer (SCD Type 2 on slow attrs)
fact_orders.customer_demo_sk → mini_dim_customer_demographicsNow the demographics dim holds maybe a few thousand combinations of (age_band, income_band, churn_band), not 2.6B rows. Each fact row carries the demo snapshot at the time of the transaction. This pattern is also the trick most candidates miss when the interviewer asks "how would you scale this past 100M customers."
Bridge tables
Many-to-many between a fact and a dim, or between two dims. Classic prompt: a customer has multiple shipping addresses, and an order can ship to any of them. You can't put address_sk on the customer dim (one-to-many) or on the fact directly without losing the customer-address relationship.
dim_customer ──┐
├── bridge_customer_address ──── dim_address
└────────────────────────────────────────────The bridge holds (customer_sk, address_sk, weight, is_primary, valid_from, valid_to). The weight column is the allocation factor — if revenue should be split 50/50 across two addresses (rare but real for B2B), the bridge encodes that and queries multiply through.
The cost: queries through bridges are easy to double-count. If you join fact_orders → dim_customer → bridge → dim_address without aggregating on the customer side first, every order gets multiplied by the number of addresses. The fix is to either pre-aggregate or use a DISTINCT plus a weighted SUM with the allocation factor.
Common pitfalls
The most common modeling mistake on whiteboards is picking the wrong grain for a fact table. Candidates write "one row per order" when the prompt clearly says "we want to analyze line items by SKU." The grain should always be the most atomic event the business cares about — order line, not order header. You can always aggregate up; you cannot recover detail you didn't store. Interviewers test this with prompts like "what's the average discount per SKU across orders" — if your grain is the order header, the answer is "we can't tell" and you lose the round.
A second pitfall is denormalizing dimensions into the fact to "save a join." It feels clever until SCD Type 2 enters the picture and now you have customer address strings frozen into a billion fact rows. The right move is a surrogate key on the dim and a join. Modern column stores make these joins essentially free; the cost story from the 1990s no longer applies on Snowflake or BigQuery.
The third trap is conflating natural keys and surrogate keys. customer_id from the source system is the natural key — it can change, be reused after deletion, or differ across systems. The surrogate customer_sk is a warehouse-generated integer that uniquely identifies a version of the customer (critical for SCD Type 2). Mixing them up means historical reports drift silently as upstream IDs get reassigned. Surrogate keys are not a bureaucratic flourish — they're the only way to keep history honest.
A fourth pitfall is using Type 2 SCD on dims that should be mini-dims. A dim_customer with frequently changing income bands turns into a multi-billion-row table. If the interviewer mentions "weekly updates" or "demographics change often," that's your cue to propose a mini-dimension. Candidates who instinctively reach for Type 2 on everything signal they've never operated a warehouse at scale.
Finally, forgetting the unknown row in every dimension is a small but telling miss. Every dim should have a sentinel row with sk = -1, attributes 'Unknown', so that facts with late-arriving or missing dimension keys can still be loaded without nulls breaking joins. Production warehouses live or die by this discipline.
Related reading
- Data Vault 2 deep dive for Data Engineer interviews
- Data mart design for Data Engineer interviews
- Data lineage for Data Engineer interviews
- Data modeling approaches for Systems Analyst interviews
- Anti and semi joins for Data Engineer interviews
If you want to drill DE modeling prompts like these every day, NAILDD is launching with 500+ data engineering and SQL problems built around exactly this interview pattern.
FAQ
Can I skip the date dimension and just use raw timestamps?
Technically yes, and many teams do for the first six months of their warehouse life. The problem appears as soon as analysts start writing WHERE EXTRACT(quarter FROM order_ts) = ... and shipping subtly inconsistent fiscal-quarter definitions across dashboards. A pre-built dim_date with fiscal_quarter, is_us_holiday, is_weekend, and iso_week standardizes that logic in one place and makes ad-hoc queries faster because the warehouse can prune partitions through the dim join.
Is this Kimball-only, or does it apply to Data Vault and Inmon too?
This post is squarely the Kimball dimensional modeling worldview, which is what 80% of Data Engineer interviews actually test because it's what 80% of warehouses look like in practice. Data Vault uses hubs, links, and satellites instead of facts and dims — different vocabulary, different ETL pattern. Inmon advocates a normalized enterprise warehouse with dimensional marts on top. If the job posting mentions "Data Vault" explicitly, brush up on that separately; otherwise default to Kimball for the loop.
How do conformed dimensions interact with mesh architectures?
A data mesh pushes ownership of domain data to product teams, which can fracture conformed dimensions if you're not careful — two domains will independently define "customer" with different rules. The standard pattern is a thin shared dimension layer owned by a platform team, hydrated by domains via contracts, and consumed by every domain mart. Conformed dimensions are still required; the mesh just changes who owns the producer side.
When does a junk dimension stop making sense?
Once any single column on the junk dim exceeds a few hundred distinct values, the Cartesian product blows up and the junk dim loses its compression advantage. The threshold is rough but useful: stay under ~10,000 total combinations across the junk dim. Past that, promote the offending column to its own dimension or split the junk into two narrower ones.
What's the difference between a factless fact and a bridge table?
A factless fact records that an event or relationship happened — registration, eligibility, attendance — and you count rows to get the metric. A bridge table resolves a many-to-many between two existing dimensions, or between a fact and a dim, and carries optional weights for allocation. The shapes look similar (mostly FKs, no measures), but the intent differs: factless facts are events, bridges are structural plumbing.
How do I answer "how would you model this" without freezing?
Walk the interviewer through three steps out loud: identify the business process (orders, returns, sessions), pick the grain (one row per what?), then enumerate the dimensions that describe that grain (who, what, where, when, why, how). Once those are on the board, the choice of fact-table type and the need for conformed dims, junk dims, or bridges falls out naturally. Talking through this scaffold beats jumping straight to a star schema you half-remember.