Data quality dimensions for DE interviews
Contents:
Why interviewers probe DQ dimensions
If you have ever been on a 3 AM page where the revenue dashboard at Stripe shows a 40% drop and the answer is "the upstream CDC feed dropped a partition", you already know why data quality dimensions are the single most asked open-ended question in a data engineering loop. Interviewers at companies like Snowflake, Databricks, Airbnb, and DoorDash want to hear that you have a vocabulary — not a vague "we have tests" answer. The vocabulary is the DAMA-DMBOK framework: six dimensions that turn fuzzy complaints into measurable SLAs.
The trap is treating these like a checklist. Senior candidates use the dimensions as a diagnostic lens during a debugging story — "the issue was timeliness, not accuracy, because the values were correct but stale by 14 hours" — and that framing alone often clears the bar. Memorize the names; understand the metrics; rehearse one war story per dimension. That is the whole interview prep, distilled.
Load-bearing trick: when the interviewer asks "how do you ensure data quality?", do not list tools. List dimensions, then map tools to dimensions. Great Expectations covers validity. dbt source freshness covers timeliness. Monte Carlo or anomaly tests cover accuracy.
The six dimensions at a glance
Here is the canonical cheat sheet. Print it, tape it to your monitor for interview week.
| Dimension | Question it answers | Typical SLA | SQL signal |
|---|---|---|---|
| Completeness | Are required fields populated? | email_completeness >= 99% |
COUNT(col) / COUNT(*) |
| Accuracy | Do values match reality? | Variance vs source-of-truth < 0.5% | Reconciliation diff |
| Consistency | Do related tables agree? | Row count delta = 0 | Cross-table joins |
| Timeliness | Is the data fresh enough? | lag <= 60 min |
MAX(updated_at) vs now |
| Uniqueness | Any duplicates on the key? | duplicate_rate = 0% |
COUNT(*) - COUNT(DISTINCT pk) |
| Validity | Do values fit the schema/domain? | invalid_rate < 0.1% |
Regex / range / enum checks |
Each one gets its own section below with a runnable SQL probe and the SLA language that interviewers expect to hear. The order matters: completeness and uniqueness are table-level, accuracy and consistency are cross-system, timeliness is operational, and validity is schema-level. Mixing them up is the fastest way to lose a senior loop.
Completeness
Completeness is the percentage of non-null values in fields the business cares about. Not every NULL is a problem — middle_name being empty is fine, but customer_id being empty in an orders table is a data corruption event. The interviewer wants to hear you separate required from optional before you write any SQL.
SELECT
100.0 * COUNT(email) / COUNT(*) AS email_completeness,
100.0 * COUNT(phone) / COUNT(*) AS phone_completeness,
100.0 * COUNT(country_id) / COUNT(*) AS country_completeness
FROM customers;A good SLA reads like "email completeness must be at least 99% on customers created in the last 7 days". The window matters — historical backfills are usually messier and you do not want a 4-year-old data quality issue paging you tonight. This is also why most modern checks run on a rolling partition, not the full table.
Accuracy
Accuracy is the dimension that humbles juniors. It asks: do the values reflect the real world? You cannot answer this with SQL alone — you need a source of truth to reconcile against. At a fintech, that is the accounting ledger or the payment processor's settlement file. At Uber, it is the driver-app event stream. At Netflix, it is the playback heartbeat.
-- reconcile our orders table vs the processor settlement file
WITH ours AS (
SELECT DATE(created_at) AS day, SUM(amount_usd) AS our_total
FROM orders WHERE status = 'captured' GROUP BY 1
),
theirs AS (
SELECT day, total_captured AS their_total FROM processor_settlement
)
SELECT
o.day,
o.our_total,
t.their_total,
o.our_total - t.their_total AS variance,
100.0 * (o.our_total - t.their_total) / NULLIF(t.their_total, 0) AS variance_pct
FROM ours o JOIN theirs t USING (day)
WHERE ABS(o.our_total - t.their_total) > 0.01
ORDER BY day DESC;The senior move is to say "I do not test accuracy with assertions — I test it with reconciliation". Assertions catch obvious bugs; reconciliation catches the silent drift that costs millions before anyone notices. Tolerances matter too: a 0.5% variance on payments is a sev-2 ticket, but on impressions it might be within sampling noise.
Consistency
Consistency is internal accuracy — different tables in your warehouse telling the same story. If fact_orders says 12,481 orders happened yesterday and daily_summary_orders says 12,402, one of them is wrong and probably both downstream dashboards are too.
-- two views of the same truth — must match
SELECT 'fact_orders' AS src, COUNT(*) AS n
FROM fact_orders WHERE DATE(created_at) = '2026-05-01'
UNION ALL
SELECT 'daily_summary', SUM(orders_count)
FROM daily_summary_orders WHERE day = '2026-05-01';Gotcha: consistency failures usually come from late-arriving events or timezone drift, not from broken pipelines. Always check the watermark and the timezone before assuming corruption.
Timeliness
Timeliness is the gap between when an event happened in the real world and when it is queryable in the warehouse. Two metrics matter and interviewers love to see both:
- Lag —
now() - max(event_time)— how stale is the freshest row? - Freshness —
now() - max(loaded_at)— when did we last update the table at all?
An empty table that was last loaded an hour ago has good freshness but terrible lag. A frequently-updated table that only inserts events from last week has good freshness but a 7-day lag. Knowing the difference is the senior signal.
SELECT
MAX(event_time) AS latest_event,
MAX(loaded_at) AS latest_load,
EXTRACT(EPOCH FROM (NOW() - MAX(event_time))) / 60 AS lag_minutes,
EXTRACT(EPOCH FROM (NOW() - MAX(loaded_at))) / 60 AS staleness_minutes
FROM fact_events;A good SLA reads "orders are queryable within 15 minutes of capture, 99.5% of days". Notice the 99.5% — never promise 100%, because backfills and re-runs will burn you. The dbt source freshness block is the cheapest way to wire this into a daily CI run.
Uniqueness
Uniqueness asks whether your primary key is actually unique. Sounds trivial; is not. The duplicate creeps in via late-arriving CDC events, merge upserts that race, or a backfill that re-ingests a partition without truncating it first.
SELECT
COUNT(*) AS total_rows,
COUNT(DISTINCT order_id) AS unique_keys,
COUNT(*) - COUNT(DISTINCT order_id) AS duplicate_rows,
100.0 * (COUNT(*) - COUNT(DISTINCT order_id)) / COUNT(*) AS duplicate_rate_pct
FROM fact_orders
WHERE created_at >= NOW() - INTERVAL '7 days';The SLA you want to defend is "duplicate_rate = 0% on the primary key, measured on the last 24 hours". Anything else is a bug. If the interviewer pushes back with "what about soft-duplicates" — different order_id but same customer, same amount, same minute — that is a deduplication question, not a uniqueness one. Different dimension entirely.
Validity
Validity is the schema-and-domain dimension: does the value fit the declared format and the business rules? amount_usd = -42.50 might be a valid float, but it is invalid for an order capture. email = 'foo@' parses but will not deliver. country_code = 'ZZ' is two characters but is not in your enum.
-- email regex check
SELECT COUNT(*) AS invalid_emails
FROM users
WHERE email !~ '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$';
-- range check on amounts
SELECT COUNT(*) AS out_of_range_amounts
FROM orders
WHERE amount_usd NOT BETWEEN 0 AND 50000;
-- enum check on status
SELECT status, COUNT(*) AS n
FROM orders
WHERE status NOT IN ('pending','captured','refunded','disputed')
GROUP BY status;This is where Great Expectations, dbt tests, and Soda Core shine — they make these checks declarative. The interview signal is naming the tool and its dimension: "I use expect_column_values_to_match_regex for validity, expect_column_values_to_be_unique for uniqueness, and source freshness blocks for timeliness."
Common pitfalls
The most common candidate failure is conflating accuracy with validity. Validity asks "does the value fit the schema?" — accuracy asks "is the value true?" An email that passes regex but belongs to a different person is valid but inaccurate. The fix in interviews is to verbally separate them: "regex catches validity; reconciliation against the CRM catches accuracy." Once you say this out loud, the interviewer relaxes.
A second trap is promising 100% on any SLA. The senior framing is always a percentage with a window — "99.5% of partitions land within their freshness window over a rolling 30 days". Anyone who promises 100% has either never been on-call or is about to be page-bombed. Tie the SLA to the business cost of failure: a finance dashboard tolerates less drift than a marketing attribution model.
The third pitfall is measuring DQ on the whole table forever. Tables grow; historical partitions ossify; today's bugs hide in noise from 2019. Always scope your check to a rolling window — last 24 hours, last 7 days, current partition. This both speeds up the query and makes the metric actionable: a regression today should not be diluted by three years of clean history.
The fourth trap is alerting on every dimension at the same severity. A 0.2% drop in completeness on phone is a Monday-morning ticket; a 5% variance on amount_usd against the processor is a sev-1 page. Build your alerting matrix with per-dimension, per-table tolerances and route them differently. Otherwise the team will silence the channel within a week.
The fifth pitfall is forgetting consistency between the warehouse and the lake. If you have a Snowflake warehouse and an Iceberg lake on S3, those two stores will drift — different load schedules, different retries, different idempotency models. Add a daily reconciliation job that diffs row counts and key checksums between them, or you will discover the gap during an exec readout.
Related reading
- Data contracts in DE interviews
- Data lineage in DE interviews
- dbt incremental models in DE interviews
- How to calculate a data quality score in SQL
- SQL for data engineer interview
If you want to drill DQ scenarios like this end-to-end, NAILDD ships 500+ data engineering interview problems built around exactly these six dimensions.
FAQ
Which DQ dimension matters most?
It depends on the business surface. Finance and payments lean hardest on accuracy and consistency, because dollar variance is auditable and customer-visible. Real-time products like ride-hailing or trading lean on timeliness — a 30-second-stale price is worse than a slightly wrong one. CRM and growth lean on completeness and validity, because broken contact fields kill outreach. In the interview, pick the dimension that maps to the company you are interviewing at and use that as your opening example.
How do I turn these dimensions into an SLA document?
Write one line per dimension per critical table. Each line names the metric, the threshold, and the window: "fact_orders.completeness(customer_id) >= 99.99% over rolling 24h". Then tie each line to an alert severity and a runbook link. Publish that doc in the team wiki. When a stakeholder asks "do we measure data quality?", you point at the doc. When something breaks, the runbook tells the on-call where to start.
Is this an official framework?
The six dimensions trace back to the DAMA-DMBOK body of knowledge and ISO 8000. Different vendors add their own — Monte Carlo talks about volume and schema changes as separate dimensions, Soda includes distribution drift. The six listed here are the interview-defensible canonical set. If an interviewer pushes back on a seventh, name DAMA, agree that volume and schema can be subdimensions, and move on.
How do I implement these checks at scale?
Pick one tool per layer. For declarative schema and validity tests, use dbt tests or Great Expectations inside your transformation layer. For freshness and lag, use dbt source freshness or the warehouse's built-in metadata. For accuracy and reconciliation, write custom SQL jobs scheduled in Airflow or Dagster, because they are domain-specific. For anomaly detection on top of all of the above, a tool like Monte Carlo or Bigeye runs unsupervised drift detection — useful but optional for a small team.
What if the interviewer asks about ML model data quality specifically?
ML adds feature drift and label leakage on top of the six dimensions. Feature drift is a special case of accuracy over time — the feature distribution in production diverges from training. Label leakage is a consistency issue — your training set accidentally contains future information. Mention both, and connect them back to the six core dimensions instead of treating them as a separate framework.
How do I prioritize DQ work when everything is on fire?
Rank tables by blast radius. A fact table feeding the company KPI dashboard gets full coverage on all six dimensions. A staging table nobody queries gets uniqueness on the key and timeliness on load. Build a tiered SLO matrix: tier-1 tables get all six checks with paging alerts, tier-2 gets four with Slack alerts, tier-3 gets two with weekly digests. Anything else and you will burn a quarter writing tests nobody reads.