How to calculate Data Quality Score in SQL
Contents:
Why Data Quality Score matters
It is Wednesday morning, the dashboard is green, and your CEO finds last week's revenue is off by twelve percent. The culprit is not the BI tool — it is a silent schema drift in an upstream Stripe webhook that flipped amount_total from cents to whole dollars for forty-eight hours. Every analytics team at Stripe, Snowflake, Databricks, Notion, and DoorDash eventually learns this the hard way and reaches for the same answer: a Data Quality Score that turns trust into a number you can plot, alert on, and explain in a standup.
A Data Quality Score is a single composite number between zero and one hundred that summarizes table health across multiple dimensions. The Wang and Strong framework from MIT defines six: completeness, uniqueness, validity, consistency, accuracy, and timeliness. This post walks the SQL end to end — per-dimension queries you can run on Postgres or Snowflake, the composite formula that rolls them up, the daily snapshot pattern that catches regressions, the pitfalls that bite first-time authors, and the optimizations that keep the queries cheap on a billion-row events table.
The SQL formula
A Data Quality Score for a table is a weighted average of per-dimension scores, where each dimension score is the share of rows that pass a rule. Completeness counts non-null rows, uniqueness counts distinct keys, validity counts rows that satisfy a constraint, consistency counts rows that agree across two systems, accuracy counts rows that match a trusted reference, and timeliness counts rows that land within an SLA window. Every dimension reduces to the same shape — COUNT(*) FILTER (WHERE rule) divided by COUNT(*) — which makes the score easy to compose and to materialize.
The trap most analysts hit is treating the score as a single average across all columns. The fix is to anchor every dimension to a specific column or relationship, then weight the dimensions by business impact. A null user_id matters more than a null middle name; a duplicate order_id matters more than a duplicate user-agent string.
Completeness
Completeness is the share of rows where a column is not null. It is the cheapest dimension to compute and the one that catches the largest class of incidents — broken ETL jobs, dropped Kafka partitions, and schema drift. The query is a single aggregation per column.
SELECT
'user_email' AS column_name,
COUNT(*) AS total_rows,
COUNT(*) FILTER (WHERE user_email IS NOT NULL) AS non_null,
COUNT(*) FILTER (WHERE user_email IS NOT NULL)::NUMERIC * 100 / COUNT(*) AS completeness_pct
FROM users
UNION ALL
SELECT
'created_at',
COUNT(*),
COUNT(*) FILTER (WHERE created_at IS NOT NULL),
COUNT(*) FILTER (WHERE created_at IS NOT NULL)::NUMERIC * 100 / COUNT(*)
FROM users
UNION ALL
SELECT
'country',
COUNT(*),
COUNT(*) FILTER (WHERE country IS NOT NULL),
COUNT(*) FILTER (WHERE country IS NOT NULL)::NUMERIC * 100 / COUNT(*)
FROM users;The pattern scales by union — one block per column. In Snowflake or BigQuery the FILTER clause becomes COUNT_IF(condition) or SUM(CASE WHEN condition THEN 1 ELSE 0 END); the math is identical. Keep the output as a long table keyed on (table_name, column_name, snapshot_date), not a wide pivot, so the downstream snapshot table avoids a schema migration every time you add a column.
Uniqueness
Uniqueness is the share of key values that appear exactly once. It is the dimension that catches dedupe bugs, upsert collisions, and the classic "the join multiplied my row count by three" mistake. The query first groups by the key to count occurrences, then aggregates those counts.
WITH dup_stats AS (
SELECT
user_id,
COUNT(*) AS occurrences
FROM users
GROUP BY user_id
)
SELECT
COUNT(*) AS total_unique_keys,
COUNT(*) FILTER (WHERE occurrences > 1) AS duplicate_keys,
COUNT(*) FILTER (WHERE occurrences = 1)::NUMERIC * 100 / COUNT(*) AS uniqueness_pct
FROM dup_stats;For composite keys — order_id plus line_item_id — group by both columns and the same shape applies. Two subtleties matter. First, null keys are excluded by GROUP BY, so add a separate completeness check on the key. Second, the score above is the share of distinct keys that are unique; if you instead want the share of rows that are not part of any duplicate, divide COUNT(*) FILTER (WHERE occurrences = 1) by SUM(occurrences). Pick one definition and stick with it — the two numbers can differ by ten points on a duplicate-heavy table.
Validity
Validity is the share of rows whose values satisfy a rule. The rules are domain-specific: an email matches a regex, an age sits between zero and a hundred and twenty, a country code belongs to an allowlist, a status enum is one of a known set. One query can cover several rules at once.
SELECT
COUNT(*) FILTER (WHERE email ~ '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$')::NUMERIC
* 100 / COUNT(*) AS email_valid_pct,
COUNT(*) FILTER (WHERE age BETWEEN 0 AND 120)::NUMERIC
* 100 / COUNT(*) AS age_valid_pct,
COUNT(*) FILTER (WHERE country IN ('US', 'CA', 'GB', 'DE', 'FR', 'JP', 'AU'))::NUMERIC
* 100 / COUNT(*) AS country_valid_pct,
COUNT(*) FILTER (WHERE status IN ('active', 'paused', 'churned'))::NUMERIC
* 100 / COUNT(*) AS status_valid_pct
FROM users
WHERE created_at >= CURRENT_DATE - INTERVAL '7 days';The seven-day filter is intentional. Validity rules drift — country allowlists grow, status enums change — so the score should reflect the recent state, not history. For columns that should never change shape (a phone format, a postcode pattern), drop the filter and score the full table. Document the rule in a dq_rules reference table so a junior engineer six months later does not have to guess what "valid" means.
Composite score
The composite score collapses per-dimension numbers into one. The naive version is a flat arithmetic mean; the honest version is a weighted mean, where weights reflect business impact. A weight of zero point four on completeness and zero point one on timeliness is a reasonable start for a transactional table; for an analytics events table the weights might be reversed.
WITH metrics AS (
SELECT
'users' AS table_name,
(SELECT COUNT(*) FILTER (WHERE user_id IS NOT NULL)::NUMERIC * 100 / COUNT(*) FROM users) AS completeness,
(SELECT COUNT(DISTINCT user_id)::NUMERIC * 100 / COUNT(*) FROM users) AS uniqueness,
(SELECT COUNT(*) FILTER (WHERE email ~ '^.+@.+\..+$')::NUMERIC * 100 / COUNT(*) FROM users) AS validity,
(SELECT COUNT(*) FILTER (WHERE updated_at >= CURRENT_DATE - INTERVAL '1 day')::NUMERIC * 100 / COUNT(*) FROM users) AS timeliness
)
SELECT
table_name,
completeness,
uniqueness,
validity,
timeliness,
0.4 * completeness + 0.3 * uniqueness + 0.2 * validity + 0.1 * timeliness AS overall_score
FROM metrics;The slide-ready form: Score = 0.4 * completeness + 0.3 * uniqueness + 0.2 * validity + 0.1 * timeliness. Tune the coefficients to the table, not the company. A finance reporting mart and a clickstream warehouse have different cost functions for the same dimension; the weights encode that difference.
Daily monitoring
A point-in-time score is a snapshot; the value comes from trend. Materialize per-dimension scores into a dq_snapshots table once a day and plot the last thirty days. A sudden drop is the signal — absolute level matters less than the change.
SELECT
snapshot_date,
table_name,
completeness_pct,
uniqueness_pct,
validity_pct,
timeliness_pct,
0.4 * completeness_pct
+ 0.3 * uniqueness_pct
+ 0.2 * validity_pct
+ 0.1 * timeliness_pct AS overall_dq
FROM dq_snapshots
WHERE snapshot_date >= CURRENT_DATE - INTERVAL '30 days'
ORDER BY snapshot_date DESC, table_name;Wire the snapshot job to your orchestrator (Airflow, Dagster, Prefect) and set the alert threshold to a relative drop — a five-point decrease versus the seven-day rolling average — rather than an absolute floor. Absolute thresholds catch nothing on a healthy table and fire constantly on a noisy one; relative thresholds catch regressions that matter.
Common pitfalls
The first pitfall is treating a single metric as enough. A table with ninety-nine percent completeness across all columns can still be unusable if the missing one percent is the primary key. The fix is to score each column separately and weight critical columns above optional ones, so a null in user_id moves the score in a way a null in referral_source cannot.
The second pitfall is letting validity rules drift without versioning. The country allowlist grows when sales opens a new region, the status enum changes when product ships a lifecycle state, and the score drops overnight for a reason unrelated to data quality. The fix is to store rule definitions in a dq_rules table with an effective_from column and score against the rules in force at snapshot time.
The third pitfall is comparing to a stale baseline. A score that sat at ninety-two for two years and dipped to eighty-nine yesterday is not a five-point regression against the long-term mean; it is a three-point dip against last week, and that is what matters. The fix is to compare against a rolling seven- or fourteen-day baseline, because drift is normal and only sudden change is news.
The fourth pitfall is applying the same weights to every table. The users table has high stakes on uniqueness because a duplicate row breaks every join downstream; an events table tolerates uniqueness noise but cannot tolerate timeliness slips. Keep a per-table weights config in version control and review it quarterly, the way you review SLOs.
The fifth pitfall is celebrating a perfect score. A Data Quality Score that sits at one hundred percent week after week almost always means the checks are weak — real data has refunds, late arrivals, and broken jobs. Treat a sustained one hundred as a code smell and audit the rules themselves.
Optimization tips
The per-dimension queries are linear scans — fine on a million rows, expensive on a billion. The first lever is to push the aggregation into a daily materialized view or an incremental dbt model, so the snapshot job reads pre-aggregated counts. Snowflake gets this with STREAMS and TASKS; Databricks with Delta Live Tables; Postgres with a scheduled REFRESH MATERIALIZED VIEW.
The second lever is partitioning. If the underlying table is partitioned by created_at or event_date, scope every dimension query to the latest partition and union with yesterday. This turns a billion-row scan into a ten-million-row scan and keeps the nightly job under a minute on commodity warehouse compute. The third lever is sampling for expensive dimensions — accuracy checks that join to a reference table can run on a one-percent stratified sample — though completeness and uniqueness should always run on the full table because the rare null is the whole point.
The fourth lever is to keep the snapshot table narrow. Store one row per (table_name, column_name, dimension, snapshot_date) and compute the composite at read time in the BI tool, rather than freezing a weighting scheme into the warehouse. The weights will change; the per-dimension scores will not.
Related reading
- SQL window functions interview questions
- How to calculate cohort revenue in SQL
- A/B testing peeking mistake
If you want to drill SQL questions like this every day, NAILDD is launching with 500+ SQL problems across exactly this pattern.
FAQ
Six dimensions or four — which framework should I pick?
The original Wang and Strong paper from MIT lists six: completeness, uniqueness, validity, consistency, accuracy, and timeliness. Many teams consolidate to four or five by merging accuracy into validity or dropping consistency when there is only one source of truth. Either is defensible. Pick the framework your data leadership already trusts, document it in a one-page spec, and move on. The score is only useful if everyone interprets it the same way, which requires a shared definition more than a perfect taxonomy.
How do I set the dimension weights?
Anchor weights to business impact, not gut feel. Ask: if this column drops ten points on this dimension, what breaks downstream? A ten-point drop in completeness on order_id breaks every revenue report, so it deserves a heavy weight. A ten-point drop in completeness on marketing_source makes one dashboard noisier and deserves a light weight. Start with a default split — forty percent completeness, thirty percent uniqueness, twenty percent validity, ten percent timeliness — and tune per table.
Should I use a Data Quality tool or pure SQL?
Tools like Great Expectations, dbt tests, and Soda give you a fluent test syntax, a results database, and a UI for free. Pure SQL is more flexible but you pay for the maintenance. The rule of thumb: use dbt tests for per-pipeline checks that live with the model, and a dedicated tool for the cross-pipeline score that lives on the platform. Pure SQL is fine for the first dozen checks; once you cross fifty, the tooling earns its keep.
Should the snapshot run daily or in real time?
Daily snapshots are cheap and catch ninety-five percent of incidents. Real-time scoring is necessary where a bad batch has direct financial consequences within hours — fraud detection, billing, ads attribution. The trade-off is cost: real time requires streaming aggregations on Kafka or Pulsar and an OLAP store, which is a six-figure infrastructure decision. Most teams start daily, add hourly for one or two critical tables, and never need real time.
What is an acceptable Data Quality Score?
It depends on the table. For critical financial or PII tables, anything below ninety-nine is an incident. For operational tables that feed dashboards and product features, ninety-five is the working threshold. For analytics events where you expect a long tail of malformed payloads, ninety is normal and only a sudden drop matters. Anything below eighty on a production-grade table is a pause-and-investigate signal, not a side ticket.