How to calculate Data Completeness in SQL

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Why completeness matters

Your dashboard says signups are up fifteen percent, the marketing lead schedules a victory lap, and a week later someone notices the country field is forty percent empty. The signup count was real, the geo breakdown was a lie, and the whole reporting chain has to be re-explained. The data did not "go bad" — it was incomplete, silently, for a sprint.

Data completeness is the share of rows in a column that contain a usable value. It is the cheapest data-quality metric to compute and the one most often missing from production monitoring. Teams ship dbt models, BI dashboards, and ML features against tables where they have never measured the null rate of the inputs. A tracking deploy ships, a vendor changes a schema, an ETL truncates, and a column quietly stops being filled. Every downstream consumer keeps running, the numbers just drift.

Completeness is one of five canonical data-quality dimensions — completeness, uniqueness, freshness, validity, consistency — and it is the one interviewers at Stripe, Snowflake, Databricks, and Airbnb reach for first when they want to test whether a candidate has actually maintained a production pipeline. The SQL is short, but calibration, segmentation, and the alerting story separate a junior analyst from a data engineer who has been paged at 3 a.m.

The SQL formula

The plain-English definition is the number of non-null rows divided by total rows, expressed as a percentage and scoped to a recent time window so you measure today's pipeline, not the entire history of the table. The window is the most underrated part of the recipe — old data dilutes the signal you actually care about, which is whether the pipeline broke this week.

SELECT
    COUNT(*) AS total_rows,
    COUNT(email) AS non_null_rows,
    COUNT(email)::NUMERIC * 100 / NULLIF(COUNT(*), 0) AS completeness_pct
FROM users
WHERE created_at >= CURRENT_DATE - INTERVAL '7 days';

Two SQL details worth saying in an interview. First, COUNT(column) already ignores nulls, so you do not need a FILTER (WHERE column IS NOT NULL) clause — the longer form works but signals unfamiliarity with default aggregate semantics. Second, the NULLIF(COUNT(*), 0) guard protects against a zero-divide on empty result sets, which happens during cold-start backfills and partition-pruning bugs.

The WHERE clause is the lever you tune to ask the right question. Seven days catches gradual drift; one day catches a deploy that broke tracking; thirty days smooths over weekday-weekend swings on B2B products. Pick the window that matches the failure modes you have seen in this column.

Completeness across fields

A single column completeness check is a smoke alarm. The full DQ dashboard is one query that returns completeness for every important column on the table, scanned in a single pass.

SELECT
    COUNT(*) AS total_rows,
    ROUND(COUNT(email)::NUMERIC    * 100 / NULLIF(COUNT(*), 0), 2) AS email_pct,
    ROUND(COUNT(phone)::NUMERIC    * 100 / NULLIF(COUNT(*), 0), 2) AS phone_pct,
    ROUND(COUNT(country)::NUMERIC  * 100 / NULLIF(COUNT(*), 0), 2) AS country_pct,
    ROUND(COUNT(avatar_url)::NUMERIC * 100 / NULLIF(COUNT(*), 0), 2) AS avatar_pct,
    ROUND(COUNT(referrer_source)::NUMERIC * 100 / NULLIF(COUNT(*), 0), 2) AS referrer_pct
FROM users
WHERE created_at >= CURRENT_DATE - INTERVAL '7 days';

This is a single sequential scan with five aggregates — orders of magnitude cheaper than five UNION ALL queries that each scan the table. The output is one row, easy to pipe into a BI tile or a Slack alert. Keep one physical scan whenever you can.

The same query rewritten in long format is useful for charting tools that prefer tidy data — wide for dashboards, long for charts and joins.

SELECT 'email' AS field,
       COUNT(email)::NUMERIC * 100 / NULLIF(COUNT(*), 0) AS completeness_pct
FROM users WHERE created_at >= CURRENT_DATE - INTERVAL '7 days'
UNION ALL SELECT 'phone',   COUNT(phone)::NUMERIC   * 100 / NULLIF(COUNT(*), 0) FROM users WHERE created_at >= CURRENT_DATE - INTERVAL '7 days'
UNION ALL SELECT 'country', COUNT(country)::NUMERIC * 100 / NULLIF(COUNT(*), 0) FROM users WHERE created_at >= CURRENT_DATE - INTERVAL '7 days';

Tracking completeness over time

A snapshot tells you the current state; a time series tells you whether the pipeline is degrading. The standard pattern is one row per day per column, computed by DATE_TRUNC and the same COUNT(column) ratio.

SELECT
    DATE_TRUNC('day', created_at)::DATE AS day,
    COUNT(*) AS rows_inserted,
    ROUND(COUNT(email)::NUMERIC    * 100 / NULLIF(COUNT(*), 0), 2) AS email_pct,
    ROUND(COUNT(country)::NUMERIC  * 100 / NULLIF(COUNT(*), 0), 2) AS country_pct,
    ROUND(COUNT(referrer_source)::NUMERIC * 100 / NULLIF(COUNT(*), 0), 2) AS referrer_pct
FROM users
WHERE created_at >= CURRENT_DATE - INTERVAL '30 days'
  AND created_at <  DATE_TRUNC('day', CURRENT_DATE)
GROUP BY DATE_TRUNC('day', created_at)
ORDER BY day;

The exclusion of the current day with created_at < DATE_TRUNC('day', CURRENT_DATE) is intentional. The partial day always looks weird — early-morning traffic is not the same mix as evening traffic, and ingestion lag on mobile clients can make the current hour look half-empty for no real reason. Always cut the time series at the last fully closed period before you alert on it.

When email_pct drops from ninety-five to thirty between two days, that is your signal. The diagnosis is almost never a real change in user behavior — users do not collectively forget to enter their email overnight. It is a tracking deploy, a schema change at an upstream vendor, a feature flag that bypassed a form field, or a silent truncation in an ETL step. Open the deploy log first.

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Alerting on a baseline drop

Alerts compare the current period to a recent baseline and flag when the gap exceeds a tolerance in percentage points. The recipe below benchmarks the most recent day against the previous seven, absorbing weekday-weekend swings without being so wide it masks slow regressions.

WITH current_day AS (
    SELECT COUNT(email)::NUMERIC * 100 / NULLIF(COUNT(*), 0) AS pct,
           COUNT(*) AS rows_inserted
    FROM users
    WHERE created_at >= CURRENT_DATE - INTERVAL '1 day'
      AND created_at <  CURRENT_DATE
),
baseline AS (
    SELECT COUNT(email)::NUMERIC * 100 / NULLIF(COUNT(*), 0) AS pct
    FROM users
    WHERE created_at >= CURRENT_DATE - INTERVAL '8 days'
      AND created_at <  CURRENT_DATE - INTERVAL '1 day'
)
SELECT
    c.rows_inserted,
    ROUND(c.pct, 2) AS today_pct,
    ROUND(b.pct, 2) AS baseline_pct,
    ROUND(c.pct - b.pct, 2) AS drop_pp,
    CASE
        WHEN c.rows_inserted < 50 THEN 'low_volume_skip'
        WHEN b.pct - c.pct > 10   THEN 'critical_over_10pp'
        WHEN b.pct - c.pct > 5    THEN 'warning_5_to_10pp'
        WHEN b.pct - c.pct > 2    THEN 'minor_2_to_5pp'
        ELSE 'ok'
    END AS verdict
FROM current_day c
CROSS JOIN baseline b;

The low_volume_skip branch is the part most teams forget. On a small product a single quiet day can drop the denominator to a few dozen rows, and one missing email becomes a five-point swing. Without a volume guard the alert fires every Sunday and people stop reading it. Set the threshold wherever a one-row change costs less than a percentage point — typically fifty for early-stage products, ten thousand for established ones.

The thresholds themselves are calibration. Ten percentage points is correct for a column expected at ninety-nine percent. For a column expected at thirty percent, a ten-point drop might be normal noise. Pick the threshold per column, not per table.

Common pitfalls

The first pitfall is treating an empty string as a non-null value. In Postgres, MySQL, and Snowflake, '' is not NULL, and COUNT(column) cheerfully includes it. If your application stores empty strings when a user clears a field, your completeness percentage looks pristine while the field is actually empty for half your rows. The fix is COUNT(NULLIF(column, '')) instead of COUNT(column), or an explicit WHERE column <> '' AND column IS NOT NULL filter.

The second pitfall is computing completeness on the entire history of the table. The whole point of the metric is to catch a regression in this week's data, but if you average over five years a complete outage last Tuesday moves the headline number by a hundredth of a percent. Always scope completeness to a recent window — one to seven days for alerting, thirty days for trend charts — and publish the window length next to the number.

The third pitfall is applying one threshold to every column. The email field on a signup form should be at ninety-nine percent or higher; the avatar_url field is opt-in and might run at thirty percent forever. If your alert config screams about avatar at thirty percent, the on-call engineer ignores the channel, and real ninety-to-thirty drops get missed. Calibrate per column and route low-confidence fields to a digest.

The fourth pitfall is ignoring segmentation. Completeness can collapse for one acquisition channel — say, an API partner that stopped sending country — while the global average barely moves because that channel is twenty percent of volume. Slice completeness by the same dimensions you slice product metrics by: channel, platform, country, plan tier. An extra GROUP BY channel exposes the regression the aggregate hides.

The fifth pitfall is forgetting defaults. Sometimes a column is populated with 0, 'unknown', or -1 when the real value is missing — often because the schema disallows nulls. From the database's perspective the column is full; from a data-quality perspective it is empty. Treat sentinels explicitly with COUNT(CASE WHEN column NOT IN ('unknown', '') AND column IS NOT NULL THEN 1 END).

Optimization tips

For large tables the bottleneck is the sequential scan over the time-windowed slice. The first lever is partitioning on created_at — Snowflake and BigQuery cluster natively, Postgres needs declared monthly partitions. A seven-day completeness query then reads one partition and finishes in seconds on billion-row tables.

The second lever is a pre-aggregated daily summary table — (day, column_name, non_null, total, completeness_pct) — refreshed by a dbt model overnight. The dashboard then reads kilobytes instead of gigabytes. Vendors like Monte Carlo and Bigeye are essentially commercial versions of this pattern; rolling your own with dbt is a one-day project.

The third lever is materializing the alert query itself. If the alert runs every fifteen minutes and the table only mutates on the daily ETL, persist the result on the same cadence as the data. This drops the warehouse cost of the alert to near zero.

If you want to drill data-quality SQL questions like this every day, NAILDD is launching with 500+ SQL problems on exactly this pattern.

FAQ

What counts as "complete" for a column?

The strict definition is "not null", but the practical definition is "contains a usable value". Empty strings, sentinel values like 'unknown' or -1, and obvious placeholders like noreply@example.com should be treated as missing for completeness purposes, even though the database engine reports them as populated. Decide the convention per column and enforce it with NULLIF or a CASE expression across every query that touches the field.

What threshold should I set per column?

The threshold depends on whether the field is required, important, or optional. Required fields — primary identifiers, email on a signup form, transaction amount on a payment record — should sit at ninety-nine percent or higher, and any drop below that warrants a page. Important fields the product uses but does not require — country, referrer, plan tier — typically live in the eighty-to-ninety-five range and tolerate a few points of drift. Optional fields — avatar, bio — care more about stability than absolute level, so alert on a sudden drop relative to baseline rather than on a fixed floor.

Completeness versus validity — what is the difference?

Completeness asks whether the field is populated; validity asks whether the populated value is correct. An email column where every row contains some string is one hundred percent complete; if half those strings are missing the @ symbol, validity is fifty percent. Both belong on the DQ dashboard. Completeness regressions usually point at tracking or ETL bugs; validity regressions usually point at upstream input changes or vendor schema drift.

Should I monitor completeness on the recent window or on the entire table?

Both, with different framings. Completeness on a one-to-seven-day window is the production alert metric — it catches regressions the day they happen, before they pollute downstream models. Completeness on the entire table is the historical-quality report — it tells you which columns are reliable enough to feed an ML model. The two queries are identical except for the WHERE clause, so build both.

A column's completeness dropped from ninety-five percent to thirty percent overnight — what now?

Open the deploy log for the upstream service first; this is almost always a code change rather than a user-behavior change. If no deploy correlates, check schema changes on any third-party source that feeds the column — vendors silently rename or drop fields more often than you would expect. Until the root cause is fixed, tag the affected column with a data-quality warning on every downstream dashboard, and pause any automated decisioning that consumes the field.

How do I monitor completeness on slow-arriving columns?

Some columns are populated asynchronously — a first_purchase_at field that is NULL until the user converts, a verified_at field set days after signup. Their completeness will always look low on the latest day because the events have not happened yet. Measure completeness at a fixed offset from row creation — first_purchase_at on rows at least thirty days old, for example — so the denominator only includes rows that have had enough time to be populated. Without the offset, the alert fires on the metric's own definition rather than on a real regression.