How to calculate AUC ROC in SQL

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Why AUC ROC belongs in your SQL toolbox

"How healthy is the fraud model in production?" ends most ML platform standups, and "the AUC is 0.83 this week, 0.79 last week" ends them quickly. AUC ROC collapses a model's ranking ability into one number between zero and one, threshold-independent. You do not pick a cutoff to evaluate it, so it survives the moment the product team changes risk appetite from "block 5 percent" to "block 1 percent" without invalidating last quarter's chart.

A data scientist at Stripe, Airbnb, Uber, or DoorDash reaches for AUC because production binary classification is almost always imbalanced — fraud is 0.3 percent of transactions, click-through is 2 percent of impressions. A model that predicts "not fraud" on everything has 99.7 percent accuracy and is useless. AUC asks: if I draw one positive and one negative at random, what is the probability the model scores the positive higher? The definition holds at any base rate.

Most analysts call sklearn.metrics.roc_auc_score from pandas. Fine offline, wrong tool when the model produces millions of predictions a day or you want drift monitoring as a Snowflake or Databricks tile. This post walks the SQL — Mann-Whitney shortcut, per-segment breakdown, weekly trend, ties. Queries run in Postgres with minor edits for Snowflake, BigQuery, and Redshift.

The Mann-Whitney shortcut

Computing AUC from a threshold sweep in SQL is painful — a row per threshold, precision and recall at each, a trapezoidal integral. The Mann-Whitney U identity makes it tractable: for binary labels and a continuous score, AUC equals the probability that a random positive outranks a random negative.

AUC = U_positive / (n_positive * n_negative)
U_positive = sum_of_ranks_of_positives - n_positive * (n_positive + 1) / 2

Ranks are row positions when you sort by score ascending. Sum the ranks of positives, subtract a constant compensating for positives ranking each other, divide by the product of class sizes. Three aggregates, one division — same answer, single pass, no thresholds.

The data model

A single fact table of predictions joined back to the realized outcome.

predictions (prediction_id, entity_id, score, actual, prediction_date)

score is the model's continuous output — a probability, a logit, or any monotonic transform. actual is the realized binary label. prediction_date is when the model scored the example; the label may arrive seconds later for a click prediction and days later for churn.

Four SQL queries you will actually ship

1. Base AUC over a window

Rank every prediction, sum the positive ranks, plug into the identity.

WITH ranked AS (
    SELECT
        actual,
        score,
        RANK() OVER (ORDER BY score) AS rank_score
    FROM predictions
    WHERE prediction_date >= CURRENT_DATE - INTERVAL '7 days'
),
agg AS (
    SELECT
        SUM(CASE WHEN actual = 1 THEN rank_score ELSE 0 END) AS sum_ranks_pos,
        COUNT(*) FILTER (WHERE actual = 1)                   AS n_pos,
        COUNT(*) FILTER (WHERE actual = 0)                   AS n_neg
    FROM ranked
)
SELECT
    n_pos,
    n_neg,
    (sum_ranks_pos::NUMERIC - n_pos * (n_pos + 1.0) / 2)
        / NULLIF(n_pos::NUMERIC * n_neg, 0) AS auc
FROM agg;

RANK() gives ties the same rank, which is the right behavior for the identity. NULLIF guards against a window with zero positives or zero negatives during a quiet hour.

2. AUC by segment

Group by a segment, compute AUC inside each, sort ascending so the worst-performing slice surfaces first.

WITH ranked AS (
    SELECT
        e.segment,
        p.actual,
        p.score,
        RANK() OVER (PARTITION BY e.segment ORDER BY p.score) AS rank_score
    FROM predictions p
    JOIN entities    e USING (entity_id)
    WHERE p.prediction_date >= CURRENT_DATE - INTERVAL '7 days'
),
agg AS (
    SELECT
        segment,
        SUM(CASE WHEN actual = 1 THEN rank_score ELSE 0 END) AS sum_ranks_pos,
        COUNT(*) FILTER (WHERE actual = 1)                   AS n_pos,
        COUNT(*) FILTER (WHERE actual = 0)                   AS n_neg
    FROM ranked
    GROUP BY segment
)
SELECT
    segment,
    n_pos,
    n_neg,
    (sum_ranks_pos::NUMERIC - n_pos * (n_pos + 1.0) / 2)
        / NULLIF(n_pos::NUMERIC * n_neg, 0) AS auc
FROM agg
WHERE n_pos >= 30 AND n_neg >= 30
ORDER BY auc ASC;

A global 0.85 that hides a 0.58 on new users is the most common silent model failure. The n_pos >= 30 AND n_neg >= 30 filter is not optional — with few positives in a segment, the variance of the estimate is enormous and the ranking is noise.

3. Weekly AUC trend for drift monitoring

DATE_TRUNC to weeks, compute AUC inside each, plot the result. The query that lives next to "error rate" and "latency" on a Snowflake or Databricks tile.

WITH ranked AS (
    SELECT
        DATE_TRUNC('week', prediction_date)::DATE AS week,
        actual,
        score,
        RANK() OVER (PARTITION BY DATE_TRUNC('week', prediction_date)
                     ORDER BY score)              AS rank_score
    FROM predictions
    WHERE prediction_date >= CURRENT_DATE - INTERVAL '16 weeks'
),
agg AS (
    SELECT
        week,
        SUM(CASE WHEN actual = 1 THEN rank_score ELSE 0 END) AS sum_ranks_pos,
        COUNT(*) FILTER (WHERE actual = 1)                   AS n_pos,
        COUNT(*) FILTER (WHERE actual = 0)                   AS n_neg
    FROM ranked
    GROUP BY week
)
SELECT
    week,
    n_pos,
    n_neg,
    ROUND(((sum_ranks_pos::NUMERIC - n_pos * (n_pos + 1.0) / 2)
            / NULLIF(n_pos::NUMERIC * n_neg, 0))::NUMERIC, 4) AS auc
FROM agg
ORDER BY week;

A weekly AUC drifting from 0.84 to 0.79 over four weeks is the signal that the input distribution shifted and the model needs retraining. Trend matters more than the absolute number — a stable 0.74 beats a sliding 0.88. Wrap this in LAG(auc) OVER (ORDER BY week) to drive a percent-change alert; most teams page on a three-to-five-percent week-over-week drop.

4. Approximate PR AUC for very imbalanced labels

When positives are below one percent of the dataset, ROC AUC can stay high while precision at useful recall is awful. PR AUC — area under the precision-recall curve — is the better lens. There is no closed-form Mann-Whitney shortcut here; you walk the curve.

WITH points AS (
    SELECT
        score                                            AS threshold,
        COUNT(*) FILTER (WHERE actual = 1)::NUMERIC
            / NULLIF(COUNT(*), 0)                        AS precision_at,
        SUM(COUNT(*) FILTER (WHERE actual = 1))
            OVER (ORDER BY score DESC)::NUMERIC
            / NULLIF(SUM(COUNT(*) FILTER (WHERE actual = 1))
                     OVER (), 0)                         AS recall_at
    FROM predictions
    WHERE prediction_date >= CURRENT_DATE - INTERVAL '7 days'
    GROUP BY score
),
sweep AS (
    SELECT
        recall_at,
        precision_at,
        LAG(recall_at) OVER (ORDER BY recall_at) AS prev_recall
    FROM points
)
SELECT
    SUM((recall_at - COALESCE(prev_recall, 0)) * precision_at) AS pr_auc
FROM sweep;

A trapezoidal approximation. On most production tables it lands within a percent of the sklearn value — the bar a monitoring query needs to clear.

Reading the score

AUC is bounded in zero to one, and 0.5 — not zero — is the meaningful baseline. A 0.5 means coin flips. Below 0.5 means the model is anti-correlated; flip the predictions. A perfect 1.0 means every positive outranks every negative.

0.50          random baseline
0.60 to 0.70  weak signal, often the floor for a usable production model
0.70 to 0.80  decent, where most fraud and churn models live
0.80 to 0.90  good, the band where a tuned model with strong features lands
0.90+         excellent, but verify there is no label leakage first

The leakage line is not a joke. AUC above 0.95 on a non-trivial problem almost always traces to a leaked feature — an event that occurred after prediction time but was joined in as if available beforehand. A 0.97 that does not survive a holdout cohort is why a Friday promotion becomes a Monday rollback.

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Common pitfalls

The first trap is comparing AUC across populations with different base rates. AUC is robust to class imbalance in one sense — invariant under resampling — but a 0.82 on new users and a 0.82 on returning users do not say the same thing about business value. The new-user segment may have far fewer positives, so the same ranking translates into a smaller absolute number of correct flags. Pair AUC with precision at a fixed recall to keep the interpretation honest.

The second trap is a contaminated baseline window. If the dataset includes predictions from before the model was deployed at full traffic, or labels that have not fully materialized for late-resolving events like 30-day churn, the AUC is mixing two distributions. Lag the evaluation window by however long the label takes to resolve for ninety percent of examples.

The third trap is letting ties silently distort the result. A tree ensemble with few unique scores can produce thousands of predictions at the same value. RANK() shares the rank across ties — right for the identity — but conceals that the model cannot order those examples. Pair the AUC query with COUNT(DISTINCT score); a suspiciously low distinct count means ties are inflating the score.

The fourth trap is reporting one global AUC for a multi-segment product. A 0.85 overall can mask a 0.58 on the revenue-critical segment, simply because the failing slice is small. Build the per-segment query into the standard deck and treat the global AUC as a smoke test. The segment breakdown is what catches fairness regressions and subpopulation drift.

The fifth trap is treating AUC as a proxy for calibration. A model can have a perfect AUC and output probabilities unrelated to the true rate — AUC cares about ranking, not magnitude. If the surface needs a probability that means something ("alert if fraud probability exceeds 80 percent"), you also need a Brier score or reliability diagram alongside.

Optimization tips

Partition the predictions table by prediction_date — daily partitions are standard, hourly for very high-volume scoring services. AUC queries are range-bounded by time, so partition pruning collapses the scan to "the last week" without changing a line of SQL.

Materialize per-segment per-day aggregates. Store sum_ranks_pos, n_pos, n_neg, and count_distinct_score per (segment, date). Every dashboard query reduces to a small SUM and a divide; sixteen weeks goes from a multi-minute scan to a sub-second roll-up.

Bound RANK() to the window you actually plot — the window function is the expensive piece. For drift monitoring, four to sixteen weeks is the right horizon. Twelve months of ranks is wasted work.

Store calibration alongside AUC. A tile showing AUC, Brier score, and base-rate-of-positives in one row catches what each metric individually misses: drifting AUC plus stable Brier points at a label-distribution shift; stable AUC plus drifting Brier points at calibration drift.

A Python sanity check

When a SQL AUC looks off, recompute it in scikit-learn and compare. The two should agree to four decimal places.

from sklearn.metrics import roc_auc_score
auc = roc_auc_score(actual, score)

If they disagree by more than 0.001, the bug is in the SQL — an off-by-one in n_pos * (n_pos + 1) / 2, or a query using ROW_NUMBER() instead of RANK(). ROW_NUMBER() assigns distinct ranks to ties and biases the AUC.

If you want to drill SQL like this every day, NAILDD is launching with hundreds of analytics and ML-flavored problems.

FAQ

What counts as a "good" AUC ROC score?

It depends on the problem. Credit scoring lands between 0.70 and 0.82 — rich features, clean labels, genuinely hard behavior. Spam classification routinely hits 0.95+ because the signal is overwhelming. Fraud sits around 0.75 to 0.88 depending on type and label freshness. Anything above 0.95 on a non-trivial domain deserves a leakage audit before promotion.

How is AUC different from accuracy?

Accuracy uses a single decision threshold; AUC sweeps every threshold and measures ranking. Under class imbalance the gap is loud: a model predicting "not fraud" on every transaction has 99.7 percent accuracy on a 0.3 percent base rate but an AUC of exactly 0.5. AUC catches the failure accuracy hides.

When should I use PR AUC instead of ROC AUC?

When positives are below roughly two percent, ROC AUC becomes a poor proxy for production usefulness — it can stay high while precision at usable recall is awful, because the negative class is so large that a tiny false-positive rate produces an avalanche of flags. PR AUC weights the precision-recall trade-off directly. Most rare-event teams report both, with PR AUC as the primary metric.

How do I handle ties in the score column?

Use RANK() rather than ROW_NUMBER() — ties should share a rank for the identity to hold. The deeper issue is that a high tie rate, common with tree ensembles or quantized scores, hides the fact that the model cannot order those examples. Pair the AUC query with COUNT(DISTINCT score) to flag a thin score column.

Can AUC ROC be below 0.5?

Yes — the model is anti-correlated with the truth. A logistic regression with flipped coefficient signs produces it; so will a labeling bug that swapped positives and negatives upstream. An AUC of 0.42 in production almost always indicates a data bug rather than a genuinely terrible model.

How often should I recompute AUC on a deployed model?

Weekly is the right cadence for most production models — short enough to catch real drift before stakeholders notice, long enough that the metric is stable under day-to-day variation in volume and base rate. Daily AUC is appropriate only when labels resolve within hours and volume is large enough that the daily estimate is stable; otherwise noise swamps signal.