How to calculate MAD outliers in SQL

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Why MAD beats standard deviation

Picture a Tuesday afternoon at Stripe: API latency dashboards spike, and your platform PM wants to know which endpoints actually misbehaved versus which ones look noisy because a single deploy fired a burst of slow requests. You reach for the obvious z-score detector — mean and standard deviation, threshold at three sigma — and watch it fail. The slow requests you are trying to flag are pulling the mean up and inflating the standard deviation, so the detector decides nothing is unusual. You need a measure of spread that does not get contaminated by the rows under suspicion, and that is exactly what the median absolute deviation gives you.

MAD is the median of the absolute differences from the median. Up to fifty percent of the data can be garbage and MAD will still report a sensible spread. That breakdown point is dramatically better than the standard deviation, which can be wrecked by a single extreme value. On the right-skewed, long-tailed distributions you see in production — payment amounts, response times, session length, ad spend by campaign — MAD refuses to let the loudest rows dictate the verdict. That is why interviewers at Snowflake, Databricks, Stripe, and Netflix keep asking analysts to implement MAD in raw SQL.

The MAD formula

The formula has two layers. First compute the median of the values, then take the median of the absolute deviations from that median. The result is a single number that summarizes the typical distance from the center on a robust scale.

median_x       = median(x)
abs_deviation  = |x_i - median_x|
MAD            = median(abs_deviation)
modified_z_i   = 0.6745 * (x_i - median_x) / MAD

The constant 0.6745 is the inverse normal CDF evaluated at 0.75. Multiplying MAD by 1 / 0.6745 makes it a consistent estimator of the standard deviation when the underlying data is normal. Equivalently, multiplying the centered value by 0.6745 / MAD rescales the score so a threshold of three lines up with the familiar three-sigma rule. Iglewicz and Hoaglin recommend flagging anything with a modified z-score above 3.5; I use three as a wider net and review borderline cases manually.

The SQL recipe

In Postgres, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY value) returns the continuous median. MAD takes two passes: one to find the median, a second to compute absolute deviations and percentile them. The query below uses an api_latency table with event_time and value columns to find requests that are unusually slow in the last seven days.

WITH median_calc AS (
    SELECT
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY value) AS med
    FROM api_latency
    WHERE event_time >= CURRENT_DATE - INTERVAL '7 days'
),
abs_deviations AS (
    SELECT
        a.event_time,
        a.value,
        m.med,
        ABS(a.value - m.med) AS abs_dev
    FROM api_latency a
    CROSS JOIN median_calc m
    WHERE a.event_time >= CURRENT_DATE - INTERVAL '7 days'
),
mad_calc AS (
    SELECT
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY abs_dev) AS mad,
        MAX(med) AS med
    FROM abs_deviations
)
SELECT
    a.event_time,
    a.value,
    m.med,
    m.mad,
    0.6745 * (a.value - m.med) / NULLIF(m.mad, 0) AS modified_z
FROM abs_deviations a
CROSS JOIN mad_calc m
WHERE ABS(0.6745 * (a.value - m.med) / NULLIF(m.mad, 0)) > 3
ORDER BY ABS(a.value - m.med) DESC
LIMIT 50;

Read the query top down. The median_calc CTE returns a single row containing the seven-day median. The abs_deviations CTE joins that scalar back to every event row with a CROSS JOIN — deliberate because the right side has exactly one row, so this is a constant-broadcast, not a Cartesian explosion. The mad_calc CTE percentiles those absolute deviations to get MAD itself, and re-emits the median with MAX(med) so the final SELECT pulls both numbers from one row. The wrap guards 0.6745 * (value - med) / MAD with NULLIF(mad, 0) to avoid division by zero.

On BigQuery the percentile syntax is PERCENTILE_CONT(value, 0.5) OVER () returning one value per row. Snowflake has both the standard aggregate and a MEDIAN(value) shortcut. The CTE skeleton is identical.

Modified z-score with verdict labels

Most production dashboards want more than a binary flag. Pushing the verdict into a CASE lets analysts triage borderline rows without rerunning the query, and it gives the on-call engineer something to look at before deciding whether to wake anyone up.

WITH median_calc AS (
    SELECT
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY value) AS med
    FROM api_latency
    WHERE event_time >= CURRENT_DATE - INTERVAL '7 days'
),
abs_deviations AS (
    SELECT
        a.event_time,
        a.value,
        m.med,
        ABS(a.value - m.med) AS abs_dev
    FROM api_latency a
    CROSS JOIN median_calc m
    WHERE a.event_time >= CURRENT_DATE - INTERVAL '7 days'
),
mad_calc AS (
    SELECT
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY abs_dev) AS mad,
        MAX(med) AS med
    FROM abs_deviations
),
labelled AS (
    SELECT
        a.event_time,
        a.value,
        0.6745 * (a.value - m.med) / NULLIF(m.mad, 0) AS modified_z
    FROM abs_deviations a
    CROSS JOIN mad_calc m
)
SELECT
    event_time,
    value,
    modified_z,
    CASE
        WHEN ABS(modified_z) > 3.5 THEN 'outlier'
        WHEN ABS(modified_z) > 2.5 THEN 'borderline'
        ELSE 'normal'
    END AS verdict
FROM labelled
ORDER BY ABS(modified_z) DESC;

The two thresholds are not arbitrary. The 3.5 boundary is the Iglewicz and Hoaglin recommendation from the standard reference on outlier detection. The 2.5 lower threshold marks the zone where rows are statistically suspicious but not unambiguous — the right behavior in most pipelines is to log them, not page anyone. If the median itself is unstable on small samples, swap the hard 3.5 for a percentile of the score, e.g. PERCENTILE_CONT(0.999) WITHIN GROUP (ORDER BY ABS(modified_z)).

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Rolling MAD for time series

A static MAD computed over the last seven days is a fine first pass. What it misses is drift. If your service latency creeps from 80 ms to 200 ms over a week the global median moves with it and the detector stops flagging anything. The fix is to make both the median and MAD rolling, so each point is compared against its own recent neighborhood rather than the whole window.

WITH rolling AS (
    SELECT
        event_date,
        value,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY value)
            OVER (
                ORDER BY event_date
                ROWS BETWEEN 28 PRECEDING AND 1 PRECEDING
            ) AS rolling_med
    FROM daily_metric
),
rolling_dev AS (
    SELECT
        event_date,
        value,
        rolling_med,
        ABS(value - rolling_med) AS abs_dev
    FROM rolling
),
rolling_mad AS (
    SELECT
        event_date,
        value,
        rolling_med,
        abs_dev,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY abs_dev)
            OVER (
                ORDER BY event_date
                ROWS BETWEEN 28 PRECEDING AND 1 PRECEDING
            ) AS rolling_mad
    FROM rolling_dev
)
SELECT
    event_date,
    value,
    rolling_med,
    rolling_mad,
    0.6745 * (value - rolling_med) / NULLIF(rolling_mad, 0) AS modified_z
FROM rolling_mad
WHERE event_date >= CURRENT_DATE - INTERVAL '90 days'
ORDER BY event_date;

The window 28 PRECEDING AND 1 PRECEDING evaluates each day against the prior four weeks, deliberately excluding the current day so a spike cannot pull its own baseline. For slower metrics push it to 90 PRECEDING. Postgres 14+ supports PERCENTILE_CONT as a window function directly; older versions need a self-join or a lateral subquery, which is slower but tractable on small tables. Note that the first 28 days have no full window, so the rolling median is NULL and the modified z-score is undefined — filter those rows out, or use COUNT(*) OVER (...) to require a minimum sample size before emitting a score.

Common pitfalls

When MAD reports zero, the modified z-score divides by zero and your detector silently dies. This happens whenever more than half of the values in the window are identical — for example, a metric dominated by zeros because most users do not perform the tracked action. Guard with NULLIF(mad, 0) so the score becomes NULL instead of erroring, then fall back to a percentile-based detector for that segment.

Forgetting the 0.6745 rescaling makes the threshold sensitive to the underlying distribution and breaks comparability with a regular z-score. Engineers who skip it tend to pick thresholds empirically and never realize they have invented a new statistic. Always apply the constant, and pick your threshold against the rescaled score, not raw (value - median) / MAD.

Applying MAD to a multimodal distribution gives you nonsense answers. If a metric has two clear modes — free versus paid users, weekday versus weekend traffic, mobile versus desktop sessions — the global median lands in the trough between them and MAD becomes huge. The detector then flags almost nothing, including the genuine anomalies. Segment the data and compute MAD per segment, or switch to a mixture-aware detector.

Applying MAD to categorical data is a category error. MAD only makes sense for ordinal or continuous metrics where subtraction is defined. For categorical anomalies use chi-square or a frequency-based detector instead. And comparing raw MAD across populations is meaningless because the units differ — always work with the rescaled modified z-score for cross-metric comparisons, or normalize MAD by the median to get a coefficient of dispersion.

Optimization tips

PERCENTILE_CONT is an aggregate over sorted input, which means it sorts. On large tables that sort is the dominant cost. Prune by event time before the percentile, sample down to a representative subset for exploration, and only run the full-resolution query when emitting a final flag. Partitioning the source table on event date is the single biggest win on Postgres or BigQuery because the planner can skip irrelevant partitions before sorting.

If MAD is a daily job and the window is fixed, materialize the rolling median and MAD into a derived table refreshed on a schedule, then read from that table in dashboards. Recomputing two percentile aggregates over a 28-day window every dashboard load is wasteful when the underlying daily aggregates are static once the day closes. A materialized view, an incremental dbt model, or a scheduled INSERT into a summary table all work.

On Snowflake and Databricks, APPROX_PERCENTILE(value, 0.5) is significantly faster than the exact aggregate and accurate enough for outlier detection. The TDigest-backed error is well under one percent on the medians and MADs that you care about, with an order-of-magnitude speedup on tables with hundreds of millions of rows. Prefer the approximate variant unless you have a regulatory reason to compute exact statistics, and always inspect the row count of the median CTE before the join — a forgotten GROUP BY upstream can silently inflate the join and yield wrong scores without erroring.

If you want to drill SQL detectors like MAD against a graded problem set, NAILDD is launching with 500+ SQL problems covering exactly this pattern.

FAQ

When should I pick MAD over IQR for outlier detection?

Both are robust on skewed distributions, and the right choice depends on how you want to score the rows. MAD plus the 0.6745 rescaling produces a continuous modified z-score that is directly comparable to a familiar three-sigma threshold, which makes it the natural fit when downstream code already speaks z-score. IQR with the 1.5xIQR rule produces a clean binary flag and is the standard for boxplots, so reach for it when the consumer is a chart rather than an alerting pipeline.

Should I emit the modified z-score or just a boolean flag?

Emit the score and let downstream consumers threshold it. A boolean from the detector locks in a threshold you will eventually want to tune, and tuning against a stored score is a one-line change while regenerating booleans means a full recompute. The only reason to store a flag is if the storage cost of the score is genuinely prohibitive, which it almost never is.

What do I do when more than fifty percent of my data is duplicates?

MAD is zero and the modified z-score is undefined. The practical fix in most pipelines is to fall back to IQR for that segment, or compute MAD on the non-zero subset and treat the zero mass separately. Do not silently divide by zero and ship null scores; flag the segment as undetectable and route it through a human review.

Does the Iglewicz and Hoaglin threshold of 3.5 work for every metric?

It is a sensible default and worth knowing because reviewers will ask, but no fixed threshold is universally correct. On heavy-tailed distributions like ad spend, 3.5 floods the detector with false positives. On well-behaved internal metrics like service uptime it is too loose. Calibrate against a labeled sample if you have one, otherwise pick the percentile of the rescaled score that matches your alerting budget.

Can I use MAD as a feature in a downstream model?

Yes, and the modified z-score is often a better predictor than the raw value for fraud, anomaly, and churn risk models because it encodes how unusual the row is relative to its peers. The catch is that you must compute the score on a held-out window — usually the last seven or twenty-eight days excluding the row itself — to avoid leakage. Computing MAD over a window that includes the row you are scoring lets the target sneak into the feature.

How is SQL MAD different from MAD in statistics packages?

Most statistics packages return MAD pre-multiplied by the 1 / 0.6745 rescaling factor, so the number they return is already in standard-deviation-equivalent units. SQL implementations return the raw MAD without rescaling, and you apply the constant in the score formula. Mismatched conventions are a classic source of off-by-a-factor-of-1.4826 bugs.