How to calculate heteroscedasticity in SQL

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Why heteroscedasticity matters

Your PM pings you on a Thursday with a screenshot of a revenue regression: "We told the board the lift is significant, but the confidence interval looks ridiculous. Is the model broken?" You open the residuals. The spread is tight at the bottom of the x-axis and fans out by a factor of five at the top. The point estimate is fine. The standard errors are wrong. The board is being shown a number with the wrong uncertainty attached to it.

That fan shape is heteroscedasticity — non-constant residual variance — one of the classic OLS assumptions that fails more often than it holds on product data. Revenue, session length, ad spend, and almost every count or duration metric break this assumption out of the box. The coefficients stay unbiased, but the standard errors behind p-values and confidence intervals are off, often by a lot. A 95 percent interval becomes an 80 percent interval, and nobody on the call knows.

This question shows up in DS rounds at Stripe, DoorDash, Uber, Airbnb, Netflix, Snowflake, and Databricks. The interviewer slides a residual plot across the table and asks two things: how would you detect this in SQL, and what would you do about it. Strong candidates write a bucketed variance query in under three minutes, follow up with a one-query Breusch-Pagan test, and name a fix before being asked.

Postgres, Snowflake, BigQuery, and Redshift all ship REGR_SLOPE, REGR_INTERCEPT, REGR_R2, VAR_SAMP, and NTILE — the entire toolbox. This post walks the two practical tests, ranks the fixes, and collects the traps that get candidates dinged.

The bucketed residual test

The fastest detection is the bucketed residual test. Fit a simple linear regression, compute the residual for every row, split the x-axis into ten equal-count buckets with NTILE(10), and look at the residual standard deviation inside each bucket. Flat bars mean homoscedasticity. Bars that climb with x mean heteroscedasticity. The whole thing is one CTE plus a GROUP BY.

WITH residuals AS (
    SELECT
        x,
        y - (
            REGR_INTERCEPT(y, x) OVER ()
            + REGR_SLOPE(y, x) OVER () * x
        ) AS residual
    FROM training_data
),
buckets AS (
    SELECT
        NTILE(10) OVER (ORDER BY x) AS bucket,
        x,
        residual
    FROM residuals
)
SELECT
    bucket,
    COUNT(*)                  AS n,
    AVG(x)                    AS mean_x,
    VAR_SAMP(residual)        AS var_residual,
    STDDEV_SAMP(residual)     AS std_residual
FROM buckets
GROUP BY bucket
ORDER BY bucket;

Reading the output is direct. If std_residual lands within roughly twenty percent across all ten buckets, you have homoscedasticity and OLS standard errors are usable. If the bottom bucket has 100 and the top has 500, that is a five-times spread and the OLS standard errors should not be trusted. On product data, a max/min bucket variance ratio above two is enough to switch to robust standard errors.

One subtle thing about REGR_INTERCEPT and REGR_SLOPE with OVER (): they compute the global intercept and slope and broadcast that single number to every row, which is what you want for residuals. Always print the slope and intercept once in a separate query to sanity-check the line you are subtracting.

Breusch-Pagan in one query

The bucketed test is great for a dashboard. For a formal yes-or-no answer with a p-value, run Breusch-Pagan. Regress the squared residuals on the predictor. If the slope is meaningfully different from zero, the variance depends on x — the definition of heteroscedasticity. The test statistic is N times the R-squared of the auxiliary regression, distributed as chi-square with one degree of freedom under the null.

WITH residuals AS (
    SELECT
        x,
        POWER(
            y - (
                REGR_INTERCEPT(y, x) OVER ()
                + REGR_SLOPE(y, x) OVER () * x
            ),
            2
        ) AS sq_residual
    FROM training_data
)
SELECT
    REGR_SLOPE(sq_residual, x)         AS slope_on_squared,
    REGR_R2(sq_residual, x)            AS r2_of_aux_regression,
    COUNT(*) * REGR_R2(sq_residual, x) AS bp_statistic,
    COUNT(*)                           AS n
FROM residuals;

Reading the output: compare bp_statistic against chi-square critical values. The conventional thresholds are 3.84 at alpha 0.05 and 6.63 at alpha 0.01, both with one degree of freedom. If your statistic is 12.7 you reject homoscedasticity at every conventional level. If it is 1.2 you fail to reject.

On samples in the millions Breusch-Pagan rejects almost everything because the test is too powerful — trust the bucketed variance ratio more than the p-value at that scale. On samples under a hundred rows the chi-square approximation is shaky and you should bootstrap. Between a few hundred and a few hundred thousand rows the test is well calibrated.

What to do when you find it

The fix list has four entries, ranked by how often they ship.

Robust standard errors. Huber-White sandwich estimators (HC0 through HC3) recompute the standard errors without changing the coefficient. There is no clean way to do this in pure SQL — pull the residuals into Python (statsmodels.OLS(...).fit(cov_type='HC3')) or R. This is the right answer when the coefficient itself is the deliverable, for example when a stakeholder asks how much a one-unit increase in price moves revenue.

Log transform of the response. For revenue, session length, time-on-task, and almost any positive count or duration that grows with x, replacing y with LN(y) stabilizes the variance and turns a multiplicative noise model into an additive one. The coefficient now describes a percent change in y per unit change in x. Most stakeholders find percent changes easier to reason about anyway.

SELECT
    REGR_SLOPE(LN(y), x)     AS log_slope,
    REGR_INTERCEPT(LN(y), x) AS log_intercept
FROM training_data
WHERE y > 0;

Weighted least squares. Weight each row inversely to its variance — small-variance rows count more, high-variance rows count less. In SQL you can implement WLS approximately by joining a per-bucket variance table back and using inverse-variance weights in a manual normal-equations formulation. Analysts usually skip this step and go straight to robust standard errors because the gain is small and the code is fiddly.

Honest reporting. Even if you cannot fix heteroscedasticity, naming it in the methods section — "OLS standard errors, no robustness correction, residuals show mild variance growth with x" — beats reporting a too-tight interval and acting surprised when the next quarter's number lands outside it.

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

The residual diagnostic plot

For a dashboard or a report, the residuals-versus-predicted plot is what reviewers want to see. A clean cloud with no shape means OLS is fine. A funnel that opens to the right means heteroscedasticity. A curve means non-linearity. A V-shape means both at once.

WITH model AS (
    SELECT
        REGR_SLOPE(y, x)     AS slope,
        REGR_INTERCEPT(y, x) AS intercept
    FROM training_data
),
diag AS (
    SELECT
        m.intercept + m.slope * d.x       AS predicted,
        d.y - (m.intercept + m.slope * d.x) AS residual
    FROM training_data d
    CROSS JOIN model m
)
SELECT predicted, residual
FROM diag
ORDER BY predicted;

Export the two columns into whatever your team uses for plots and overlay a horizontal line at zero plus the bucketed standard deviations on top. Reviewers can read the picture in five seconds, and the diagnostic plot has earned more ship-it decisions than any p-value.

Common pitfalls

The first trap is running Breusch-Pagan on multivariate models with only one predictor in the auxiliary regression. The strict definition regresses squared residuals on every x in the original model. In a univariate world the SQL above is fine, but if your model has price, marketing spend, and seasonality, you need an auxiliary regression with all three. Reaching for the bivariate version on a multivariate model gives you a number that does not test what you think it tests.

The second trap is computing the residual standard deviation on in-sample residuals after fitting on the same rows. Outliers that pulled the line toward themselves make the residuals look artificially small around the outlier, which can mask a real funnel. The clean fix is to fit on a training fold and compute residuals on a held-out validation fold. If you do not have a fold, drop the top one percent of leverage points before running the diagnostic.

The third trap is forgetting that revenue, session length, and basically every positive count or duration are heteroscedastic by default. If you skip the diagnostic and ship raw OLS on these metrics, you are publishing wrong standard errors every time. The cheap pre-flight check is to run a quick log transform side by side: if the slope and R-squared change a lot, your residual structure was distorting the linear fit and you should have logged from the start.

The fourth trap is concluding "no heteroscedasticity" from bucketed standard deviations that look flat. Bucketing aggregates and can hide a smooth fan inside a few bins. Cross-check with the formal Breusch-Pagan test. If only the bucketed view looks clean and the test rejects, trust the test and consider fixes.

The fifth trap is failing to disclose what standard errors you used. A coefficient with OLS standard errors and the same coefficient with HC3 standard errors can have wildly different p-values, and the difference matters whenever the decision is borderline. Always name the SE type in the methods line of the report.

Optimization tips

On metric tables under a million rows these queries return in milliseconds. The work starts on raw event tables with hundreds of millions of rows — the window aggregates for the global intercept and slope force a full scan. Pre-aggregate to one row per modeled unit before fitting: one row per user per day for an activation regression, one row per order for a revenue regression. The residual computation drops from minutes to under a second.

The second lever is materializing the residuals once and reusing them. The bucketed test, the Breusch-Pagan statistic, and the diagnostic scatter all consume the same residual column. A CREATE TABLE AS SELECT ... with (x, y, residual), refreshed when the upstream model is refit, lets every diagnostic touch a few megabytes instead of recomputing.

The third lever is clustering the source on the regressor for sliced diagnostics — "is variance growing with x within each country?" On Snowflake, BigQuery, and Databricks, clustering on country, x lets each per-country test scan only the relevant partition.

If you want to drill SQL questions like this every day, NAILDD is launching with 500+ SQL problems across exactly this pattern.

FAQ

Does heteroscedasticity bias the coefficients?

No. The coefficient estimates remain unbiased even when the residual variance is not constant. What changes is the variance of those estimates: the conventional standard errors understate or overstate the true uncertainty, which is why p-values and confidence intervals built from them are not trustworthy. The line through the cloud is still in the right place, but the band you draw around it is wrong.

What p-value should I use for Breusch-Pagan?

Chi-square critical values for one degree of freedom are 3.84 at alpha 0.05 and 6.63 at alpha 0.01. With multiple predictors in the auxiliary regression the degrees-of-freedom equal the number of predictors and the critical values change accordingly. On sample sizes above a million the test rejects almost every regression because it becomes extremely sensitive to tiny deviations — trust the bucketed variance ratio at that scale. Under a few hundred rows the chi-square approximation is shaky and you should bootstrap.

Does the log transform always fix it?

Often, but not always. Log transforms work when the residual variance grows roughly proportionally to the mean, which is the common case for revenue, durations, and counts. They do not work when the variance has a different functional form — U-shaped in x, for example — or when you have zeros that force you into LN(y + 1) shims that introduce their own bias. The general-purpose alternative is the Box-Cox transform, which picks the power transform that maximizes the likelihood under a constant-variance Gaussian model.

Are robust standard errors available in pure SQL?

Not as a built-in. Postgres, Snowflake, BigQuery, and Redshift ship the regression aggregates for coefficient estimation, but none of them ship the sandwich estimators. Compute residuals in SQL, ship them to Python via your warehouse driver, and call statsmodels.OLS(...).fit(cov_type='HC3'). You can roll your own HC0 with a careful join on the residual squared times the design matrix, but the Python round-trip is almost always faster.

Is heteroscedasticity the same as having outliers?

No, they are distinct problems even though they sometimes co-occur. Outliers are individual points that sit far from the fitted line — a small number of rows that disproportionately move the slope. Heteroscedasticity is a structural property of the variance: the spread of the residual cloud changes with x, irrespective of any single point. Detect outliers with a leverage or influence metric, detect heteroscedasticity with a bucketed variance test or Breusch-Pagan.

How does heteroscedasticity affect A/B test analysis?

A/B tests are usually analyzed with a two-sample t-test or a regression with a binary treatment indicator, both of which assume equal variance across groups. When the treatment shifts both the mean and the variance — common when the treatment changes behavior in a way that increases dispersion — the conventional standard error is wrong and the p-value is mis-calibrated. The standard fix is Welch's t-test, which allows unequal variances. In a regression framing, use HC2 or HC3 robust standard errors and report the heteroscedasticity diagnostic alongside the lift.