How to calculate linear regression in SQL
Contents:
Why regression in SQL matters
It is Monday morning and your PM at Stripe pings you before standup: "Does hours-since-signup actually predict revenue per user, or am I cherry-picking that scatter?" The data lives in a warehouse table with twenty million rows and pulling it into Python would take longer than the meeting. The cleanest defensible answer is a one-line ordinary least squares regression executed inside the SQL editor: two numbers — slope and intercept — that summarize the linear relationship in a way every PM understands.
This question shows up in analytics rounds at Airbnb, DoorDash, Uber, Netflix, Snowflake, and Databricks. The framing is almost always the same: here are two columns, fit a line of best fit in SQL, then explain what the slope means in business terms. Strong candidates write the regression in under two minutes using COVAR_POP and VAR_POP, swap in REGR_SLOPE when the dialect supports it, and call out three assumptions before the interviewer asks. Postgres, Snowflake, BigQuery, Redshift, and ClickHouse all ship the REGR_* family as built-in aggregates.
The OLS formula in one line
Ordinary least squares fits a line y = intercept + slope * x by minimizing the sum of squared residuals. The closed-form solution is:
slope = cov(x, y) / var(x)
intercept = mean(y) - slope * mean(x)
y_hat = intercept + slope * xThe slope tells you how much y changes when x increases by one unit. The intercept is the predicted value of y when x is zero — which is sometimes meaningful and sometimes nonsense, depending on whether x equals zero exists in your data. The line of best fit is the unique line that minimizes the squared vertical distance between the observed y values and the predicted y_hat values, and SQL can compute both numbers in a single query.
Slope and intercept with aggregates
Every modern warehouse exposes COVAR_POP and VAR_POP as aggregate functions. The slope is the population covariance divided by the population variance of x, and the intercept follows directly from the means.
SELECT
COVAR_POP(amount, hours_since_signup)
/ NULLIF(VAR_POP(hours_since_signup), 0) AS slope,
AVG(amount)
- COVAR_POP(amount, hours_since_signup)
/ NULLIF(VAR_POP(hours_since_signup), 0)
* AVG(hours_since_signup) AS intercept,
COUNT(*) AS n_pairs
FROM users_engagement
WHERE created_at >= CURRENT_DATE - INTERVAL '90 days'
AND amount IS NOT NULL
AND hours_since_signup IS NOT NULL;If the result is slope = 0.5 and intercept = 10, the interpretation is straightforward: an extra hour of usage corresponds to an additional fifty cents of revenue per user, on a baseline of ten dollars for a brand-new user with zero hours. The NULLIF wrapper is not optional — if all x values are identical the variance is zero and you would otherwise get a division-by-zero error that crashes the query.
Always include the pair count alongside the slope. A slope computed from twelve rows is not a slope, it is an anecdote. Most reviewers expect at least a few hundred non-null pairs before they take the result seriously, and any production dashboard should hide the line of best fit when the sample size drops below that threshold.
Prediction for a new point
Once the slope and intercept are computed, scoring a new point is a single multiplication and add. The cleanest pattern is to compute the model in a CTE and CROSS JOIN it against the rows you want to predict.
WITH model AS (
SELECT
COVAR_POP(y, x) / NULLIF(VAR_POP(x), 0) AS slope,
AVG(y)
- COVAR_POP(y, x) / NULLIF(VAR_POP(x), 0) * AVG(x) AS intercept
FROM training_data
WHERE event_date >= CURRENT_DATE - INTERVAL '90 days'
)
SELECT
new_x.x_value,
m.intercept + m.slope * new_x.x_value AS predicted_y
FROM (VALUES (10), (20), (50), (100)) AS new_x(x_value)
CROSS JOIN model m;This pattern scales to any number of new points. If you want to score the whole production table, replace the VALUES clause with SELECT x FROM production_features. The CROSS JOIN against a single-row model is a constant-time broadcast and the warehouse optimizer handles it without breaking a sweat.
For real-time scoring inside a dashboard, materialize the model coefficients into a tiny one-row table that is refreshed on a schedule. Reading two floats from a one-row table is essentially free, and the dashboard query stays under one second even on hundreds of millions of rows of features.
Postgres REGR functions
Postgres, Snowflake, BigQuery, and Redshift ship a dedicated family of regression aggregates that compute the same numbers in one pass over the data and skip nulls automatically.
SELECT
REGR_SLOPE(amount, hours_since_signup) AS slope,
REGR_INTERCEPT(amount, hours_since_signup) AS intercept,
REGR_R2(amount, hours_since_signup) AS r_squared,
REGR_COUNT(amount, hours_since_signup) AS n_pairs,
REGR_AVGX(amount, hours_since_signup) AS mean_x,
REGR_AVGY(amount, hours_since_signup) AS mean_y
FROM users_engagement
WHERE created_at >= CURRENT_DATE - INTERVAL '90 days';Note the argument order: REGR_SLOPE(y, x). This trips up almost every candidate the first time they use it, because human intuition reads left to right as "x then y". The reason for y, x is historical — it mirrors the ANSI SQL spec, which mirrors the SAS convention — but the takeaway is to always read the function as "regress the first argument on the second argument".
The REGR_* family is faster than the equivalent COVAR_POP / VAR_POP expression because the warehouse can compute all six aggregates in a single pass over the data. On a billion-row table the difference is measurable, often two to three seconds versus six to eight. Prefer REGR_* whenever it is available; fall back to the manual covariance formula only on dialects that do not ship it, such as older MySQL.
Reading the slope and r-squared
The slope answers the magnitude question and r_squared answers the strength question. A slope of 0.5 means "one additional unit of x corresponds to half a unit of y". An r_squared of 0.04 means "x explains four percent of the variance in y", which is barely a signal. The two numbers must be reported together — a slope without r_squared is a number without a confidence story, and r_squared without a slope tells you nothing about direction or magnitude.
A practical reading guide for analytics work: r_squared above 0.5 is a strong relationship and you can lean on the slope in your recommendation; between 0.1 and 0.3 is the typical range for social and behavioral data and the slope is suggestive but not conclusive; below 0.05 and the line is essentially noise. For a deeper treatment of these thresholds and how to communicate them to a non-technical PM, see the correlation guide.
The sign of the slope is the same as the sign of the correlation. In bivariate OLS, r_squared equals the square of the Pearson correlation coefficient, which means you can sanity-check your regression by running CORR(x, y) separately and squaring the result. If the two numbers disagree by more than a rounding error, something is wrong with the query — usually a stray filter that is applied to one calculation but not the other.
Common pitfalls
The first trap is divide-by-zero when x is constant. If every row in your sample has the same value of x, the variance is zero and the slope is undefined. Without a NULLIF wrapper around VAR_POP(x), the query crashes with a division error. Always wrap the denominator and accept that the slope will be null on degenerate inputs rather than blowing up the entire dashboard.
The second trap is forgetting that OLS assumes linearity, normally distributed residuals, and homoscedasticity — equal variance of residuals across the range of x. When these assumptions fail, the slope is still computable but the standard errors and confidence intervals around it are biased. A scatter plot before the regression catches most violations in ten seconds, and you should always glance at one before reporting a slope to anyone outside the team.
The third trap is treating the slope as a causal effect. In a bivariate regression without controls, the slope is a description of correlation in your sample, not an experimental treatment effect. Saying "an extra hour of usage causes fifty cents more revenue" is wrong; the honest framing is "users who use the product one hour more spend fifty cents more on average". The difference matters when the PM asks whether forcing users to spend more time on the app will lift revenue.
The fourth trap is trying to fit a multivariate regression in pure SQL. With two predictors you can still write the closed-form solution by hand using matrix algebra encoded in window functions, but it is painful, hard to review, and brittle. Once you need more than one predictor, export the data to Python statsmodels or R lm. SQL is for the bivariate case; everything else belongs in a notebook.
The fifth trap is mixing population and sample variance functions. The slope formula assumes population variance and population covariance — VAR_POP and COVAR_POP. If you accidentally use VAR_SAMP with COVAR_POP (or the reverse) the result is still a number, but it is the wrong number. The clean rule is: use the REGR_* family whenever possible, and if you must write it by hand pick POP for both pieces.
Optimization tips
For tables under a hundred million rows the regression is essentially free — a single aggregate scan completes in well under a second on any warehouse with reasonable compute. Above that scale the cost is dominated by the scan, not the math, so the optimization rule is the same as for any aggregate: prune columns, prune partitions, and let the warehouse do its job.
Always filter to the relevant time window with a partition pruning predicate. A WHERE event_date >= CURRENT_DATE - INTERVAL '90 days' against a partitioned table reads only the recent partitions and skips the rest. Without the predicate the query scans the full history and the wall-clock time grows linearly with table age.
If the same model coefficients feed twenty different dashboards, materialize them. A nightly job that writes a single-row table with (slope, intercept, r_squared, n_pairs, computed_at) lets every dashboard read in constant time instead of recomputing on every refresh — the standard pattern at Snowflake and Databricks. The savings on warehouse credits compound fast.
Related reading
- How to calculate correlation in SQL
- SQL window functions interview questions
- How to calculate heteroscedasticity in SQL
- How to calculate confidence interval in SQL
If you want to drill SQL questions like this every day, NAILDD is launching with 500+ SQL problems across exactly this pattern.
FAQ
Can I fit multivariate regression in pure SQL?
Technically yes, by encoding the matrix algebra beta = (X'X)^-1 X'y as a sequence of window functions and aggregates, but in practice the code is unreadable and breaks on any non-trivial number of predictors. The right answer for two-or-more-predictor regression is to pull the features into Python statsmodels or R lm. SQL is the right tool for the bivariate case and for batch-scoring a precomputed multivariate model; it is the wrong tool for fitting the multivariate model itself.
What r-squared is considered good?
It depends entirely on the domain. In physical or engineering data with controlled inputs, anything below 0.9 is suspicious. In behavioral and social data — clicks, churn, retention — 0.1 to 0.3 is the realistic range and a slope with r_squared of 0.2 can still be a useful signal. In financial returns data, even 0.05 is sometimes meaningful because of the sheer scale of the underlying flows. Report the number honestly and let the reader calibrate against domain expectations.
How do I compute a confidence interval for the slope?
The closed-form standard error of the slope is sqrt(residual_variance / sum_sq_x_centered), which is computable in SQL but verbose. In practice it is faster and more reliable to bootstrap: resample the rows with replacement, refit the slope each time, and take the 2.5th and 97.5th percentiles of the resulting distribution. Modern warehouses run a thousand-iteration bootstrap in seconds on tens of millions of rows.
Does REGR_R2 match the squared Pearson correlation?
Yes, in the bivariate case. The coefficient of determination from a one-predictor OLS regression equals the square of the Pearson correlation coefficient between the two variables. This is a useful sanity check: run CORR(x, y), square it, and compare to REGR_R2(y, x). The two numbers must agree to within floating-point precision. If they do not, look for a stray filter or join that is applied to one calculation but not the other.
Can I predict on new data with different units?
No. The slope is denominated in units of y per unit of x as seen during fitting. If the training data measured hours_since_signup and you score on days_since_signup, the prediction will be off by a factor of twenty-four. Standardize units at the feature engineering stage and document them in model metadata, not the dashboard query.
When should I use a robust regression instead of OLS?
When the data has heavy-tailed residuals or visible outliers that drive the fit. OLS minimizes squared residuals, which means a single extreme point can dominate the slope. Robust regressions like Huber, RANSAC, and quantile regression downweight outliers and produce a slope that reflects the bulk of the data. SQL does not ship these out of the box, so the practical workflow is to detect outliers via IQR in SQL, winsorize them, then refit OLS — and for a properly robust fit, switch to Python.