How to calculate correlation in SQL
Contents:
Why correlation matters
Your PM drops a message before standup: "Does signup age actually predict lifetime value, or am I imagining a pattern?" You have one chart and one number to bring back. The cheapest, most defensible answer is the Pearson correlation coefficient — a single number between minus one and one that captures the strength and direction of the linear relationship. If r is 0.6 you have a story worth telling; if r is 0.05 you save the team from chasing noise.
This question shows up in analytics rounds at Stripe, Airbnb, DoorDash, Uber, Netflix, and Snowflake. The framing is the same: here are two columns, compute the correlation in SQL, then tell me whether to act on it. Strong candidates write the CORR aggregate in under a minute, add a sample-size guard, and call out two traps before the interviewer asks. Postgres, Snowflake, BigQuery, Redshift, and ClickHouse all ship CORR as a built-in aggregate.
The formula in one line
Pearson r is the covariance of x and y divided by the product of their standard deviations:
r = sum( (x - mean_x) * (y - mean_y) )
/ sqrt( sum( (x - mean_x)^2 ) * sum( (y - mean_y)^2 ) )The range is minus one to one. A value near one means the variables move together in a clean linear way. A value near minus one means they move in opposite directions. A value near zero means there is no linear relationship — but watch the word "linear", because a perfect parabola has r equal to zero and is anything but unrelated.
In SQL you almost never type this formula by hand. CORR(x, y) returns the same number with one function call, skipping rows where either column is null.
Pearson r with CORR
Computing Pearson r is a single aggregate. The pattern: filter to rows where both columns are non-null, call CORR, and print the sample count next to it so a reviewer can sanity-check the result.
SELECT
CORR(user_age, lifetime_value) AS r,
COUNT(*) AS n
FROM users
WHERE lifetime_value IS NOT NULL
AND user_age IS NOT NULL;Reading the output: if r lands around 0.4 on a hundred thousand users, you have a moderate positive relationship worth digging into. If r is 0.03 on the same sample, you have nothing — the noise floor for a series this size is around 0.006, so 0.03 is technically significant but practically useless. Anchor the interpretation to the sample size and business context, not the raw number.
One subtle thing about CORR: it silently drops rows where either input is null. If thirty percent of lifetime_value is null because new users have not paid yet, the correlation you compute is for paying users only — a different population from the one the PM had in mind.
Correlation by group
A single overall r often hides the answer. The same product might show a strong age-LTV relationship in the United States, no relationship in Germany, and a negative one in Brazil. Slice by country, channel, or cohort, count the rows in each slice, and order by absolute r.
SELECT
country,
COUNT(*) AS n,
CORR(user_age, lifetime_value) AS r,
POWER(CORR(user_age, lifetime_value), 2) AS r_squared
FROM users
WHERE lifetime_value IS NOT NULL
AND user_age IS NOT NULL
GROUP BY country
HAVING COUNT(*) >= 100
ORDER BY ABS(CORR(user_age, lifetime_value)) DESC;The HAVING COUNT(*) >= 100 line is the line interviewers look for. Without it the top of the result is whichever country has five users and one rich outlier. A hundred is a reasonable floor for exploratory work; for a dashboard, push it to five hundred or a thousand. The r_squared column gives the share of variance explained, which is the number most PMs actually want — r equal to 0.5 sounds dramatic, but r-squared equal to 0.25 makes it clear that age explains a quarter of LTV variance.
Pairwise correlation matrix
When you are looking at a wide table — age, days active, session count, LTV — the next move is a pairwise correlation matrix. Every pair gets its own CORR aggregate, and the result is one row with one column per pair. This is the SQL equivalent of df.corr() in pandas.
WITH stats AS (
SELECT
user_id,
age,
ltv,
days_active,
n_sessions
FROM users
WHERE ltv IS NOT NULL
)
SELECT
CORR(age, ltv) AS age_ltv,
CORR(age, days_active) AS age_days_active,
CORR(age, n_sessions) AS age_n_sessions,
CORR(ltv, days_active) AS ltv_days_active,
CORR(ltv, n_sessions) AS ltv_n_sessions,
CORR(days_active, n_sessions) AS days_active_n_sessions
FROM stats;The result is a single row with six numbers — every unique unordered pair across four variables. In production you would pivot this into a long table with three columns to feed a heatmap. For a screen interview the flat row is faster to write and easier to read aloud. If days_active and n_sessions come back at 0.95, you have near-collinear features and almost certainly want to drop one before fitting a regression.
Spearman rank correlation
Pearson assumes the relationship is linear and both variables are roughly normal. Real product data rarely cooperates: LTV is right-skewed, session counts are heavy-tailed, ad spend is log-distributed. When the relationship is monotonic but not linear, Pearson underestimates the strength and Spearman gives a better answer.
Spearman is just Pearson applied to ranks. Two lines of SQL — RANK() OVER (ORDER BY x) for each variable, then CORR on the ranks — and you have a non-parametric correlation coefficient.
WITH ranked AS (
SELECT
RANK() OVER (ORDER BY user_age) AS rank_age,
RANK() OVER (ORDER BY lifetime_value) AS rank_ltv
FROM users
WHERE lifetime_value IS NOT NULL
AND user_age IS NOT NULL
)
SELECT
CORR(rank_age::NUMERIC, rank_ltv::NUMERIC) AS spearman_r
FROM ranked;If Pearson r is 0.3 and Spearman r is 0.7 on the same data, the relationship is strongly monotonic but not linear — log-transforming one or both variables will usually recover most of the signal. If both come back similar, Pearson is fine. The rank-then-CORR pattern is portable across every warehouse and is the version interviewers expect to see written out.
How to read r and r-squared
A correlation coefficient is meaningless without a reference for what "strong" means in your domain. On consumer product data, anything above 0.5 between two non-trivially-related variables is rare and noteworthy. On lab data with clean measurements, 0.95 is the bar. On financial markets, 0.1 between two return series is a tradable signal. Anchor the number to what the field considers strong, not to a textbook table.
Rough conventions for product analytics: r above 0.7 is strong, 0.5 to 0.7 is moderate, 0.3 to 0.5 is weak, below 0.3 is essentially noise unless the sample is huge. Negative r uses the same buckets — minus 0.6 is moderate negative, not weak.
R-squared is r squared and is the percentage of variance in one variable explained by the other. r equal to 0.5 maps to r-squared equal to 0.25 — twenty-five percent of variance explained. PMs find r-squared easier to reason about because the units are intuitive. If you only have time to ship one number, ship r-squared with the sign of r.
Statistical significance is a different question from practical strength. A correlation of 0.02 on a million rows is statistically significant but not worth acting on. A correlation of 0.6 on twenty rows is practically interesting but statistically unreliable. The right framing for a PM is "the relationship is real and the size is moderate", not "p less than 0.05".
Common pitfalls
The most common mistake is treating correlation as causation. r equal to 0.9 between in-app messages received and lifetime value does not mean sending more messages causes higher LTV. The likely story is that high-engagement users both receive more messages and pay more, and the messages did nothing. The fix is to design an experiment that breaks the confound — a random hold-out of users who do not receive the messages — and measure the lift directly.
A second trap is the linearity assumption. Pearson r equal to zero does not mean the variables are unrelated — it means there is no linear relationship. A perfect parabola has r equal to zero. Plot the scatter before you trust the number; if you see a curve, switch to Spearman or fit a non-linear model.
A third pitfall is letting outliers run the result. One user with a hundred thousand dollars of lifetime value in a sample where the median is fifty dollars can flip the sign of r. The fix is to winsorize — clip values to the 1st and 99th percentile — or switch to Spearman, which is rank-based and immune to extreme values.
A fourth trap is reporting r on a sample too small to support it. With thirty rows the standard error on r is roughly 0.2, which means a sample r of 0.4 has a 95-percent confidence interval that overlaps zero. The conventional floor is one hundred observations for exploratory work and a few hundred for a result you would present to a non-technical audience.
A fifth pitfall is spurious correlations across unrelated time series. Two metrics that both grow over time will show high correlation no matter what they are. The fix on time series is to compute correlation on first differences — value - LAG(value, 1) — instead of raw levels. This strips out the shared trend and tests whether the two series move together day to day.
Optimization tips
On tables under ten million rows the CORR queries above run in seconds. The work starts on raw event tables with hundreds of millions of rows, where the aggregate has to scan the entire dataset. Roll up to a user-level or daily-level fact table first: one row per entity with the columns you want to correlate. The cost drops by one to two orders of magnitude.
The second lever is filtering before the aggregate. WHERE predicates that eliminate nulls or outliers should sit inside the CTE, not in the outer query. The planner can push them down and skip rows before they touch the aggregator. On Snowflake and BigQuery this also reduces scanned bytes.
The third lever is caching the result. A daily snapshot of the top pairwise correlations is enough for any executive dashboard. A small correlation_daily table refreshed once a day saves you from recomputing the same matrix on every page load.
Related reading
- SQL window functions interview questions
- How to calculate autocorrelation in SQL
- How to calculate confidence interval in SQL
- A/B testing peeking mistake
If you want to drill SQL questions like this every day, NAILDD is launching with 500+ SQL problems across exactly this pattern.
FAQ
Does every database have a CORR function?
Postgres, Snowflake, BigQuery, Redshift, Oracle, and ClickHouse all ship CORR as a built-in aggregate with identical semantics. MySQL is the holdout — there is no CORR aggregate, so you compute it manually with SUM, AVG, and STDDEV_SAMP. SQLite also lacks CORR in the standard build. Almost every modern warehouse interview is on Postgres or Snowflake, where CORR is the right answer.
What does r equal to one actually mean?
A perfect positive linear relationship: every point lies on a straight line with positive slope. In real product data you only see r equal to one when the two variables are tautologically related — for example, total revenue and total revenue in cents. If your CORR returns one between two columns you thought were distinct, double-check the inputs; you have probably duplicated a column or joined on the wrong key.
When should I use Pearson versus Spearman?
Use Pearson when the relationship is roughly linear and both variables are roughly normal. Use Spearman when the relationship is monotonic but not linear, when distributions are skewed, or when outliers are a concern. The honest default for product data is to compute both and look at the difference: a large gap is a signal the relationship is non-linear and any model you fit should account for it.
What is the difference between correlation and covariance?
Covariance has the same sign information as correlation but its magnitude depends on the units of x and y, which makes it impossible to compare across pairs. Correlation rescales covariance by the standard deviations of each variable, producing a unit-free number between minus one and one. You almost never report covariance directly — it shows up inside formulas for portfolio variance or the Kalman filter, but not on dashboards.
How do I test whether r is statistically significant?
The standard test is a t-statistic computed as r * sqrt(n - 2) / sqrt(1 - r^2) with n minus two degrees of freedom. The test rarely matters in practice — the bigger question is whether r is large enough to act on, which is a domain judgment. A useful rule of thumb: with one hundred observations, the noise floor for r is about 0.2; with one thousand observations, it is about 0.06.
Can I compute correlation on categorical variables?
Pearson and Spearman are for numeric inputs. For two categorical variables, use the chi-square test of independence. For one numeric and one categorical, use a one-way ANOVA or a difference-in-means test. Encoding a four-level country variable as integers and feeding it to CORR produces a number, but the number is meaningless because the integer ordering is arbitrary.