How to calculate cross-correlation in SQL

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

What is cross-correlation and why analysts care

The cross-correlation function (CCF) measures how strongly two time series move together when one of them is shifted forward or backward in time. Concretely, you have two daily series — say marketing spend x_t and signups y_t — and you want to know whether spend today drives signups three days from now, seven days from now, or maybe not at all. Plain Pearson correlation between same-day values would miss that delay completely. CCF makes the delay a first-class part of the answer by computing a correlation at each lag k and letting you read the lag with the strongest signal directly off the table.

Picture the scenario every senior analyst has lived through. Your growth PM at Stripe pings you on a Friday afternoon asking why the paid acquisition channel "looks broken" — spend went up sharply on Monday, signups stayed flat through Wednesday, and now finance is asking whether to cut the channel. Before you say anything, you want to know if there is a consistent lag between spend and signups in your historical data. If the CCF peaks at k=3, you have a defensible answer: signups should show up by the weekend, do not cut the channel based on a three-day window. If the CCF peaks at k=0 or is flat across all lags, that is also a defensible answer — the channel does not have a delayed effect, and the spike is genuinely worrying.

The same shape of question shows up across data analyst interviews at Meta, Airbnb, DoorDash, and Snowflake. Support tickets today and churn next week. Push notifications today and re-engagement two days later. Ad impressions today and brand search the following Tuesday. Every product analyst who works near growth or retention ends up writing this query at least once a quarter, and being able to write it directly in the warehouse — without exporting to Python — is a real skill that comes up in screens.

The CCF formula

The mathematical definition is straightforward. For two zero-mean stationary series X and Y, the cross-correlation at lag k is the correlation between X_t and Y_{t-k} across all valid timestamps. Positive k means Y lags behind X (so X "leads"). Negative k means X lags behind Y. In practice you compute it across a symmetric window such as k = -14..14 for daily data, then plot or table the result and pick the lag with the largest absolute value.

CCF(k) = corr(X_t, Y_{t-k})  for k in [-K, K]

Two conventions tend to confuse people. First, some libraries flip the sign of k — always check whether positive lag means "Y is delayed" or "X is delayed" in your tool. Second, the correlation should be computed only on rows where both X_t and Y_{t-k} exist, which means the sample size shrinks as |k| grows. We will handle both explicitly below.

Cross-correlation in SQL

The cleanest implementation uses LAG window functions to produce shifted columns, then CORR to compute the correlations. This works in PostgreSQL, Snowflake, BigQuery, Redshift, and Databricks SQL with only minor dialect tweaks.

WITH base AS (
    SELECT
        event_date,
        marketing_spend AS x,
        signups         AS y,
        LAG(signups, 0)  OVER (ORDER BY event_date) AS y_lag_0,
        LAG(signups, 1)  OVER (ORDER BY event_date) AS y_lag_1,
        LAG(signups, 3)  OVER (ORDER BY event_date) AS y_lag_3,
        LAG(signups, 7)  OVER (ORDER BY event_date) AS y_lag_7,
        LAG(signups, 14) OVER (ORDER BY event_date) AS y_lag_14
    FROM daily_marketing
)
SELECT
    CORR(x, y_lag_0)  AS ccf_0,
    CORR(x, y_lag_1)  AS ccf_1,
    CORR(x, y_lag_3)  AS ccf_3,
    CORR(x, y_lag_7)  AS ccf_7,
    CORR(x, y_lag_14) AS ccf_14
FROM base;

If ccf_3 comes out highest in absolute value — say 0.62 versus 0.18 for ccf_0 — you have evidence that today's spend correlates most strongly with signups three days later. That is the "delayed effect" interviewers want to hear you describe.

A subtle point: LAG(signups, k) shifts the value at row t so it reflects signups_{t-k}. When you correlate x (which is x_t) with y_lag_k, you are computing corr(X_t, Y_{t-k}), exactly matching the CCF definition for positive k. To extend to negative k, you can shift x instead of y, or use LEAD(signups, k) which is equivalent to LAG(signups, -k).

Searching for the optimal lag

Hand-listing a dozen LAG calls works for a screen, but for production you want a parameterized scan. The pattern below uses a VALUES table of candidate lags and a self-join on event_date.

WITH base AS (
    SELECT
        event_date,
        marketing_spend AS x,
        signups         AS y
    FROM daily_marketing
),
lag_grid AS (
    SELECT k FROM (VALUES (-14),(-7),(-3),(-1),(0),(1),(3),(7),(14)) AS t(k)
),
joined AS (
    SELECT
        g.k,
        b1.x AS x_t,
        b2.y AS y_t_minus_k
    FROM lag_grid g
    CROSS JOIN base b1
    JOIN base b2
      ON b2.event_date = b1.event_date - (g.k * INTERVAL '1 day')
)
SELECT
    k                          AS lag_days,
    CORR(x_t, y_t_minus_k)     AS ccf,
    COUNT(*)                   AS n_pairs
FROM joined
GROUP BY k
ORDER BY ABS(CORR(x_t, y_t_minus_k)) DESC;

Read the result top-down. The top row is your candidate optimal lag. The n_pairs column is your sanity check — if it drops below, say, 60 pairs at k = 14 because your history is short, treat that row with suspicion. In Snowflake or BigQuery, replace INTERVAL '1 day' with the dialect-specific syntax (DATEADD(day, -g.k, b1.event_date) in Snowflake, DATE_SUB(b1.event_date, INTERVAL g.k DAY) in BigQuery).

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Lead vs lag interpretation

lag k > 0:  x_t correlates with y_{t-k}  →  X leads Y by k periods
lag k < 0:  x_t correlates with y_{t+|k|} →  Y leads X by |k| periods
lag k = 0:  contemporaneous correlation, no delay information

In a marketing-to-conversions setup, the expected sign is k > 0 — spend today drives signups in the future. If your CCF peaks at k < 0, that means signups today predict spend tomorrow, which usually points to a feedback loop in your bidder (the platform sees high conversions and raises the bid) rather than a "signups cause spend" story. At Tesla, a model where service appointments predict next-week sales sounds plausible, but in practice it usually means both are driven by a third variable like a delivery surge — see the spurious correlation pitfall below.

Common pitfalls

When teams compute CCF for the first time, the most damaging mistake is running it on trended series without detrending. If both X and Y grow over time — and almost every product metric at a growing company does — the CCF will show high values at every lag simply because both series are drifting upward together. The fix is to detrend both series before computing CCF, either by differencing (x_t - x_{t-1}) or by subtracting a rolling mean. In SQL, a LAG(x, 1) OVER (...) followed by x - lag_x gives you the first difference in a single CTE.

A closely related trap is spurious correlation driven by seasonality. Marketing spend at Amazon ramps up every November and so do signups — the CCF will dutifully report a high correlation at lag zero and several positive lags, but the underlying cause is the holiday season, not a causal link from spend to signups. The fix is to add seasonal controls: deseasonalize each series by subtracting a 7-day rolling mean for weekly seasonality, or fit and remove a seasonal STL component. If you skip this step, your CCF table is essentially a calendar in disguise.

Outliers wreck CCF more aggressively than they wreck plain Pearson correlation because the lagged join amplifies their reach. A single Black Friday day at DoorDash can create a high correlation at k = 0 and lookalike peaks at k = 1 and k = -1 purely from one row pair on each side of the spike. Either winsorize the inputs (cap at the 99th percentile), exclude known anomaly dates, or compute a robust variant using ranks instead of raw values.

Mismatched frequencies cause silent garbage. If your marketing data is daily but your signups table is weekly, naively joining on date returns one row per week and your CCF is computed on a handful of points. Aggregate both series to the same grain before the lag scan, and document which grain you chose. A weekly grain reduces the noise but also blurs short delays — a three-day lag becomes invisible.

The final mistake is interpretive: treating the lag with the largest |CCF| as causal. CCF is descriptive — it tells you the lag at which two series correlate most strongly. To make a causal claim, you need a Granger causality test (does adding lagged X improve the forecast of Y beyond Y's own history?) or a proper experiment. Interviewers love asking the follow-up "and how would you prove causality" — have an answer ready.

Optimization tips

For tables under ten million rows, the self-join approach above runs in a few seconds on any modern warehouse. Past that, switch to the LAG-based version, which only scans the base table once. If you need lags out to k = 90 on a multi-year daily series, materialize the lagged columns in a staging table and run the CORR aggregations against that — Snowflake and BigQuery both handle this pattern efficiently because the lagged columns share the same micro-partition layout as the base.

Indexing helps a lot in PostgreSQL: a B-tree index on event_date makes the self-join essentially free. In columnar warehouses, there is no index, but clustering or partitioning by event_date plays the same role. If the date column is not the partition key, the join can degenerate into a full shuffle. Always check the query plan before running CCF on a billion-row table.

If you are computing CCF for many (X, Y) pairs — say, every channel against every conversion event — pivot the pairs into a long table with a pair_id column and add pair_id to the PARTITION BY of the window functions. One query then produces the full CCF matrix in a single scan, which is what teams at Linear and Notion do for their growth dashboards.

If you want to drill SQL questions like this every day, NAILDD is launching with 500+ SQL problems across exactly this pattern.

FAQ

What is the difference between CCF and Granger causality?

CCF is descriptive — it answers "at what lag are these two series most correlated?" Granger causality is a formal hypothesis test that asks "does adding past values of X improve the prediction of Y beyond Y's own past?" In an interview, CCF is the right starting point because it is cheap to compute and easy to plot; Granger is the right follow-up when someone asks whether the relationship is genuinely predictive. The two often agree on direction but disagree on magnitude, and Granger can fail to reject when CCF looks visually strong, usually because of autocorrelation in the residuals.

What lag range should I scan over?

Pick the range from the domain, not from the data. For marketing-to-conversion at a consumer app, scan k = -7..14 days — anything beyond two weeks is rarely actionable. For support tickets to churn at a SaaS like Notion, scan k = 0..60 days because churn is a monthly event. If you do not know the right range, run a wide scan first (-30..30), look for the peak, then refine.

How do I add confidence bands to CCF?

The standard approximation is ±2/sqrt(n), the same band used for autocorrelation, where n is the number of paired observations at that lag. Any |CCF(k)| outside this band is statistically distinguishable from zero at roughly the 5% level. This is an approximation that assumes the two series are independent white noise under the null, so do not over-interpret marginal exceedances — and remember that n shrinks at large |k|, which widens the band.

Does CCF work for categorical variables?

Not directly. CCF is defined on numeric series. If X is categorical, encode it as one or more numeric series first — for binary categories use a 0/1 indicator, for multi-category use one indicator per level and run CCF on each against Y. For rare-event series like account_takeover_flag, prefer to aggregate to a daily count of events and run CCF on the count series instead of the raw 0/1 stream.

What does it mean if CCF peaks at lag zero?

It means X and Y move together with no detectable delay, which is honestly the most common result in practice and is usually less interesting than a peak at a nonzero lag. Contemporaneous correlation does not give you a leading indicator — you cannot use X_t to forecast Y_{t+k} because the effect is instantaneous. Before concluding "no delay exists," double-check that you actually detrended and deseasonalized; otherwise a lag-zero peak may just be the shared trend talking.

How is CCF different from a regression with lagged predictors?

A regression like Y_t ~ X_t + X_{t-1} + X_{t-3} + X_{t-7} gives you partial effects — the contribution of each lag holding the others constant. CCF gives you marginal effects — each lag's correlation with Y ignoring the other lags. When the lagged values of X are themselves correlated (which they almost always are), the two analyses can disagree sharply on which lag matters most. Use CCF to explore and the regression to commit, and report both if the interviewer asks for a forecasting recommendation.