May 18, 2026·13 min read

CUPED explained simply

Prep A/B testing and statistics

300+ questions on experiment design, sample size, p-values, and pitfalls.

Contents:

Why CUPED matters
The one-paragraph intuition
The CUPED formula
A worked numerical example
When CUPED helps and when it does not
Minimal SQL implementation
Python reference implementation
Common pitfalls
What interviewers ask
Related reading
FAQ

Why CUPED matters

Picture a typical experimentation Monday at Stripe, Netflix, or DoorDash. A PM ships a checkout tweak, the platform spits out a 1.4 percent revenue lift with p-value 0.18, and the launch review is Thursday. The raw revenue metric is noisy because user spend ranges from one dollar to ten thousand per month. You either run another four weeks and miss the launch, or you ship a coin flip. CUPED gives you a third path: same minimum detectable effect, 30 to 70 percent fewer users, by subtracting the variance that has nothing to do with the treatment.

Microsoft published the original CUPED paper in 2013. By 2026 every serious experimentation stack ships it as a default adjustment, including the home-grown platforms at Meta, Uber, Airbnb, Booking and the commercial ones like Eppo and Statsig. A 50 percent variance cut means you halve the runtime of every continuous-metric test in the company, which is the kind of number that shows up in a board deck.

It is also the single most reliable interview question for senior analytics and data-science roles. Recruiters at Amazon, Meta, and Netflix screen for it because candidates who can explain CUPED in plain English, derive theta on a whiteboard, and ship the SQL are the ones who have actually shipped experiments.

The one-paragraph intuition

CUPED stands for Controlled-experiment Using Pre-Experiment Data. A user who spent four hundred dollars last month will probably spend roughly four hundred next month, regardless of whether you show them the new checkout button. That predictable chunk is variance you can subtract before comparing control and treatment. Once removed, the remaining variance is closer to pure treatment effect plus noise, and the t-test gets dramatically more powerful.

Mechanically you fit a one-variable regression of the in-experiment metric on its pre-experiment counterpart, then subtract the predicted value from the actual value. The residual is the "CUPED-adjusted" metric. You run the same t-test you always run, just on the residual instead of the raw value. Nothing else about the experiment changes. Random assignment still holds, the null hypothesis is still no effect, the p-value still means the same thing. CUPED is purely a variance-reduction wrapper.

The CUPED formula

Y_CUPED = Y - theta * (X - mean(X))

Where Y is the metric you care about measured during the experiment window, X is the same metric measured during a pre-experiment baseline period, and theta is the ordinary-least-squares slope from regressing Y on X.

theta = Cov(Y, X) / Var(X)

Theta is computed on the pooled sample, never separately by treatment group. The pooled slope is what makes the adjustment unbiased. If you let each group estimate its own theta you have implicitly fit a different model in each arm and you re-introduce the treatment effect into the regression coefficient.

The variance of the adjusted metric drops by a factor that depends on the correlation between Y and X.

Var(Y_CUPED) = Var(Y) * (1 - rho^2)

So if Y and X are correlated at 0.7, variance drops 49 percent. At 0.9, it drops 81 percent. At 0.3, only 9 percent, which is why CUPED is sometimes oversold. The whole game is finding a pre-experiment covariate with a high correlation to the in-experiment metric.

A worked numerical example

Suppose you run a paywall test at a streaming product. The in-experiment metric is revenue per user during a two-week treatment window. The covariate is revenue per user during the four weeks before the test started. You pull 100,000 users split 50/50.

In the raw data, treatment users average $12.40 and control users average $12.10. Standard deviation across both groups is $8.50. The t-statistic is about 2.5, p-value 0.012. That looks significant, but a Bonferroni correction across the four dashboard metrics moves the threshold to 0.0125. You are now borderline.

Now compute CUPED. The correlation between Y and X is 0.75. Theta is roughly 0.78. After subtracting theta * (X - mean(X)) from every user's Y, the adjusted means are still $12.40 and $12.10 — CUPED does not change means in expectation — but the standard deviation drops to about $5.60. The t-statistic jumps to 3.8, p-value under 0.001. You ship. Same effect, same sample, noise floor down 35 percent.

When CUPED helps and when it does not

CUPED shines for continuous, heavy-tailed, user-level metrics where pre-period behavior strongly predicts in-experiment behavior. Revenue, sessions, minutes-watched, transactions, items purchased — these are the textbook wins. The variance comes mostly from user heterogeneity, not from the treatment, so soaking that heterogeneity into a covariate is a massive win.

CUPED helps less for new users, because they have no pre-period data. You can set X to zero and run the math, but you get no variance reduction on that slice, and on a test made of mostly new users the gain shrinks toward zero. Some teams stratify: run CUPED on the returning-user slice and accept the unadjusted t-test on the new-user slice.

CUPED helps less for binary metrics like seven-day retention or paid conversion. The variance of a Bernoulli is bounded above by 0.25 and tightly tied to the mean, so even a strong predictor only buys you single-digit-percent variance reduction. Big-tech platforms still apply it because every percent counts at their scale; smaller teams rarely see the engineering cost pay for itself on binary outcomes.

CUPED breaks entirely when the pre-period and the experiment period are not comparable. If you change instrumentation, run through a seasonal anomaly, or sample users in a way that correlates with X, theta is biased and the residuals carry that bias. The fix is boring: snapshot the same metric over a stable pre-window and audit that the distributions look the same on day one.

Prep A/B testing and statistics

300+ questions on experiment design, sample size, p-values, and pitfalls.

Join the waitlist

Minimal SQL implementation

The implementation is two CTEs. The first aggregates pre-period and in-experiment metrics per user. The second computes theta as a scalar across all users in the pooled sample.

WITH user_data AS (
    SELECT
        user_id,
        SUM(CASE WHEN event_date BETWEEN '2026-05-04' AND '2026-05-17'
                 THEN revenue END) AS y,
        SUM(CASE WHEN event_date BETWEEN '2026-04-06' AND '2026-05-03'
                 THEN revenue END) AS x
    FROM transactions
    WHERE user_id IN (SELECT user_id FROM experiment_assignment)
    GROUP BY user_id
),
theta AS (
    SELECT
        COVAR_POP(y, x) / NULLIF(VAR_POP(x), 0) AS theta_hat,
        AVG(x) AS mean_x
    FROM user_data
)
SELECT
    u.user_id,
    a.variant,
    u.y - t.theta_hat * (u.x - t.mean_x) AS y_cuped
FROM user_data u
CROSS JOIN theta t
JOIN experiment_assignment a USING (user_id);

Run a two-sample t-test on y_cuped grouped by variant. Most warehouses have built-in correlation and covariance aggregates; if yours does not, compute them manually with SUM((y - avg_y) * (x - avg_x)) style expressions. The deeper SQL recipe with handling for nulls, multi-period covariates, and ratio metrics is in the how to calculate CUPED in SQL post.

Python reference implementation

If you prefer to pull the data and run CUPED in pandas or numpy, the entire trick is six lines.

import numpy as np

def cuped_adjust(y: np.ndarray, x: np.ndarray) -> np.ndarray:
    theta = np.cov(y, x, ddof=0)[0, 1] / np.var(x, ddof=0)
    return y - theta * (x - x.mean())

y_adj = cuped_adjust(y, x_pre_experiment)
# now run a two-sample t-test on y_adj split by treatment

Note ddof=0 to match the population definitions in the formula and avoid off-by-one bias on small samples. Compute theta on the pooled y and x across both arms, not per group.

Common pitfalls

Computing theta separately for control and treatment is the single most common bug. It looks innocuous when your ORM hands you a per-group aggregator, but it introduces bias. Each group's theta is fit to its own data, and the treatment effect leaks into the slope. Only the pooled estimator is unbiased. If you cannot pool, fall back to the unadjusted t-test rather than ship a biased CUPED.

Using a pre-experiment window that overlaps the experiment is the second classic mistake. The covariate must be measured strictly before random assignment. If even a day of the pre-period falls inside the test window, the treatment can influence X, which means X is no longer independent of assignment, and the adjustment becomes a thin disguise for double-dipping. Lock the pre-period boundary at the assignment timestamp.

Choosing a weak covariate is the most expensive pitfall. Teams sometimes pick a generic feature like "days since signup" because it is easy to pull, then are disappointed when variance drops four percent. The covariate should be the same metric you are testing, measured over a pre-period long enough that user-level noise averages out. For a two-week revenue test, four to eight weeks of pre-period revenue is the sweet spot.

Running CUPED on a sample full of brand-new users is the fourth stumble. Their X is zero, so the adjustment is zero, but adjusted variance equals raw variance. If your experiment is 80 percent new users, you bought yourself essentially nothing. Segment the analysis or report new users separately.

Forgetting to validate that the covariate distribution is balanced across arms is the last trap. Random assignment should give you mean(X) approximately equal across treatment and control; if it does not, your randomization is broken or your assignment table is bugged. Check this on day one of every test. It is the cheapest sanity check you can run.

What interviewers ask

"What is CUPED in one sentence." A variance-reduction method that subtracts off the predictable part of an experiment metric using its pre-experiment value, so the t-test becomes more powerful at the same sample size.

"Why does it work." Because the experiment metric Y has two sources of variance — user-level heterogeneity and treatment-related noise — and the pre-experiment covariate X captures the heterogeneity component. Subtracting theta * (X - mean(X)) removes that chunk without changing the expected treatment effect.

"Why pool theta across groups." Per-group theta absorbs the treatment effect into the regression slope and biases the adjustment. The pooled estimator is unbiased under random assignment.

"What variance reduction do you actually see in practice." For revenue and engagement metrics with a strong pre-period correlate, 30 to 70 percent is typical. For binary metrics, single digits. The number is 1 - rho^2 so it is bounded by the correlation between Y and X.

"What if a user has no pre-period data." Set X to zero or to the global mean; the adjustment for that user is zero. Variance reduction does not apply to that slice, so report new-user and returning-user results separately if you have a meaningful split.

If you want to drill A/B testing and experimentation questions like this in interview format, NAILDD is launching with 500+ SQL and statistics problems covering exactly this pattern.

FAQ

Does CUPED work for ratio metrics like conversion rate or CTR?

Vanilla CUPED works on user-level continuous metrics. For ratio metrics where the denominator varies across users — clicks per impression, orders per session — you need the delta-method extension that handles numerator and denominator separately and combines them with a linearization. The variance reduction is usually smaller than on continuous metrics, but the procedure is the same idea applied to a transformed metric.

How long should the pre-experiment window be?

For weekly-cyclic metrics like revenue or sessions, a four to eight week pre-period balances noise reduction against drift. Shorter windows mean noisy X and weaker theta; longer windows pull in stale user behavior that no longer reflects current preferences. Some teams use exactly the same length as the experiment window itself, which is a defensible default if you do not want to think about it.

Can I combine CUPED with stratification?

Yes, and many platforms do. Stratify on the obvious dimensions like country and platform, then apply CUPED inside each stratum. The variance reduction stacks: stratification cuts the cross-stratum variance, CUPED cuts within-stratum variance. The math is just CUPED applied to each cell and then a pooled t-test, weighted by stratum size.

Does CUPED change the treatment effect estimate?

In expectation, no. CUPED is unbiased under random assignment, so the estimated lift E[Y_treat - Y_control] is unchanged. In any single sample the point estimate will move a little because of finite-sample noise in theta, but the bias is zero and the variance is strictly smaller. That is the entire pitch.

What about regression adjustment with multiple covariates?

CUPED with a single covariate is a special case of regression adjustment. If you have a vector of pre-experiment features, you can run an OLS on the pooled sample and use the residuals as your adjusted metric. The variance reduction is bounded by 1 - R^2 instead of 1 - rho^2. Most teams find one well-chosen covariate captures 80 percent of the available variance reduction; adding more features yields diminishing returns and more failure modes.

When is CUPED the wrong tool?

When you have no pre-period data, when the metric is intrinsically low-variance like a 99th-percentile latency capped by an SLO, or when the experiment spans a structural break — a pricing change, an outage, a marketing burst — that decorrelates pre-period from in-experiment behavior. Reach for stratification, sequential testing, or a longer run instead.