CUPED: variance reduction for faster A/B tests

Prep A/B testing and statistics
300+ questions on experiment design, sample size, p-values, and pitfalls.
Join the waitlist

Why variance reduction matters

The higher the variance of a metric, the longer you wait for an A/B test to call a winner. Two ways to speed up an experiment: more traffic or less noise. Traffic is capped by your user base; noise can be peeled off mathematically.

CUPED — Controlled-experiment Using Pre-Experiment Data, published by Microsoft in 2013 by Deng, Xu, Kohavi, and Walker — takes the variation predictable from pre-experiment behavior, subtracts it out, and runs the t-test on what remains. The treatment effect is preserved because pre-experiment behavior is independent of random assignment. The CI shrinks because you removed variance that had nothing to do with the treatment.

CUPED cuts required experiment duration by 20 to 50 percent. Platforms at Microsoft, Netflix, Meta, Uber, Airbnb, and DoorDash ship it as a default. Commercial platforms like Eppo and Statsig ship the same recipe.

The math intuition

CUPED is regression adjustment with a covariate that correlates with the outcome but is independent of the treatment.

Let Y be the metric measured during the experiment for a given user — revenue per user across the test window — and let X be the same metric measured before the experiment started. The adjusted metric is:

Y_cuped = Y - theta * (X - E[X])

where theta is the coefficient that minimizes the variance of Y_cuped:

theta = Cov(Y, X) / Var(X)

That is the slope of the linear regression of Y on X. Subtracting theta * (X - E[X]) strips out the share of variation in Y that is explained by pre-experiment behavior.

E[X] is the load-bearing piece. Under random assignment, the expected value of X is the same in treatment and control. Subtracting a quantity with zero expectation in both groups does not bias the estimated treatment effect — it only changes the variance. We strip noise, not signal.

The variance after adjustment is:

Var(Y_cuped) = Var(Y) * (1 - corr(Y, X)^2)

A correlation of 0.7 drops variance by 49 percent. A correlation of 0.9 drops it by 81 percent. The stronger the link between pre-period and in-experiment metrics, the bigger the win. This is why CUPED shines on revenue and session-count metrics: those are stable per user week over week.

The CUPED algorithm step by step

Pick a covariate X — the same or a closely related metric measured before the experiment. Compute theta on the pooled dataset across treatment and control: theta = Cov(Y, X) / Var(X). Build the adjusted metric per user: Y_cuped_i = Y_i - theta * (X_i - mean(X)).

Run your usual statistical test on Y_cuped instead of Y. The mean difference is preserved in expectation; the standard error shrinks. A 95 percent CI of [0.92, 2.70] on the raw metric might collapse to [1.49, 2.53] on the CUPED metric.

How to choose covariates

The rule: the covariate must be measured strictly before the experiment and must not be a function of group assignment. Once that holds, pick the signal that correlates most strongly with the outcome.

The same metric over the pre-period is almost always best. If Y is revenue across a 14 day test, X is revenue across the 14 days before. Session count for a session-count outcome. Order count for order count. Anything else and the correlation usually falls off a cliff.

Pre-period length should match experiment length. Too short loses precision on the covariate; too long dilutes the correlation as behavior drifts.

CUPED generalizes to multiple covariates via OLS of Y ~ X1 + X2 + ... + Xk, but the first one or two capture most of the gain. A third covariate rarely justifies the complexity.

New users have no pre-period behavior. Set their covariate to the population mean so the adjustment collapses to zero. The alternative is to analyze new and existing users separately.

Python worked example

Simulate 10 thousand users with a true effect of plus 2 dollars per user, then compare CIs before and after CUPED.

import numpy as np
from scipy import stats

np.random.seed(42)
n_control = 5000
n_treatment = 5000

# Each user has a latent "willingness to spend"
user_baseline_control = np.random.normal(50, 20, n_control)
user_baseline_treatment = np.random.normal(50, 20, n_treatment)

# Pre-period metric: revenue before the experiment
X_control = user_baseline_control + np.random.normal(0, 10, n_control)
X_treatment = user_baseline_treatment + np.random.normal(0, 10, n_treatment)

# In-experiment metric: treatment adds +2 to revenue per user
noise_control = np.random.normal(0, 10, n_control)
noise_treatment = np.random.normal(0, 10, n_treatment)

Y_control = user_baseline_control + noise_control
Y_treatment = user_baseline_treatment + 2.0 + noise_treatment

# Raw t-test
t_stat, p_value = stats.ttest_ind(Y_treatment, Y_control)
diff_raw = Y_treatment.mean() - Y_control.mean()
se_raw = np.sqrt(Y_control.var()/n_control + Y_treatment.var()/n_treatment)

print("Without CUPED:")
print(f"  Mean diff: {diff_raw:.3f}")
print(f"  SE: {se_raw:.3f}")
print(f"  95% CI: [{diff_raw - 1.96*se_raw:.3f}, {diff_raw + 1.96*se_raw:.3f}]")
print(f"  p-value: {p_value:.4f}")

# CUPED adjustment
X_all = np.concatenate([X_control, X_treatment])
Y_all = np.concatenate([Y_control, Y_treatment])

theta = np.cov(Y_all, X_all)[0, 1] / np.var(X_all)
X_mean = X_all.mean()

Y_cuped_control = Y_control - theta * (X_control - X_mean)
Y_cuped_treatment = Y_treatment - theta * (X_treatment - X_mean)

t_stat_c, p_value_c = stats.ttest_ind(Y_cuped_treatment, Y_cuped_control)
diff_c = Y_cuped_treatment.mean() - Y_cuped_control.mean()
se_c = np.sqrt(
    Y_cuped_control.var()/n_control + Y_cuped_treatment.var()/n_treatment
)

print("\nWith CUPED:")
print(f"  theta: {theta:.3f}")
print(f"  Mean diff: {diff_c:.3f}")
print(f"  SE: {se_c:.3f}")
print(f"  95% CI: [{diff_c - 1.96*se_c:.3f}, {diff_c + 1.96*se_c:.3f}]")
print(f"  p-value: {p_value_c:.4f}")

reduction = 1 - (se_c / se_raw)
print(f"\nSE reduction: {reduction:.1%}")

corr = np.corrcoef(Y_all, X_all)[0, 1]
print(f"corr(Y, X): {corr:.3f}")
print(f"Theoretical variance reduction: {corr**2:.1%}")

The output on seed 42:

Without CUPED:
  Mean diff: 1.808
  SE: 0.452
  95% CI: [0.922, 2.695]
  p-value: 0.0001

With CUPED:
  theta: 0.819
  Mean diff: 2.013
  SE: 0.266
  95% CI: [1.492, 2.534]
  p-value: 0.0000

SE reduction: 41.2%
corr(Y, X): 0.808
Theoretical variance reduction: 65.3%

The CI narrows by 41 percent, from a width of 1.77 down to 1.04. The point estimate stays close to the true plus 2 effect — CUPED removes noise without bias.

Prep A/B testing and statistics
300+ questions on experiment design, sample size, p-values, and pitfalls.
Join the waitlist

When CUPED does not help

No pre-period data. If every user is new — a landing page test for paid acquisition — there is no history to anchor the covariate, and CUPED does not apply.

Low correlation. If corr(Y, X) is below 0.3, variance reduction is under 9 percent. Not worth the pipeline complexity.

Non-stationary metrics. If pre-period and experiment straddle a regime change — Black Friday, a product launch — the correlation collapses. CUPED stays unbiased but the variance reduction evaporates.

Ratio metrics. Conversion rate is orders / visits. Applying CUPED directly to a user-level ratio is incorrect because the variance of a ratio is not the ratio of variances. The fix is the delta method to linearize, then CUPED on the linearized series.

Broken randomization. CUPED corrects variance, not bias. If groups already differ on observables, CUPED will absorb the gap into theta and make the bias harder to spot. Inspect group balance on the covariate before trusting the adjusted estimate.

CUPED versus other variance reduction methods

Method Idea Variance reduction Complexity Constraints
CUPED Subtract a pre-period covariate adjustment 20 to 80 percent depending on correlation Medium Needs pre-period data
Post-stratification Compute weighted estimate over strata 10 to 30 percent Low Limited by number of strata
CUPAC Regression adjustment with ML-predicted covariate 30 to 90 percent High Risk of leakage if model is fit on test data
Winsorization Cap extreme values 5 to 40 percent Low Loses information in the tails
Delta method Linearize a ratio metric Depends on metric Medium Only for ratio metrics

CUPED versus stratification. Post-stratification slices users into buckets by stable attributes — platform, country, plan tier — and computes a weighted effect. Simpler to implement, smaller variance reduction. They compose: stratify first, then CUPED inside each stratum.

CUPED versus CUPAC. CUPAC replaces the raw pre-period metric with the prediction of an ML model trained on pre-period features. The predicted score is a stronger covariate because the model captures non-linearities. The cost is operational: managing the model, preventing leakage, explaining variance shifts across re-trainings.

Common pitfalls

The most frequent pitfall is computing theta on treatment and control separately and applying each group's theta to its own users. That is two unrelated regressions, not CUPED, and the estimate is biased because the groups are no longer on the same scale. Always compute one theta from the pooled data and apply it to every user.

A second trap is letting the covariate be a function of the treatment. The classic version uses "revenue in the first three days of the experiment" as X and "revenue across the full two weeks" as Y. Both are post-randomization, independence breaks, the estimate is biased. The covariate window must end strictly before the experiment window begins.

A subtler pitfall is reporting segment-level Y_cuped raw means. Y_cuped has the same expected group difference as Y but a different scale of levels — a segment with high X can show negative Y_cuped in both arms. Report the difference of means rather than raw means when communicating CUPED results.

Skipping validation is its own pitfall. Teams plug CUPED into a pipeline and never check whether the variance reduction was worth the engineering cost. A one-line check per experiment — print theta, the realized SE reduction, and the correlation — keeps everyone honest.

The last pitfall is treating CUPED as a substitute for design quality. If the experiment is underpowered, has broken randomization, or measures the wrong metric, CUPED will not save it. Variance reduction is a multiplier on sound design, not a rescue mission.

Interview questions on CUPED

"Explain CUPED in plain language." Users differ before the experiment starts. Some spend a hundred dollars a week, some ten thousand. That spread drowns out the treatment effect. CUPED subtracts the predictable part of the spread using pre-experiment behavior. You are stripping noise the treatment did not cause.

"Why is CUPED unbiased?" The covariate is measured before the experiment and is independent of assignment. Under random assignment, E[X] matches in both groups, so subtracting a function of X - E[X] does not shift the group-mean difference.

"How do you pick the covariate?" The same metric over a pre-period of equal length to the test. Try a few candidates and keep the one with the highest correlation, as long as each is measured strictly before the experiment window.

"What about new users?" Set their covariate to the population mean so the adjustment collapses to zero. The alternative is to split the analysis into new versus existing users.

"Can you apply CUPED to conversion rate?" Not directly — conversion is a ratio metric. Use the delta method to linearize it, then apply CUPED. If the denominator is fixed by design, apply CUPED to the numerator alone.

"How is CUPED different from adding a covariate to OLS?" Mathematically the same — CUPED is OLS of Y ~ treatment + X rewritten to slot into a t-test pipeline. The framing separates metric adjustment from the test step.

"When is CUPED useless?" When corr(Y, X) is below 0.3, when there is no pre-period data, when the metric distribution shifts dramatically between pre-period and experiment, or when pre-period behavior carries no signal for the test window.

If you want to drill experimentation and SQL problems daily, NAILDD bundles 500+ interview problems across this pattern.

FAQ

What is CUPED in A/B testing?

CUPED — Controlled-experiment Using Pre-Experiment Data — uses each user's behavior before the experiment to strip predictable noise out of the in-experiment metric. The confidence interval tightens at the same sample size, which translates to 20 to 50 percent shorter experiments or smaller minimum detectable effects.

How much variance does CUPED remove?

The variance reduction equals corr(Y, X)^2. A correlation of 0.7 cuts variance by 49 percent and 0.9 cuts it by 81 percent. Correlations of 0.5 to 0.8 are typical for revenue and session-count metrics over short test windows.

How do you choose the covariate?

The same metric over a pre-period of equal length to the experiment is almost always the best covariate. If the test runs 14 days, use the same metric across the 14 days before. The covariate must be measured strictly before the experiment.

Can CUPED be used for conversion rate?

Not directly — conversion is a ratio metric and the variance algebra breaks. Use the delta method to linearize the ratio, then apply CUPED to the linearized series. When the denominator is fixed by design, apply CUPED to the numerator alone.

When does CUPED fail to deliver?

When there are no pre-period data, when corr(Y, X) is below 0.3, or when seasonality or external shocks make the pre-period a poor predictor of in-experiment behavior. CUPED also cannot rescue an experiment with broken randomization, since it corrects variance but not bias.

Is CUPED the same as adding a covariate to OLS?

Mathematically yes — CUPED is equivalent to OLS of Y ~ treatment + X. CUPED is preferred when the platform already runs t-tests on the adjusted metric, because it separates metric adjustment from the test step.