Variance reduction techniques for A/B tests
Contents:
Why variance reduction matters
In an A/B test you compare the mean of a metric between treatment and control. The larger the variance, the harder it is to detect an effect. The standard answer — collect more samples — sounds cheap until you remember that traffic is finite. Senior data scientists at Netflix, Meta, Microsoft, and DoorDash run thousands of experiments per quarter and each one fights for a slice of the same user pool. Variance reduction buys back power without asking product to slow the roadmap.
The idea behind every variance reduction method is the same: identify variation in the outcome that is unrelated to the treatment, and subtract it out before computing the test statistic. The treatment effect is preserved because the subtracted noise is independent of random assignment. The confidence interval shrinks because you stripped away signal that was never about the experiment. In practice you see 20 to 50 percent variance cuts on stable metrics like revenue per user, sessions, or time spent — which translates to running the test in half the time, or detecting a minimum effect roughly 30 percent smaller for the same sample size.
Once a platform standardizes on a method like CUPED, every future experiment benefits with zero extra work from the experimenter. This is one of the highest-leverage investments in an experimentation program, and it shows up in senior data scientist interviews at Stripe, Airbnb, and Uber as a way to separate candidates who run experiments from candidates who understand them.
CUPED — the workhorse method
CUPED stands for Controlled-experiment Using Pre-Experiment Data. Microsoft published the original paper in 2013 (Deng, Xu, Kohavi, Walker) and the method has become the default across mature experimentation platforms. The intuition: if a user spent $200 per week before the experiment, they are likely to spend a lot during the experiment too. That baseline has nothing to do with assignment, so it is pure noise from the experiment's perspective. Strip it out and you are left with cleaner signal.
The adjusted outcome is Y_adjusted = Y - theta * (X - mean(X)), where Y is the experiment-window metric, X is a pre-experiment covariate, and theta = Cov(Y, X) / Var(X). The mean of Y_adjusted equals the mean of Y, so the unbiased treatment effect estimate is unchanged. The variance is Var(Y) * (1 - rho^2), where rho is the Pearson correlation between Y and X. A correlation of 0.7 yields a 51 percent variance reduction. A correlation of 0.3 yields only 9 percent — still free, but less dramatic.
CUPED shines when the metric is stable and a pre-period exists for most users — revenue per user, sessions per day, watch time. It breaks when there is no pre-period behavior to anchor on: brand new users, conversion to first purchase, or experiments on features that did not exist before. For a deeper dive into the algorithm, see CUPED variance reduction in A/B tests.
Stratified randomization
Stratified randomization balances the distribution of users across treatment and control on observable characteristics before randomization. Without it, simple Bernoulli randomization can accidentally over-assign mobile users to treatment, or heavy spenders to control. With large samples these imbalances are small but non-zero, and on noisy metrics they meaningfully inflate variance.
The mechanics: partition users into strata defined by features that correlate with the outcome — device class, platform, country, tenure bucket. Inside each stratum, run an independent 50/50 random assignment. Every stratum is then exactly balanced across treatment and control. The variance reduction on the stratified estimator is roughly R^2_strata — the fraction of outcome variance explained by stratum membership.
In practice this buys 5 to 20 percent variance reduction on most metrics, less than CUPED but stackable. Implementation is harder than CUPED because the randomization service has to know stratum membership at assignment time. Most internal A/B platforms at Snowflake, Airbnb, and Linear support a small list of stratification keys (country, device, locale) and not arbitrary ones.
Post-stratification
Post-stratification achieves the same goal without touching the assignment logic. The experiment runs with plain Bernoulli randomization, and the analysis stage reweights the data so that each stratum carries its population share. The estimator becomes ATE_post = sum over strata of weight_strata * (mean_T_strata - mean_C_strata).
The advantage: you can apply it to experiments already in flight, or retroactively to historical ones. No coordination with the platform team. The disadvantage: you cannot stratify on a feature you did not record at assignment time, and the variance reduction is slightly smaller than pre-experiment stratification because the strata are reweighted, not perfectly balanced.
A common production pattern is to combine post-stratification on country and platform with CUPED on the metric covariate. The adjustments are nearly orthogonal, so their variance reductions are roughly additive — 40 percent from CUPED plus 10 percent from post-stratification gets you to about 46 percent total.
Regression adjustment
Regression adjustment generalizes both CUPED and stratification. Fit Y = beta_0 + beta_1 * T + beta_2 * X_1 + beta_3 * X_2 + ... + epsilon, where T is the treatment indicator and X_i are covariates. The coefficient beta_1 is the treatment effect adjusted for everything in the model. CUPED is the special case of one continuous covariate; stratification is the special case of one categorical covariate as dummies.
The cost is interpretability and a small risk of bias. With many covariates and modest sample sizes the regression can overfit. The Lin (2013) result shows that with full treatment-by-covariate interactions the regression estimator is unbiased and at least as efficient as the simple difference in means, but in practice keep the covariate list short, pre-register it before unblinding, and rerun sensitivity checks.
Regression adjustment is most useful when several strong predictors exist that no single CUPED covariate captures — pre-period spend, device class, geography, tenure bucket combined. It is also the natural framework for heterogeneous treatment effects (CATE).
Which method to choose
| Situation | Method |
|---|---|
| Stable metric with strong pre-period data | CUPED |
| Need balance on device or country | Pre-experiment stratification |
| Experiment already running, balance is off | Post-stratification |
| Multiple strong predictors | Regression adjustment |
| Studying heterogeneous effects | Regression with interactions |
| Brand new users, no pre-period | None of the above — increase sample |
The default at most large experimentation programs is CUPED for the primary metric, layered with stratification or post-stratification on one or two structural keys (platform, country). Regression adjustment is reserved for cases where the covariate set is rich and the team has bandwidth to validate the model carefully.
A worked Python example
Here is a minimal CUPED implementation that you can adapt directly:
import numpy as np
import pandas as pd
def cuped_adjust(df: pd.DataFrame, y: str, x: str) -> pd.Series:
"""Return CUPED-adjusted outcome y using covariate x."""
theta = np.cov(df[y], df[x], ddof=1)[0, 1] / np.var(df[x], ddof=1)
return df[y] - theta * (df[x] - df[x].mean())
# Simulated experiment data
rng = np.random.default_rng(42)
n = 10_000
pre = rng.normal(100, 30, n) # pre-period spend
treat = rng.integers(0, 2, n)
post = pre * 0.8 + rng.normal(0, 20, n) + 5 * treat
df = pd.DataFrame({"treat": treat, "post": post, "pre": pre})
df["post_adj"] = cuped_adjust(df, y="post", x="pre")
raw_var = df.groupby("treat")["post"].var().mean()
adj_var = df.groupby("treat")["post_adj"].var().mean()
print(f"variance reduction: {1 - adj_var / raw_var:.1%}")You should see a 50 to 70 percent variance cut in this synthetic setup, which is optimistic compared to production but illustrates the math. In production, simulate on AA data first to confirm the covariate works before applying it live.
Common pitfalls
The first trap is using a covariate that is correlated with the treatment itself. CUPED and regression adjustment both assume the covariate is measured strictly before randomization. If you accidentally use a metric from inside the experiment window — even partially — you introduce a feedback loop that biases the treatment effect estimate. The fix is to lock the covariate definition to a pre-period that ends before any user has been assigned, and to validate this on the experiment dataset by computing the correlation between the covariate and the treatment indicator. It should be statistically indistinguishable from zero.
The second trap is mixing AA-period and experiment-period users without weighting. If a user enters the experiment late, their pre-period might overlap with the live window for other users. You either need to define a single global pre-period or to compute per-user pre-periods that strictly precede each user's assignment timestamp. Both approaches work; the global pre-period is simpler and almost always sufficient.
The third trap is reporting variance reduction without checking sample ratio mismatch. A 40 percent variance reduction on a sample with SRM is meaningless — the experiment is broken before the adjustment runs. Always confirm SRM is absent before pulling variance reduction numbers into a decision. Treat the AA check, SRM check, and covariate-balance check as the three gates the data must pass before the CUPED-adjusted CI shows up in a launch review.
The fourth trap is forgetting that variance reduction does not fix a peeking habit. If the team peeks at p-values daily, the false positive rate inflates whether or not you applied CUPED. Variance reduction makes correct tests faster; it cannot rescue tests that violated their stopping rule. The peeking problem is a separate, equally important problem with its own fixes.
The fifth trap is over-engineering. Some teams try every covariate in the warehouse and end up with a regression that captures 70 percent of variance on AA data and 10 percent on live data. Pick one or two strong covariates with clear causal interpretation, document them, and stop. The marginal gain from squeezing the last few percent of variance is almost never worth the audit overhead of explaining a black-box adjustment to a launch review.
Related reading
- CUPED variance reduction in A/B tests
- A/B test sample size calculator and guide
- Sample ratio mismatch (SRM)
- The peeking problem in A/B testing
- Guardrail metrics in A/B testing
- How to design an A/B test step by step
If you want to drill A/B testing problems like this every day until they feel automatic, NAILDD has 500+ data science interview questions across exactly this pattern.
FAQ
How much variance reduction does CUPED actually give?
In production, CUPED typically delivers 20 to 50 percent variance reduction on stable metrics like revenue per user or session count. The exact number depends on the correlation between the pre-period covariate and the outcome. A correlation of 0.5 yields about 25 percent, 0.7 yields about 51 percent, 0.9 yields about 81 percent. New-user-heavy experiments see much smaller gains because the pre-period covariate barely exists.
Can I combine CUPED with stratification?
Yes, and most mature platforms do. Pre-experiment stratification on a categorical key (platform, country) balances the assignment; CUPED on a continuous covariate (pre-period spend) cleans up the remaining noise. The two adjustments are approximately orthogonal, so the variance reductions roughly add — expect 30 percent from CUPED plus 10 percent from stratification to yield 36 to 40 percent total.
Does the pre-period need to be longer than the experiment window?
No. The pre-period needs to be long enough for the covariate to be a stable, low-noise summary — typically one to four weeks. Too short and the covariate is noisy with low correlation; too long and it captures stale behavior. One to two weeks works for most session-level metrics; revenue metrics often want four weeks because of weekly cycles.
Regression adjustment versus CUPED — which is better?
Regression adjustment generalizes CUPED, so a well-specified regression cannot be worse and can be better when multiple covariates carry independent signal. In practice CUPED wins on simplicity: a single covariate, a closed-form theta, easy to validate, easy to explain to a launch review. Most teams default to CUPED and only reach for regression when multiple strong predictors exist.
When should I bother with variance reduction at all?
When sample size is constrained, the minimum detectable effect is small, the metric is noisy, and pre-period data exists. If your experiment has tens of millions of users and the expected lift is double-digit, variance reduction is a rounding error. If you are chasing a 1 percent lift on a niche feature with a few hundred thousand users, it can be the difference between a conclusive readout and three more weeks of waiting.
Does variance reduction change the treatment effect estimate?
In expectation, no. CUPED, post-stratification, and regression adjustment all produce unbiased estimates when their assumptions hold. What changes is the variance of the estimator, which shrinks the confidence interval. If the point estimate moves meaningfully after adjustment, that is a red flag — usually the covariate is correlated with treatment, the pre-period leaked, or an SRM is lurking. Investigate before trusting the adjusted result.