How to compute A/B test sample size
Contents:
- Why size the test before you launch
- The four inputs that drive N
- Formula for proportions
- Formula for means
- Worked example for a checkout conversion test
- Python calculator with statsmodels
- How each parameter moves N
- Common pitfalls
- Practical runtime advice
- Interview prompts you should be ready for
- Related reading
- FAQ
Why size the test before you launch
Shipping an A/B test without a sample-size calculation is the experimentation equivalent of opening an order book with no position size. Arms that are too small return "not significant" even when the new variant is genuinely better, and the team rolls a winner back to control. Arms that are too large burn traffic you did not need while your roadmap slips on a settled question.
The harshest version is silent. At Stripe, Airbnb, or DoorDash, the most expensive A/B tests are not the ones that broke production — they are the ones that returned an inconclusive readout because the design could never have caught the effect the team cared about. A 30% powered test misses real effects seven times out of ten, and nobody flags it.
Sizing the test up front fixes the rules of the game before you see any data. You commit to a minimum effect worth shipping, an error tolerance, and a number of users per arm. That number drives runtime, traffic share, and the conversation with the PM about whether the experiment is worth running. It is also the cleanest defence against peeking, which we cover in A/B testing peeking mistake.
The four inputs that drive N
Every sample-size formula reduces to four quantities. Fix any three and the fourth falls out.
Significance level (alpha) is the false-positive rate. The convention is 5%, with a two-sided critical z of 1.96. Tightening to 0.01 increases N by roughly 30%.
Power (1 - beta) is the probability of detecting a real effect. Default 80%. Critical decisions — pricing, billing — deserve 90% or 95%. Moving from 80% to 90% bumps N by about 30%; moving to 95% nearly doubles it.
Minimum detectable effect (MDE) is the smallest lift worth shipping, not the lift you hope for. This is a product decision: if checkout conversion lifting by less than 0.3 pp would not justify the engineering cost, set MDE at 0.3 pp and let the math determine runtime.
Baseline variance is the noise in the metric. For a proportion it is p times (1 minus p), maxed at p = 0.5. For a continuous metric like revenue per user, estimate variance from a recent historical window on the same traffic source.
Formula for proportions
For a binary outcome — converted or not, clicked or not — sample size per arm follows from the normal approximation to the binomial:
n = (z_alpha/2 + z_beta)^2 * (p1*(1-p1) + p2*(1-p2)) / (p2 - p1)^2Here z_alpha/2 = 1.96 for two-sided alpha = 0.05; z_beta = 0.842 for power = 0.80 (or 1.282 for power = 0.90); p1 is the control rate; p2 is the target after applying MDE.
When p1 and p2 are close — almost always true at production scale — the formula simplifies using pooled p = (p1 + p2) / 2:
n ≈ (z_alpha/2 + z_beta)^2 * 2 * p * (1 - p) / delta^2The quadratic delta in the denominator is the single most important fact about A/B sizing: halving the MDE quadruples the sample size.
Formula for means
For a continuous metric — revenue per user, session duration, items per order — replace the proportion variance with the metric's own variance:
n = (z_alpha/2 + z_beta)^2 * 2 * sigma^2 / delta^2sigma is the historical standard deviation, delta is the MDE in the metric's units. Revenue distributions are heavy-tailed, so sigma is usually large relative to the mean — that is why teams use variance reduction, covered in CUPED variance reduction A/B testing. A 30% reduction in variance cuts sample size by 30%, often turning a 60-day test into 40 days.
Worked example for a checkout conversion test
A PM at a DoorDash-style marketplace wants to test a new payment-method selector. Baseline cart-to-paid conversion is 5%. The PM picks 0.5 absolute percentage points (relative 10% lift) as the smallest improvement worth shipping. Alpha = 0.05 two-sided, power = 0.80.
The z sum is 1.96 + 0.842 = 2.802. With p1 = 0.05 and p2 = 0.055, pooled p = 0.0525, so p * (1 - p) = 0.04974. With delta = 0.005, delta squared = 0.000025. Plugging in:
n = (2.802)^2 * 2 * 0.04974 / 0.000025
= 7.851 * 0.09949 / 0.000025
= 31,244The test needs about 31,244 per arm, or 62,488 total. At 1,000 new checkout sessions per day that is 63 days. The PM renegotiates: at MDE = 1 pp, delta = 0.01 drops N to ~7,810 per arm, ~16 days. That tradeoff between sensitivity and runtime is the whole point of the calculation.
Python calculator with statsmodels
You should never do this math by hand in production. statsmodels ships calculators that match the formulas above, using Cohen's effect-size scaling for a consistent API.
For a proportion test:
import numpy as np
from statsmodels.stats.power import NormalIndPower
analysis = NormalIndPower()
p1 = 0.05 # baseline conversion
p2 = 0.055 # target conversion (absolute MDE = 0.5 pp)
alpha = 0.05
power = 0.80
p_avg = (p1 + p2) / 2
effect_size = abs(p2 - p1) / np.sqrt(p_avg * (1 - p_avg))
n_per_group = analysis.solve_power(
effect_size=effect_size,
alpha=alpha,
power=power,
alternative="two-sided",
)
print(f"Cohen effect size: {effect_size:.4f}")
print(f"Per group: {int(np.ceil(n_per_group))}")
print(f"Total: {int(np.ceil(n_per_group) * 2)}")Output:
Cohen effect size: 0.0224
Per group: 31234
Total: 62468The slight delta from the hand calculation comes from internal quantile discretization — close enough to trust.
For a continuous metric, swap to TTestIndPower and pass Cohen's d:
from statsmodels.stats.power import TTestIndPower
analysis = TTestIndPower()
sigma = 24.0 # historical standard deviation of revenue per user
delta = 1.0 # MDE in dollars
alpha = 0.05
power = 0.80
effect_size = delta / sigma
n_per_group = analysis.solve_power(
effect_size=effect_size,
alpha=alpha,
power=power,
alternative="two-sided",
)
print(f"Cohen d: {effect_size:.4f}")
print(f"Per group: {int(np.ceil(n_per_group))}")
print(f"Total: {int(np.ceil(n_per_group) * 2)}")Wrap both into a helper and version-control the inputs alongside your design doc so you can defend the chosen N at the readout review.
How each parameter moves N
The four inputs do not move N evenly. Knowing the sensitivities saves you from arguing about the wrong knob.
| Change | Effect on N | Why |
|---|---|---|
| Alpha tightens (0.05 → 0.01) | N rises ~30% | Stricter false-positive bar requires more evidence |
| Power rises (0.80 → 0.90) | N rises ~30% | Lower tolerance for missing real effects |
| MDE halves | N quadruples | Quadratic relation: N ~ 1 / delta^2 |
| Variance doubles | N doubles | Linear relation: N ~ sigma^2 |
| Baseline closer to 50% | N rises | Proportion variance maxes at p = 0.5 |
| One-sided test | N falls ~20% | Lower bar at the cost of detecting regressions |
The dominant lever is always MDE. If a stakeholder insists on detecting a 0.1 pp change and you lack months of traffic, the conversation should not be about formulas — it should be about whether 0.1 pp is really the bar for shipping. Once they see the runtime math, the answer is usually no.
Common pitfalls
The most common mistake is treating sizing as a one-time formality. A team sizes the test on a Tuesday assuming 5% baseline and 1,000 daily users, the PM revises the experience two weeks in, traffic dips on a holiday week, and nobody re-runs the math. The new design has different variance and baseline, but the original N still anchors the roadmap. Re-running sample size when assumptions change is cheap; ignoring drift costs you the entire test.
Another trap is using observed standard deviation from the test itself rather than from a holdout. Estimating sigma on the same data you plan to test biases the readout. Always pull sigma from a pre-test window of the same length and seasonality. For heavy-tailed metrics like revenue, winsorize at the 99th percentile before computing sigma so a handful of whales do not balloon the required N.
The third trap is sizing for a single primary metric while ignoring guardrails. You may need 30k per arm to detect the primary lift but only 5k to detect a critical regression on crash rate. The test must size to the strictest binding constraint — usually primary — and the guardrails inform downside power. Note both in the design doc so reviewers see the tradeoffs.
A fourth trap is computing N for two arms when you plan three or four. Multi-arm tests need a multiple-comparisons correction like Bonferroni or Holm, which translates to roughly 20-30% more sample per arm. Ignoring this turns a nominal 5% alpha into something closer to 12-15% after every variant comparison.
The final trap is forgetting attrition. The formula gives valid units in the analysis. In production, 5-10% of traffic is dropped by bot filters, segment mismatches, or instrumentation gaps. Pad the target by 10-15% so you actually reach N once bookkeeping is done.
Practical runtime advice
Compute runtime in days, not users. Divide total N by daily exposed traffic at the entrypoint, not monthly actives. At 2,000 daily users with N = 60,000 total, runtime is 30 days. Skipping this step is how teams sign up for tests they cannot finish before the quarter ends.
Round up to full weeks. Traffic mix shifts dramatically between weekdays and weekends, and any test shorter than a week risks day-of-week bias. A 10-day test is worse than a 14-day test even when the math says 10 is enough.
Resist calling the test early. If the formula says 14 days and day 5 shows p = 0.04, that is not a finish line. Peeking without a sequential correction inflates real alpha to 15-25%. If you genuinely need early stopping, use group-sequential designs or always-valid p-values, not ad-hoc dashboards. The post on A/B testing peeking mistake covers why peeking breaks the math.
If you want to drill experiment-design and SQL questions like this every day, NAILDD ships with hundreds of A/B testing and statistics problems lifted from real analyst loops.
Interview prompts you should be ready for
Senior analyst and product data scientist loops at Meta, Amazon, Netflix, and Uber almost always include a sample-size question. The interviewer is checking whether you have actually sized a test, not whether you can recite the formula. Start with the four parameters and tie them to a concrete scenario.
A common follow-up is "what if the metric is heavy-tailed?" The right answer covers winsorizing, CUPED, and switching to a more sensitive proxy if the primary is too noisy. A weaker answer jumps to nonparametric tests, which lack a clean sizing formula.
Another favorite is "how would you cut runtime in half?" The honest answer: relax MDE (biggest lever), reduce variance via CUPED or stratification, increase traffic share, or accept a slightly higher alpha. "Use a one-sided test" buys only 20% and trades regression detection — guardrails will reject that.
Related reading
- Power analysis explained simply
- A/B testing peeking mistake
- CUPED variance reduction A/B testing
- Effect size explained simply
- How to design an A/B test step by step
FAQ
What is the minimum sample size for an A/B test?
There is no universal minimum. Required N depends on baseline rate, variance, MDE, alpha, and power. A 50% baseline with 10% relative MDE may need only a few hundred per arm at 80% power, while a 2% baseline with 5% MDE can need over a hundred thousand. Compute N every time.
How do I pick MDE for an A/B test?
Treat MDE as a product decision, not a statistical one. Ask: what is the smallest lift that justifies shipping? If 0.5 pp in checkout conversion is the floor, set MDE = 0.5 pp. If only 2% would be worth the engineering cost, set MDE = 2%. The math then tells you whether the test is feasible.
Why does halving MDE quadruple the sample size?
MDE sits squared in the denominator: N is proportional to 1 / delta^2. Standard error shrinks as 1 / sqrt(N), so detecting an effect half as large with the same precision needs four times the sample size.
Can I stop the test when the p-value goes significant?
No — that is peeking, and it badly inflates false positives. Daily checks over two weeks can push real alpha from 5% to 20-30%. Either fix N up front and look once, or use a sequential framework with corrected stopping rules. Fixed N is type-I-error control, not an inconvenience.
What if I do not have enough traffic for the required sample size?
Relax MDE with the PM (biggest lever). Reduce variance via CUPED, stratification, or a more sensitive proxy metric like add-to-cart. Increase traffic share to the test. Accept a longer runtime if the experiment is critical. The wrong answer is running an underpowered test and treating the negative readout as meaningful.
How long should an A/B test run?
Long enough to hit N and cover a full weekly cycle. For most consumer products that is at least two weeks even if N is reached earlier, and at least one full week for high-volume entrypoints. Cover known seasonality — paydays, weekends, promotional cycles — that could bias the readout.