Power analysis explained simply

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

The one-line definition

Statistical power is the probability that your A/B test will flag a real effect as significant when that effect actually exists. Formally it is 1 − β, where β is the false-negative rate. The industry default is 80%, which means that even a perfectly designed test still misses one real win in five. Once you accept that framing, every sample-size argument suddenly makes sense.

A power analysis translates "we want to know if a change works" into "we need N users per arm". You pick the alpha you tolerate, the power you want to hit, and the smallest effect worth shipping. The math gives you N. Skip this step and you are running a coin flip with a dashboard.

The most expensive mistake at Meta, Airbnb, or DoorDash is not a bad test — it is a test statistically incapable of detecting the effect it targets. Leadership reads "not significant" and concludes the feature does nothing. The truth is usually that the test was underpowered.

The four parameters

Power analysis is a four-way relationship. Fix three of them and the fourth falls out of the equation. Internalize this and you will stop arguing about formulas.

Alpha (α) is the false-positive rate you accept. Almost every team uses 5% by convention. If you run dozens of experiments at α = 5%, you should expect roughly one bogus winner per twenty null tests just from noise. That is why guardrail metrics and replication matter.

Beta (β) is the false-negative rate. Power equals 1 − β. The 80% convention is not sacred — it is just a balance between "we keep missing real wins" and "we cannot afford to run 200k users through every test". Critical decisions (pricing, billing, anything regulated) deserve 90% or 95% power.

MDE is the minimum detectable effect — the smallest lift you actually care about. This is the parameter analysts under-think the most. MDE is a product decision, not a statistical one. If a 0.2% lift in checkout conversion would still be worth the engineering cost, your MDE is 0.2%. If you would only ship at +1%, set MDE = 1%.

N is the sample size per arm. This is usually what you solve for. Smaller MDE and higher power both push N up sharply, often non-linearly. A test designed for a 1% lift typically needs four times the traffic of one designed for a 2% lift.

When you actually run it

The standard scenario is designing a new experiment. You know the baseline rate, you have agreed on an MDE with the PM, and you commit to α = 5% and 80% power. The calculation tells you how long the test must run at your daily traffic. If the number is "11 weeks", talk to the PM before shipping the splitter.

The second scenario is mid-flight diagnostics. The test is running and the lead asks whether the current sample can catch the effect they care about. Fix α and N, solve for power at the relevant effect size. This is also how you kill tests early when they cannot succeed even in the best case.

The third scenario is the post-mortem. A test came back not significant and people want to know whether the design was capable of finding something. This is post-hoc analysis, with limits we cover below.

Worked example for a conversion test

Imagine a checkout-button experiment at a Stripe-like billing flow. Baseline conversion is 5%. The product team would ship at a 1 percentage-point lift, so MDE = 1pp (5% → 6%). You commit to α = 5% two-sided and power = 80%.

The formula for two proportions, using a normal approximation, is below. It is just the standard two-sample z-test rearranged for N.

n = (z_{α/2} + z_β)^2 × (p1(1-p1) + p2(1-p2)) / (p1 - p2)^2

Where z_{α/2} = 1.96 for two-sided α = 5% and z_β = 0.84 for 80% power. Plugging in:

n ≈ (1.96 + 0.84)^2 × (0.05 × 0.95 + 0.06 × 0.94) / (0.01)^2
  ≈ 7.84 × (0.0475 + 0.0564) / 0.0001
  ≈ 7.84 × 0.1039 / 0.0001
  ≈ 8,146 per arm

You need roughly 8,150 users in control and 8,150 in treatment, so about 16,300 total. If the page sees 2,000 eligible users per day at a 50/50 split, the test needs about eight to nine days of clean traffic — and you should round up to a full two weeks to cover day-of-week and novelty effects.

In Python, the canonical implementation is statsmodels. It uses Cohen's h for proportions, which is a transformed effect size; the numerical answers line up with the formula above to within rounding.

from statsmodels.stats.power import NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize

effect = proportion_effectsize(0.06, 0.05)  # Cohen's h
analysis = NormalIndPower()

n = analysis.solve_power(
    effect_size=effect,
    alpha=0.05,
    power=0.80,
    ratio=1.0,
    alternative='two-sided',
)
print(f"N per arm: {n:.0f}")

For continuous metrics like revenue per user, swap to a t-test power function and supply the standard deviation. Revenue distributions usually have long right tails, so a naive proportion-style calculation will understate N badly. Treat any heavy-tailed metric with a bootstrap-based estimate or a CUPED-adjusted variance.

Tradeoffs between the parameters

Pushing N up is the safest lever. More users means more power at any fixed effect size and tighter confidence intervals on the readout. The cost is time on the calendar and the opportunity cost of holding back a winning treatment from the control group. Most growth teams cap individual experiment durations at four weeks; if your power analysis demands eight, the MDE is probably too small.

Raising alpha shrinks N at the cost of more false positives. Going from 5% to 10% alpha cuts N by roughly 25%. Almost no respectable team does this in practice because the false-positive rate compounds across an experimentation program. The cleaner move is to lower N requirements through variance reduction (CUPED, stratification) rather than relaxing alpha.

Raising MDE is the most underused lever. If you genuinely only ship at a 2% lift, do not design a test for 0.5%. Be honest with the PM. The conversation "we cannot detect 0.5% with our traffic, what is the smallest lift you would actually ship at" usually ends with an MDE that makes the timeline reasonable.

Lowering power below 80% is rarely defensible. It looks attractive on a sample-size calculator but it just means most of your "negative" results are inconclusive noise. Better to invest in variance reduction or scope the test to a higher-traffic surface.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

A priori vs post-hoc

A priori power analysis is the planning step described above. You run it before the experiment, you log the assumptions in the test brief, and you commit to a stopping rule. This is the standard practice at every serious experimentation team and the only version your interviewer wants to hear about by default.

Post-hoc power analysis is what you do after the test, using the observed effect size in place of the planned MDE. The math works, but the interpretation is fraught. When a test fails to reject the null, the observed effect tends to be near zero, which forces post-hoc power to be near α. That is a tautology, not a finding. The cleaner instrument after the fact is precision analysis: report the width of the 95% confidence interval and ask whether it rules out effects you would have shipped at.

If a stakeholder asks "what was the post-hoc power?", reframe the question. The useful version is "given our final sample, what is the smallest lift we could still have detected at 80% power?" That number tells you whether the test was capable of finding business-relevant effects, without the circular logic of plugging the observed estimate back in.

Common pitfalls

The first pitfall is skipping the calculation entirely. Teams launch a test because "we have the traffic" and then panic when the readout is ambiguous. Without a planned N, you have no clean way to say whether you ran long enough. Always log the target sample size in the experiment doc before the splitter goes live, even if it is a back-of-the-envelope estimate.

The second pitfall is pulling MDE out of thin air. A 1% MDE chosen because "it sounds reasonable" is a red flag that nobody asked the product team what lift would actually justify shipping. Tie MDE to a real business floor — engineering cost, risk tolerance, revenue threshold. Otherwise you are optimizing a number nobody owns.

The third pitfall is ignoring metric variance. The two-proportion formula works for binary metrics like conversion. The moment you switch to revenue per user or any continuous metric with a fat tail, that formula understates N dramatically. Either run a t-test power calculation with a real SD estimate, or simulate sample size via bootstrap on a recent window.

The fourth pitfall is using post-hoc power as a defense. "We had 30% power, that is why we did not find anything" is not an argument — it is an admission that the test should never have launched. The right response is to plan better next time, often by raising MDE or stacking multiple weeks of data before the readout. For the related and more common sin of stopping early, see the peeking mistake in A/B testing.

The fifth pitfall is mis-handling novelty effects. The first three days of an experiment frequently show an exaggerated treatment effect because curious users click everything new. Power analysis assumes a stable effect over the test window, so plan a minimum of two weeks regardless of what the calculator says. Anything shorter risks reading a transient lift as a permanent one.

Interview prompts you should be ready for

When an interviewer asks "what is power?", the answer is the probability of detecting a real effect, conditional on the effect being non-zero. Add that the default 80% is conventional, not principled, and that critical decisions deserve more.

When they ask "why 80%?", say it is a working balance between false-negative tolerance and the cost of larger samples. Strong candidates note that the cost-of-a-miss should drive the choice, not folklore.

When they ask "how do you size a test?", walk through the four parameters in order: alpha is fixed by policy, power is a target, MDE comes from product, and N falls out. Mention that for continuous metrics you also need the metric standard deviation.

When they ask "1% MDE versus 5% MDE?", the right answer is that 1% needs roughly 25x the sample of 5% (N scales with 1/MDE^2), so the choice is a business decision about how small a lift is worth shipping. Then ask what the company's actual shipping bar is — interviewers like candidates who push the question back.

If you want to drill scenarios like "we have 4,000 daily users and a 4% baseline, can we detect a 0.5% lift in two weeks", NAILDD is launching with a deep library of A/B testing and SQL interview drills built around exactly this pattern.

FAQ

What is beta in plain English?

Beta is the probability of missing a real effect when you run the test — a false negative. Power is just 1 − β. Teams obsess over alpha (false positives) because it sounds dangerous, but in growth work β is usually the bigger source of pain because underpowered tests look like "no effect" and quietly kill good ideas.

Can I increase power without raising N?

Yes, by reducing the variance of the metric. CUPED is the standard trick: regress the metric on a pre-experiment covariate and analyze residuals. Stratified sampling and switchback designs also help. Raising MDE is the other lever, but only if the product team genuinely accepts a higher shipping bar.

Is 80% power really mandatory?

It is conventional, not mandatory. Use 80% for typical experimentation. Push to 90% or 95% when the decision is hard to reverse — pricing changes, billing changes, anything that touches paid retention. For exploratory A/B tests on a button color, 80% is plenty.

Is post-hoc power useful at all?

It is more confusing than helpful when computed at the observed effect. The form that pays off is "given our final N, what was the smallest lift we could have detected at 80% power?". That number tells you whether the test was capable of resolving a business-relevant effect, without the circular reasoning of plugging the observed estimate back in.

How do I size a test for revenue per user instead of conversion?

Use a t-test power calculation and supply the empirical standard deviation from a recent stable window of your data. Revenue distributions are heavy-tailed, so the SD will look large and N will spike. Variance reduction via CUPED is almost always worth it for revenue metrics — it can cut required N by 30% to 50% on real datasets.

What if the calculator says I need more users than I have in a year?

The MDE is too small relative to your traffic. Three honest options: raise MDE to a level the team would actually ship at, run on a higher-traffic surface, or switch to CUPED. Running the underpowered test anyway is the worst option — ambiguous result and burned calendar.