Peeking problem in A/B testing
Contents:
Why peeking matters in real teams
You launched an A/B test on Monday. By Thursday, your dashboard shows p = 0.04 for the conversion lift, and the PM asks whether you can ship before the weekend launch window closes. This is the moment that quietly destroys experimentation programs at companies like Stripe, DoorDash, and Airbnb. The temptation to act on early significance feels rational, but it is the single most expensive mistake in applied statistics.
Frequent peeking with early stopping at p < 0.05 inflates the false positive rate from the nominal 5% to somewhere between 20% and 30%, depending on how aggressively you check. Roughly one in three "significant" experiments you ship is shipping noise, and your roadmap quietly fills with neutral or negative features the team believes are wins.
The peeking problem is also a near-guaranteed interview topic for mid-level and senior data science roles. If you cannot articulate why repeated significance testing breaks the math, you are getting downleveled. A clear answer signals that you understand the difference between running an A/B test and running it correctly.
The math behind inflated false positives
A fixed-horizon test sets α = 0.05, meaning a 5% probability of declaring a winner when the variants are identical, if and only if you check the result exactly once at a pre-specified endpoint. The guarantee is about the procedure as a whole, not about each individual peek.
Under the null hypothesis where the true effect is zero, the p-value follows a uniform(0, 1) distribution. Each fresh peek is a new draw. If you draw repeatedly and stop the moment any draw falls below 0.05, the cumulative probability of crossing the threshold approaches 1 as peeks grow. The fixed-horizon test does not protect you because you are no longer running a fixed-horizon test.
Here is the rough scaling of false positive rate (FPR) against equally-spaced peeks under continuous monitoring:
1 peek (correct): ~5%
5 peeks: ~14%
10 peeks: ~19%
20 peeks: ~25%
30 peeks (daily): ~30%
Continuous monitoring: approaches 100% as N growsThat last line is the killer. With unbounded sample size and unlimited peeks, you are mathematically guaranteed to cross p < 0.05 even when the true effect is zero. The longer you watch, the more certain the false win becomes.
A simulation that proves the trap
The easiest way to convince a skeptical PM is to run the simulation yourself. The code below generates 10,000 A/A experiments where both variants are identical, then applies a naive "stop at first significant peek" rule. The empirical FPR should come out around 30%, not 5%.
import numpy as np
np.random.seed(42)
trials = 10_000
false_positives = 0
for _ in range(trials):
# Null hypothesis: no real effect
data_a = np.random.normal(0, 1, 10_000)
data_b = np.random.normal(0, 1, 10_000)
# Peek every 100 observations
for n in range(100, 10_001, 100):
diff = data_b[:n].mean() - data_a[:n].mean()
se = np.sqrt(2.0 / n)
z = diff / se
if abs(z) > 1.96: # naive p < 0.05
false_positives += 1
break
print(f"Empirical FPR: {false_positives / trials:.3f}")
# Expected: ~0.30, not 0.05Run this once and you will see the headline number jump from 5% to roughly 30%. Walk a stakeholder through it and the abstract math becomes a concrete number on their laptop. This is the single most effective tool for changing experimentation culture in a team that has been quietly peeking for months.
Five proven fixes
Every major experimentation platform — from Optimizely to the internal tools at Meta, Netflix, and Microsoft — implements one of the corrections below. Pick the one that matches your team's tolerance for math complexity and runtime cost.
The first option is pre-registering the sample size. Decide N in advance based on minimum detectable effect, baseline rate, and α/β, then look exactly once at the end. Simplest correct approach, requires no special tooling. The downside is you must wait for the full sample even when the effect is obviously large or zero, which feels wasteful under tight product timelines.
The second option is sequential testing, including SPRT, mSPRT, and always-valid inference. These procedures explicitly allow peeking while controlling the type-I error rate, adjusting the rejection boundary so the procedure as a whole still has α = 0.05. The cost is more complex code and a small efficiency loss versus a perfectly-sized fixed-horizon test, but you get the ability to stop early on a true win or a true flat.
The third option is Bonferroni correction. If you plan exactly K peeks in advance, test each at α/K. With 5 peeks at α = 0.05, each look uses α = 0.01. Easy to explain and requires no specialist library. The downside: Bonferroni is conservative, so you lose statistical power and need a larger sample to detect the same effect.
The fourth option is alpha spending, also known as Lan-DeMets or O'Brien-Fleming boundaries. You allocate your α budget across planned interim analyses more smartly than Bonferroni, typically spending less early and more later. This is the standard in clinical trials and is implemented in tools like rpact and gsDesign.
The fifth option is Bayesian A/B testing. Instead of a p-value, compute the posterior probability that variant B beats variant A. The posterior is a well-defined function of the data and your prior, and does not depend on how many times you stopped to look. Genuinely peek-resistant, though it requires specifying a prior and defining decision rules in terms of probability thresholds.
When stopping early is actually allowed
Not every early stop is peeking in the bad sense. There are two legitimate reasons to halt a test before the planned sample size.
The first is obvious user harm. If a guardrail metric like crash rate, latency, or revenue collapses by 30%+ in the treatment, stopping the test is an ethical decision, not a statistical one. Most experimentation platforms hard-code automated harm detection on guardrails so this stop happens without anyone needing to peek at the primary metric.
The second is business or external constraints. A fixed-date campaign launch, regulatory changes, or an engineering rollback. Stopping the test is fine — what is not fine is using the mid-experiment data to declare a winner. If you stopped for business reasons, you stopped, period. You do not get to retrofit a significance claim onto the truncated data.
The wrong pattern most teams accidentally fall into is "p = 0.049, looks significant, let's stop." That is not a harm stop and not a business stop. That is the peeking problem in its purest form.
Common pitfalls
The "just a quick check" pitfall feels innocent and is the most common entry point for peeking culture. A single unplanned look at the p-value with intent to stop early roughly doubles your FPR, from 5% to around 10%. The fix is not "peek less" — it is to use a procedure that explicitly accounts for the peek, either alpha spending with one interim analysis or a sequential test.
A second pitfall is stopping on a visual trend. Watching the lift line go up for three days and concluding "the effect is real, ship it" is statistically identical to peeking on p-value. Random walks routinely produce three-day trends under the null. Define stopping rules through a formal sequential procedure before launch, not through subjective dashboard intuition during the test.
The third pitfall is ignoring peeking in the post-hoc analysis. People say "I only looked twice, that can't matter much." Two peeks takes your FPR from 5% to about 8%, and three peeks pushes it past 11%. Log every interim look at the primary metric and treat the analysis as a multiple-testing problem.
A fourth pitfall is re-launching failed experiments. Re-running a flat test on the same metric with a slight tweak and hoping for a win is a multiple comparisons problem across experiments. Treat the program as a portfolio, set a global false-discovery threshold, and use Benjamini-Hochberg to adjust across related tests.
The fifth pitfall is mixing Bayesian and frequentist reporting. Some teams compute a Bayesian posterior to "avoid peeking" but then quote p-values in the readout. This combines the weaknesses of both frameworks. Pick one inference framework, document the decision rule before launch, and report against that rule consistently.
How to spot peeking in your org
You usually do not need to audit logs — peeking shows up in how people talk about experiments in standups and Slack. Phrases like "the experiment is looking promising, let's ship early" are a near-certain signal that someone is reading the p-value before the planned endpoint. So is any "is this significant yet" discussion during the live test window.
A second symptom is stopping experiments early on a single threshold crossing without any pre-registered interim analysis plan. If your team has no document defining when peeks are allowed and what the adjusted thresholds are, every early stop is a peek by definition.
A third symptom is re-running tests on the same hypothesis with minor variations until one "works." This is the garden-of-forking-paths problem — structurally identical to peeking, just spread across experiments. Watch for the same metric appearing in three or four experiment writeups within a quarter.
The correct process is mechanical and boring: pre-register the metric, MDE, sample size, and stopping rules; launch; monitor only guardrails (crashes, latency, severe revenue loss); do not look at the primary metric until the end; then analyze and decide. If you need mid-experiment looks, adopt sequential testing or alpha spending up front rather than peeking and rationalizing afterwards.
Related reading
- A/B testing peeking — the mistake that fails 40% of junior PMs
- Bayesian A/B testing
- Guardrail metrics in A/B testing
- How to design an A/B test step by step
- How to calculate Bonferroni correction in SQL
If you want to drill A/B testing interview questions like this every day, NAILDD is launching with hundreds of experimentation and SQL problems built from real data science loops.
FAQ
If I peek just once, how bad is the false positive rate?
A single unplanned interim look at the primary metric with intent to stop early pushes your FPR from the nominal 5% to roughly 8-10%, depending on when in the test the peek happens. It feels like a small sin, but it has already doubled your error rate. If you know you need to look once, plan the look in advance and use alpha spending with two analyses so the math actually supports the decision.
Can I check once per week if my test runs four weeks?
Four equally-spaced peeks under naive significance testing inflate the FPR to around 12-13%. That is unambiguously worse than the textbook 5% guarantee. If weekly checks are a hard product requirement, use a sequential testing method like mSPRT or group sequential boundaries from the start. These give you the weekly flexibility while keeping the overall α at 5%.
Does Bayesian A/B testing really avoid the peeking problem?
Yes, with two caveats. The Bayesian posterior probability that B beats A is a well-defined function of the observed data and the prior, and it does not depend on the number of times you computed it. So in that narrow sense, Bayesian inference is peek-invariant. The caveats are that the prior matters — a wildly optimistic prior will pull conclusions toward early data — and that you still need a pre-committed decision rule like "ship if P(B > A) > 0.95" to avoid moving the goalposts mid-experiment.
What is the difference between peeking and sequential testing?
Peeking is repeated naive significance testing at α = 0.05 without any correction. Sequential testing is a formal procedure where the rejection boundary is calibrated specifically to allow continuous monitoring while still controlling the overall type-I error at α. Mechanically the analyst is looking at the data multiple times in both cases, but only one of them has valid statistical guarantees.
How do big tech companies actually handle this in production?
Most large experimentation platforms implement sequential or always-valid methods under the hood. Microsoft's experimentation platform uses mSPRT, Optimizely Stats Engine uses always-valid p-values, and Netflix has published on sequential confidence intervals. The analyst sees a dashboard where the displayed p-value or confidence interval is already corrected, so peeking is safe by construction. If your in-house platform shows raw frequentist p-values updated in real time, you have a peeking risk surface and should either retrain the team or switch to a corrected statistic.
My PM still wants to ship at p = 0.049 after three days. How do I push back?
Run the simulation from earlier in this post on a real laptop with the PM watching. Seeing the empirical FPR climb from 5% to 30% over 10,000 simulated experiments is more persuasive than any argument about the central limit theorem. Then offer a concrete alternative: either wait to the pre-registered N, or adopt a sequential procedure that lets them peek legitimately. The conversation almost always ends with the second option, which is the right outcome.