A/B testing peeking — the mistake that fails 40% of junior PMs in interviews

Prep A/B testing and statistics
300+ questions on experiment design, sample size, p-values, and pitfalls.
Join the waitlist

Picture this: you're in a PM or product analytics interview, the interviewer asks you to walk through how you'd analyze a 2-week A/B test, and you say something like “I'd check the p-value each day to see if it's significant.”

You just failed the question. Here's why — and what to say instead.

What “peeking” means

Peeking is looking at the p-value of an A/B test before the planned sample size is reached, and stopping the test as soon as p < 0.05.

It feels rational: why wait if the data already shows significance?

The problem is that classical p-values assume one comparison at one fixed time. Every additional look is a new comparison. The math behind significance no longer applies.

The actual math

At a single planned endpoint, your false-positive rate is exactly your α — usually 5%. That means in a world where the two variants are identical, you'd falsely declare a winner 5% of the time.

Now imagine you check every day for 14 days. Even if both variants are truly identical, the probability that at least one of those 14 checks crosses p < 0.05 is:

$$P(\text{at least one false positive}) = 1 - (1 - 0.05)^{14} \approx 51%$$

The simulated bound is lower than this naive calculation (consecutive observations are correlated), but in practice peeking with daily checks pushes the effective false positive rate to 30-40%. Studies by Optimizely and Evan Miller both confirmed this empirically.

So when you peek and stop early, you're not running “a 5% false positive test.” You're running “a 30-40% false positive test.”

Why interviewers love this question

It separates candidates who memorized “p < 0.05 means significant” from candidates who understand why the threshold matters. It's a single concept that exposes whether you actually ran A/B tests in production or just read about them.

It also has a clean follow-up: “How would you fix it?” — which lets the interviewer see how deep your statistics knowledge goes.

How a senior PM answers

A strong answer covers three things:

  1. Recognize the problem. “Peeking inflates the false positive rate. If I check daily for two weeks, my effective alpha is more like 30-40%, not 5%.”
  2. Explain why. “Classical p-values assume one comparison at one pre-registered time. Every additional look is an opportunity for noise to cross the threshold.”
  3. Offer a fix. Choose one of:
    • Pre-register the sample size and don't look until you hit it.
    • Sequential testing with alpha-spending — methods like O'Brien-Fleming or Pocock boundaries adjust the threshold for each check.
    • Bayesian A/B testing — uses posterior probability instead of p-values; peeking is mathematically valid.
Prep A/B testing and statistics
300+ questions on experiment design, sample size, p-values, and pitfalls.
Join the waitlist

The interview trap inside the trap

Many candidates know about peeking but then suggest fixing it with Bonferroni correction. This is wrong.

Bonferroni is for multiple independent comparisons (e.g., testing 10 metrics in one experiment). Peeking is sequential — same comparison, repeated. The math is different.

If you mention Bonferroni for peeking, you'll lose senior-level credibility. Use alpha-spending or sequential methods instead.

Drill this before your next interview

Practice articulating the peeking concept in 60 seconds out loud:

  • “Peeking inflates false positives because each check is a fresh shot at noise crossing the threshold.”
  • “At 14 daily checks, effective alpha is 30-40%, not 5%.”
  • “Fix it with pre-registered sample size, sequential testing with alpha-spending, or a Bayesian approach.”

If you can say that in your sleep, you're ahead of most candidates.

Other A/B testing traps to know

Once you nail peeking, interviewers go deeper. Make sure you can also handle:

  • Sample Ratio Mismatch (SRM) — when your 50/50 split actually produced 48/52. It's a bug signal, not noise.
  • Multiple comparisons — testing 10 metrics at alpha=0.05 → 40% chance of at least one false positive. Bonferroni or FDR control.
  • Confidence interval overlap — overlapping CIs do not mean “no significant difference.” Compare CIs of the difference instead.
  • Simpson's paradox — segment-level effects can reverse the aggregate. Always check key segments.

For 300+ A/B testing questions with worked solutions across these traps, join the waitlist for naildd.

FAQ

Is peeking always wrong?

If you use a sequential method (alpha-spending, Bayesian, mSPRT) designed for it — no. Those methods adjust for the fact that you're looking multiple times. The wrong thing is classical p-values + peeking.

What if my test reached significance very early — can I stop?

In classical frequentist testing: no. Stick to your pre-registered sample size. In sequential testing: yes, but the threshold for “significant” is stricter than 0.05 — that's how the math accounts for early looks.

Does this apply to one-sided tests too?

Yes. Peeking is a problem regardless of one-sided or two-sided tests. The false positive rate inflates the same way.