Sequential testing explained simply
Contents:
Why this matters
Classic A/B doctrine says fix your sample size in advance and resist the urge to look at the dashboard. That advice is correct for a fixed-horizon test, but it is also expensive. If a feature is a runaway winner on day three, waiting two more weeks for the pre-registered N hurts the control arm and slows the roadmap. If it is a clear loser, every extra day is wasted exposure for the variant arm.
Sequential testing is the family of methods that lets you safely look at results as they accumulate and stop the experiment as soon as the evidence is conclusive. Netflix, Microsoft, and Stripe all use sequential approaches in production, and Optimizely built its Stats Engine around them. When you run thousands of experiments per quarter, every day saved compounds.
A senior data scientist or product analyst interview at Meta, Airbnb, or DoorDash will probe this area with some version of "what is wrong with peeking" followed by "how would you fix it". If you can name two methods and explain what they correct, you are ahead of most candidates.
The short version
In a standard fixed-horizon test, you decide the sample size up front, you run until you hit it, and you compute one p-value at the end. The 5 percent false positive rate is preserved because you make exactly one decision.
Sequential testing replaces that single decision with a procedure that adapts as data arrives. The math compensates for the fact that you are making many decisions, so the overall false positive rate is still controlled at 5 percent even if you check the dashboard every morning. You give up a tiny amount of statistical power per look, and in exchange you gain the right to stop early when the effect is large or obviously absent.
The peeking problem
If you run a standard t-test with alpha set to 0.05 and you only ever check once at the end, your long-run false positive rate is 0.05. That is the entire promise of the procedure.
Now imagine you peek and apply the same 0.05 threshold every day. Each check is correlated with the previous one, but each adds a new opportunity to cross the line. With five peeks the cumulative false positive rate climbs to roughly 14 percent. With ten peeks it is around 19 percent. If you peek every single day across a four-week experiment, the rate approaches certainty: you are essentially guaranteed to declare a winner somewhere along the way even when both arms are identical.
This is the formal version of the p-hacking story. You are not cheating on purpose. You are just applying a procedure that was designed for one decision to a situation where you make many. The fix is not to summon willpower and stop peeking. The fix is to use a procedure that is built for many decisions in the first place. That is what sequential testing buys you. For a deeper walkthrough of the failure mode, see the A/B testing peeking mistake explainer.
Methods of sequential testing
Always Valid Inference
AVI produces a p-value process that is valid at every moment of the experiment, no matter how many times you look. The p-value at time t is constructed using a mixture or martingale that already accounts for the optional stopping. You can check after every user, every hour, or only at the end, and the 5 percent guarantee holds. Optimizely Stats Engine is built on this idea.
SPRT
The Sequential Probability Ratio Test was proposed by Abraham Wald in the 1940s. After each observation you update a log-likelihood ratio comparing two simple hypotheses, typically a specified effect against the null. You stop when the ratio crosses an upper or lower bound chosen to give you the desired type I and type II error rates. SPRT is optimal when both hypotheses are point hypotheses, but real A/B tests usually have a composite alternative, so it is rarely used in its pure form today.
Alpha spending
Alpha spending divides your total alpha budget across a planned set of interim looks. If you allow yourself five looks, you might spend 0.01 at each, or you can shape the spend with a Lan-DeMets boundary that saves alpha for later looks. This approach is standard in clinical trials, where the looks are scheduled by an independent monitoring board.
mSPRT
Mixture SPRT replaces the single point alternative of classical SPRT with a prior over effect sizes, typically a normal mixture centered on zero. The result is always valid, similar in spirit to AVI, but easier to reason about and tune. mSPRT is the workhorse inside several modern experimentation platforms, including the one used by Stripe.
Worked example mSPRT
Below is a pseudo-code sketch that captures the daily-update loop most teams write around an mSPRT primitive. The exact statistic depends on whether your metric is binary, continuous, or a ratio, but the control flow is the same.
# Daily update loop for an mSPRT-based A/B test
from sequential_testing import mSPRT
test = mSPRT(
alpha=0.05, # overall type I error
beta=0.20, # overall type II error at the prior mean
metric="conversion", # binary outcome here
prior_sd=0.02, # expected lift scale
)
while not test.decided():
yesterdays_users = fetch_assignment_and_outcome(date=yesterday())
test.update(yesterdays_users)
if test.reject_null():
notify("Stop: lift detected with always-valid p < 0.05")
break
if test.accept_null():
notify("Stop: futility boundary crossed, no detectable lift")
break
if days_running() > max_days:
notify("Stop: max horizon reached, report current confidence sequence")
breakThree details matter in production. The update step should be idempotent so the daily job can rerun safely. The metric must be locked before the test starts, because changing it mid-test invalidates the always-valid guarantee. The stopping rule and the maximum horizon together determine effective power, so you still need a power analysis at design time. See power analysis explained simply.
When to use and when to skip
Sequential testing pays off when you run a lot of experiments, when traffic accumulates quickly, and when the cost of waiting is real. A consumer product at DoorDash or Uber that ships dozens of pricing experiments per quarter is the canonical use case. A platform team at Snowflake or Databricks that experiments on developer-facing changes with high traffic and clear telemetry is another.
It is the wrong tool when traffic is thin, when the effect of interest is small and slow to emerge, or when novelty effects dominate the first week of any new feature. With low traffic, the sequential machinery still requires real sample sizes to detect small lifts, and you end up running the test almost as long as the fixed-horizon alternative anyway. With novelty effects, an early stop on day three would catch the curiosity bump rather than the steady state, which is exactly the wrong reading. Some teams handle this by enforcing a minimum lookback window before the sequential rule is allowed to fire.
A useful intermediate option is a holdout. If your goal is post-launch measurement rather than a go or no-go decision, see holdout vs A/B testing in practice for the trade-offs.
Common pitfalls
The first common pitfall is mixing stopping rules midway through an experiment. Teams sometimes start with a fixed-horizon plan, peek anyway because the dashboard is right there, then switch to sequential language after the fact to justify an early stop. This is a worse procedure than either pure approach, because the actual stopping rule is whatever felt natural at the time. The fix is to commit to a sequential rule in the experiment design document, including the maximum horizon and the metric, and to enforce it through tooling rather than discipline.
The second pitfall is ignoring novelty effects when stopping early. Conversion rates, click-through rates, and engagement metrics on new features often spike in the first few days as curious users try the variant, then settle to a lower steady-state level. A sequential rule that fires on day two will lock in the curiosity reading. The standard fix is a minimum window of seven to fourteen days before the rule is allowed to declare a winner, plus a guardrail that re-checks the metric over the second week of any winner before full rollout.
The third pitfall is using sequential methods on metrics that are not designed for them. Ratio metrics like average order value or revenue per user have variance that depends on the assignment and that compounds across days. Naive sequential procedures assume independent and identically distributed observations and will misfire. The fix is to use the delta method or a paired bootstrap for the variance, or to switch to a metric definition that respects the design. The bootstrap in A/B testing explainer covers the variance side in more depth.
The fourth pitfall is treating sequential testing as a substitute for variance reduction. If your metric is noisy, sequential machinery cannot rescue you, it will just take a long time. Variance reduction techniques like CUPED apply cleanly to sequential tests and the two are complementary. See CUPED explained simply for the basic mechanic.
The fifth pitfall is forgetting that the futility boundary matters as much as the rejection boundary. Many teams configure rejection thresholds carefully and leave the futility side at defaults, then wonder why their tests run for the full horizon when the effect is clearly zero. Tighten both sides at design time.
Interview answers
"What is sequential testing?" It is a family of A/B procedures that let you check results repeatedly during an experiment and stop early when the evidence is strong, without inflating the false positive rate above the agreed level.
"Why can you not peek in a standard test?" Each look adds a new chance to cross the rejection threshold. With ten daily peeks at alpha 0.05, your effective false positive rate is closer to 19 percent than 5 percent.
"How does it correct for that?" Three flavors. Always Valid Inference builds p-values that are valid at every moment by design. SPRT and mSPRT compare likelihood ratios against pre-specified boundaries. Alpha spending divides a fixed budget across a planned schedule of looks.
"When would you use it in practice?" High-traffic platforms running many experiments per quarter where shipping speed matters. I would still use a fixed-horizon test for low-traffic cases, novelty-prone metrics, and first-of-its-kind features where I want a clean post-mortem.
"What is the main risk?" Stopping early on noise when novelty effects dominate the first week. Mitigate with a minimum observation window, a maximum horizon, and a held-back validation slice that confirms the effect after the rule fires.
If you want to drill A/B and SQL interview problems at this depth, NAILDD is launching with 500 plus problems across exactly these patterns.
Related reading
- A/B testing peeking mistake
- Peeking problem A/B test
- P-value explained simply
- CUPED explained simply
- Power analysis explained simply
- Bayesian A/B test interview recipe
FAQ
Is sequential testing always safer than a fixed-horizon test?
Only if it is implemented correctly and the stopping rule is committed to before the test starts. A poorly implemented sequential procedure is worse than a clean fixed-horizon test, because the apparent license to peek encourages the behavior the math is supposed to prevent. The most common failure mode is teams switching rules mid-test, which destroys the false-positive guarantee.
Is Bayesian A/B testing the same thing?
They overlap but are not identical. A Bayesian test produces a posterior at every moment, so you can stop when the posterior probability of a positive effect crosses a threshold. The differences are that you specify a prior, your stopping criterion is a probability rather than a p-value, and frequentist error rates depend on your prior and threshold combination. See the Bayesian A/B test interview recipe for the Bayesian framing.
What open-source libraries can I use?
For Python, confseq implements confidence sequences and always-valid inference. The expan package covers basic sequential procedures. For R, gsDesign covers group sequential designs and alpha spending, the standard in clinical trials. Most large companies write a thin wrapper around one of these primitives plus their internal metric pipeline, because the orchestration is usually more code than the statistic.
How is this different from group sequential design in clinical trials?
Group sequential design is the clinical-trial flavor of the same idea, with pre-planned interim analyses, an alpha spending function, and an independent monitoring committee that executes the looks. Modern web A/B testing inherits the math but loosens the operations: looks happen continuously, the monitor is a dashboard, and the spending function is replaced by an always-valid procedure or an mSPRT statistic.
Can I combine sequential testing with variance reduction like CUPED?
Yes, and you usually should. CUPED regresses out a pre-experiment covariate to shrink the confidence interval at every point in time, so the sequential procedure crosses its boundaries sooner. The two techniques are mathematically independent and compose cleanly. The only catch is that the covariate window must be frozen before the test starts.
Does it work for ratio metrics like ARPU or AOV?
It can, but you have to handle the variance correctly. Ratio metrics have variance that depends on numerator, denominator, and their correlation, so a naive standard error understates the noise. The delta method gives a closed-form approximation and is the most common production approach. A paired bootstrap is the safer fallback when the delta method assumptions are shaky.