May 18, 2026·13 min read

Holdout vs A/B testing in practice

Prep A/B testing and statistics

300+ questions on experiment design, sample size, p-values, and pitfalls.

Contents:

What a holdout actually is
Why a holdout exists at all
Designing a holdout experiment
When to pick A/B vs holdout
Worked examples by team
Operating a holdout day-to-day
Common pitfalls
Related reading
FAQ

What a holdout actually is

A holdout is a slice of users — usually 1 to 5 percent — permanently excluded from a feature while the remaining 95 to 99 percent see the rollout. The same users sit in the holdout for months or years, and the team compares headline metrics between holdout and exposed users at quarterly check-ins.

In a classic A/B, the treatment and control groups are roughly balanced, the test runs two to four weeks, and the question is "which version do we ship". A holdout flips that asymmetrically — tiny control, huge treatment — and the question becomes "how much is the feature actually worth, three months and six months from now". A/B-only programs get good at picking winners on short windows but cannot answer whether the last six months of shipping moved the company-level needle.

Why a holdout exists at all

The honest reason holdouts exist is that short-window A/B tests systematically over-credit launches. Four failure modes recur.

Long-term effect drift. A new push notification can lift week-one retention by 4 percent and look clean. Six months later, push fatigue has accumulated and app retention is down 3 percent. The A/B was valid — it just measured a window where the negative effect had not surfaced. A six-month holdout catches the reversal.

Cumulative effect of many features. A team at Notion or Linear might ship 50 small features in a half-year. Each A/B reads +0.5 to +2 percent. By naive arithmetic the product should be 30 percent better, but gains overlap, launches cannibalize each other, and total retention is often flat. Without a holdout the counterfactual disappeared the day each feature shipped.

Incrementality of paid surfaces. Performance marketing teams at DoorDash, Uber, and Stripe live and die on this. An ad campaign reports +10 percent attributed conversions, but a meaningful share would have happened anyway. A geo or user-level holdout that gets zero ads is the only credible way to separate incremental from cannibalized sales.

SUTVA violations. The stable unit treatment value assumption — one user's outcome is unaffected by what others see — underpins every A/B. Social features, marketplace pricing, recommendation, and anything users discuss outside the app break it. A small persistent holdout sidesteps contamination instead of pretending it does not exist.

Designing a holdout experiment

Picking the group

The holdout is a random slice of the eligible population, 1 to 5 percent depending on scale. At Netflix, Meta, or Amazon, 1 percent is tens of millions of users. At a Series B startup with 200,000 monthly actives, you need 5 percent or you will not detect anything below a 5 percent lift.

The group must be locked in. A user randomized into the holdout this quarter stays there next quarter, until you explicitly refresh the pool. The group must also be isolated from other experiments by default — every A/B treatment a holdout user receives muddies the signal.

Duration

Three months is the practical minimum. Six to twelve months is standard. For incrementality on paid marketing, holdouts often run for two or more years on an evergreen basis, rotating membership annually but keeping the slot permanent.

Metrics

Read out the company's headline metrics: D30/D90 retention, weekly actives, revenue per user, NPS, complaint rate. The interesting reads are trends, not snapshots — does the gap between holdout and exposed widen or narrow over time? A widening gap means the feature stack keeps compounding. A flat gap after month three means the early wins faded.

Power

Holdouts are usually underpowered for small effects, and that is fine — they target company-level drift, not 0.3 percent changes. Before launch, compute the MDE at 80 percent power against your actual baseline. If your MDE is 3 percent and you are hunting sub-1 percent moves, you need a bigger holdout or a longer window.

-- Quarterly readout sketch
WITH labeled AS (
  SELECT
    user_id,
    CASE WHEN bucket = 'holdout' THEN 'holdout' ELSE 'exposed' END AS arm,
    DATE_TRUNC('month', activity_date) AS month,
    revenue_usd,
    is_retained_d30
  FROM events
  WHERE activity_date >= DATE '2026-01-01'
),
agg AS (
  SELECT arm, month,
         COUNT(DISTINCT user_id) AS users,
         AVG(revenue_usd) AS arpu,
         AVG(CASE WHEN is_retained_d30 THEN 1.0 ELSE 0.0 END) AS d30
  FROM labeled
  GROUP BY 1, 2
)
SELECT
  month,
  MAX(CASE WHEN arm = 'exposed' THEN arpu END) AS exposed_arpu,
  MAX(CASE WHEN arm = 'holdout' THEN arpu END) AS holdout_arpu,
  MAX(CASE WHEN arm = 'exposed' THEN arpu END)
    - MAX(CASE WHEN arm = 'holdout' THEN arpu END) AS arpu_lift
FROM agg
GROUP BY month
ORDER BY month;

When to pick A/B vs holdout

The choice is operational, not philosophical.

Situation	Pick
Short-term effect of one change (1 to 4 weeks)	A/B
Long-term effect (3 plus months)	Holdout
Cumulative effect of many launches in a half-year	Holdout
Incrementality of ads, promos, or push	Holdout
Effect that violates SUTVA (network, marketplace)	Switchback or holdout
Change users see across devices or discuss offline	Holdout
Pricing experiment with cross-arm leakage risk	Holdout or geo split
Algorithm change with fast user-level signal	A/B

These are not mutually exclusive. Mature programs at Stripe, Airbnb, and Netflix run an A/B layer for short-window decisions on top of a persistent holdout for long-window measurement. Reading them together is the whole point.

Prep A/B testing and statistics

300+ questions on experiment design, sample size, p-values, and pitfalls.

Join the waitlist

Worked examples by team

Netflix long-term holdout. Around 1 percent of accounts does not receive the latest recommendation, UI, and onboarding launches. Quarterly readouts compare retention and viewing-hours deltas. This answers "did six months of personalization actually move retention" rather than "did this ranking change win its two-week A/B".

Marketing holdouts at DoorDash and Uber. A geo or user-level slice receives zero retargeting and zero paid ads for a defined window. The sales difference versus the exposed group is the incremental value of the spend. Teams routinely discover that 30 to 50 percent of attributed conversions would have happened without the ad.

Push notifications. Push is the canonical "great in week one, terrible in month six" feature. A 5 percent holdout that receives no marketing push gives a clean read on net effect across retention, DAU, and unsubscribe rate.

Promo and coupon programs. A holdout that never sees promo codes lets the revenue team estimate true incremental revenue — usually positive but smaller than gross attribution suggested.

Recommender baselines. A small holdout on a non-personalized baseline ranking — popularity, recency, editorial — is the floor against which the personalized stack is measured. Without it you only compare new recommenders to old ones and compound errors over time.

Operating a holdout day-to-day

Designing the holdout is easy. Operating it across the organization is what kills programs.

Product managers will lobby to release to the holdout the moment their A/B reads positive — usually framed as "the holdout is missing out on a clear win". Every release shrinks the long-window measurement, so the experimentation team needs an explicit policy and a senior sponsor (a VP of Product or Chief Data Officer) who can say no.

The pool drifts in representativeness. Three years in, users locked into the holdout have aged differently than the rest of the base: fewer recommended sessions, lower notification volume, different cohort entry distributions. The fix is a periodic refresh every 12 to 24 months — retire the current pool, randomize a new one, accept some loss of continuity for representativeness.

Analysis hygiene matters more than people expect. Holdout flags must be stamped on the user record at randomization and never recomputed. If the bucketing service crashes and re-randomizes, the holdout is dead. Every analyst should be able to join users to experiment_assignments on a stable key and see the same arm for the same user every time.

The framing that works internally is "we keep a tiny slice on the previous experience so we can honestly measure whether the new experience is better". Most executives accept this once it is framed as a measurement investment rather than a deprivation.

Common pitfalls

When teams stand up their first holdout, the most common mistake is running it for too short a window. Three months is the floor, not the target. A six-week holdout is a slow A/B that costs more to operate and gives less signal. Commit upfront to a six or twelve month readout and budget patience as carefully as compute.

A second trap is letting the holdout leak into the regular A/B pool. If a holdout user gets bucketed into ten unrelated A/B treatments over a quarter, the long-run signal is contaminated by short-run noise. Exclude holdout users from the experimentation pool by default, with a deliberate exception list for tests that genuinely need them.

A third pitfall is sizing too small. Below 0.5 percent of an audience you usually cannot detect anything under a 5 percent effect, so the holdout reads "no significant difference" regardless of the truth. Run a power calculation against your headline metric and size to an MDE that is meaningful for the business.

A fourth is comparing arms without checking baseline equivalence. Even with proper randomization, occasional imbalances happen — platform mix, country distribution, tenure profile. Before reading the lift, confirm with a CUPED-style pre-period check that the arms looked the same before treatment. Skipping this is how teams report a 2 percent lift that turns out to be platform mix drift.

A fifth pitfall is treating the holdout as permanent without a refresh policy. Three to five years in, users still in the holdout no longer represent new acquisition cohorts. Document a refresh cadence every 12 to 24 months with a clear handover plan so long-horizon trend data is not lost in transition.

If you want to drill experimentation and product-analytics problems like this every day, NAILDD is launching with 500+ SQL and stats problems built around exactly these decisions.

FAQ

What percent of users should sit in a holdout?

For most companies the right range is 1 to 5 percent. At Netflix or Meta scale, 1 percent is still tens of millions of users with very tight confidence intervals. Earlier-stage companies with a few hundred thousand monthly actives usually need 3 to 5 percent so headline metrics have enough power to detect company-level moves. Always confirm with a power calculation against the actual baseline and the smallest effect that would matter.

Can a holdout and A/B tests run at the same time?

Yes, and at any reasonable scale they should. The cleanest setup is a small persistent holdout excluded from the regular A/B pool, while the remaining audience flows through normal A/B tests for short-window decisions. A/B tells you which version wins this week, holdout tells you whether the last six months of shipping moved the business. Mature programs at Airbnb, Stripe, and Uber run both layers concurrently with explicit governance about which experiments can touch the holdout slice.

How do I explain to leadership that 5 percent of users are locked out of new features?

Frame it as a measurement investment, not a deprivation. The pitch that works: "without this small holdout we cannot prove that new features are net positive over the long run — we are protecting our ability to make honest decisions about what to keep building". Pair this with a concrete past example where a feature looked positive in A/B but turned out neutral or negative over six months. Holdouts are how you avoid that pattern repeating.

When does a holdout not make sense?

Holdouts struggle at very small scale. If your monthly active base is under fifty thousand, a 5 percent holdout is 2,500 users, rarely enough to detect anything below a 10 percent lift on most product metrics. They also do not fit products with very short feature lifecycles. A stronger pre-registered A/B program plus a quarterly retrospective is usually a better investment at that scale.

What do I do if the holdout shows the overall effect is negative?

First, do not panic and do not immediately roll back. A negative holdout readout is a diagnostic, not a verdict. Pull the list of features that shipped during the window and segment the readout by surface — onboarding, ranking, notifications, monetization — to find where the loss is concentrated. Then run targeted A/B tests that turn off specific candidate features for the exposed group and see which off-switches recover the metric. The holdout flags that something is wrong at the company level; the A/B layer finds the specific cause.

How is a holdout different from a difference-in-differences study?

A holdout is a randomized comparison: users are assigned to holdout or exposed at the start, stay there, and the lift is the difference in outcomes. A difference-in-differences design is a quasi-experimental method for when you cannot randomize — a feature rolls out by country or by cohort — and you estimate the causal effect by comparing change-over-time in the treated group against change-over-time in an untreated comparison group. Holdouts are cleaner because randomization removes selection bias; DiD is what you reach for when randomization is unavailable.