SRM in A/B testing

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

What SRM means and why it kills tests

Sample Ratio Mismatch (SRM) is the situation where the observed split between control and treatment in an A/B test differs from the split you planned. You designed a 50/50 experiment, but after a week of traffic you see 51.3/48.7. With small samples that gap could be random noise. With six-figure samples, even a one-percent gap is not random — it is a structural signal that the plumbing of the experiment is broken, and any verdict you draw is contaminated by whatever is causing the imbalance.

Picture the scenario most senior data scientists have lived through. Your PM ships a redesigned checkout. The test reads a 3% lift in conversion. Then someone notices that treatment has 2% fewer users than control. The lift is no longer trustworthy: maybe the slower variant filtered out impatient users, or a bot-detection rule fired more aggressively in one arm. The two arms are no longer exchangeable populations.

Interviewers at Meta, Stripe, Netflix, and Airbnb increasingly probe SRM because it separates candidates who can crunch p-values from candidates who understand the operational reality of experiments. A clean p-value computed on dirty assignment is worse than no test at all — it gives false confidence. SRM catches the dirty assignment before the false confidence ships.

Why SRM is a hard stop

If the actual distribution across arms does not match the design, randomization itself has failed. Once randomization fails, the two groups are no longer probabilistically equivalent on the unobserved confounders. Any metric difference could be the effect of the feature, the effect of the composition shift, or some interaction of the two. There is no clean way to disentangle them after the fact, because the very thing that made the groups comparable — random assignment — is what broke.

Suppose a checkout redesign reports a 3% conversion lift, and treatment also lost 2% of its assigned users between assignment and exposure. The plausible story is that a slow-loading hero image caused some users to bounce before the experiment instrumentation fired. Those users disproportionately include slow-connection and mobile traffic, which also tends to have lower baseline conversion. Their absence from treatment artificially inflates the treatment rate. The reported lift is partly real and partly a survivorship illusion.

The operating rule mature experimentation teams enforce is brutally simple: if SRM fires, do not interpret the metrics until the root cause is identified and fixed.

How to detect SRM with chi-square

The standard SRM test is a chi-square goodness-of-fit test comparing observed group sizes to expected sizes given the planned ratio. The formula sums over arms of (observed minus expected) squared divided by expected, compared against a chi-square distribution with degrees of freedom equal to arms minus one.

Take a worked example. You designed a 50/50 split and collected 105,200 users: 53,400 in control and 51,800 in treatment. Under the null hypothesis of a clean 50/50 split, each arm should have 52,600 users:

X2 = (53400 - 52600)^2 / 52600 + (51800 - 52600)^2 / 52600
   = 640000 / 52600 + 640000 / 52600
   = 12.17 + 12.17
   = 24.33

With one degree of freedom and an alpha of 0.05, the critical value is 3.84. Our statistic of 24.33 blows past it. The p-value is roughly one in a million. This is unambiguous SRM. The experiment is invalid until you find the cause. Most teams use a tighter alpha of 0.001 for SRM because false negatives ship wrong conclusions while false positives only cost an investigation.

Python: SRM check in three lines

The chi-square test is in scipy, and the SRM check is one of the shortest pieces of useful code you will ever write:

from scipy.stats import chisquare

observed = [53400, 51800]   # actual arm sizes
expected = [52600, 52600]   # design at 50/50

stat, p_value = chisquare(observed, f_exp=expected)
print(f"X2 = {stat:.2f}, p = {p_value:.6f}")
# X2 = 24.33, p = 0.000001

The p-value is effectively zero — randomization is broken. For unequal splits like a 90/10 ramp, compute the expected counts proportionally:

observed = [82100, 19300]
total = sum(observed)
ratio = [0.8, 0.2]
expected = [total * r for r in ratio]

stat, p_value = chisquare(observed, f_exp=expected)
print(f"X2 = {stat:.2f}, p = {p_value:.6f}")

For a drop-in helper used across every experiment dashboard, wrap it and raise on p < 0.001:

def srm_check(observed, ratio, alpha=0.001):
    total = sum(observed)
    expected = [total * r for r in ratio]
    stat, p = chisquare(observed, f_exp=expected)
    return {"stat": stat, "p_value": p, "srm": p < alpha}

Stick this in the same report that renders treatment effects. Platforms like Statsig, Eppo, and GrowthBook all do exactly this out of the box.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Common pitfalls

The first pitfall is checking SRM only at the end of the test. By that point the budget is spent and the team is psychologically committed to the readout. Run the check daily, ideally hourly during the ramp window. Many root causes — a bad redirect, a broken tracking pixel, a misconfigured bot rule — get worse the longer they run, so early detection limits data loss.

A second pitfall is forgetting that the relevant denominator for SRM is "users exposed to the experiment," not "users in the assignment log." If you assign users at session start but fire telemetry only after a feature flag has been read, you have created a triggered experiment, and the right population for chi-square is the triggered one. Running the test on the wrong population either masks real SRM or invents fake SRM.

A third trap is treating SRM as a binary signal you can ignore once it is "not significant." Always inspect the chi-square statistic and the directional skew. A p of 0.05 drifting wider over three days is not clean — it is a slow-motion failure. Pair the p-value with a chart of group sizes over time and flag monotonic divergence even when the formal test has not yet fired.

A fourth pitfall is global SRM-clean but local SRM-broken. The overall split looks fine, but iOS shows a 53/47 imbalance and Android shows 47/53, which cancel at the top level. Run SRM checks per major segment — platform, country, device class — and treat a segment-level failure as seriously as a global one, because segment bias still poisons the segment metric reads product teams care about.

A fifth pitfall is "post-hoc weighting will save us." Some teams try to rescue an SRM-affected test by reweighting groups to match the design ratio. The cause of SRM is usually correlated with the outcome — fast users stayed, slow users dropped — so reweighting on observed dimensions cannot recover the unobserved confounding. The honest move is to fix the cause and rerun.

What to do when SRM fires

Stop interpreting metrics immediately, no matter how clean the readout looks. Communicate clearly to stakeholders: the test is invalid, not negative. A simple template helps: "SRM detected at p = X. Suspending readouts until root cause is identified."

Then look at where users could have leaked between assignment and exposure. Redirects are the classic culprit. If treatment loads through a redirect and control loads directly, the redirect itself filters out slow connections, mobile users, and ad blockers disproportionately. Align the load mechanism so both arms have the same opportunity to render.

Inspect bot filtering next. Most platforms apply bot heuristics after assignment but before exposure logging. If the treatment changes the request pattern — an extra XHR call on page load, for example — the bot filter may flag treatment traffic at a different rate. Move the bot-filtering pipeline to run before assignment.

Audit the randomization function itself. Hash collisions, modulo arithmetic on biased input spaces, and reseeding the RNG per session instead of per user are common imbalance sources. A 30-line unit test asserting uniform distribution on a million synthetic user IDs catches most of these before they reach production.

Finally, check whether the experiment is implicitly triggered. If users only enter when they click a specific button, and treatment changes the probability of that click, SRM is structurally baked in. Move the trigger upstream of variant rendering, or switch to an intention-to-treat analysis. If none of these turn up a cause, escalate — persistent unexplained SRM signals platform-level decay that will corrupt every future experiment.

Interview questions on SRM

What is SRM and why is it a problem? SRM is a divergence between observed and designed group sizes in an experiment. It indicates randomization failed somewhere between assignment and measurement. Once randomization fails, the groups are no longer probabilistically equivalent, so any metric difference is confounded by composition shift. You cannot causally attribute the observed lift to the tested change, which defeats the purpose of running the test.

How do you check for SRM? A chi-square goodness-of-fit test against the design ratio. Compute expected counts as total times design proportions, get the chi-square statistic, look up the p-value with k minus one degrees of freedom. The conventional threshold is p < 0.001, much tighter than metric inference, because the cost of false negatives is asymmetric.

A test shows a significant positive effect, but you find SRM. What do you do? Refuse to report the lift. Communicate the test is invalid, not negative. Investigate redirects, bot filters, randomization integrity, and triggered-experiment effects. Once fixed, rerun from scratch. Do not reweight post hoc — the bias is almost always on unobserved confounders.

Name three common causes of SRM. Redirects that drop users in one arm, bot filtering after assignment firing unevenly, and triggered experiments where the variant changes trigger probability. A fourth: randomization keyed to session instead of user, letting users switch arms by reopening the app.

SRM only shows on iOS — Android is clean. How do you interpret? The Android readout is still suspect: whatever broke iOS may have a subtler footprint on Android, especially if both share a backend assignment path. The priority is to localize the iOS failure — likely the iOS SDK, a webview redirect, or an iOS-specific bot rule. Do not pretend Android is safe just because global SRM passed.

If you want to drill experimentation questions like this every day, NAILDD is launching with a library of SQL and statistics problems built around real platform diagnostics — SRM, peeking, guardrails, CUPED — not toy textbook questions.

FAQ

What p-value threshold should I use for SRM?

Most teams use p < 0.001, and some go to p < 0.0001 in high-traffic environments where even modest skew is unambiguously structural. The threshold is tighter than the metric test because SRM is an infrastructure check, not a scientific hypothesis. A false positive costs an investigation; a false negative ships a wrong product call. Pick the threshold based on that asymmetry.

Does SRM apply to unequal traffic splits?

Yes. SRM is the divergence between observed and designed ratios, whatever the design is. The chi-square test handles any ratio — pass the expected counts derived from the design. Ramp tests at 1/99 or 10/90 all have SRM exposure, and aggressive ramps amplify risk because the smaller arm is more sensitive to small absolute imbalances.

Can I salvage a test with SRM by reweighting groups?

Almost never. Reweighting only corrects bias on the dimensions you reweight on, and SRM almost always introduces bias on unobserved dimensions correlated with the outcome. The bias survives reweighting and corrupts the conclusion. Fix the root cause and rerun. Treat any "fixed" SRM result as directional, not decision-grade.

How do I automate SRM detection?

Wire the chi-square check into the experiment reporting pipeline. After each minimum-sample milestone, compute the p-value and raise a visible banner if it crosses the threshold. Most managed platforms ship this out of the box; if you are building in-house, replicate the behavior on day one. Run the check per major segment, not just globally.

How often should I check SRM during a running test?

Daily at minimum, hourly during the first 48 hours of any ramp, and continuously if your platform supports it. Early detection lets you abort and fix before the experiment consumes its full budget. Catching SRM on day one costs a day of traffic. Catching it on day fourteen costs two weeks of decisions waiting on a test you now have to throw out.