A/B testing complete guide for data analysts
Contents:
Why A/B testing dominates analyst interviews
If you are interviewing for a product analyst, data scientist, or growth role at Meta, Stripe, Airbnb, DoorDash, or Netflix, expect at least one A/B testing question per loop. Experimentation is how these companies decide what to ship — a junior analyst who cannot explain peeking, sample-size math, or SRM will ship the wrong thing on day twelve. The surface area is finite. Once you have a single mental model, the variations stop feeling new.
What an A/B test really is
An A/B test compares two product variants on randomly assigned users. Control sees the current version; treatment sees the new one. You compare a chosen metric and decide whether the change made things better, worse, or no measurable difference. Everything else is bookkeeping that protects you from fooling yourself.
The load-bearing word is random. Random assignment is the only mechanism that gives you two groups identical on every variable you measured and every one you did not. That is what lets you attribute the metric difference to the change rather than to seasonality, marketing pushes, or which users happened to log in that morning. If assignment is not random, your conclusion is not causal.
With multiple changes tested simultaneously the design becomes a multivariate test, which costs much more traffic because you are estimating interactions on top of main effects. Stick to A/B unless you can justify the cost.
The seven-step workflow
1. Write the hypothesis
A hypothesis is a falsifiable statement with a mechanism: "Changing the primary CTA from gray to green will lift checkout conversion by at least 1.5 percentage points because contrast against our white background increases click intent on mobile." Change, metric, expected effect size, causal story. Without all four, the experiment becomes a fishing trip. H0: treatment equals control. H1: they differ.
2. Choose the primary metric
One primary metric. Not two. The primary is what makes the ship decision; everything else is context. Define it as a SQL expression today, before launch. Guardrails — latency, crash rate, refund rate — are tracked separately and must not regress. Secondary metrics inform diagnosis but never override the primary.
3. Calculate sample size
Sample size depends on four inputs: baseline rate, minimum detectable effect (MDE), alpha (typically 0.05), and power (typically 0.80). The smaller the MDE, the more users you need; sample size scales roughly with 1 over MDE squared. This calculation tells you how many days the test must run; it is not a knob to wiggle once the test is live.
4. Randomize users
Each user is assigned to one group via a hash of user ID plus an experiment salt. The unit is almost always the user, not the session — session-level assignment lets one human see both variants and contaminates the comparison. Products with strong network effects (marketplaces, social) need cluster randomization at the market or community level.
5. Run the experiment
Start the test, then leave it alone. Minimum: one full business cycle (seven days) to absorb day-of-week seasonality. Two to three weeks is healthier because it absorbs the novelty effect.
6. Analyze the results
Compute the test statistic and p-value. If p is below alpha, reject the null. If p is at or above alpha, you do not have evidence to reject — which is not "there is no effect." Always report the confidence interval alongside the point estimate; the interval tells you how much the data actually pinned down.
7. Make the decision
Statistical significance is necessary but not sufficient. Is the effect practically significant? Is the CI narrow enough that both bounds point at "ship"? Did guardrails hold? A red on any of those is a stop.
Core concepts every analyst must know
A p-value is the probability of observing data at least as extreme as what you saw, assuming the null is true. It is not the probability that the null is true. Misreading p-values is the most common error in writeups.
Alpha is the false-positive tolerance. At alpha = 0.05 you accept a 5 percent chance of shipping a change that does not help. Some teams run 0.01 for high-cost rollouts and 0.10 for cheap exploratory tests.
Power is the probability of detecting a real effect when one exists. At 0.80 you miss real effects 20 percent of the time. Underpowered tests are worse than no tests because non-significant results get read as "no effect" when they actually mean "we never had a chance."
MDE is the smallest lift the test is built to find. Set it from business logic: what lift justifies the engineering and maintenance cost of the new code path?
SRM (Sample Ratio Mismatch) is the test you run before trusting the result. You planned 50/50 and got 52/48. On a five-million-user test, a chi-square test of proportions fails and tells you bucketing is broken. An SRM failure invalidates everything downstream.
A worked Python example
A z-test for proportions is the workhorse for conversion tests. Control: 480 conversions out of 5,000. Treatment: 540 out of 5,000.
import numpy as np
from scipy import stats
n_ctrl, conv_ctrl = 5000, 480
n_test, conv_test = 5000, 540
p_ctrl = conv_ctrl / n_ctrl
p_test = conv_test / n_test
p_pool = (conv_ctrl + conv_test) / (n_ctrl + n_test)
se = np.sqrt(p_pool * (1 - p_pool) * (1 / n_ctrl + 1 / n_test))
z = (p_test - p_ctrl) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z)))
lift = p_test - p_ctrl
se_diff = np.sqrt(p_ctrl * (1 - p_ctrl) / n_ctrl + p_test * (1 - p_test) / n_test)
ci_low, ci_high = lift - 1.96 * se_diff, lift + 1.96 * se_diff
print(f"control={p_ctrl:.2%} treatment={p_test:.2%} z={z:.2f} p={p_value:.4f}")
print(f"lift={lift:+.2%} 95% CI=[{ci_low:+.2%}, {ci_high:+.2%}]")Output: control 9.60 percent, treatment 10.80 percent, z = 2.01, p = 0.044, lift +1.20 pp with a 95 percent CI of roughly [+0.03 pp, +2.37 pp]. Significant at alpha = 0.05, but the lower bound is essentially zero — you barely cleared the bar. For continuous metrics like revenue per user, swap in Welch's t-test; for money-shaped metrics, bootstrap CIs are more honest than the normal approximation.
Bayesian vs frequentist in one page
A frequentist test answers "how surprising is this data if there is no real effect?" Output is a p-value and CI. Peeking is forbidden because every extra look inflates the false-positive rate.
A Bayesian test answers "given the data, what is the probability that treatment beats control?" Output is a posterior: "treatment beats control with 96 percent probability and expected lift 1.4 pp." Bayesian admits priors explicitly and allows continuous monitoring without inflating error rates. Both are valid. Frequentist is the default at most companies because the math is standardized. Pick what your platform supports; do not switch mid-experiment because the other tab shows the answer you wanted.
Common pitfalls
The single most expensive mistake is peeking. An analyst watches the dashboard daily, sees p drop below 0.05 on day six, and ships. Under daily peeking the true false-positive rate jumps from 5 percent to roughly 20 to 30 percent. The fix is to commit to a sample size at launch and not read significance until the test reaches it. If business pressure makes that impossible, use a sequential testing design that builds peeking into the math.
Multiple comparisons without correction is the second classic trap. You check fifteen metrics, one shows p = 0.03, and you call a win. At alpha = 0.05, the probability that at least one of fifteen null metrics flashes significance is roughly 54 percent. Pre-register one primary metric and require Bonferroni or Benjamini-Hochberg correction on the rest.
The novelty effect bites teams running short tests on UI changes. Users click the new variant more in week one because it is new; by week three the lift collapses and the team explains why retention dropped after launch. Run UI tests for at least two to three weeks and compare week-one to week-three metrics.
Ignoring SRM means your bucketing is broken and you do not know it. A 50/50 split that arrives as 49.2/50.8 on a large test is a red flag, not noise. Run a chi-square SRM check before you look at the metric; if it fails, stop, debug, and re-run. Trusting a metric on top of a broken split is how teams ship changes that have nothing to do with the result they read.
A subtler trap is the wrong unit of analysis. Randomizing on user but analyzing on session breaks independence and undercounts variance. The unit you analyze on must be the unit you randomized on, or you must adjust with cluster-robust errors.
Interview answers, in your own voice
Explain A/B testing in plain English. A controlled experiment: split users randomly into two groups, show one the current product and the other the new variant, compare a metric. Random assignment is what lets us call the difference causal — anything systematic that affects one group affects the other, on average.
How do you decide sample size? Four inputs: baseline rate, MDE, alpha, power. Plug into a standard formula, or use a tool like the CUPED variance-reduction recipe to tighten variance. Then check the resulting duration — if it says eight months, the test is not viable and I argue for a different method or larger MDE.
The test came back p = 0.08. Check power first. If the test was underpowered, p = 0.08 is consistent with a real effect we did not have enough users to catch — argue for a re-run. Read the CI; if the upper bound is meaningfully above zero, the data is hinting at a real lift even if formal significance is missed.
Can you stop a test early? Not with a classic fixed-sample design — that is the peeking mistake and it inflates alpha. If the business needs the option, build it in from day one with sequential testing or always-valid p-values.
Statistical vs practical significance. Statistical significance says the result is unlikely under the null. Practical significance says the effect is large enough to matter. A 0.05 percent lift can be statistically significant on a hundred-million-user test and still be useless because it is below the engineering cost of the new path. Always report point estimate and CI.
Related reading
- How to design an A/B test step by step
- A/B testing peeking mistake
- A/B test vs holdout
- Why run an A/A test in A/B testing
If you want to drill A/B testing and SQL questions daily, the NAILDD interview app ships hundreds of problems across this pattern.
FAQ
What is A/B testing in one paragraph?
A/B testing is a controlled experiment in which users are randomly split into two groups: control sees the current product, treatment sees a variant. Comparing a pre-chosen metric across groups isolates the causal effect of the change from background noise. Random assignment neutralizes the confounders you know about and the ones you have not thought of. This is why A/B tests are the default decision tool at Meta, Stripe, Airbnb, and almost every product-driven company.
How do I calculate sample size?
Four inputs: baseline conversion rate, MDE, alpha (typically 0.05), and power (typically 0.80). Smaller effects and noisier metrics require larger samples. Run the math with a calculator or a Python statsmodels call before launch, and use the resulting duration as a sanity check — if the test needs to run longer than the feature will be relevant, reach for a different method.
How long should an A/B test run?
Minimum is one full business cycle, normally seven days, so day-of-week seasonality washes out evenly. Practical default is two to three weeks to absorb the novelty effect. The rule that matters more than any number: do not stop before the planned sample size, even if the dashboard looks green or red.
A/B testing vs multivariate testing?
A/B compares two variants and isolates a single change. Multivariate varies several elements at once and estimates main effects plus interactions, needing much more traffic. Decomposing into a series of A/B tests is usually cheaper.
What if I do not have enough traffic?
Three levers. Raise the MDE. Reduce variance through CUPED, stratification, or a less noisy metric. Switch frameworks: pre-post, difference-in-differences, or synthetic control give weaker but real causal claims when an RCT is infeasible. Be honest about what each method can and cannot conclude.
Frequentist or Bayesian?
Pick what your platform supports. Frequentist is the industry default because the math is standardized and the ship rule is unambiguous. Bayesian shines with mature continuous-monitoring infrastructure. The wrong answer is to switch mid-experiment because the other one is closer to "significant."