A/B test vs multivariate test
Contents:
The short answer
An A/B test compares two (or a few) versions of one element: control versus variant. A multivariate test (MVT) runs every combination of several elements at once. A/B is simpler and needs less traffic; MVT exposes interactions but burns through users. Teams at Stripe, Airbnb, and DoorDash run hundreds of A/B tests for every MVT, because traffic is the bottleneck and interactions are rarer than people expect.
One rule to remember: if your product cannot finish an MVT in two weeks, do not run one. Run sequential A/B tests instead. The rest of this guide walks through the math, the analysis code, and the questions product analyst recruiters at Meta and Booking ask.
What an A/B test actually does
An A/B test is a randomized controlled experiment. Traffic splits into groups, each sees a different product version, and you measure the difference in a target metric until it is statistically significant under a pre-registered design. Randomization is what makes it causal — without it you are looking at correlations and selection effects.
Hypothesis: a green Buy button increases checkout conversion
Control (A): blue Buy button
Variant (B): green Buy button
Traffic split: 50/50
Primary metric: order conversion rateA/B answers one product question at a time. A/B/n is a slight extension: ship A, B, C, D as whole versions and apply a multiple-comparison correction (Bonferroni, Holm, Benjamini-Hochberg) at readout. Each arm is still a complete page.
The analysis itself is boring — a two-proportion z-test or a Welch t-test. The hard part is everything around it: minimum detectable effect, metric choice, unit of randomization, network effects, and resisting the urge to peek. The math behind peeking is in why peeking ruins your A/B test results — most experiment damage happens before anyone fits a model.
What a multivariate test actually does
MVT treats your page as a factorial experiment. You pick several elements, define levels for each, and ship every combination as its own variant. Each user sees one combination. At analysis time you decompose the result into main effects and interaction effects.
Element 1: button color (blue, green)
Element 2: button copy ("Buy", "Add to cart")
Element 3: placement (top, bottom)
Combinations: 2 x 2 x 2 = 8 variantsMVT does something A/B cannot: it surfaces interactions. Green might beat blue on average, but green plus "Add to cart" could underperform blue plus "Add to cart". You only see that pattern if an interaction term is in your model. Without it, you ship a globally good change that hurts a specific segment of the funnel.
Side-by-side differences
| A/B test | Multivariate test | |
|---|---|---|
| What it tests | One change | Combinations of several elements |
| Number of arms | 2-4 | k1 x k2 x ... x kn (multiplicative) |
| Required traffic | Moderate | Large (scales with combinations) |
| Analysis | Simple (z-test, t-test) | Factorial (ANOVA, regression) |
| Interactions | Invisible | Visible |
| Duration | Days to weeks | Weeks to months |
| Best for | Most product questions | Landing page and email optimization |
The decision almost always reduces to traffic. Even when you suspect interactions, you usually cannot afford the user count, so you fall back to sequential A/B tests and accept the risk of a hidden interaction in the funnel.
Traffic math you cannot dodge
The most concrete reason MVT is rarer than blog posts suggest: combinations explode multiplicatively.
A/B test: 2 variants
-> ~10,000 users per arm at MDE 5%, alpha=0.05, beta=0.2
-> total: ~20,000
MVT (2 x 2): 4 cells -> total: ~40,000
MVT (2 x 2 x 2): 8 cells -> total: ~80,000
MVT (3 x 3 x 2): 18 cells -> total: ~180,000Most products lack the volume. A consumer app with 50,000 monthly actives finishes a clean A/B in a week. The same app needs over a month for an eight-cell MVT, and that month exposes you to seasonality, marketing pushes, an iOS update, a competitor launch — confounders that do not exist on a one-week test.
Rule of thumb: if traffic finishes an MVT in two weeks, run it. Otherwise run sequential A/B tests. The bias from sequential testing is usually smaller than the noise you absorb from a six-week experiment.
Per-cell sample size is not just total users divided by cells. It is set by the MDE on the smallest sub-effect you care about. To detect a 2% interaction, calibrate per-cell sample size to 2%, not to a 5% main effect. Evan Miller's and Statsig's calculators make this less painful.
When to pick each one
A/B is the default for almost everything. Use it for one focused hypothesis ("$19 monthly outperforms $24", "shorter onboarding raises week-one retention"). Use it with limited traffic, when you need results quickly, when the change is backend (algorithm, pricing, infrastructure), and when this is the first experiment on a surface. The first experiment on any surface should be A/B — you want a stable baseline before slicing into factors.
MVT earns its place when three conditions hold. You are optimizing visual elements where headline, hero, and CTA interact. You have a lot of traffic — more than 100,000 users per test period. You suspect interactions, usually because previous A/B results showed a winning element on one page losing on a different layout.
Pricing pages and signup landers are the canonical MVT use case. Marketing teams at Vercel and Notion run MVTs on the homepage hero because the surface gets enough top-of-funnel traffic and the elements are visually entangled. Do not run MVT on a checkout funnel handling 5,000 sessions a day — you will not finish in time, and the result will be poisoned by everything else that changed in eight weeks.
Analyzing results
A/B analysis is a two-proportion z-test (binary) or a Welch t-test (continuous). For conversion data, the closed form is what most product analyst interviews expect you to write from memory:
from scipy import stats
control_conversions, control_total = 520, 10000
variant_conversions, variant_total = 580, 10000
p_c = control_conversions / control_total
p_v = variant_conversions / variant_total
p_pooled = (
(control_conversions + variant_conversions)
/ (control_total + variant_total)
)
se = (
p_pooled
* (1 - p_pooled)
* (1 / control_total + 1 / variant_total)
) ** 0.5
z = (p_v - p_c) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z)))Production stacks at Airbnb, DoorDash, and Booking wrap this in a CUPED-style variance reducer. CUPED knocks roughly 30-50% off the required sample size by regressing the outcome on pre-experiment covariates.
MVT analysis is factorial. You fit a regression or ANOVA with main effects and interactions, then read off the main effect of each factor, the pairwise interactions, and (rarely) the three-way interaction.
import pandas as pd
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
# df: color, copy, position, conversion
model = ols(
"conversion ~ C(color) * C(copy) * C(position)",
data=df,
).fit()
anova_table = anova_lm(model, typ=2)An interaction is significant when the effect of one factor depends on the level of another. If the p-value on C(color):C(copy) is below 0.05, you cannot interpret main effects on their own — you read them in context. That is the moment MVT pays off: you ship the specific combination that wins, not the marginal winner of each factor.
Common pitfalls
The biggest MVT mistake is running one without the traffic to finish cleanly. The experiment limps along for six weeks, seasonality shifts, the product ships unrelated changes, and the result is a polluted dataset. The fix is mechanical: compute per-cell sample size before launch, divide by daily traffic, and if the answer exceeds 14 days, run sequential A/B tests instead.
Closely related is using MVT when sequential A/B would do the job. Three two-level factors as an MVT need around 80,000 users and six weeks; the same three factors as sequential A/B tests need around 60,000 users and three weeks, and you can stop early if one factor does not matter. Take the MVT hit only with a strong prior on interactions, which is rarer than blog posts imply.
A third trap is ignoring multiple comparisons. Three factors give three main effects, three pairwise interactions, and one three-way interaction — seven tests on the same data. At alpha 0.05 you expect 0.35 false positives by chance. Apply Bonferroni or Benjamini-Hochberg before declaring any term significant, or pre-register a single primary effect.
A fourth pitfall is a randomization unit that does not match the metric. If the metric is per-user conversion and you randomize by session, the same user lands in different cells and contaminates the comparison. Pick the coarsest stable unit — usually user_id, sometimes device_id — and hash that into the bucket. MVT is more brittle than A/B because contamination splits across more cells.
The fifth pitfall is treating A/B/n as if it were MVT. A/B/C is three whole pages; you cannot decompose them into factors at analysis time because the factors were never separated at design time. If you wanted to know "did the headline or the image drive the lift?", you needed an MVT from the start.
Interview answers
What is the difference between an A/B test and a multivariate test? A/B compares one change (control vs variant); MVT runs all combinations of several elements. A/B is simpler and needs less traffic; MVT exposes interactions but needs traffic proportional to the number of combinations. A/B is the default; MVT is reserved for high-traffic surfaces where you suspect interactions.
When would you pick MVT over A/B? When the surface has more than 100,000 users in the test window, I am optimizing visual elements that plausibly interact, and previous A/B results suggest interactions exist. A pricing page or landing hero on a high-traffic site is the typical fit.
How much traffic do you need for an MVT with three two-level factors? Two cubed is eight cells. At roughly 10,000 users per cell — reasonable for 5% MDE on a 5% baseline — you need 80,000. To detect a 2% interaction you need substantially more. Evan Miller's or Statsig's calculator gives the per-cell number.
What is an interaction effect? When the effect of one factor depends on the level of another. Example: a green button with "Buy" converts well, but a green button with "Add to cart" converts poorly. Without an MVT and an explicit interaction term, the conditional pattern is invisible.
Can MVT be replaced by sequential A/B tests? Often yes. Three two-level factors as sequential A/Bs need fewer users and less calendar time, and you can drop a factor early. The sequential approach assumes interactions are weak — if they are strong, you find a local optimum and miss the global one.
Related reading
- A/B testing peeking mistake
- How to design an A/B test step by step
- Why run an A/A test before A/B testing
- A/B test vs holdout
- How to A/B test product pricing
If you want to drill product analyst experiment questions like this every day, NAILDD is launching with a daily prep loop covering exactly this pattern.
FAQ
What is full factorial vs fractional factorial MVT?
Full factorial tests every combination — three two-level factors give eight cells. Fractional factorial tests a subset chosen so you can still estimate main effects (and sometimes two-way interactions), but not higher-order ones. Fractional designs save traffic when you only care about main effects. High-volume sites use Plackett-Burman or Taguchi designs to screen many factors quickly before running a full factorial on the survivors.
How are multi-armed bandits different from A/B tests?
Multi-armed bandits (MAB) are adaptive experiments that shift traffic toward the arm that looks best while the test is still running. A/B tests fix the split until the pre-registered sample size is reached. Bandits are great for short-lived content and recommendation slots where regret minimization matters more than precise effect estimates. If you need to report "the new feature lifted conversion by X%", run an A/B and accept the regret cost.
Are MVT and A/B/n the same thing?
No. A/B/n is an A/B test with several whole-page variants, each shipped end to end. MVT decomposes the page into factors and tests their combinations. Three A/B/C variants are three complete pages with no factor decomposition possible at analysis time. An MVT of headline (2 levels) by CTA (2 levels) is four cells with two interpretable factors and one interaction. You cannot retrofit factor structure onto an A/B/n result after the fact.
Which tools support MVT in 2026?
After Google Optimize shut down in September 2023, the practical options are VWO, Optimizely, AB Tasty, and Convert for visual MVT on landing pages. For server-side experimentation at scale, Statsig, Eppo, GrowthBook, and Split.io support factorial designs with main-effects and interaction analysis. Feature-flag platforms (LaunchDarkly, Unleash) handle the targeting layer but expect you to bring your own analysis pipeline.
Should an MVT include a "do nothing" control cell?
Yes, always include a cell that matches the current production experience. Otherwise you compare new combinations against each other with no baseline, and you cannot tell whether the best of the bunch is actually better than what users see today. The control cell anchors the entire readout. Skipping it is one of the most common MVT design mistakes.