Bayesian A/B test interview recipe
Contents:
- Why interviewers ask about Bayesian A/B
- Frequentist vs Bayesian: the actual difference
- Prior, likelihood, posterior in plain language
- Credible interval vs confidence interval
- Probability to be best — the metric stakeholders want
- Python recipe: Beta-Binomial in 20 lines
- When Bayesian wins, when frequentist wins
- Common pitfalls
- Related reading
- FAQ
Why interviewers ask about Bayesian A/B
Senior data analyst loops at Stripe, Airbnb, Netflix, and DoorDash usually include "Walk me through how you'd call this test on day five." If you reach for a t-test and a fixed sample-size calculator, you fail. The expected answer is "I'd run it Bayesian, watch the posterior, and stop when probability-to-be-best clears 95%." Interviewers want to hear why peeking is fine in one framework and catastrophic in the other.
The second reason is product framing. A frequentist p-value answers a question almost nobody asked: "How surprising is this data if the variant did nothing?" The PM wants "What's the chance B beats A?" Bayesian A/B testing gives that number directly. If you can explain the swap in two minutes, you've moved past junior into senior territory.
The third reason is small samples. A startup running 800 visitors per arm cannot detect a five-percent lift with frequentist tools at reasonable power. A weakly informative prior plus a PBB-based stopping rule makes a real call on the same data. Series-B interviewers probe this hard — it's how their team operates.
Frequentist vs Bayesian: the actual difference
The frequentist test answers P(data | H0) — the probability of the observed data assuming the null is true. The Bayesian test answers P(H1 | data) — the probability the alternative is true given the data. Both are valid; only the second is what the product side wants. State this swap explicitly in interviews and add "that's why frequentist p-values get misinterpreted 90% of the time."
The mechanical differences fall out. A frequentist test fixes sample size up front to control type-I error at alpha. Once running, peeking inflates false positives — every interim look adds invisible alpha cost unless you correct with Pocock or O'Brien-Fleming. A Bayesian test has no such problem: the posterior is valid at every step because it isn't conditioned on "infinite repetitions of this experiment." It's an updated belief about a single parameter given the data so far.
Frequentist tests produce a point estimate plus a confidence interval whose interpretation requires a probability lecture. Bayesian tests produce a posterior distribution, from which credible intervals, PBB, expected lift, and risk of regret fall out directly. The cost: you must choose a prior, and non-conjugate models need MCMC (PyMC, Stan, NumPyro) — minutes per analysis, not milliseconds.
Prior, likelihood, posterior in plain language
Bayes' theorem in symbols:
P(theta | data) = P(data | theta) * P(theta) / P(data)Prior P(theta) is your belief about the parameter before the experiment. For checkout conversion, this might be "around 8%, but 3% to 15% wouldn't shock me" — encoded as Beta(4, 46) or similar.
Likelihood P(data | theta) is the probability of observed outcomes given a specific parameter. For a binary metric, this is the Binomial likelihood: K converted out of N visitors at true rate theta. The likelihood is identical in frequentist and Bayesian analysis — that's not where the worldviews differ.
Posterior P(theta | data) is what you use to decide. With enough data the prior gets swamped: a Beta(1,1) prior plus 10,000 visitors produces a posterior nearly identical to Beta(50, 50) plus the same data. That property is the answer to the standard interview pushback ("isn't choosing a prior subjective?"). At production sample sizes, it stops mattering.
For binary metrics, Beta is the conjugate prior for the Binomial likelihood. If your prior is Beta(alpha, beta) and you observe K successes in N trials, your posterior is Beta(alpha + K, beta + N - K). No MCMC. No sampling. Just arithmetic. That's why every production Bayesian A/B framework starts with Beta-Binomial — fast enough to recompute on every dashboard refresh.
Credible interval vs confidence interval
The fastest way to fail this question is "credible interval and confidence interval are basically the same." They are not.
A 95% credible interval: given the data and prior, there is a 95% probability the parameter lies in this interval. That is a direct probabilistic statement about the parameter — what every PM and executive assumes "confidence interval" means.
A 95% confidence interval: if you repeated the experiment infinitely many times, 95% of constructed intervals would contain the true parameter. The realized interval either contains the parameter or doesn't; no probability is attached. Technically correct, operationally useless when explaining a result to a non-statistician.
Numerically the two are usually close with a weak prior and large sample. So why does it matter? Because if you say "there's a 95% chance the lift is between 1.2% and 4.8%" while running a frequentist test, you've stated something false. The whole point of Bayesian is so that sentence becomes true.
Probability to be best — the metric stakeholders want
PBB (probability to be best) is the single number that drives Bayesian A/B decisions in production. Draw samples from each variant's posterior and count the fraction where B exceeds A. With 100,000 draws from Beta(561, 4441) for B and Beta(501, 4501) for A, samples_b > samples_a divided by 100,000 gives PBB directly.
The PBB threshold plays the role alpha plays in frequentist testing. Common choices: 0.95 for "confident before shipping," 0.90 for low-risk changes, 0.99 for irreversible decisions like pricing. The threshold is a business call about the cost of false positives.
PBB composes naturally for multi-arm tests. With three variants you compute P(A best), P(B best), P(C best), summing to 1. Frequentist multiple comparisons require explicit corrections (Bonferroni, Benjamini-Hochberg). Bayesian methods handle this through the posterior structure — no separate correction.
Report expected lift and expected-loss-if-wrong alongside PBB. Expected loss is what Microsoft's ExP and Optimizely's Bayesian mode use as the stop criterion: keep running until expected loss of declaring B the winner is below a small threshold like 0.5% of baseline. This catches the edge case where PBB is high but lift magnitude is trivial.
Python recipe: Beta-Binomial in 20 lines
For binary outcomes — conversion, click-through, signup completion, retention indicators — the entire Bayesian A/B test fits in twenty lines with no external dependencies beyond NumPy and SciPy.
import numpy as np
from scipy import stats
# Observed counts
visitors_a, conversions_a = 5000, 500 # 10.0%
visitors_b, conversions_b = 5000, 560 # 11.2%
# Prior: Beta(1, 1) is uniform on [0, 1] — flat, uninformative
alpha_prior, beta_prior = 1, 1
# Posterior is closed-form for Beta-Binomial conjugacy
post_a = stats.beta(alpha_prior + conversions_a,
beta_prior + visitors_a - conversions_a)
post_b = stats.beta(alpha_prior + conversions_b,
beta_prior + visitors_b - conversions_b)
# Probability that B beats A — Monte Carlo from the posteriors
n_samples = 100_000
samples_a = post_a.rvs(n_samples)
samples_b = post_b.rvs(n_samples)
prob_b_better = (samples_b > samples_a).mean()
print(f"P(B > A) = {prob_b_better:.3f}")
# P(B > A) = 0.974
# 95% credible interval for the lift
diff = samples_b - samples_a
ci_low, ci_high = np.percentile(diff, [2.5, 97.5])
print(f"95% credible interval for lift: [{ci_low:.4f}, {ci_high:.4f}]")
# 95% CI: [0.0024, 0.0220]
# Expected loss if we pick B (and B is actually worse)
loss_if_choose_b = np.maximum(samples_a - samples_b, 0).mean()
print(f"Expected loss picking B: {loss_if_choose_b:.5f}")On the same data a frequentist two-proportion z-test would yield p around 0.051 — the edge of "not significant" at alpha = 0.05. That's the scenario where a PM pressures you to keep running another week, inflating type-I error without sequential corrections. The Bayesian framing: "posterior probability that B is better is 97.4%, expected lift 0.24% to 2.20%." That's a decision you can actually make.
For revenue per user or any continuous metric, Beta-Binomial doesn't apply. Switch to a Normal likelihood with a Normal-Gamma prior (closed form) or a hierarchical model in PyMC with a log-Normal likelihood. Sample with NUTS and apply identical PBB and expected-loss calculations. Structure is the same; only the likelihood and sampling cost change.
When Bayesian wins, when frequentist wins
Bayesian wins when traffic is scarce. A B2B SaaS test with 200 trials per arm per week will never reach frequentist significance short of a massive lift. A weakly informative prior built from last quarter's data makes the test answerable in three weeks instead of seven months. Bayesian also wins when stakeholders are in the call — PBB translates to a decision without a probability lecture.
Bayesian wins for stopping rules. If you need to look daily and ship as soon as the answer is clear, you cannot do that frequentist without a sequential correction that costs power. The posterior is valid every time you compute it. Pair with a max sample size to avoid the "I'll just keep running it" trap.
Frequentist wins when your organization has an A/B testing platform already deployed, when stakeholders prefer p-values, and when samples are large enough that the choice barely affects the result. At Google scale the frameworks converge numerically, and frequentist tooling is more standardized. Don't fight the platform team on a test where it changes nothing.
Frequentist also wins for regulated work — pharma trials, financial product testing, external audits expect p-values. Outside those domains, the choice is mostly cultural.
Common pitfalls
The most damaging pitfall is a strong prior with a small sample, yielding a result that mostly reflects the prior. Set Beta(50, 50) and run 100 visitors per arm at a true 5% rate, and the posterior barely moves off the prior. Fix: for production tests start with Beta(1, 1) (uniform) or Beta(0.5, 0.5) (Jeffreys) unless you can defend an informative prior with prior-period data. Inspect prior vs posterior on a plot before reporting.
A related trap is treating PBB as a synonym for "expected lift is large." PBB measures probability of any positive lift, not magnitude. A test can sit at PBB = 0.97 with expected lift of 0.05% — statistically near-certain, operationally a rounding error. Always report expected lift alongside PBB and apply an expected-loss stop criterion. Otherwise you'll ship cosmetic wins and burn engineering quarters.
Many teams compute PBB once at the end of a fixed-sample test and call it Bayesian. That doesn't capture the main practical benefit. If your stopping rule is "run for two weeks, look once," you've used different math at the end but kept the frequentist constraints. To get the early-stopping benefit, define the stop rule (PBB threshold, expected-loss threshold, or max sample) up front and respect it.
Stakeholders sometimes hear "94% probability B is better" as "B is better by a lot." It isn't necessarily. Educate your PM with one line: "PBB measures direction, expected lift measures magnitude, expected loss measures regret-if-wrong." Putting all three on the same dashboard prevents the most common misread.
The final trap: any A/B framework depends on proper randomization, SRM checks, and pre-registered metrics. The math is downstream of experimental hygiene. If assignment is broken or your primary metric drifted mid-test, neither framework saves you. Run an A/A test periodically to confirm the pipeline is healthy.
Related reading
- A/B testing complete guide
- Bayesian methods for data science interviews
- Bayes theorem explained simply
- A/B testing peeking mistake
- Why run an A/A test
If you want to drill A/B testing and Bayesian statistics questions every day with interview-style scenarios and graded answers, NAILDD is launching with 500+ problems across exactly this pattern.
FAQ
When should I use Bayesian instead of frequentist A/B testing?
Use Bayesian when traffic is scarce, when stakeholders need a direct probability statement, or when you genuinely need to peek and stop early. Closed-form Beta-Binomial is cheap enough to recompute on every dashboard refresh, so you can build a real-time stop rule around PBB and expected loss. Use frequentist when your platform is already frequentist, when traffic is abundant, or when regulators expect a p-value.
What prior should I use for a conversion A/B test?
For default production, Beta(1, 1) (uniform) is safe and uninformative. Beta(0.5, 0.5) (Jeffreys) is also fine. Use an informative prior only when you have a defensible prior-period baseline, and weight it modestly — Beta(10, 90) for historical 10% conversion is plenty. Strong priors plus small samples give belief-dominated posteriors, the failure mode that gives Bayesian methods a bad reputation among skeptical teams.
Do I need MCMC for a Bayesian A/B test?
Not for binary metrics. Beta-Binomial conjugacy gives a closed-form posterior in two arithmetic operations. You need MCMC (PyMC, Stan, NumPyro) only for non-conjugate likelihoods — typically continuous metrics like revenue per user, modeled with log-Normal or Gamma. Runtime is minutes with modern NUTS samplers. Variational inference is faster but harder to defend in interviews.
Can I peek at Bayesian A/B results without inflating false positives?
Yes — the headline practical advantage. The posterior is valid at every step. Compute it daily, plot PBB trajectory, stop when PBB crosses your threshold (commonly 0.95). To avoid pathological infinite tests, pair with a max sample cap and expected-loss threshold. The frequentist analog requires Pocock, O'Brien-Fleming, or alpha-spending corrections that cost statistical power for the same flexibility.
What is probability to be best (PBB) and how do I compute it?
PBB is the posterior probability that one variant has a higher true metric than the others. Draw samples from each variant's posterior and count the fraction where the variant of interest exceeds all others. For two Beta posteriors, draw 100,000 samples each and compute (samples_b > samples_a).mean(). Use PBB with expected loss or expected lift to avoid shipping cosmetic wins that clear the probability threshold but have trivial magnitude.