Bayesian A/B testing — practical guide for product teams

Prep A/B testing and statistics
300+ questions on experiment design, sample size, p-values, and pitfalls.
Join the waitlist

Two ways to read an experiment

Your PM at Stripe pings you Friday afternoon. The checkout team shipped a new pricing layout and wants to know if it is safe to ramp on Monday. You open the dashboard and see two numbers: a p-value of 0.07 and a 95 percent confidence interval that straddles zero. The PM does not want a stats lecture. She wants to know how confident you are that the new layout is at least neutral, ideally a small win. The frequentist toolkit answers a different question entirely, and that mismatch is exactly why Bayesian A/B testing has quietly become the default at Airbnb, Booking, Netflix, and most consumer platforms with a serious experimentation function.

The split is philosophical. Frequentist methods ask: assuming the null hypothesis is true, how surprising is this data? Bayesian methods ask: given this data, how plausible is each value of the effect? Both are valid. Only the second maps cleanly to what business owners actually need to decide. When the head of growth at DoorDash asks if variant B is better than A, she wants a probability, not a hypothesis-test verdict. The Bayesian framework hands that back directly.

The cost of switching frameworks is mostly cultural. The math is not harder for the cases product teams care about, and the plumbing is faster to compute on conjugate models. But you have to retrain stakeholders to read posteriors, expected loss, and credible intervals instead of stars next to a p-value. Skip that retraining and you ship the same false positives in a fancier wrapper.

How the Bayesian loop works

The Bayesian update is four steps and the same every time. You set a prior, which is a probability distribution over the parameter you care about before you see the data. You collect observations from the experiment. You combine the prior with the likelihood of the data to produce a posterior distribution. You make decisions off that posterior.

The output is not a single point estimate. It is a full distribution over plausible values of the conversion rate, the revenue per user, or whatever metric you wired up. A posterior tells you both the most likely lift and how much uncertainty is left, and lets you compute any decision quantity you want: probability that B beats A, expected loss if you pick the wrong variant, or the chance that the lift is at least 1 percent.

This loop also extends naturally to sequential analysis. You can update the posterior every day and your decision rule still works, because the posterior is the entire belief state, not a single test statistic that requires a fixed sample size to behave correctly. Teams at Vercel, Figma, and Anthropic lean on this property to shorten experiments by 30 to 50 percent on average without inflating false-positive rates the way frequentist peeking does.

Choosing a prior without lying to yourself

The prior is where Bayesian gets a bad reputation. Critics say it is subjective. They are not wrong, but the alternative is the implicit uniform prior baked into every frequentist test, just less visible. The right framing is to be honest about what you know and pressure-test the prior against historical data.

A uninformative prior such as Beta(1, 1) is flat over conversion rates from 0 to 100 percent. It says you know nothing about the parameter and you want the data to do all the work. This is a safe default when you have no history, but it forces you to collect more data before the posterior tightens.

A weakly informative prior nudges the posterior toward sensible values without dominating it. For a checkout-conversion experiment at a consumer subscription business, Beta(2, 50) puts most of the mass between 1 and 8 percent, which is a defensible range for the category. The posterior still moves substantially with a few thousand observations.

An informative prior pulls from real history. If your last 12 months of identical experiments produced conversion rates clustered around 4 percent, you can encode that with Beta(40, 960). The prior speeds up convergence and protects against small-sample noise. Rule of thumb most platforms enforce: a prior is too strong if the posterior at 10x the planned sample size still looks like the prior. Run that sanity check before trusting any informative prior.

Beta-Binomial in Python

The Beta-Binomial pair is the workhorse model for conversion-rate experiments. Beta is the conjugate prior to the Binomial likelihood, so the posterior is also Beta and you can update in closed form. No MCMC, no waiting.

import numpy as np
from scipy.stats import beta

# Observed data from a checkout experiment
a_trials, a_conversions = 1000, 50
b_trials, b_conversions = 1000, 65

# Uninformative prior Beta(1, 1)
a_posterior = beta(1 + a_conversions, 1 + a_trials - a_conversions)
b_posterior = beta(1 + b_conversions, 1 + b_trials - b_conversions)

# Sample the posteriors to estimate P(B > A)
n_samples = 100_000
a_samples = a_posterior.rvs(n_samples)
b_samples = b_posterior.rvs(n_samples)

p_b_better = (b_samples > a_samples).mean()
print(f"P(B > A): {p_b_better:.3f}")

# Expected uplift and 95 percent credible interval
uplift = (b_samples - a_samples) / a_samples
print(f"Expected uplift: {uplift.mean():.1%}")
print(
    f"95% credible interval: "
    f"[{np.percentile(uplift, 2.5):.1%}, {np.percentile(uplift, 97.5):.1%}]"
)

The output tells you the probability that B is better, the expected percentage lift, and the credible interval for that lift. Hand any one of those numbers to a stakeholder and they will understand it without a stats refresher. Try doing that with a p-value and a confidence interval and watch the meeting derail.

For revenue-per-user or continuous metrics, swap the Beta-Binomial for a Normal-Normal or a Gamma-Poisson and the loop is identical. Libraries like PyMC and Bambi let you wire that up in 20 lines.

Credible interval vs confidence interval

This is the slide where senior interviewers separate the candidates who actually understand Bayesian thinking from the ones who memorized the wikipedia page.

A frequentist 95 percent confidence interval has a hard-to-explain definition: if you ran the same experiment infinitely many times and built a CI each time, 95 percent of those intervals would contain the true parameter. A given CI either contains the truth or it does not. You cannot say there is a 95 percent probability the true value lies in this specific interval.

A Bayesian 95 percent credible interval is exactly what people think a confidence interval should be: given the data and the prior, there is a 95 percent probability the true value lies in this interval. That is the interpretation product managers, executives, and analysts naturally reach for. The frequentist framework explicitly does not deliver it, and the mismatch causes a lot of wrong decision-making across the industry.

Hybrid reporting is dangerous. Showing both a p-value and a credible interval in the same report invites stakeholders to cherry-pick whichever number tells the story they want. Pick a framework, document your decision rule, and stick to it.

Prep A/B testing and statistics
300+ questions on experiment design, sample size, p-values, and pitfalls.
Join the waitlist

Decision rules that actually scale

Bayesian outputs let you build decision rules that fit the actual business risk. The three patterns that show up in every mature experimentation platform are probability thresholds, expected loss, and credible-interval width.

Probability thresholds are the simplest pattern: ship B when P(B beats A) crosses 95 percent. This is intuitive and works for most binary-outcome experiments. The risk is that you ignore effect size. A variant that wins with 96 percent probability but with an expected lift of 0.01 percent is statistically a winner and practically irrelevant.

Expected loss closes that gap. You compute the expected revenue you would lose if you shipped the worse variant. If that loss is below a tolerance threshold you set ahead of time, you ship and move on. This rule is symmetric, which matters: it stops teams from shipping micro-wins that take engineering time to maintain.

Credible-interval width is the patience rule. If your 95 percent credible interval on the lift is still wider than the smallest effect you care about, you do not have enough data. Keep running. This protects against the asymmetric stopping rule that wrecks naive Bayesian dashboards: "stop and ship if P(B > A) > 95 percent, otherwise keep going forever." That rule biases your estimates upward in the same way that frequentist peeking biases p-values downward.

Hierarchical models for segments

Most experiments are not single-metric, single-segment affairs. The growth team at DoorDash wants to know if the new checkout works across web, iOS, and Android. Running independent Beta-Binomial models per segment burns data fast: each segment has its own posterior, and small segments stay noisy forever.

A hierarchical Bayesian model pools information across segments through a shared prior. The model assumes each segment-level conversion rate is drawn from a population distribution, then learns both the population parameters and the per-segment parameters jointly. Segments with lots of data anchor the population estimate. Segments with little data borrow strength from the others and get tighter posteriors than they would in isolation.

PyMC, Stan via CmdStanPy, and Bambi make this practical. Twenty lines of model code gets you partial pooling across segments, so small segments stop returning useless answers and you can ship segment-aware decisions without waiting for each one to hit individual significance.

Common pitfalls

The most expensive Bayesian mistake is using a prior that is too strong. When the posterior at the end of the experiment looks nearly identical to the prior, the data did not move your beliefs. That means your prior is dominating. The fix is to either widen the prior or run a sensitivity analysis where you re-run the decision with a noninformative prior and confirm you would have shipped either way. If the prior changes the call, you have a problem.

A second trap is silent priors. When you share results in a Slack thread, the prior is part of the analysis, not a hidden assumption. Always write down the prior, the family, and the hyperparameters so the next analyst can reproduce. Teams that skip this step end up with credibility problems the first time a result fails to replicate at scale.

The third pitfall is asymmetric stopping rules. Stopping the experiment as soon as P(B > A) crosses 95 percent but continuing forever otherwise looks reasonable until you simulate it. The expected lift you ship will be biased upward because you preferentially stop on lucky samples. Bayesian methods do not magically immunize you against this. You need a symmetric rule, usually expected loss, that lets you stop early to ship or stop early to roll back.

The fourth pitfall is mixing p-values and posteriors in the same dashboard. Stakeholders will read whichever number supports the conclusion they already had. Pick one framework, document the decision rule, and remove the other from reports.

If you want to drill experimentation and stats questions like this every day, NAILDD ships SQL and analytics interview problems built around the exact patterns senior teams ask about.

FAQ

Will Bayesian A/B testing replace frequentist methods entirely?

Probably in consumer tech, where teams already lean on probability-of-best and expected-loss framings. Regulated industries like pharma and finance will keep frequentist methods because the regulatory frameworks were written around p-values and pre-registered sample sizes. The realistic answer is coexistence: Bayesian for product experimentation, frequentist for anything where a regulator needs a hypothesis test on file.

What prior should I use for conversion rate when I have no history?

Beta(1, 1) is the safe default, equivalent to a uniform prior on the conversion rate. It is uninformative and lets the data do the work. If you have rough domain knowledge, a weakly informative prior like Beta(2, 50) for typical consumer conversion rates is fine. Avoid Beta(0.5, 0.5) and other Jeffreys-style priors unless you understand the asymptotic behavior, since they push mass toward 0 and 1 in ways that surprise you on small samples.

Do I need a data scientist to run Bayesian A/B tests?

No. Beta-Binomial and Normal-Normal models are closed-form and can be implemented in a Python notebook or SQL pipeline by any analyst comfortable with basic stats. You need a data scientist for hierarchical models, custom likelihoods, or anything that requires MCMC. Most product experimentation ships with closed-form models for the first year, then migrates to richer models once the team is comfortable.

How does Bayesian A/B testing handle peeking?

Bayesian methods do not require fixed sample sizes, so checking the posterior daily is fine as long as your decision rule is symmetric and stated up front. Expected-loss rules are the most robust because they let you stop early to ship or to roll back. Pure probability-threshold rules with no rollback condition still inflate biased estimates, so the framework alone is not a free lunch. Pair it with a sensible stopping rule and the peeking problem largely disappears.

Which tools and platforms support Bayesian A/B testing out of the box?

Statsig, VWO SmartStats, Optimizely, and Dynamic Yield all ship Bayesian engines. Most in-house platforms at Booking, Airbnb, and Netflix have moved toward Bayesian primitives over the last decade. For homegrown stacks, PyMC and Bambi cover almost everything you need, and closed-form code in SQL handles the rest. Pick the lightest tool that lets your team express the decision rule clearly.