Bayes theorem explained simply

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Why Bayes matters

Picture a Monday standup at Stripe. A risk PM drops a screenshot of a fraud signal that fires on 0.5 percent of transactions. The vendor claims it is 95 percent accurate. The PM asks whether the team should auto-decline every transaction that trips it. Without Bayes you will quietly nod and ship a rule that blocks tens of thousands of legitimate customers per week, because the signal looks impressive in isolation and falls apart the moment you condition on how rare actual fraud is.

Bayes theorem lets you flip the question from "given fraud, how loud is the signal" into "given a loud signal, how likely is fraud". That flip is the job of a senior analyst when a stakeholder hands you a noisy detector. It also powers naive Bayes spam filters, the priors layer in Bayesian A/B testing, and the back-of-the-envelope risk questions asked in interview loops at Meta, Amazon, or DoorDash.

For a middle or senior analyst, fluency with Bayes is what separates "I can run a t-test" from "I can defend a launch decision when the metric is rare and the detector is correlated with the outcome but not equal to it".

The formula in plain English

Bayes theorem updates the probability of a hypothesis after you see new evidence:

P(A | B) = P(B | A) * P(A) / P(B)

The four pieces have names you must know cold. The prior P(A) is your belief about A before the data lands. The likelihood P(B | A) is how likely the evidence is when A is true. The marginal P(B) is the total probability of seeing B across every hypothesis. The posterior P(A | B) is the updated belief about A after the evidence.

The trickiest piece is P(B), because in real problems you have to expand it with the law of total probability across both the hypothesis and its complement.

P(B) = P(B | A) * P(A) + P(B | not A) * P(not A)

Drilling this expansion is the difference between a candidate who passes the Bayes question and one who freezes. The expansion makes the role of base rates inescapable, which is the point.

The classic disease-test problem

Every probability interview at a top US tech company has a version of this. A disease affects 1 percent of the population. A test has 99 percent sensitivity, meaning it returns positive 99 percent of the time when the patient is sick. The test has a 5 percent false-positive rate, meaning it returns positive 5 percent of the time when the patient is healthy. A random patient tests positive. What is the probability that the patient is actually sick?

Most candidates answer 99 percent because the test is "99 percent accurate". The correct answer is closer to 17 percent.

Let A be "sick" and B be "positive test". Translate the verbal facts into four quantities, then expand the marginal.

P(A) = 0.01
P(B | A) = 0.99
P(B | not A) = 0.05

P(B) = 0.99 * 0.01 + 0.05 * 0.99
     = 0.0099 + 0.0495 = 0.0594

P(A | B) = (0.99 * 0.01) / 0.0594 ~= 0.167 = 16.7 percent

About 17 percent, not 99 percent. The disease is rare and healthy people outnumber sick people 99 to 1. Even with a low false-positive rate, the sheer count of healthy people generates more positives than the sick population does. This is the base-rate fallacy, the most common probability mistake in product analytics, fraud reviews, and ML model launches. The intuition: a low prior is hard to overturn with a single noisy observation.

Bayes in fraud and spam

Same trick, fraud edition. Suppose the base rate of fraud at a payments company like Stripe is 0.1 percent. A model flags transactions as "suspicious": 90 percent of actual fraud is flagged, and 5 percent of legitimate transactions are also flagged. A flag comes in. What is the probability of fraud?

P(fraud) = 0.001
P(flag | fraud) = 0.90
P(flag | not fraud) = 0.05

P(flag) = 0.90 * 0.001 + 0.05 * 0.999 = 0.05085
P(fraud | flag) = (0.90 * 0.001) / 0.05085 ~= 0.0177 = 1.77 percent

Even with a model that catches 90 percent of fraud, a flagged transaction is fraudulent only about 1.8 percent of the time. Auto-declining on this signal alone is a customer-experience catastrophe. The right product decision is to route into step-up authentication, not a hard block, because the base rate is too low for any single noisy detector to act on.

The spam-filter version is almost identical. 10 percent of incoming emails are spam. The word "free" appears in 70 percent of spam and 5 percent of legitimate messages.

P(spam | free) = (0.70 * 0.10) / (0.70 * 0.10 + 0.05 * 0.90)
              = 0.07 / 0.115 ~= 0.609 = 60.9 percent

About 61 percent. The base rate of spam is higher than fraud, so a single moderately predictive token already shifts the posterior past 50 percent. The same formula yields wildly different actionable probabilities depending on the prior. This is why naive Bayes spam filters are calibrated per user: your inbox prior on "free" is not your neighbor's.

Prior, likelihood, posterior in A/B testing

Bayesian A/B testing is the most common production setting where these three quantities appear by name. Instead of computing a p-value against a frequentist null, you specify a prior over the conversion rate, observe successes and failures, and update to a posterior from which you read off the probability that variant B beats variant A directly.

A typical setup uses a Beta prior, because Beta is conjugate to the Bernoulli likelihood. If your prior is Beta(alpha, beta) and you observe s successes and f failures, the posterior is Beta(alpha + s, beta + f). No integral or Monte Carlo for that step. P(B > A) is then a two-dimensional integral over the joint posterior, usually computed by sampling.

posterior_A = Beta(alpha_A + s_A, beta_A + f_A)
posterior_B = Beta(alpha_B + s_B, beta_B + f_B)
P(B > A) = E[ posterior_B > posterior_A ]

The interview answer that wins points: the Bayesian framework gives you a direct probability of a launch decision being correct, which the frequentist p-value never does. Peeking is mathematically valid under a Bayesian decision rule. The trade-off is that your prior is now a load-bearing modeling choice that has to be defensible.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Naive Bayes classifier

When evidence is a feature vector rather than a single observation, the classifier extends naturally. The "naive" assumption is that features are conditionally independent given the class, which lets you factor the likelihood into a product.

P(class | x1, ..., xn) is proportional to P(class) * product over i of P(xi | class)

The classifier picks the class with the highest posterior. The independence assumption is almost always violated, but naive Bayes still classifies well because the decision rule cares about which class scores highest, not about calibrated probabilities. It is the canonical example of a model that is wrong about the joint distribution and right about the boundary.

import math

def naive_bayes_predict(prior, likelihoods):
    scores = {}
    for cls, p_cls in prior.items():
        log_score = math.log(p_cls)
        for p_x_cls in likelihoods[cls]:
            log_score += math.log(p_x_cls)
        scores[cls] = log_score
    return max(scores, key=scores.get)

The log-space trick avoids underflow when multiplying many small probabilities, and it is the version every senior candidate is expected to write on a whiteboard.

Bayes in Python

For the disease-test problem and any single-evidence flip, a four-line function is enough.

def bayes(prior_a, p_b_given_a, p_b_given_not_a):
    p_b = p_b_given_a * prior_a + p_b_given_not_a * (1 - prior_a)
    return p_b_given_a * prior_a / p_b


print(bayes(0.01, 0.99, 0.05))   # 0.1672  -> disease test
print(bayes(0.001, 0.90, 0.05))  # 0.0177  -> fraud
print(bayes(0.10, 0.70, 0.05))   # 0.6087  -> spam

The same function services all three worked examples. For a vectorized version over a Pandas Series, swap scalars for NumPy arrays.

import numpy as np

def bayes_vectorized(prior_a, p_b_given_a, p_b_given_not_a):
    prior_a = np.asarray(prior_a, dtype=float)
    p_b = p_b_given_a * prior_a + p_b_given_not_a * (1.0 - prior_a)
    return p_b_given_a * prior_a / p_b


priors = np.array([0.001, 0.01, 0.10, 0.50])
print(bayes_vectorized(priors, 0.90, 0.05))
# [0.01769 0.15385 0.66667 0.94737]

As the prior climbs from one in a thousand to one in two, the posterior moves from 1.8 percent to 95 percent on the same evidence. Showing this sweep in an onsite case is the kind of move that makes interviewers nod.

Common pitfalls

The biggest trap is ignoring the prior. Candidates plug the false-positive rate and the sensitivity into a formula, forget to multiply by P(A), and confidently announce a number that resembles the inverse likelihood rather than the posterior. The fix is muscle memory: every Bayes calculation starts with writing P(A), P(B | A), P(B | not A), and P(not A) on a fresh line before anything else.

A second trap is confusing P(A | B) with P(B | A). A test that is "99 percent accurate" usually means P(positive | sick) = 0.99, which is the sensitivity, not the predictive value of a positive test. These two quantities differ by a factor of the base rate. When the interviewer says "the test is 99 percent accurate", your follow-up is "do you mean sensitivity, specificity, or positive predictive value", and that question alone signals seniority.

A third pitfall is the independence assumption inside naive Bayes. Strongly correlated features pull the posterior in the same direction multiple times, overcounting evidence and producing overconfident probabilities. The decision rule is often still correct, but the calibration is not, so anyone using the raw probabilities downstream needs to recalibrate. Platt scaling and isotonic regression are the standard fixes when calibration matters.

A fourth trap is plugging in numbers without a sanity check on orders of magnitude. If the prior is one in a thousand and the posterior comes out at 0.83, your formula or one of the inputs is wrong. The posterior cannot dramatically exceed the strength of the evidence relative to the prior. Senior candidates always sketch the marginals and check the four probabilities are consistent before reading the result aloud.

A fifth pitfall is the "uninformative prior" cop-out. When the interviewer asks where the prior comes from and the candidate says "we use an uninformative prior", that is a dodge. The prior is a modeling choice defended with historical conversion data, similar past experiments, or domain expertise. Uninformative priors are fine when data overwhelms them, but during early experiments with low traffic the prior is doing a lot of the work.

To drill probability and Bayes questions every day, NAILDD is launching with 500+ analytics problems across this pattern.

FAQ

Why learn Bayes when ML libraries already implement it?

Because the libraries make decisions you have to defend on a whiteboard. Picking a prior, calibrating posteriors, interpreting a Bayesian A/B readout, and explaining why a fraud model produces 1.8 percent precision at the operating point all require the underlying math. Senior interviews at Meta, Amazon, and Anthropic test the math directly, not whether you can call a scikit-learn fit method.

Where do priors come from in practice?

From historical data, similar past experiments, domain expertise translated into Beta parameters, or a weak prior when you want the data to dominate. The strongest interview answer names two or three sources and explains the trade-off. A prior anchored to last quarter's conversion rate is defensible until product changes invalidate it, at which point you widen it.

Is naive Bayes still relevant in 2026?

Yes. As a baseline classifier it is fast to train, fast to score, and hard to beat on small text problems. As an interview reference, it grounds the vocabulary of priors, likelihoods, and posteriors you reuse when discussing Bayesian deep learning, calibration, and decision theory.

Is Bayes always correct?

The math is always correct. The result is only as good as the inputs. If your prior is wrong by an order of magnitude or your likelihoods come from a biased sample, the posterior will inherit those errors. The discipline of writing down the four quantities forces the assumptions into the open, which is what makes Bayesian reasoning useful in interviews: every disagreement becomes a disagreement about a named input.

How does Bayes connect to Bayesian versus frequentist A/B?

Bayes turns a prior over conversion plus observed successes and failures into a posterior, from which you read P(B > A) directly. Frequentist A/B answers a different question: "in a world with no effect, how surprising is this data". Both are valid, but the Bayesian framing communicates better to product partners and removes the peeking problem at the cost of defending the prior.

Fastest way to drill base-rate problems for interviews?

Build a flashcard set of three to five worked problems at different priors: 0.1 percent, 1 percent, 10 percent. Practice flipping P(B | A) into P(A | B) out loud in under sixty seconds. Once you can rattle off the four-quantity setup and the marginal expansion without notes, you have the part of the question that most candidates fumble.