P-value explained simply

Prep A/B testing and statistics
300+ questions on experiment design, sample size, p-values, and pitfalls.
Join the waitlist

What p-value actually is

A PM pings you on a Friday: "Variant B is up 1.2 points on conversion, p = 0.051. Can we ship?" Half your job is the answer. The other half is explaining what p = 0.051 actually means without falling into one of the wrong interpretations that show up on a Snowflake or Stripe interview loop the next week. Most analysts who fail those loops do not fail because they cannot compute a z-statistic. They fail because they say "there is a 5% probability the result is random" and the interviewer goes quiet.

P-value is the probability of seeing data at least as extreme as what you observed, assuming the null hypothesis is true. The null hypothesis (H0) is the boring claim — usually "no difference between control and treatment." P-value answers one question: how surprising are these numbers if the world is as flat as H0 says? A small p-value means the data are uncomfortable to explain under H0, so you have grounds to reject it. A large p-value means the data are fully compatible with the no-effect world. The phrase "probability that the effect is real" does not appear in the definition, and that is the most common reason candidates lose this interview question.

How to read it without lying to yourself

P-value is the probability of the data given the hypothesis. It is not the probability of the hypothesis given the data.

That one line is what every clean answer rotates around. P-value = 0.03 does not mean "there is a 3% probability that there is no effect." It means: if there is truly no effect, the probability of seeing a difference this large or larger by chance alone is 3%. These look similar in English. They are not the same in math. P(data | H0) is what you compute from a z-test, t-test, or chi-square. P(H0 | data) is what most stakeholders think you reported, and it requires Bayes' theorem plus a prior. P-value carries none of that prior. Walk into a Meta data-science loop with this confusion and the rubric catches it instantly.

What p-value does tell you: how compatible the observed data are with H0, as a tail probability. What it does not tell you: the probability H0 is true or false, the size of the effect (p = 0.001 usually means the sample is large, not the effect), the practical importance, or how likely the result is to replicate.

Worked example: conversion A/B test

You launched an A/B test. Control: 5,000 users, 500 conversions (10.0%). Treatment: 5,000 users, 560 conversions (11.2%). Observed lift: +1.2 pp. Real signal or random noise?

H0 says conversion is identical in both groups. Pooled proportion under H0 is p_pool = 1060 / 10,000 = 0.106. Standard error and z-statistic:

SE = sqrt(0.106 * 0.894 * (1/5000 + 1/5000)) = 0.00616
z  = (0.112 - 0.100) / 0.00616 = 1.95

Two-sided p-value = 2 * P(Z > 1.95) = 0.051. At alpha = 0.05, the result is formally not significant — the +1.2 pp gap is consistent with chance. But the p-value sits right at the boundary, and a modest sample-size increase would push it under 0.05. That is the conversation to have with the PM, not a yes-or-no shipping decision.

Alpha and the 0.05 convention

Alpha is the threshold below which you agree to reject H0. The standard value is 0.05 — a 5% acceptable false-positive rate. Alpha must be chosen before the test, not after you see the p-value. The framework breaks the moment you change the threshold to match the result you wanted. The 0.05 cutoff is a convention. Fisher proposed it in the 1920s and it stuck. Product teams at Airbnb, DoorDash, and Uber default to 0.05 because it balances false positives against the time cost of larger experiments. Medical trials tighten to 0.01. Particle physics demands 5 sigma (p < 3e-7). The right threshold depends on what each false positive costs.

P-value and confidence interval

P-value and the confidence interval are mathematically equivalent ways to answer the same question. The decision rule lines up exactly: if the 95% CI for the difference does not contain 0, then p < 0.05, and vice versa. But the confidence interval carries strictly more information. P-value gives a binary verdict. CI gives the range of plausible effect sizes with their uncertainty. In the worked example, the 95% CI is roughly [-0.01%, +2.41%]. The interval contains 0, lining up with p = 0.051. But the upper bound of +2.41% says the true lift could plausibly be over two points — a sizeable product win. That nuance is invisible in a bare "not significant" report. Senior analysts at Stripe and Notion default to reporting both and lead with the interval.

Type I and Type II errors

A Type I error (false positive) is rejecting H0 when no effect exists. Its probability is alpha. At alpha = 0.05 you accept that 5% of pure-noise experiments look "significant." The cost is shipped features that do nothing. A Type II error (false negative) is failing to reject H0 when an effect is real. Its probability is beta, and power = 1 - beta is the probability of detecting a true effect. Industry default for power is 80%. The cost is winning ideas killed by underpowered tests.

Alpha and beta trade off at fixed sample size. Tighten one and you raise the other. The only way to reduce both is to recruit more users. A favorite Apple and Anthropic interview question: "Which is worse, Type I or Type II?" The honest answer is "it depends on what gets shipped." For a checkout button copy change, a false positive costs a small revenue dip. For a model gating loan approvals, a false positive can mean real harm at scale. The cost of being wrong sets the threshold.

Prep A/B testing and statistics
300+ questions on experiment design, sample size, p-values, and pitfalls.
Join the waitlist

Multiple comparisons

If you test 20 hypotheses at alpha = 0.05, the expected number of false positives is 1. The probability of at least one false positive across 20 independent tests:

P(at least 1 error) = 1 - (1 - 0.05)^20 = 0.642

64% — the chance of finding a "significant" result when no real effects exist. This is the multiple-comparisons problem and the single biggest source of bad shipping decisions in fast-moving product teams. Bonferroni correction is the simplest fix — divide alpha by the number of tests: alpha_corrected = 0.05 / 20 = 0.0025. Conservative and cuts power, but reliably caps the family-wise error. False Discovery Rate (FDR), via Benjamini-Hochberg, controls the fraction of false positives among rejected hypotheses rather than the family-wise error. Less conservative, preserves more power, and is the practical default when screening many metrics. Experimentation platforms at Linear, Vercel, and Figma default to some FDR variant on the secondary-metric panel.

Python: computing p-value

Z-test for two proportions, the standard A/B test setup:

import numpy as np
from scipy import stats

n1, x1 = 5000, 500
n2, x2 = 5000, 560
p1, p2 = x1 / n1, x2 / n2

p_pool = (x1 + x2) / (n1 + n2)
se = np.sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2))

z_stat = (p2 - p1) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))

print(f"z = {z_stat:.3f}, p-value = {p_value:.4f}")
# z = 1.948, p-value = 0.0514

T-test for two means (continuous metric, e.g. AOV):

from scipy import stats

control = [450, 520, 380, 490, 510, 470, 430, 460, 500, 440]
test    = [510, 540, 470, 530, 560, 490, 520, 550, 480, 510]

stat, p_value = stats.ttest_ind(test, control)
print(f"t = {stat:.3f}, p-value = {p_value:.4f}")
# t = 3.408, p-value = 0.0032

Bonferroni correction across multiple metrics:

from statsmodels.stats.multitest import multipletests

p_values = [0.03, 0.12, 0.04, 0.001, 0.08]
rejected, corrected, _, _ = multipletests(p_values, alpha=0.05, method='bonferroni')
print("Corrected:", [round(p, 4) for p in corrected])
print("Reject H0:", rejected.tolist())
# Corrected: [0.15, 0.6, 0.2, 0.005, 0.4]
# Reject H0: [False, False, False, True, False]

After Bonferroni, only one of the five candidates survives (p = 0.001, corrected 0.005). The two raw p = 0.03 and p = 0.04 results are likely false positives — exactly what the math predicts when you screen five metrics at the conventional threshold.

Common pitfalls

The most common pitfall is reading p-value as the probability of the hypothesis. P-value = 0.02 does not mean "2% chance there is no effect." It is the probability of the data conditional on H0, not the other way around. This misreading is why analyst interviews at Google and Microsoft return so often to the basic definition — the wrong intuition is sticky and interviewers want to hear you push back on it explicitly.

The second pitfall is conflating statistical significance with practical significance. A p-value of 0.001 on a conversion gap of 0.01% is statistically airtight and practically useless. Large samples make tiny effects "significant" because the standard error shrinks with n. Always read the effect size and confidence interval next to the p-value, and ask whether the effect is large enough to justify shipping and maintaining the change.

The third pitfall is peeking. Checking the p-value every day and stopping the test the moment p < 0.05 is a reliable way to generate false positives. Each look at the data inflates the effective alpha — two weeks of daily checks at nominal alpha = 0.05 can push the real false-positive rate above 25%. If you need early stopping, use sequential testing or group-sequential designs. See the peeking mistake for the full mechanics.

The fourth pitfall is p-hacking — slicing metrics, segments, and time windows until something crosses the threshold. If you scan twenty segments and find one significant result, that is multiple comparisons in disguise, not a discovery. Fix the hypothesis, the metric, and the analysis plan before the experiment starts, and report any post-hoc slicing as exploratory.

The fifth pitfall is "p > 0.05, so there is no effect." Failure to reject is not proof of zero effect — it can just mean the test was underpowered. "Not proven" and "proven absent" are different statements. Always quote the confidence interval so readers can see what effect sizes the data rule out. And do not compare p-values across experiments: p = 0.04 in one test and p = 0.06 in another does not mean the effect exists in one and is absent in the other.

Interview questions

The test shows p = 0.001 on a 0.05% conversion lift. Ship?

Not automatically. Statistical significance is not practical significance. A 0.05% lift at high traffic can easily clear p < 0.001, but the effect may be too small to justify the cost of building, shipping, and maintaining the change. Look at the effect size and confidence interval to see whether the plausible win covers your costs.

You checked 10 metrics, one came back at p = 0.04. Verdict?

At 10 metrics and alpha = 0.05, the probability of at least one false positive is around 40%. A single p = 0.04 out of ten is more plausibly chance than discovery. Apply a multiple-comparison correction — Bonferroni gives alpha_corrected = 0.005, FDR is less aggressive. After either correction, p = 0.04 no longer survives the bar. The right move is to either pre-register the metric as primary or treat the finding as exploratory and re-test.

If you want to drill statistics questions like this every day with real interview formats, NAILDD is launching with 500+ analytics problems across exactly this pattern.

FAQ

What is p-value in simple words?

P-value is the probability of seeing data at least as extreme as yours, assuming there is no real effect. A small p-value means the data are uncomfortable to explain in a no-effect world, which is grounds to reject the null hypothesis. It is not the probability that the null hypothesis is true — that requires a Bayesian setup with a prior, neither of which p-value carries on its own.

Does p-value = 0.03 mean a 3% chance of error?

No. p = 0.03 means: if there is no effect, the probability of seeing a difference this large or larger by chance alone is 3%. It is not the probability the null hypothesis is correct. That quantity requires Bayes' theorem and a prior on the hypothesis, which p-value does not contain.

Statistical vs practical significance?

Statistical significance (p < 0.05) tells you the result is unlikely to be pure noise. Practical significance tells you whether the effect is large enough to matter for the business. A 0.01% conversion gap can be highly statistically significant at large sample sizes while being economically meaningless. Always read effect size and confidence interval next to the p-value.

Why can't I check the p-value daily during an A/B test?

Repeated checks inflate the false-positive rate. Daily checks across a two-week test can push the real alpha from 5% to 20-30%, because each look gives you another shot at crossing the threshold by chance. Fix the sample size upfront, use sequential testing, or apply group-sequential boundaries that adjust for the planned looks.