Effect size explained simply
Contents:
Why effect size matters
It is Monday at Stripe and the PM pings you in Slack. The new checkout copy hit p < 0.001 on a 14 million user sample, the launch review is in two hours, and the dashboard is green. You open the report: the conversion lift is 0.03 percentage points on a 4.2 percent base. The p-value is screaming "significant" because the sample is enormous, but the magnitude is invisible. Without effect size you ship every rounding-error win the platform finds, and the roadmap fills with microscopic moves that do not register on the business side.
Effect size is the standardized magnitude of a difference. P-value answers "is there an effect at all"; effect size answers "how big is it, in units I can compare across experiments". When the PM at DoorDash asks whether to ship a 0.5 percent uplift, the senior analyst quotes Cohen's h or a standardized mean difference and ties it to a meaningful raw delta. The junior analyst quotes the p-value and ships a coin flip.
This is also one of the most reliable interview questions for middle and senior analyst roles at Meta, Airbnb, Uber, Netflix, and Linear. The prompt is "your test is significant but the effect is small, do you ship?" The expected answer cites an effect size statistic, an interpretation threshold, and the relationship to the minimum detectable effect you planned for.
The one-paragraph intuition
Effect size strips the units off a difference so you can compare it across studies, products, and metrics. A revenue lift of two dollars per user means one thing at Netflix and a completely different thing at Spotify. Divide that two dollars by the standard deviation of revenue, and you get a number around 0.1 or 0.4 that any analyst on any product can interpret immediately. That number is Cohen's d. The same trick works for proportions with Cohen's h, for correlations with Pearson's r, and for ANOVA with eta-squared. The unit is "standard deviations of noise", and the rule of thumb is simple: small effects are around 0.2, medium around 0.5, large around 0.8.
Cohen's d for continuous metrics
For comparing two group means, Cohen's d is the workhorse. The formula divides the raw mean difference by the pooled standard deviation of the two groups:
d = (mean_B - mean_A) / s_pooled
s_pooled = sqrt(((n_A - 1) * var_A + (n_B - 1) * var_B) / (n_A + n_B - 2))Pooled SD weights each group's variance by its degrees of freedom. Both groups estimate the same underlying noise under the null, so pooling gives a tighter estimate than picking one group's SD arbitrarily. The denominator n_A + n_B - 2 reflects the two degrees of freedom you spent estimating two group means.
A worked example: control mean is 100, treatment mean is 105, pooled SD is 20. Cohen's d is (105 - 100) / 20 = 0.25. By the standard thresholds that is a small effect. On a 5 million user sample it will produce a tiny p-value, and it is your job to translate the 0.25 into "five extra dollars per user out of a 100 dollar baseline, with a 20 dollar standard deviation". That sentence is what gets a launch shipped or shelved.
In Python the implementation is six lines:
import numpy as np
def cohens_d(x, y):
nx, ny = len(x), len(y)
dof = nx + ny - 2
var_x, var_y = np.var(x, ddof=1), np.var(y, ddof=1)
s_pooled = np.sqrt(((nx - 1) * var_x + (ny - 1) * var_y) / dof)
return (np.mean(x) - np.mean(y)) / s_pooledThe 0.2 / 0.5 / 0.8 thresholds are heuristics from the 1988 textbook, not laws of physics. In ad-tech a d of 0.05 can be enormous in dollar terms because the base is huge. In medical trials a d of 0.3 might be the difference between a useful drug and a useless one. Always anchor the standardized number back to a raw business-meaningful delta.
Other effect size measures
Hedges's g is Cohen's d with a small-sample correction 1 - 3 / (4 * df - 1). Above a few hundred users per arm the correction is invisible; below that, default to g.
Pearson's r is the effect size for the relationship between two continuous variables. Its thresholds (0.1 small, 0.3 medium, 0.5 large) sit on a different scale than Cohen's d, so do not compare directly. R-squared is the share of variance one variable explains in the other.
Cohen's h is the effect size for proportions, used in A/B tests on conversion, retention, or churn. The formula uses an arcsine transform: h = 2 * arcsin(sqrt(p_B)) - 2 * arcsin(sqrt(p_A)). The arcsine stabilizes variance when proportions sit near zero or one. The same 0.2 / 0.5 / 0.8 thresholds apply.
Odds ratio shows up in logistic regression and risk modeling. An OR of 2 means twice the odds, not twice the probability — confusing the two is a classic interview slip. Eta-squared belongs with ANOVA. Cramer's V is the chi-square equivalent of Cohen's d for two categorical variables, scaled zero to one. Cliff's delta is the non-parametric alternative when distributions have heavy tails.
p-value vs effect size
The standard two-by-two grid locks in the relationship:
| Small effect | Large effect
-----------------+------------------+------------------
p < 0.05 | Significant, | Significant,
| not meaningful | meaningful
p >= 0.05 | Not significant,| Not significant,
| not meaningful | power likely lowScenario one: p = 0.001, d = 0.04. Significant on a massive sample, magnitude essentially zero. Ship only if the change is free.
Scenario two: p = 0.12, d = 0.6. Not significant, but the standardized effect is substantial. Most likely explanation: low power. Extend the experiment, do not kill the feature.
Scenario three: p = 0.001, d = 0.7. Both significant and meaningful. Ship.
Scenario four: p = 0.4, d = 0.05. Both insignificant and tiny. Kill it.
Junior analysts collapse this grid into a single column. Senior analysts walk through both axes and match the decision to the cell.
Effect size, MDE, and practical significance
Minimum detectable effect (MDE) is the smallest effect your experiment is powered to find. It is the effect size you commit to detecting before launch, and it determines the sample size you need:
n proportional to 1 / MDE^2Halve the MDE and you need four times the sample. Senior teams pick MDE from a business floor ("this feature is only worth shipping if it lifts conversion by at least 0.2 percentage points") rather than the smallest number the platform can technically detect. Picking too small an MDE is the most expensive mistake in experimentation.
A worked link to Cohen's d: if baseline conversion is 10 percent with a binomial SD around sqrt(0.1 * 0.9) = 0.3, and your MDE is 1 percentage point, the MDE in Cohen's d terms is 0.01 / 0.3 = 0.033. Tiny. That tells you you will need a giant sample, and it lets you sanity-check the platform's power calculation against textbook formulas.
Practical significance is another name for effect size in the context of business decisions. Conversion A is 10.0 percent, B is 10.1 percent, p is 0.03, relative lift is 1 percent, Cohen's h is roughly 0.003 — statistically significant, practically invisible. Whether to ship depends on three factors most analysts forget to enumerate. Implementation cost: a free copy change ships for a tiny lift; a six-week engineering project does not. Scale: 1 percent on a billion-dollar revenue line is a hundred-million-dollar win; the same 1 percent on a hundred-thousand-dollar line is a rounding error. Risk to neighboring metrics: a 0.1 percent conversion lift that costs 0.5 percent on retention is a net loss the dashboard will not show you unless you ask.
What interviewers ask
"How is effect size different from p-value?" P-value tests for the existence of a difference; effect size measures the magnitude. At small samples a real effect can show large p-values; at huge samples a trivial effect still produces tiny p-values.
"What is Cohen's d?" Standardized mean difference. Mean delta divided by pooled standard deviation. Thresholds 0.2 / 0.5 / 0.8 as a rough guide, always anchored back to a business unit.
"When is effect size more important than p-value?" Two cases. First, when the sample is huge and every test is significant. Second, when the sample is small and the p-value is misleading because power is low. In both, the magnitude carries the decision.
"Is MDE the same as effect size?" Yes. MDE is the effect size you commit to detecting before the experiment runs. Observed effect size is what you find after.
"What is the difference between absolute and relative lift?" Absolute adds percentage points (10% to 11% is +1 pp). Relative multiplies (10% to 10.1% is +1%). The difference is a factor of ten, and confusing the two is the fastest way to lose credibility in a launch review.
Common pitfalls
Reporting only the p-value is the most common trap. A significant p-value with a microscopic effect size is the default state of any A/B platform running on a giant user base, and shipping on p-value alone produces a portfolio of indistinguishable launches. The fix is to require effect size, raw delta, and confidence interval in every launch doc.
Reporting only the effect size is the mirror trap. A large effect size on a small sample is mostly noise. You can compute Cohen's d on twenty users and get 1.2, and it will mean nothing because the sample is too small to estimate pooled SD reliably. Pair every effect size with a confidence interval and refuse to interpret point estimates without one.
Ignoring the confidence interval on the effect size itself is a related mistake. A narrow interval means your estimate is precise; an interval crossing zero means you do not know the sign. Senior analysts read the interval first and the point estimate second.
Comparing effect sizes across methods is a trap intermediate analysts fall into when they switch metrics. Cohen's d, Pearson's r, Cramer's V, and Cohen's h are on different scales. A d of 0.4 and an r of 0.4 are not "the same size". Within a method, comparison works; across methods, convert or stop comparing.
Confusing absolute and relative lift is the trap that ends careers. A PM who reads "+1 percent" and assumes percentage points when you meant relative will ship a feature expecting ten times the actual lift. State the unit every time: "Conversion lifted from 10.0 to 10.1 percent (+0.1 pp absolute, +1 percent relative)" is the only acceptable phrasing.
Related reading
- How to calculate effect size in SQL
- How to design an A/B test step by step
- A/B testing peeking mistake
- CUPED explained simply
- Bootstrap explained simply
- How to calculate confidence interval in SQL
If you want to drill effect size, MDE, and A/B reasoning on the kind of mocks Meta, Stripe, and Netflix actually use, NAILDD is launching with hundreds of timed analytics interview problems built around exactly this pattern.
FAQ
Is Cohen's d valid when the data is not normal?
The point estimate of Cohen's d is defined regardless of distribution because it is just a ratio of sample statistics. The interpretation thresholds and confidence intervals around d, however, assume roughly normal distributions and similar variances. For heavily skewed or heavy-tailed data, two safer choices exist. Cliff's delta is a non-parametric effect size based on the probability that a value from one group exceeds a value from the other. Alternatively, bootstrap a confidence interval around d itself by resampling the two groups thousands of times.
Are the 0.2 / 0.5 / 0.8 thresholds strict?
No. They are heuristics from Cohen's 1988 textbook, derived from the social-science effect sizes he saw at the time. A Cohen's h of 0.02 on a five-billion-impression ad platform can move hundreds of millions of dollars. A Cohen's d of 0.9 in a small clinical trial can collapse to nothing on replication. Always interpret the standardized number alongside the raw business delta, the sample size, and the cost of the change.
Should effect size be reported in percent?
Only if you are reporting a lift, not a standardized effect size. Lifts come in absolute (percentage points) and relative (percent) forms, and they are not interchangeable. Standardized effect sizes like Cohen's d and Cohen's h are unitless. In a launch doc, report the standardized effect size, the raw absolute delta, the relative lift, and the confidence interval.
Do you need a confidence interval for the effect size?
Yes, always. The point estimate is unstable on small samples and seductive on large ones. The standard format is d = 0.3 [95% CI: 0.1, 0.5]. If the interval crosses zero, the direction of the effect is unclear and you should not lead with the point estimate.
How do effect size and statistical power relate?
Power is the probability of detecting a true effect of a given size at a given sample size and significance level. Effect size is the input; sample size is the output. Halve the effect you want to detect and the required sample roughly quadruples, because n scales like 1 / d^2. That is the single most useful fact to memorize before an experimentation interview.