Normal distribution
Contents:
What it is
The normal distribution (also called Gaussian, or "the bell curve") is the most useful shape in applied statistics. If a recruiter at Stripe or Meta hands you an experiment to analyze, the default assumption you make about a metric is usually "approximately normal in the mean, given enough samples." Knowing what that means, and when it breaks, is the difference between a t-test you can defend and a p-value that gets torn apart in code review.
Picture it as a hill. Most of the mass sits near the center, the slopes fall off symmetrically on both sides, and the tails stretch out toward plus and minus infinity but never quite touch zero. The center tells you where typical values land; the width tells you how spread out they are. Those two numbers — mean and standard deviation — are everything you need to fully specify a normal curve. No third parameter for skew, no fourth for kurtosis. Just two.
That parsimony is why hundreds of statistical procedures — t-tests, z-tests, linear regression, confidence intervals, ANOVA, Kalman filters, Gaussian processes — assume something in the model is normal. When it actually is, the math is clean and the answers are sharp. When it is not, you fix the data, switch to a different test, or accept that your p-values are lies. This post walks through how to tell which world you are in.
Properties
Four properties are worth memorizing before you ever touch a test.
It is symmetric. Left tail and right tail are mirror images, with no skew. If your histogram has a long tail on one side and a wall on the other, you are not looking at a normal distribution.
It is fully described by two parameters. The mean (mu) anchors the center of the bell. The standard deviation (sigma) controls how fat or thin it is. A small sigma means a tall, narrow curve clustered tightly around the mean; a large sigma means a short, wide curve. Same shape, different scale.
Mean, median, and mode all coincide. Because the curve is symmetric and unimodal, the average, the middle value, and the most common value land on the same point. The moment those three diverge in your data, you are looking at something non-normal — usually skewed.
The support is unbounded. A normal random variable can mathematically take any value from minus to plus infinity. In practice almost all probability sits within a few sigmas of the mean, but the tails never end. This matters when you model quantities that cannot be negative — revenue, age, latency. Strictly speaking, normal is the wrong model for those, even if it fits okay in the middle.
The shorthand for "X is normal with mean mu and standard deviation sigma" is X ~ N(mu, sigma^2). The special case N(0, 1) is called the standard normal, and it is where z-scores live. Any normal variable becomes standard normal by subtracting the mean and dividing by sigma:
z = (x - mu) / sigmaThat single line is the engine behind every z-table you have ever squinted at.
The 68-95-99.7 rule
Memorize these three numbers. They will save you minutes in interviews and hours in real analysis.
In any normal distribution, about 68% of values fall within one sigma of the mean, about 95% within two sigmas (more precisely 1.96), and about 99.7% within three sigmas. The rule — sometimes called the empirical rule — works for any normal, large or small.
Worked example. Suppose checkout amounts at an e-commerce store are normal with mean $80 and sigma $20. Then 68% of orders sit in [$60, $100], 95% in [$40, $120], and 99.7% in [$20, $140]. A $200 order would be six sigmas above the mean, with probability under one in a billion — strong evidence either the assumption is wrong or the order is exotic.
The rule is also the source of the "two-sigma" outlier cutoff and the 1.96 multiplier in 95% confidence intervals. Both come from the same fact: the area under the standard normal curve between minus 1.96 and plus 1.96 equals 0.95.
Where it shows up in practice
Three buckets cover most of what you will see at work.
A/B test analysis. The Welch t-test and z-test for difference of means assume the sampling distribution of the difference is approximately normal. Thanks to the Central Limit Theorem (CLT), this holds for almost any reasonable metric once your sample is large enough. The mean of a sample, even when raw values are skewed, becomes increasingly normal as N grows. That is why teams at DoorDash, Airbnb, and Uber can run t-tests on revenue per session even though revenue per session is wildly non-normal.
Confidence intervals. A 95% interval is mean plus or minus 1.96 times SE, where SE is the standard error. The 1.96 comes straight from the standard normal. The interval is valid whenever the sampling distribution is approximately normal — once N is large enough, CLT does the heavy lifting.
Modeling assumptions in ML. Linear regression assumes residuals are normal with constant variance. Gaussian mixture models, PCA, and many Bayesian priors assume the underlying variables are normal or close. When the assumption holds, the math gives crisp closed-form answers. When it does not, the model still fits but its confidence intervals and parameter p-values get unreliable.
Where it does not apply
A surprising amount of real-world data is not normal. Catching this early saves you from running tests that look fine but quietly produce wrong answers.
Revenue, income, and order value are almost always right-skewed. A few large customers drag the tail out, mean and median diverge sharply, and log-normal is a far better baseline than normal.
Latency and response time are right-skewed by construction. Most requests are fast, a small fraction are slow, and the worst tail is what users feel. Reporting mean latency hides exactly the problem you are trying to measure — that is why every SRE team reports p50, p95, and p99 instead.
Lifetime value (LTV) is dominated by whales. A handful of customers spend orders of magnitude more than the rest, so treating LTV as normal yields confidence intervals that are too tight and a t-test that will lie to you.
Retention curves decay exponentially, not symmetrically. You cannot center a bell on them no matter how hard you squint.
Counts and proportions are not normal in raw form — they are discrete and bounded. A landing-page conversion rate lives in [0, 1] and follows a binomial. With large samples, CLT lets you approximate the sampling distribution of the rate as normal, but the underlying variable is not.
Checking for normality
You have two kinds of tools: visual and statistical. Use both.
The fastest visual check is a histogram. If it looks symmetric and bell-shaped, you are probably fine. A Q-Q plot is more precise: it plots the sorted values of your data against the quantiles of a theoretical normal distribution. If the points fall on a straight diagonal line, the data is approximately normal. Systematic curves at the ends mean fat or thin tails. An S-shape means skew.
import matplotlib.pyplot as plt
import scipy.stats as stats
stats.probplot(data, dist='norm', plot=plt)
plt.show()For a statistical test, you have three common choices. Shapiro-Wilk is the standard for small to medium samples, up to a few thousand observations. Anderson-Darling is more sensitive in the tails, which is where most real data deviates from normal. Kolmogorov-Smirnov is the classic but tends to be less powerful in practice.
from scipy.stats import shapiro
stat, p = shapiro(data)
print(f"Shapiro-Wilk p-value: {p:.4f}")One trap. On very large samples — say N above 5,000 — these tests will reject normality almost every time, even when the data looks visually fine. The test becomes sensitive enough to flag deviations too small to matter. At that scale, trust the Q-Q plot and a histogram more than the p-value.
Working with non-normal data
If your data is not normal, you have four solid moves.
Transform. A log transformation — np.log1p(y) — pulls in the right tail and often turns skewed positive data like revenue, latency, or session length into something close to normal. Box-Cox is the generalized version: it finds the best power transformation for your specific data.
import numpy as np
from scipy.stats import boxcox
y_log = np.log1p(y)
y_bc, lam = boxcox(y + 1e-9)Switch tests. If you cannot transform, swap parametric tests for non-parametric equivalents. Use Mann-Whitney U instead of t-test, Wilcoxon signed-rank instead of paired t-test, and Kruskal-Wallis instead of ANOVA. These rely on ranks rather than raw values, so the underlying distribution does not need to be normal.
Bootstrap. Resample your data with replacement many thousand times, compute the statistic on each resample, and read confidence intervals straight from the resulting distribution. Bootstrap makes no parametric assumptions at all, which is why it is the workhorse method for skewed metrics like revenue per user.
Lean on CLT. If you only care about the mean and your N is large, the sampling distribution of the mean is approximately normal regardless of underlying shape. That is why you can still run a t-test on revenue per session at N = 100,000 even though revenue per session is heavily skewed.
Common pitfalls
Treating every metric as normal is the most expensive mistake in this category. Revenue, LTV, and latency are reliably non-normal, and running an unmodified t-test on them produces p-values that are too small and intervals that are too tight. The fix is to either log-transform, switch to a non-parametric test, or bootstrap. Pick the one that matches what you are reporting.
A subtler trap is applying the 68-95-99.7 rule outside of normal data. A finance partner once told me "this $1,800 order is four sigmas above mean — kill the integration, something is broken." But order value was log-normal, not normal. In a log-normal world a four-sigma value is not exotic at all. Always confirm the underlying distribution before invoking sigma thresholds. The rule is conditional on the bell shape; without it, the numbers mean nothing.
Confusing sigma and standard error is another classic. Sigma measures the spread of the raw data. Standard error — sigma divided by the square root of N — measures the spread of the sample mean. SE shrinks as N grows; sigma does not. Reporting a sigma when you mean SE makes your confidence intervals look 100 times wider than they should be. Reporting SE when you mean sigma makes outlier thresholds look impossibly tight.
Finally, watch out for normality tests on huge samples. Shapiro-Wilk on 200,000 rows will reject normality even if your data is visually flawless. The test is right that the data is not exactly normal — almost no real data is — but the deviation is small enough that any normal-based procedure will still work. Pair the test with a Q-Q plot and use judgment.
Related reading
- Variance and standard deviation
- Confidence intervals for data science interviews
- Bootstrap explained simply
- Binomial distribution explained simply
If you want to drill statistics and SQL interview questions like this every day, NAILDD is launching with hundreds of problems that test exactly this kind of reasoning.
FAQ
Is all data normally distributed?
No. Some variables — measurement error, IQ scores by construction, adult height — are close to normal. Most of the metrics you care about at work — revenue, LTV, latency, time-on-task — are not. The safest default with a new metric is to plot it before you assume anything. The shape almost always surprises people the first time they look.
How do I quickly check whether a variable is normal?
Plot a histogram and a Q-Q plot. The histogram tells you whether the shape is roughly symmetric and bell-like; the Q-Q plot tells you whether the tails behave. If both look reasonable, run a Shapiro-Wilk or Anderson-Darling for a quantitative check — but at large N, trust your eyes more than the p-value, since the tests over-reject on big samples.
What do I do if my data is not normal?
Pick one of four moves. Log-transform skewed positive data and rerun a parametric test. Switch to a non-parametric test like Mann-Whitney U or Wilcoxon. Bootstrap your statistic and read intervals from the resampled distribution. Or, if you only care about the mean and have a large enough sample, lean on CLT — the mean will be approximately normal even when the underlying variable is not.
Does the 68-95-99.7 rule work for any distribution?
No, it is specific to the normal. For other distributions the percentages within one, two, and three sigmas can be very different. Chebyshev's inequality gives a general bound — at least 75% of values fall within two sigmas — but that is much looser than 95%. Use the empirical rule only after confirming the data is approximately normal.
Why does the Central Limit Theorem matter here?
Because it explains why so many statistical procedures still work even when raw data is not normal. CLT says that as sample size grows, the sampling distribution of the mean approaches normal regardless of underlying shape. That is why a t-test on heavily skewed revenue data still gives sensible answers at N = 50,000, and why most production A/B pipelines check normality of the mean, not the raw metric.