Normal distribution: z-score and the 3-sigma rule
Contents:
What the normal distribution is
The normal distribution, also called the Gaussian, is a continuous probability distribution whose density curve has the shape of a symmetric bell. It is the single most useful distribution in applied statistics: confidence intervals, hypothesis tests, and A/B test math all lean on it. Explaining it cleanly in an interview at Stripe, Netflix, or DoorDash covers a third of the typical statistics screen.
Two parameters define the shape. The mean, mu, decides where the peak sits. The standard deviation, sigma, decides how wide the bell is. The notation N(mu, sigma squared) means a normal with that mean and a variance equal to sigma squared. For instance, N(170, 100) describes adult heights with a mean of 170 cm and a standard deviation of 10 cm.
The density function looks like this:
f(x) = (1 / (sigma * sqrt(2 * pi))) * exp( -(x - mu)^2 / (2 * sigma^2) )You will rarely write this formula by hand at work. What matters in interviews is recognizing it, knowing the parameters, and reasoning about the shape. The exponential decay is also why values far from the mean become vanishingly unlikely.
Core properties to memorize
The curve is symmetric around mu, so mean, median, and mode all coincide. The left half is a mirror image of the right half, which is why z-tables only list half the entries.
The 68-95-99.7 rule, also called the three-sigma rule, is the fastest sanity check in any analytics review.
| Interval | Share of values |
|---|---|
| mu plus or minus 1 sigma | 68.3% |
| mu plus or minus 2 sigma | 95.4% |
| mu plus or minus 3 sigma | 99.7% |
If heights follow N(170, 10 squared), then 95% of people fall between 150 and 190 cm. Tails decay exponentially, so values four sigma or more from the mean have probability under 0.006%.
Linear combinations of independent normals stay normal. If X is N(mu1, sigma1 squared) and Y is N(mu2, sigma2 squared) and the two are independent, then X plus Y is N(mu1 plus mu2, sigma1 squared plus sigma2 squared). This is why A/B test math is clean: the difference of two sample means stays normal as long as each component is, and standard errors add in quadrature.
Z-score and standardization
A z-score expresses the distance between an observation and the mean in units of standard deviations:
z = (x - mu) / sigmaStandardizing turns any normal into the standard normal N(0, 1). A single table or function call covers every case, and you can compare values measured on incompatible scales.
Suppose a candidate scores 82 on a SQL screener (mean 70, sigma 8) and 65 on a Python screener (mean 55, sigma 5).
- z for SQL is (82 - 70) / 8 = 1.50
- z for Python is (65 - 55) / 5 = 2.00
The Python result is the stronger signal even though the raw score is lower. Recruiters at companies like Linear or Notion routinely standardize panel scores this way.
| z-score | Share below | Share above |
|---|---|---|
| -2.0 | 2.3% | 97.7% |
| -1.0 | 15.9% | 84.1% |
| 0.0 | 50.0% | 50.0% |
| 1.0 | 84.1% | 15.9% |
| 1.96 | 97.5% | 2.5% |
| 2.0 | 97.7% | 2.3% |
| 3.0 | 99.9% | 0.1% |
The value 1.96 is the magic number behind the 95% confidence interval, since 2.5% of the mass lies in each tail beyond it. That is why the multiplier 1.96 appears whenever you compute a 95% confidence interval. Z-scores are also a standard ingredient for outlier detection, where points with absolute z greater than 3 are typically flagged.
Why normal shows up everywhere
The short answer is the Central Limit Theorem, or CLT. It states that the arithmetic mean of a large number of independent identically distributed random variables is approximately normal, regardless of the underlying distribution. Convergence is usually visible by the time the sample size reaches a few dozen.
Height is a useful intuition: it is the sum of hundreds of small genetic and environmental contributions. None is individually normal, but their sum approaches a normal shape. The same logic applies to measurement error and IQ scores.
For analysts the connection is direct. Sample mean conversion rates are approximately normal at typical sample sizes. The difference between two sample means is normal because the sum or difference of normals stays normal. Both the z-test and the t-test lean on the normality of the sample mean rather than the data.
The standard normal and its three functions
The standard normal is the special case N(0, 1). Every other normal reduces to it through the z transform, which is why textbooks only list the standard normal.
You will reach for three functions repeatedly. The PDF gives the height of the curve at a point x; by itself it is not a probability, since the probability of any exact value for a continuous distribution is zero. The CDF returns P(X less than or equal to x). It answers questions like "what fraction of users complete checkout in under five seconds." The PPF is the inverse of the CDF: given a probability, it returns the value with that much mass to its left.
Interviewers love to ask which function answers a disguised question. "How many sigmas correspond to a 99% confidence level" is a PPF question. "What fraction of orders take longer than ten minutes" is a CDF question.
Python with scipy.stats.norm
In Python the normal lives in scipy.stats.norm. The three main methods mirror the three functions above.
from scipy import stats
# Standard normal N(0, 1)
dist = stats.norm(loc=0, scale=1)
# PDF -- density at a point
print(dist.pdf(0)) # 0.3989 -- peak of the bell
print(dist.pdf(1.96)) # 0.0584
# CDF -- P(X <= x)
print(dist.cdf(1.96)) # 0.975 -- 97.5% to the left
print(dist.cdf(-1.96)) # 0.025 -- 2.5% to the left
# PPF -- inverse CDF: probability -> x
print(dist.ppf(0.975)) # 1.96
print(dist.ppf(0.995)) # 2.576 -- 99.5% quantile, used for 99% CIFor any non-standard normal, set loc to mu and scale to sigma. A common gotcha: scale is the standard deviation, not the variance. People who learned the formula with sigma squared sometimes pass the variance, which silently corrupts every probability they compute.
# Heights: N(170, 10^2)
height = stats.norm(loc=170, scale=10)
# Probability of a height below 155 cm
print(height.cdf(155)) # 0.0668
# Height at the 90th percentile
print(height.ppf(0.90)) # 182.8 cm
# Probability of a height between 160 and 180 cm
print(height.cdf(180) - height.cdf(160)) # 0.6827When data is not normal
The normal model is powerful but not universal. A senior analyst should know when it breaks down.
Skewed distributions are the most common counter-example. Revenue per user, average order value, and session duration are almost always right skewed: many small values and a long heavy tail. The lognormal often fits this kind of data better.
Count data is another case. Purchases per user, page views per session, or clicks per email are discrete non-negative integers, usually modeled with Poisson or negative binomial. Pretending count data is normal makes interval estimates symmetric around the mean even when the support starts at zero.
Binary outcomes are technically Bernoulli, but the sample proportion across a large group is approximately normal thanks to the CLT, which is why normal-based formulas work for conversion-rate confidence intervals at typical experiment sizes.
Heavy-tailed data is the most dangerous case. Financial returns, insurance losses, and viral content metrics have tails that decay much slower than the normal. The three-sigma rule no longer holds, and models built on normal assumptions will systematically underprice these risks. Check normality with histograms and Q-Q plots, and remember that most inference relies on normality of the average rather than the data.
Common pitfalls
The first trap is confusing the distribution of the data with the distribution of the sample mean. Candidates often declare a t-test invalid because the raw observations are not normal. That is almost always wrong at moderate sample sizes. The t-test cares about the distribution of the sample mean, which the CLT makes approximately normal. Anchor every normality claim to a specific quantity: individual values, the sample mean, or the difference of two sample means.
The second trap is treating the three-sigma rule as a universal law. It is exactly true only under the normal model. On revenue, latency, or viral coefficients, three-sigma events happen weekly. If you use a z threshold of three to flag outliers on right-skewed data, you will under-detect on the heavy side. Log-transform first, or switch to a robust method like the interquartile range.
The third trap is passing variance instead of standard deviation into a function that expects sigma. Both scipy and numpy use scale for the standard deviation, but plenty of textbooks write the parameter as sigma squared. Sanity-check with the 68-95-99.7 rule: query the CDF at mu plus sigma and confirm the result is near 0.84.
The fourth trap is forgetting that the PDF is not a probability. A common slip is to say "the probability that x equals 1.96 is 0.0584." For a continuous distribution that probability is exactly zero. What 0.0584 represents is a density. Interviewers at quant-heavy shops listen for this distinction.
Interview-style questions
Statistics interviews at companies like Meta, Amazon, and Snowflake recycle a small set of normal-distribution prompts.
What is the normal distribution and what are its main properties?
A continuous symmetric distribution shaped like a bell, fully determined by mean mu and standard deviation sigma. Key properties: symmetry around mu, mean equals median equals mode, the 68-95-99.7 rule, exponentially decaying tails, and sums of independent normals stay normal.
Explain the three-sigma rule.
Roughly 68% of observations fall within one sigma of the mean, 95% within two, and 99.7% within three. A normal observation beyond three sigma is so unlikely that it is usually flagged for inspection. The rule only holds under the normal model and breaks badly on heavy-tailed data.
What is a z-score and why is it useful?
A z-score is the number of standard deviations between an observation and the mean. It standardizes any normal into the standard normal, so one table or function call covers every case. Analysts use z-scores to compare values on different scales, compute tail probabilities, and flag outliers.
Why is the normal distribution so common in practice?
Largely because of the Central Limit Theorem. Quantities that are the sum or average of many small independent contributions tend to be approximately normal. Most inference is about sample averages, which become normal at moderate sample sizes regardless of the underlying distribution.
Can you run a z-test if the data is not normal?
Often yes, provided the sample is large enough. The z-test cares about the normality of the test statistic, which by the CLT is approximately normal at high-dozens sample sizes. For small samples with non-normal data, use a non-parametric alternative like Mann-Whitney U.
Related reading
- T-test vs z-test in statistics
- Confidence intervals for data science interviews
- How to calculate a confidence interval in SQL
- How to calculate IQR outliers in SQL
If you want to drill statistics interview questions like this one every day, NAILDD is launching with structured practice across distributions, hypothesis testing, and experiment design.
FAQ
What is the normal distribution in plain English?
A symmetric bell-shaped distribution defined by two numbers, the mean and the standard deviation. Most observations cluster near the mean, and the chance of an extreme value drops off very quickly. It is the default model in classical statistics because so many real-world quantities and almost every sample average end up looking approximately normal.
What does the 68-95-99.7 rule say?
Under a normal distribution, about 68% of values fall within one standard deviation of the mean, 95% within two, and 99.7% within three. Values beyond three standard deviations are rare enough that they are usually flagged as outlier candidates. The rule is only exact for normal data, so applying it to skewed metrics like revenue can be misleading.
What is a z-score and how is it used in analytics?
A z-score is the number of standard deviations an observation sits from the mean. Analysts use it to compare measurements on different scales, read tail probabilities from the standard normal table, and surface outliers in approximately normal data. It is the building block of confidence intervals and z-tests.
Can I use statistical tests when my data is not normally distributed?
Usually yes, as long as your sample is reasonably large. By the Central Limit Theorem, the sample mean is approximately normal even when the raw data is not, which is what most parametric tests actually require. For small samples with heavily skewed data, switch to a non-parametric test such as Mann-Whitney or Wilcoxon, or use a bootstrap interval.
How do I check if my data is approximately normal?
Start with a histogram for shape and a Q-Q plot for tail behavior. Add a numerical check: skewness near zero and kurtosis near three are consistent with normality. Formal tests like Shapiro-Wilk help on very small samples, but at large n they reject almost any real-world data.