Standard deviation explained simply
Contents:
Why standard deviation matters
Standard deviation, written SD or σ, is the base statistic for describing spread. Together with the mean it gives the full picture of a numeric variable: where the center is and how wide the cloud around it is. Without SD you cannot build a confidence interval, size an A/B test, or decide whether a value is "weirdly high" or normal noise.
For a junior or mid data analyst interview at Google, Meta, Stripe, or Airbnb, SD shows up almost every loop. Common openers: "what is the difference between SD and variance", "explain the 68-95-99.7 rule", "how do you use SD when sizing an A/B test". Candidates who can only recite the formula get filtered. Candidates who can sketch the intuition, write the SQL, and connect SD to sample size pass.
The intuition in one paragraph
SD measures how far values sit from the mean on average. Small SD means the data is packed close to the center. Large SD means the data is spread out. Two datasets with the same mean can tell very different stories: a salary of 100k with SD 5k describes a narrow band of senior ICs at one company, while 100k with SD 40k describes a mix of interns and staff engineers. Always show both.
The formula, worked out
Take five salaries: 50, 60, 70, 80, 90 (in thousands of USD). The mean is 70. The deviations from the mean are -20, -10, 0, 10, 20. Square them so positives and negatives do not cancel: 400, 100, 0, 100, 400. The average of those squares is the variance. For a population that divides by n=5 giving 200. The SD is the square root, about 14.14k. Interpretation: salaries deviate from the mean by about 14k on average, in the same units as the original data.
For a sample, divide by n-1 instead of n. The same numbers give 250 for variance and about 15.81k for SD. The bump from n to n-1 is Bessel's correction and it is the default in nearly every analytics tool.
variance = sum( (x_i - mean)^2 ) / (n - 1) # for a sample
SD = sqrt(variance)Squaring is what makes SD asymmetric to outliers: a single value 5 SDs from the mean contributes 25 times more to variance than a value 1 SD away. That is why one billionaire wrecks the SD of household income.
SD vs variance
Variance and SD carry the same information. The only practical difference is units. Variance is in squared units, so for salaries in thousands of USD the variance comes out in "thousands squared" which means nothing to a human. SD is in the original units, so "salaries vary by about 14k from the mean" is something you can put in a deck.
Variance is the natural quantity for the math. It is additive for independent variables and plugs into the central limit theorem cleanly. SD is the natural quantity for reporting. The pipeline is almost always: compute variance, take the square root at the end, report SD. Squaring deviations instead of taking absolute values penalizes large deviations more and gives a differentiable, convex loss with closed-form solutions.
The 68-95-99.7 rule
For a normal distribution, about 68 percent of values sit within one SD of the mean, about 95 percent within two SDs (more precisely 1.96), and about 99.7 percent within three SDs. This empirical rule is the fastest mental check on whether a value is unusual.
Worked example. Checkout values on an ecommerce site: mean 1000, SD 200. Under the empirical rule, 68 percent of orders fall between 800 and 1200, 95 percent between 600 and 1400, and 99.7 percent between 400 and 1600. A 1800 dollar order sits at 4 SDs above the mean, so it has roughly a 1 in 16000 prior probability under a normal model. Likely an outlier, possibly a fraud signal.
The catch is "normal distribution". Empirical revenue distributions are rarely normal — they are right-skewed and bounded at zero. The empirical rule overcounts the central mass and undercounts the tail. Always plot the distribution before quoting the rule. For heavy-tailed metrics use percentiles or a log transform.
Sample vs population (n vs n-1)
A population is the full set: every employee, every user, every order. For a population you divide the sum of squared deviations by N. A sample is a slice: last week's orders, one department, an A/B test bucket. For a sample you divide by n minus 1.
The reason is bias. When you estimate population variance from a sample, using the sample mean (itself estimated from the same data) systematically underestimates spread. Dividing by n-1 corrects for that. The expected value of the n-1 version equals the true population variance. This is Bessel's correction.
In analytics you almost always have a sample. Even the "full" users table is a sample from the population of users who could have shown up. Default to n-1. NumPy defaults to n (population) and you have to pass ddof=1 to get the sample version. Pandas, R, and most stats libraries default to n-1. Knowing which default applies to which tool is a routine source of off-by-one bugs in interview live coding.
SD in Python and SQL
import numpy as np
import pandas as pd
data = [50, 60, 70, 80, 90]
np.std(data) # 14.14 -- population, divides by n
np.std(data, ddof=1) # 15.81 -- sample, divides by n - 1
pd.Series(data).std() # 15.81 -- pandas default is sample
pd.Series(data).std(ddof=0) # 14.14 -- population if you ask for it-- Postgres, BigQuery, Snowflake all support both
SELECT
STDDEV_SAMP(salary) AS sd_sample, -- divides BY n - 1
STDDEV_POP(salary) AS sd_population, -- divides BY n
VAR_SAMP(salary) AS var_sample,
AVG(salary) AS mean_salary
FROM employees;
-- In Postgres, plain STDDEV() is an alias for STDDEV_SAMP
SELECT STDDEV(salary) FROM employees;When pasting results into a dashboard or a writeup, label the version. A reviewer who sees "SD = 14.14" with no context will assume sample and may catch the discrepancy on a second look. Better to write "sample SD = 15.81 (n=5)" so the math is reproducible.
How SD drives A/B testing
The standard error of the mean is SE = SD / sqrt(n). Smaller SE means a more precise estimate of the mean. SE shrinks at a rate of 1 / sqrt(n), which is why doubling your sample only improves precision by about 41 percent. This is the core reason A/B tests need large sample sizes.
A 95 percent confidence interval for the mean is mean plus or minus 1.96 times SE. If a treatment lifts revenue per user by 1.20 dollars and SE is 0.40, the 95 percent CI is roughly [0.42, 1.98] and the result is significant. If SE is 0.80, the CI is [-0.36, 2.76] and the result is not significant despite the same point estimate. SD drives SE drives CI width drives whether a lift "ships" or not.
For sample sizing, n is proportional to SD^2 / MDE^2. Doubling the MDE quarters the sample. Doubling the SD quadruples it. This is why variance reduction techniques like CUPED matter: cutting SD by 30 percent cuts required sample size by about half. For a Bernoulli outcome the variance is p(1-p), so SD is sqrt(p(1-p)) — the form behind sample size formulas for conversion-rate experiments.
Coefficient of variation
CV is SD divided by mean. It expresses spread as a fraction of the center, letting you compare datasets with different scales. Salaries of 50 plus or minus 10 (CV 0.2) are more variable in relative terms than prices of 1000 plus or minus 100 (CV 0.1), even though absolute spread is larger for prices.
CV is useful for benchmarking variability across metrics or teams. Revenue per user on a B2B SaaS product might have CV 1.5 (long tail of enterprise deals); revenue per session on consumer ecommerce might have CV 0.6. CV breaks down when the mean is near zero or negative, so use it for strictly positive metrics.
Common pitfalls
Confusing SD with variance is the most common stumble. Variance has squared units and is not directly interpretable. SD is in the original units. If your answer to "by how much do users vary in revenue" is a number in squared dollars, you have answered the wrong question. Practice flipping between the two and always report SD when the audience is a human.
Using n instead of n-1 on a sample is a quiet bug. The numerical difference is small for large samples but it compounds in chained calculations and gets flagged on careful review. Default to ddof=1 in NumPy and STDDEV_SAMP in SQL, and state which version you used. On an interview, name-drop Bessel's correction when you choose n-1.
Applying the 68-95-99.7 rule to a non-normal distribution is the third trap. Right-skewed revenue, heavy-tailed click counts, and bounded conversion rates all break the rule. Plot the distribution first. If it is skewed, switch to percentile-based summaries (p50, p90, p99) or apply a log transform before quoting empirical-rule numbers.
Reporting SD without the mean is the fourth pitfall. SD on its own does not tell you whether the data is centered at 5 or at 5000. Always pair the two. Lead with mean and immediately follow with SD or SE.
The fifth pitfall is forgetting SD does not "add" across groups. If group A has SD 10 and group B has SD 15, the pooled SD is not 25. Variance is additive for independent groups: Var(X+Y) = Var(X) + Var(Y). Compute pooled variance first, then take the square root. Summing SDs directly produces wrong sample size numbers.
Interview cheatsheet
"What is SD?" The square root of variance — the average distance from the mean in the original units.
"Why divide by n-1 for a sample?" The sample mean is estimated from the same data, so using n underestimates population variance. The n-1 version is unbiased. This is Bessel's correction.
"What is the 68-95-99.7 rule?" For a normal distribution, the fractions of mass within 1, 2, and 3 SDs of the mean. Sanity check only; do not apply to skewed distributions.
"How does SD relate to A/B sample size?" n ∝ SD² / MDE². Cutting SD in half cuts required sample size by a factor of four. This is why variance reduction techniques like CUPED matter.
"SD vs SE?" SD describes spread of individual data points. SE describes spread of the sample mean. SE = SD / sqrt(n).
Related reading
- Variance and standard deviation
- Normal distribution explained simply
- Median explained simply
- How to calculate confidence interval in SQL
- CUPED variance reduction in A/B testing
If you want to drill questions like this every day, NAILDD is launching with 500+ SQL and stats problems aimed at exactly this pattern of "explain the concept, then run it on real data".
FAQ
Does SD work for skewed distributions?
Mathematically yes — you can always compute it. Interpretively no, because the intuition that "most of the mass is within 1-2 SDs" assumes a roughly symmetric distribution. For right-skewed metrics like revenue or session duration, SD will be inflated by the long tail and the empirical rule will not apply. Use interquartile range or percentile cuts for those metrics, or log-transform first and quote SD on the log scale.
Can I add standard deviations directly?
No. Variances of independent random variables add: Var(X + Y) = Var(X) + Var(Y). SD does not. To combine two groups, compute pooled variance first and then take the square root. A common mistake in sample size calculations is to sum the SDs of the two arms; the correct move is to sum variances and root the result.
What does it mean if SD equals zero?
Every value in the dataset is identical. In practice this happens with constants, with a single observation, or when you accidentally group by the column you were trying to compute SD over. If you see SD = 0 in production, investigate the input before trusting any downstream confidence interval.
How do I compare SDs across metrics with different units?
Use the coefficient of variation: SD divided by the mean. CV is dimensionless, so you can put revenue, session count, and time-on-site on the same axis. CV breaks down when the mean is near zero or negative, so reserve it for strictly positive metrics.
Sample SD vs population SD — which do I report?
Sample SD with n-1 unless you genuinely have the entire population. In analytics, even a "complete" table is a sample from the population of users or sessions that could have shown up. Default to sample SD, label it as such, and only switch to population SD when the prompt is something like "compute the SD of these five specific values, treating them as the whole world".
How does SD interact with CUPED and variance reduction?
CUPED uses a pre-experiment covariate to subtract predictable variance from the response variable. The residual has lower SD, which means smaller SE for a fixed n, which means tighter confidence intervals and shorter experiments. A 30-40 percent SD reduction is common on metrics with strong pre-period signal, and it translates into roughly half the required sample size.