Confidence intervals on the data science interview
Contents:
Why CIs show up in every DS loop
Walk into any senior data science loop at Stripe, Netflix, Airbnb, or DoorDash and at least one question will hinge on confidence intervals. The hiring manager at Databricks asks "your A/B variant beat control by 3 percent, what is the 95 percent CI on the lift." The staff scientist at Anthropic gives you a 50-row sample of revenue per user and asks you to defend a bootstrap interval on the median. They all want the same signal: do you understand what an interval really represents, or do you treat 1.96 as a magic number from a textbook.
Confidence intervals are the daily currency of analytics and experimentation. When a PM at Linear asks "what is conversion rate of variant B," the right answer is 5.4 percent with a 95 percent CI of 4.8 to 6.1 percent, not 5.4 percent. The interval tells the reader whether the decision is safe at the current sample size or whether you need another week of traffic. The trap is that most candidates can write the formula but cannot interpret the result correctly, and a senior interviewer will catch a sloppy phrasing in two seconds.
What a confidence interval actually is
A 95 percent confidence interval is a range of values for a parameter, computed from data, with the property that the procedure used to build it would capture the true parameter in roughly 95 of every 100 repeated experiments. The interval is random; the parameter is fixed. That phrasing is the only one that is mathematically defensible at a whiteboard.
The interval is not "a 95 percent probability that the true parameter sits in this range." That second phrasing is a Bayesian credible interval and requires a prior. Frequentist intervals do not assign probability to the parameter — they assign coverage to the procedure. In an interview reach for the precise version first and soften it only if asked for a layperson translation.
The width of a CI is driven by three things: the confidence level (95 percent is the default, 99 widens by roughly 30 percent), the variability of the data, and the sample size (the standard error shrinks as sqrt(n) grows). Doubling the sample size narrows the CI by about 30 percent, not 50 percent — a fact every senior DS should be able to quote without computing.
Parametric CI
The parametric CI assumes the sampling distribution of your statistic is approximately normal. For a mean with large n, the central limit theorem makes this work whether the underlying data is normal or not. The closed-form interval is the workhorse you will write on the whiteboard:
CI = mean ± z * SE
SE = stddev / sqrt(n)z = 1.96 for 95 percent confidence, 1.645 for 90, 2.576 for 99. In Python you grab the multiplier and the interval in three lines:
import numpy as np
import scipy.stats as stats
mean = data.mean()
se = data.std(ddof=1) / np.sqrt(len(data))
ci = stats.norm.interval(0.95, loc=mean, scale=se)ddof=1 is the unbiased sample standard deviation that divides by n - 1. NumPy defaults to ddof=0, the population version, which is wrong when your data is a sample. The mistake is small for n > 1000 and meaningful for n < 100; interviewers ask about it.
For small samples — under 30 observations — swap the normal multiplier for a t-distribution multiplier. The t-quantile at 95 percent for n = 10 is roughly 2.26, widening the interval by 15 percent versus the naive 1.96. Hard-coding 1.96 on a 12-row segment produces an interval that is too narrow and a recommendation that quietly oversells precision.
t_mult = stats.t.ppf(0.975, df=len(data) - 1)
ci = (mean - t_mult * se, mean + t_mult * se)For proportions — conversion rates, opt-in rates, click-through rates — use the binomial standard error and switch to Wilson when the proportion is near zero or one. Wald with p = 0.02 on n = 50 returns a lower bound below zero, which is the kind of result that ends interview rounds.
Bootstrap CI
Bootstrap is the non-parametric tool you reach for when the distributional assumption breaks. Revenue per user is heavy-tailed. The median, the 90th percentile, and AUC have no clean closed-form CI. Bootstrap handles all of them with the same three-step recipe.
Resample the data with replacement to build B synthetic samples of the original size. Compute the statistic on each resample. Read off the 2.5th and 97.5th percentiles of the resulting distribution as the 95 percent CI bounds. B between 1,000 and 10,000 is standard.
import numpy as np
rng = np.random.default_rng(seed=42)
boot_stats = np.empty(10000)
for i in range(10000):
sample = rng.choice(data, size=len(data), replace=True)
boot_stats[i] = np.mean(sample)
ci = np.percentile(boot_stats, [2.5, 97.5])Bootstrap shines on three fronts. It works for any statistic — means, medians, ratios, AUC, gini, anything you can compute on a sample. It makes no distributional assumption beyond "the sample is representative of the population." And it generalizes to functions of multiple variables — a CI on revenue / users is a one-line tweak of the snippet above.
The trade-offs are worth naming. Bootstrap is computationally expensive — 10,000 resamples of a one-million-row table needs vectorization, not a Python for loop. It is biased downward for the variance of a heavy-tailed mean when the tail is undersampled; the bias-corrected and accelerated (BCa) variant fixes that. Naive bootstrap is wrong for dependent data — block bootstrap is the patch for time-series and is asked about at staff-level loops.
Interpretation that wins points
The single best thing you can do in an interview is phrase the CI precisely. A 95 percent CI of [10, 14] means: the procedure that produced this interval would, over many repeated experiments, capture the true parameter in roughly 95 percent of those experiments. It is a property of the procedure, not of this specific interval.
The wrong phrasing is "there is a 95 percent probability that the true parameter is between 10 and 14." That sentence treats the parameter as random, which is a Bayesian framing and requires a prior. Frequentists treat the parameter as a fixed unknown number, and probability statements about it are not well defined.
In a dashboard for a product manager, the loose phrasing is fine and arguably better — your reader hears "the answer is somewhere around 12 plus or minus 2." In an interview, the loose phrasing flags you as someone who has not thought carefully about the framework. Reach for the strict version first, then offer the loose translation if the interviewer signals they want it.
A second precision trap: do not say "this interval contains the true mean." Say "this interval was constructed by a procedure with 95 percent coverage." The first phrasing tempts the reader into assigning probability to the interval itself, which is exactly the Bayesian misreading you are trying to avoid.
Common pitfalls
The most common interview failure is applying a normal-approximation CI to skewed data. The mean of 10,000 purchases is roughly normal by the central limit theorem, but on n = 100 with one whale representing 30 percent of revenue, the standard error explodes and the interval is meaningless. The diagnostic is the ratio of standard deviation to mean; when that ratio crosses 2 or 3, switch to bootstrap on a robust statistic like the trimmed mean or the median.
The second trap is reading non-overlapping CIs as automatic statistical significance and overlapping CIs as automatic non-significance. Non-overlapping 95 percent CIs do imply the difference is significant well below the 5 percent level. Overlapping CIs can still hide a significant difference, because the standard error of the difference between two independent means is sqrt(SE1^2 + SE2^2), not the larger of the two. For A/B comparisons, compute the CI on the delta directly rather than eyeballing the two-bar chart.
The third trap is hard-coding 1.96 for any sample size. For n < 30, the t-multiplier is the honest answer. For proportions near zero or one, Wald produces bounds outside [0, 1] and you should reach for Wilson. For the median or any non-mean statistic, no closed-form normal-approximation CI exists and bootstrap is the only correct tool. Memorizing "1.96 for 95 percent" without knowing the boundary conditions is a senior-level red flag.
The fourth trap is confusing confidence with credibility on the whiteboard. Senior interviewers at Meta, Google, and Apple deliberately phrase questions so the candidate has a chance to slip between frequentist and Bayesian framings. Each framework is internally consistent; mixing them produces nonsense. If you build a frequentist CI, describe it in frequentist terms.
The fifth trap is ignoring sample dependence. Web analytics often counts events, not users — a single power user can contribute 200 sessions to a 1,000-row sample, and the effective sample size is much smaller than the row count. The naive CI is too narrow and the recommendation it backs is overconfident. Aggregate to one observation per user before computing the CI, or compute cluster-robust standard errors.
Where this shows up in production
A/B test platforms at Microsoft, Apple, and Uber attach a CI to every metric that ships to the experiment dashboard. The CI decides whether a result moves to "ready to ship," "needs more data," or "stop, this is a false positive." The internal review process at most of these companies will not approve a launch on a point estimate without a CI that excludes zero by a comfortable margin.
Forecasting teams at Snowflake and Databricks present every revenue projection as a fan chart — the central line is the point forecast, the shaded band is the 80 or 95 percent prediction interval. The width of that band drives every "how confident are you in next quarter's number" follow-up from finance. Forecasting CIs combine model uncertainty with the residual variance of the data, and the fan widens with the forecast horizon, which is the visual everyone in the room understands instantly.
Related reading
- Confidence interval in SQL
- Bootstrap CI in SQL
- Bootstrap explained simply
- Bayesian methods for the data science interview
- A/B testing peeking mistake
If you want to drill questions like this every day, NAILDD is launching with 500+ data science problems built around exactly this kind of senior interview pattern.
FAQ
Is bootstrap valid for time series?
Naive bootstrap is not valid for time series because it destroys the temporal dependence that defines the data. Any statistic that depends on autocorrelation, seasonality, or trend will be wrong. The fix is block bootstrap — resample contiguous blocks instead of individual rows. Block length is the lever: too short leaks independence into the resamples, too long inflates variance. The stationary bootstrap variant randomizes block length and is the default in modern packages.
Should I report 90, 95, or 99 percent CI?
95 percent is the default in nearly every dashboard, paper, and A/B platform. Use 90 percent when you want a tighter interval and your stakeholder accepts more false positives — early-stage product decisions are the canonical case. Use 99 percent for high-stakes calls like pricing changes or fraud thresholds. The trade-off is mechanical: 99 percent widens the interval by about 30 percent over 95 percent, which means you need roughly 70 percent more data to keep the same precision.
When does the central limit theorem kick in for means?
The textbook answer is n = 30, and that holds for moderately skewed data. For heavy-tailed distributions like revenue per user, you may need n = 1,000 or more before the sampling distribution of the mean is near normal. The diagnostic is to bootstrap the mean and look at the histogram — if it is symmetric and unimodal, the normal approximation is fine; if it is still skewed, the parametric CI is misleading and you should report the bootstrap percentile interval instead.
What is the difference between a confidence interval and a prediction interval?
A confidence interval is about a parameter — the true mean, the true conversion rate, the true regression coefficient. A prediction interval is about a future observation — the value of the next data point. Prediction intervals are always wider because they include both the uncertainty in the parameter and the residual variance of individual observations. Forecasting fan charts are prediction intervals; A/B test dashboards report confidence intervals on the average treatment effect.
How do I get a CI on the median or any percentile?
Bootstrap. There is no closed-form normal-approximation CI for the median, and the binomial-based exact CI works only for the median, not for arbitrary percentiles. Resample your data 1,000 to 10,000 times, compute the statistic of interest on each resample, and take the 2.5th and 97.5th percentiles of the resulting distribution. The same pattern handles the 90th percentile of session duration, the gini coefficient of revenue, or any other custom statistic your team reports.
How do I check whether my CI procedure has the correct coverage?
Simulate. Generate many samples from a known distribution, compute the CI on each, and count what fraction contain the true parameter. A correctly calibrated 95 percent CI should capture the true value in 94 to 96 percent of simulations. Coverage below 90 percent means your interval is too narrow — typically a violated assumption like independence or normality. Coverage simulations are how analytics teams validate their experiment platform is honest.