May 18, 2026·14 min read

T-tests on the data science interview

Q: Why does Welch's t-test beat Student's as the default?

Student's assumes equal population variance across groups, which fails routinely in A/B tests where the treatment changes behavior. Welch's drops the assumption and pays only a tiny power cost when variances are equal, so using it as a default has negligible downside and real robustness upside. Modern statistics texts and most experimentation platforms now name Welch's as the standard choice, and top-tech interviewers expect you to do the same.

Q: What sample size do I need for a t-test to be valid?

There is no single threshold, but the practical rule is roughly 30 per group for symmetric distributions and 50 to 100 for moderately skewed ones. For very heavy-tailed metrics — revenue per user, retention days, latency tails — the safe threshold can be in the thousands, because a single outlier can dominate the mean. Below those thresholds, switch to Mann-Whitney U or a bootstrap confidence interval.

Q: Should I run Levene's test first to decide between Student's and Welch's?

No. The pretest-then-test pipeline distorts the joint false-positive rate, and the gain from picking Student's when variances happen to be equal is small. Just run Welch's by default. If an interviewer pushes back, note that the pretest itself has Type I and Type II errors, and conditioning the second test on its outcome breaks the clean interpretation of the final p-value.

Q: Can I use a t-test on conversion rate?

Strictly no — conversion rate is a proportion, and chi-square or a two-proportion z-test is the right tool. With very large samples a Welch's t-test on the 0/1 indicator gives a similar p-value because the central limit theorem kicks in, but the standard error is built differently and the proportion test is more efficient. The safe answer is "two-proportion z-test or chi-square, not a t-test" — that signals you understand the difference between a continuous mean and a proportion.

Q: How do I report a t-test result to a non-technical stakeholder?

Lead with the effect size and confidence interval, then mention the p-value as a footnote. The line that lands with a PM at Notion or Linear is "treatment lifted revenue per user by 0.4 percent with a 95 percent CI of 0.3 to 0.5 percent (p = 0.001)," not "the t-statistic was 3.2 with 47,000 degrees of freedom." Stakeholders care whether the effect is real, how big it is, and how precise the estimate is — the test statistic itself is plumbing.

Prep A/B testing and statistics

300+ questions on experiment design, sample size, p-values, and pitfalls.

Join the waitlist

Contents:

Why t-tests show up in every DS loop
One-sample t-test
Two-sample t-test
Paired t-test
Welch's t-test — the modern default
Assumptions and what to do when they break
Common pitfalls
Where this shows up in production
Related reading
FAQ

Why t-tests show up in every DS loop

Walk into a senior DS loop at Stripe, Netflix, Airbnb, DoorDash, or Snowflake and at least one question will turn on a t-test. The hiring manager at Databricks asks whether your A/B treatment moved average revenue per user. The staff scientist at Anthropic hands you two columns of latency numbers and asks how to test whether the new model serves faster. The technical screen at Linear gives you a paired before/after dataset for a UI change.

Candidates lose points not because they cannot recite the formula but because they reach for the wrong variant, blur the assumptions, or muddle the p-value. A senior interviewer listens for whether you can pick between one-sample, two-sample independent, paired, and Welch's variants in three seconds, then explain why the choice matches the design.

One-sample t-test

A one-sample t-test compares the mean of a single sample against a fixed reference value mu_0. The classic prompt is "your checkout latency SLA is 4 seconds, here is a sample of 500 sessions, is the team meeting the SLA?" The reference value comes from the business and is not estimated from the data.

The mechanics: compute the sample mean x_bar, standard deviation s, and standard error s / sqrt(n). The test statistic scales the gap between x_bar and mu_0 by the standard error and follows a t-distribution with n - 1 degrees of freedom under the null.

H0: mu = mu_0
H1: mu != mu_0

t = (x_bar - mu_0) / (s / sqrt(n))

from scipy.stats import ttest_1samp

stat, p = ttest_1samp(sample, popmean=4.0)
if p < 0.05:
    print('reject H0 — mean differs from 4 seconds')

The trap is forgetting that the reference value must be external. Candidates sometimes compute mu_0 from a different slice of the same data, which turns the test into a two-sample design and inflates the false-positive rate. If an interviewer at Notion asks you to "compare the latency of the new release against the historical baseline," ask whether the baseline is a fixed published number or a separate sample — the answer dictates the test.

Two-sample t-test

A two-sample t-test compares the means of two independent groups — the bread-and-butter test of A/B testing. The data is two columns of numbers from separately drawn samples, usually control and treatment, and the question is whether the population means are equal.

H0: mu_A = mu_B
H1: mu_A != mu_B

t = (x_bar_A - x_bar_B) / SE

The standard error is where the variants split. Student's t-test pools variance across both groups, assuming a common population variance. Welch's does not pool and uses the per-group variances directly. For modern A/B tests on user-level metrics, Welch's is the safer default — more on that two sections down.

from scipy.stats import ttest_ind

stat, p = ttest_ind(control, treatment, equal_var=False)  # Welch's

A second mistake is misreading the alternative hypothesis. ttest_ind is two-sided by default, so a p-value of 0.04 means the means differ in either direction. If an interviewer at Vercel asks "is the new variant faster than control," you need a one-sided test — either halve the two-sided p-value (when the observed direction matches the hypothesis) or pass alternative='less'. State the direction before you run the test; running both and picking the smaller p-value is textbook p-hacking.

Paired t-test

A paired t-test compares two measurements taken on the same units, where each unit contributes one observation in each condition. The columns are matched pairs, not independent samples. Classic prompts: "we shipped a UI change to all users on the same day, here is revenue per user before and after, did revenue move?" or "each user saw both layouts in a within-subjects study, did engagement change?"

The test does not operate on the two columns directly — you compute the per-row difference and run a one-sample t-test against zero on that vector. The pairing absorbs between-user variance and leaves the within-user change, which is usually a much smaller, more sensitive quantity.

diff_i = X_after_i - X_before_i
t = mean(diff) / (sd(diff) / sqrt(n))

from scipy.stats import ttest_rel

stat, p = ttest_rel(after, before)

Pairing typically yields more power than the equivalent two-sample design at the same N, sometimes dramatically more for sticky user-level metrics like spend or session count. CUPED is a regression-based generalization of the same idea. If a panel at Airbnb asks "how would you increase power without changing sample size," paired analysis or CUPED-style covariate adjustment is the answer.

The pitfall is using a paired test on data that is not actually paired. In a true A/B test where each user was randomly assigned to one variant, observations are independent across users, not paired across conditions — you must run ttest_ind. Match the test to the design.

Welch's t-test — the modern default

Welch's is the two-sample t-test with one key change: it does not assume equal variance across groups. The denominator uses each group's own variance, and the degrees of freedom are approximated by the Welch-Satterthwaite equation.

t = (x_bar_A - x_bar_B) / sqrt(s2_A / n_A + s2_B / n_B)

from scipy.stats import ttest_ind

stat, p = ttest_ind(control, treatment, equal_var=False)

Welch's is the modern default for three reasons. A/B treatments routinely change the variance of the metric — a discount that converts more people also produces more $0 sessions among non-converters, inflating treatment-arm variance. Group sizes are often unbalanced in production experiments, and Student's t-test is sensitive to that imbalance when variances differ. When variances genuinely are equal, Welch's loses almost no power, so the cost of using it as a blanket default is negligible.

If an interviewer at Meta or OpenAI asks "which t-test would you run by default," the right answer is Welch's, with a one-sentence justification about unequal variance being the realistic regime. "I'd run Levene's test first and pick" is a yellow flag — the pretest-then-test pipeline distorts the false-positive rate and is no longer recommended.

Prep A/B testing and statistics

300+ questions on experiment design, sample size, p-values, and pitfalls.

Join the waitlist

Assumptions and what to do when they break

The classical assumptions are independence of observations, approximate normality of the sampling distribution of the mean, and (for Student's) equal variance across groups. Listing them is table stakes — what wins points is knowing which ones bend gracefully and which break the test outright.

Normality matters less than candidates fear. The central limit theorem makes the sampling distribution of the mean approximately normal for sample sizes in the hundreds, even when the underlying data is heavily skewed. Revenue per user is wildly non-normal at the row level, but with 50,000 users per arm the sample mean is well-behaved and the t-test is fine. Normality genuinely matters below roughly 30 observations per group with heavy-tailed data — there you reach for a non-parametric alternative or a bootstrap.

Independence matters a lot and bends nothing. If users are clustered — multiple sessions per user, multiple users per company, multiple measurements per device — the standard errors are wrong and the p-values are too small. The fix is a cluster-aware design: bootstrap with the user as the resampling unit, the delta method on user-level aggregates, or a mixed-effects model. If a Stripe panel asks "what if the same user shows up multiple times in your data," they are testing whether you spot the independence violation.

When the assumptions are too far gone, the right replacement depends on the failure mode. Heavy-tailed continuous data with small N points toward Mann-Whitney U for two independent samples or Wilcoxon signed-rank for paired data. Binary outcomes belong with chi-square or a two-proportion z-test, not a t-test on the 0/1 indicator.

Common pitfalls

The most common pitfall at staff DS loops is reaching for a t-test on a binary outcome. Conversion rate is a proportion, not a continuous mean, and the right tool is chi-square or a two-proportion z-test. A t-test on a 0/1 column will run and return a number, but the Bernoulli variance structure is built into the proportion test and lost in the t-test, so the standard error is wrong on small samples. If a DoorDash interviewer asks "did treatment lift conversion from 5.0 to 5.3 percent," do not reach for ttest_ind on the indicator vector.

A second pitfall is treating a low p-value as proof of practical importance. A 0.001 p-value on a 0.2 percent lift across 10 million users is statistically detectable and commercially uninteresting. Senior interviewers want you to couple the test with an effect size — Cohen's d for continuous metrics, absolute and relative lift for proportions — and a confidence interval. The line that wins points is "p = 0.001, effect = +0.4 percent with a 95 percent CI of 0.3 to 0.5 percent, robust but small," not "p < 0.001 so ship it."

A third pitfall is multiple testing without correction. If you ran t-tests on twenty secondary metrics, at least one is expected to come back significant by chance at the 0.05 level. Specify the primary metric before the test, then apply Bonferroni or Benjamini-Hochberg to the secondaries. If a candidate at Meta says "we ran t-tests on every metric in the dashboard and found three that moved," the interviewer will probe for the correction.

A fourth pitfall is sequential peeking — checking the t-test p-value daily and stopping as soon as it dips below 0.05. The frequentist t-test is built around a fixed sample size, and repeated looks inflate the false-positive rate well above the nominal level. The fix is a fixed sample size, an alpha-spending procedure like O'Brien-Fleming, or an always-valid sequential framework. Senior loops at Netflix and Linear specifically test for this trap.

Where this shows up in production

In production, the t-test is the engine of almost every classical A/B platform. The default analysis screen on the platforms used by Uber, Airbnb, and DoorDash is a Welch's t-test on the per-user metric, with a delta-method correction for ratio metrics. The frontend shows a lift, a p-value, and a confidence interval — under the hood it is the formulas from this article.

The same machinery powers monitoring. A latency regression alert at Vercel is a paired t-test in disguise, comparing post-deploy to pre-deploy. A model drift alert at Anthropic on an offline eval is a t-test on the per-example score, with Bonferroni across dashboard metrics. The t-test is the daily currency of analytics and observability work.

If you want to drill statistics questions like these every day, NAILDD is launching with curated DS interview problems that cover t-tests, chi-square, bootstrap, and the full A/B testing toolkit.

FAQ

Why does Welch's t-test beat Student's as the default?