T-test vs z-test
Contents:
The 30-second answer
A z-test assumes you know the population standard deviation (sigma) or that the metric is a proportion whose variance is closed-form from the rate. A t-test estimates the standard deviation from the sample itself and pays for that uncertainty with a wider reference distribution. In practice, analysts almost always reach for a t-test on continuous metrics because true population sigma is virtually never known — you only have the sample in front of you.
For sample sizes above roughly 30, the two tests produce nearly identical p-values, so the choice rarely changes the conclusion in real A/B testing where N is in the thousands. Where it does matter is interviews: saying "z-test because N is large" signals you have not thought about whether you know sigma. The cleaner answer is t-test by default, switch to a two-proportion z-test when the metric is a Bernoulli rate.
When to use a z-test
A z-test is the right tool when the population standard deviation is known a priori — not estimated from the same sample you are testing. This shows up mostly in textbooks: IQ tests calibrated to sigma=15, standardized exams with published population variance, factory measurements with long-validated tolerance. In analytics the condition is so rare that recognizing the test by name is more useful than running it.
The other place a z-test shines is comparing proportions. A conversion rate is a Bernoulli variable whose variance is fully determined by the rate itself — p(1-p). That formula is the population variance, not a sample estimate, so the two-proportion z-test is mathematically clean and pairs naturally with a confidence interval on the lift. Every major A/B testing platform — in-house tooling at Meta, Netflix, Airbnb, and off-the-shelf systems like Statsig or Eppo — reports conversion experiments using this test.
z = (x_bar - mu_0) / (sigma / sqrt(n))Here x_bar is the sample mean, mu_0 is the hypothesized mean, sigma is the known population standard deviation. The result is compared against N(0,1). If absolute z exceeds 1.96, you have a two-sided p-value below 0.05.
from statsmodels.stats.proportion import proportions_ztest
count = [120, 145] # conversions in control, treatment
nobs = [2000, 2000] # users per arm
stat, p = proportions_ztest(count, nobs)
print(f'z={stat:.3f}, p={p:.4f}')No estimate of variance is fed in — the function computes it directly from the rates. That is the hallmark of a z-test on proportions: the variance formula is closed-form from the parameter you are testing.
When to use a t-test
Use a t-test for continuous metrics where you estimate the standard deviation from the sample itself: average order value, time on site, latency, revenue per active user, session duration. Because you replaced a known sigma with a noisy sample s, the test statistic must account for the extra uncertainty, and it does so by switching the reference distribution from normal to Student's t with n-1 degrees of freedom.
t = (x_bar - mu_0) / (s / sqrt(n))The only formula change from the z-test is the swap of sigma for s. With low degrees of freedom — say n=5 — the t-distribution has heavier tails than the normal, so the critical value for alpha=0.05 jumps from 1.96 to about 2.78. By n=30 the critical value is back to roughly 2.04, and by n=120 you cannot tell the two distributions apart on a chart.
from scipy.stats import ttest_ind
stat, p = ttest_ind(control, treatment, equal_var=False) # Welch's
print(f't={stat:.3f}, p={p:.4f}')The equal_var=False switch enables Welch's t-test, which drops the assumption that the two groups share a variance. In A/B testing, treatments routinely change both the mean and the spread — a discount that converts more users also creates more zero-revenue sessions for non-converters — so Welch's is the safer default. Use Student's only when you have explicitly checked that variances match.
Why the t-distribution has fatter tails
When you estimate sigma from the same sample you are testing, you add a second source of noise on top of the noise in the sample mean. The t-distribution accounts for that by making extreme values more probable than they would be under a normal distribution. The practical consequence is wider confidence intervals and larger p-values — the correct, more conservative answer when you do not know the true spread.
The shape depends on degrees of freedom. With df=1 the t-distribution looks almost like a Cauchy. By df=10 the tails are visibly heavier than normal but the bell shape is clear. By df=30 you need a magnifying glass to spot the difference — the origin of the folk rule "use a z-test when N is over 30". The real reason analysts use t is that sigma is unknown, not that N is small.
Flavors of the t-test you should know
The one-sample t-test compares a single group's mean to a fixed reference, like "is average checkout time slower than the 4-second SLA?". The reference value mu_0 is fixed from a contract or product target, not estimated.
The two-sample independent t-test compares two groups drawn from different populations — the workhorse for A/B testing. Each user contributes one observation. The Welch's variant (equal_var=False) does not assume equal variance, almost always the right call in production because treatments shift both center and spread.
The paired t-test compares two measurements on the same units, like revenue per user before and after a UI change. It eliminates between-user variance by subtracting paired values. Confusing paired with independent is one of the most common errors in writeups for retention or pricing experiments where you held the user set constant.
from scipy.stats import ttest_1samp, ttest_ind, ttest_rel
ttest_1samp(times, popmean=4.0) # one-sample
ttest_ind(control, treatment, equal_var=False) # Welch's
ttest_rel(before, after) # pairedThe function names in scipy.stats map cleanly to the three flavors. The mistake is reaching for ttest_ind on before/after data — it ignores the pairing structure and inflates the standard error, costing you significance on real effects.
A/B testing recipe, worked end-to-end
A typical interview scenario: control got 2,000 users with 120 conversions, treatment got 2,000 users with 145 conversions. The metric is a proportion, so chi-square or its sibling the two-proportion z-test is the canonical choice. T-test on the 0/1 indicator works numerically for large N but signals weak fundamentals because the variance of a Bernoulli is p(1-p), not s^2.
from statsmodels.stats.proportion import proportions_ztest
count = [120, 145]
nobs = [2000, 2000]
stat, p = proportions_ztest(count, nobs)If the same experiment had average revenue per user as its metric, the recipe flips to Welch's t-test. Revenue per user is continuous, you do not know the population variance, and a single user can spend anywhere from zero to thousands of dollars. The standard error has to be estimated from the sample — exactly what a t-test does.
Sample size matters less than metric type. Even at N=10,000 per arm, you would not run a t-test on conversion rate, because the variance assumption is wrong for a Bernoulli outcome. The choice is dictated by the measurement scale — the same logic that drives the t-test vs chi-square decision.
Side-by-side comparison
| z-test | t-test | |
|---|---|---|
| Variance assumption | Known population sigma | Estimated from sample as s |
| Reference distribution | Standard normal N(0,1) | Student's t with n-1 df |
| Typical sample size | Large or known sigma | Any, especially n < 30 |
| Tails of reference | Lighter | Heavier (more conservative) |
| In practice | Conversion rate, IQ scores | Continuous A/B metrics |
| Python call | proportions_ztest |
ttest_ind(equal_var=False) |
The row that trips up candidates most often is "variance assumption". Saying "z-test because N is large" misses the deeper reason analysts use a t-test by default: even with millions of rows, sigma is unknown, so you are estimating it from the sample. The t-distribution is the honest reference, and at large N it converges to the normal anyway.
Interview questions you will actually get
"When can you use a z-test instead of a t-test?" When the population variance is genuinely known a priori — rare outside textbook problems — or when the metric is a proportion and variance is closed-form from the rate. For continuous metrics in A/B testing, default to t-test because you are always estimating sigma from the sample.
"What is Welch's t-test?" A modification that drops the equal-variance assumption Student's t makes. In production A/B testing the treatment usually shifts both mean and variance, so equal-variance is fragile. Welch's costs nothing extra — same call with equal_var=False.
"At what sample size do t and z give the same answer?" Beyond roughly n=30 per group the difference is negligible. By n=120 the two reference distributions are visually indistinguishable. The honest answer is "use t whenever you estimate sigma, regardless of N".
"Can I use a t-test for conversion rate?" Numerically yes for large N, but the standard error formula is wrong in principle. A Bernoulli has variance p(1-p), not a sample variance. Senior interviewers will dock you for this even if the p-values match.
"What are the assumptions of the t-test?" Independence of observations, approximate normality of the sampling distribution of the mean (free with N>30 by the CLT), and — for Student's — equal variances. Welch's drops the equal-variance requirement.
Common pitfalls
When teams first run statistical tests on experiment data, the most frequent error is reaching for a z-test on a continuous metric because "we have a lot of users, CLT covers us". The CLT argument is right that the sampling distribution of the mean is normal, but you still need a standard error, and you are estimating it from the sample — that is the t-test. Framing it as a z-test signals you have not internalized the difference between a known sigma and an estimated s.
Another trap is running Student's t-test instead of Welch's by default. Student's assumes equal variance, almost never true in A/B testing because the treatment moves both mean and spread. Welch's drops that assumption at zero cost — same call, just equal_var=False. Reach for Student's only after you have verified equal variance.
Using a t-test on a 0/1 conversion indicator is a classic mistake that survives because the numbers come close to the two-proportion z-test at large N. The standard error formula is wrong, though — the variance of a Bernoulli is p(1-p), not the sample variance the t-test computes. Use the two-proportion z-test or chi-square for conversion rate, and reserve t-test for continuous metrics.
Ignoring the pairing structure in before-and-after designs is the fourth frequent mistake. Running ttest_ind on data where the same user contributes both observations inflates the standard error and costs you significance on real effects. The fix is ttest_rel.
Finally, watch out for hidden non-independence. If your unit of randomization is user but you analyze at the session level, you have inflated effective sample size and the p-values are too small. The fix is to aggregate to the randomization unit first, or use clustering-aware methods — delta method, bootstrap by user, cluster-robust standard errors.
Related reading
- T-test vs chi-square
- P-value explained simply
- Effect size explained simply
- Confidence intervals data science interview
- Bootstrap explained simply
- A/B testing peeking mistake
If you want to drill stats and experiment-design questions like this every day, NAILDD is launching with 500+ data science problems built around exactly this kind of choice.
FAQ
What is the difference between t-test and z-test in one sentence?
A t-test estimates the standard deviation from the sample and uses Student's t-distribution to account for that extra uncertainty, while a z-test assumes the population standard deviation is known and uses the standard normal as its reference. In practice the t-test is the default for continuous metrics because true population sigma is virtually never known.
Do I need to check normality before running a t-test?
For sample sizes above roughly 30 per group, no — the CLT guarantees that the sampling distribution of the mean is approximately normal regardless of the shape of the underlying data. For smaller samples or heavily skewed metrics like revenue with a few whales, prefer Mann-Whitney U or a bootstrap confidence interval. In production analytics the bigger threat to t-test validity is non-independence, not non-normality.
Which test should I use for conversion rates?
Chi-square on the 2x2 contingency table or the equivalent two-proportion z-test. The variance of a Bernoulli is p(1-p), closed-form from the rate, so a z-test is the principled choice. A t-test on the 0/1 indicator works numerically for large N but uses the wrong standard error formula.
What if my data is non-normal and the sample is small?
Switch to a non-parametric test. Mann-Whitney U is the analog of the two-sample t-test and compares ranks rather than means. The Wilcoxon signed-rank test is the analog of the paired t-test. These tests do not require normality and are robust to outliers, at slightly lower power when the data really is normal.
What is the difference between Student's and Welch's t-test?
Student's assumes the two groups share a single variance, while Welch's estimates each group's variance separately and adjusts the degrees of freedom. In A/B testing the treatment usually shifts both mean and spread, so equal-variance is fragile. Welch's is the recommended default in scipy.