How to calculate Confidence Interval in SQL

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Why a confidence interval beats a point estimate

Your PM walks over Monday morning: "AOV in April was $50. Should we lower the free-shipping threshold from $75 to $50?" You pull the query, see mean(total) = 50.12, and report it. Two weeks later marketing rolls out the change and revenue per visitor drops 4%. The data was not wrong — $50 was a single number with no honest error bar, and the true April AOV was somewhere in $46-$54 depending on how the long tail of high-ticket orders happened to land.

That is the entire point of a confidence interval. A 95% CI tells your reader the range of values consistent with the data at a stated level of certainty — $50 ± $3. When that range is wider than the business decision can tolerate, you need more data before you act. Confidence intervals are how analysts at Stripe, DoorDash, and Netflix turn raw query outputs into recommendations a director can defend two weeks later.

In SQL interviews, CI questions come up in two flavors: "compute AOV with a 95% CI" (means) and "what is the conversion rate of variant B with error bars" (proportions). The formulas differ, the SQL differs, and confusing them on a whiteboard is a fast way to lose a senior analyst role. We will walk through both, plus Wilson and bootstrap, in working PostgreSQL.

The formula in one minute

For a mean, the standard 95% normal-approximation CI is:

CI = mean ± z * SE
SE = stddev / sqrt(n)

For 95% confidence, z = 1.96. For 90% use z = 1.645. For 99% use z = 2.576. The standard error SE shrinks as sqrt(n) grows, which is why doubling your sample size only narrows the interval by about 30%, not 50%.

For a proportion (conversion rate, opt-in rate, click-through rate), the formula uses the binomial standard error:

CI = p ± z * sqrt(p * (1 - p) / n)

These two formulas — Wald for means, Wald for proportions — cover the vast majority of analyst dashboards. Both assume the sampling distribution is approximately normal, which is true for large n and not-too-extreme p. We will hit the edge cases in the pitfalls section.

CI for means in SQL

Suppose the table orders has columns order_id, user_id, total, status, created_at. Compute the AOV with a 95% CI over the trailing 30 days:

WITH stats AS (
    SELECT
        AVG(total)            AS mean_aov,
        STDDEV_SAMP(total)    AS std_aov,
        COUNT(*)              AS n
    FROM orders
    WHERE status = 'paid'
      AND created_at >= CURRENT_DATE - INTERVAL '30 days'
)
SELECT
    mean_aov,
    std_aov,
    n,
    mean_aov - 1.96 * std_aov / SQRT(n) AS ci_lower,
    mean_aov + 1.96 * std_aov / SQRT(n) AS ci_upper,
    1.96 * std_aov / SQRT(n)             AS ci_radius
FROM stats;

STDDEV_SAMP is the sample standard deviation (divides by n-1), the unbiased estimator when your data is a sample of a larger population — which it always is in production. Some dialects default STDDEV to the population version (STDDEV_POP); for n > 1000 the difference is negligible, but on a 200-row segment it matters. Be explicit. The ci_radius column is useful in dashboards: "AOV: $50.12 ± $1.20 (95% CI)" beats forcing the reader to subtract ci_upper - mean_aov in their head.

For the CI per segment — say, per acquisition channel — add a GROUP BY:

SELECT
    channel,
    AVG(total)                                            AS mean_aov,
    AVG(total) - 1.96 * STDDEV_SAMP(total) / SQRT(COUNT(*)) AS ci_lower,
    AVG(total) + 1.96 * STDDEV_SAMP(total) / SQRT(COUNT(*)) AS ci_upper,
    COUNT(*)                                              AS n
FROM orders
WHERE status = 'paid'
  AND created_at >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY channel
ORDER BY n DESC;

Watch for tiny segments. If n = 12 for the referral channel, the CI explodes and the normal approximation breaks down. Filter out segments with n < 30 or switch to a t-distribution multiplier — interviewers ask about this.

CI for proportions in SQL

For a conversion rate over the trailing 30 days:

WITH stats AS (
    SELECT
        COUNT(*)                                                                AS n,
        COUNT(*) FILTER (WHERE event = 'purchase')                              AS conversions,
        COUNT(*) FILTER (WHERE event = 'purchase')::NUMERIC
            / NULLIF(COUNT(*), 0)                                               AS p
    FROM funnel_events
    WHERE event_date >= CURRENT_DATE - INTERVAL '30 days'
)
SELECT
    p,
    n,
    conversions,
    p - 1.96 * SQRT(p * (1 - p) / n) AS ci_lower,
    p + 1.96 * SQRT(p * (1 - p) / n) AS ci_upper,
    1.96 * SQRT(p * (1 - p) / n)     AS ci_radius
FROM stats;

Cast at least one side of the division to NUMERIC — integer division in PostgreSQL silently returns zero and you will wonder why every CR is 0%. NULLIF(COUNT(*), 0) defends against an empty result set returning a divide-by-zero error.

This Wald CI is fine for n >= 1000 and p between roughly 0.1 and 0.9. Outside that range, switch to Wilson.

Wilson interval for small samples

When the conversion rate is below 1% or above 99%, or n is small, the Wald CI produces nonsense — a lower bound below zero or an upper bound above one. Wilson handles this gracefully:

WITH stats AS (
    SELECT
        COUNT(*)                                  AS n,
        COUNT(*) FILTER (WHERE event = 'purchase') AS k
    FROM funnel_events
    WHERE event_date >= CURRENT_DATE - INTERVAL '30 days'
),
wilson AS (
    SELECT
        n,
        k,
        k::NUMERIC / NULLIF(n, 0)                                AS p,
        (k + 1.96 * 1.96 / 2.0) / (n + 1.96 * 1.96)              AS center,
        1.96 * SQRT(
            (k::NUMERIC * (n - k) / NULLIF(n, 0) + 1.96 * 1.96 / 4.0)
            / NULLIF(n + 1.96 * 1.96, 0)
        ) / (n + 1.96 * 1.96)                                    AS radius
    FROM stats
)
SELECT
    p,
    n,
    center - radius AS wilson_lower,
    center + radius AS wilson_upper
FROM wilson;

The Wilson interval is asymmetric around p — that asymmetry is the feature. For 2/50 = 4% conversions, Wald gives roughly -1.4% to 9.4%, which makes no sense. Wilson gives 1.1% to 13.5%, which is honest. If you remember one proportion CI variant in interviews, remember Wilson.

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Bootstrap alternative

The normal-approximation CI assumes the sample mean behaves normally. Revenue per user does not — it has a heavy right tail with whales pulling the mean up. Bootstrap is the right tool: resample your data with replacement many times, compute the statistic on each resample, and read off the 2.5th and 97.5th percentiles of the resulting distribution.

True bootstrap in SQL is awkward because most engines lack random sampling with replacement. You can fake it with generate_series and random():

WITH base AS (
    SELECT total
    FROM orders
    WHERE status = 'paid'
      AND created_at >= CURRENT_DATE - INTERVAL '30 days'
),
resamples AS (
    SELECT
        b AS bootstrap_id,
        (SELECT AVG(total)
         FROM base
         TABLESAMPLE BERNOULLI(100) REPEATABLE(b)
        ) AS resampled_mean
    FROM generate_series(1, 1000) AS b
)
SELECT
    PERCENTILE_CONT(0.025) WITHIN GROUP (ORDER BY resampled_mean) AS ci_lower,
    PERCENTILE_CONT(0.975) WITHIN GROUP (ORDER BY resampled_mean) AS ci_upper
FROM resamples;

In practice, run bootstrap outside SQL — Python with numpy.random.choice or R with boot is faster and more correct. The SQL version above is a teaching aid, not a production pattern. For a deeper walkthrough see bootstrap CI in SQL.

Common pitfalls

The most common interview failure is applying a normal-approximation CI to skewed revenue data. The mean of 10,000 purchases looks roughly normal thanks to the central limit theorem, but the CI on individual purchase amounts is unreliable when one whale represents 8% of revenue. If your standard deviation is two or three times the mean, that is a red flag — switch to bootstrap or report median with a percentile-based CI. Junior candidates rarely look at the std-to-mean ratio before quoting an interval.

The second trap is small-sample CIs. For n < 30, the t-distribution multiplier is materially larger than 1.96 — at n = 10 it is roughly 2.26, widening the interval by 15%. SQL engines have no built-in t-quantile, so analysts hard-code 1.96 and produce an interval that is too narrow. Compute the t-multiplier externally, or refuse to produce a CI for n < 30 and report the raw count.

Third, confusing confidence with credible intervals. A frequentist 95% CI says: "if we ran this experiment 100 times, roughly 95 of the calculated intervals would contain the true parameter". It does not say "there is a 95% probability the true value lies in this specific interval" — that is a Bayesian credible interval and requires a prior. Be precise on whiteboards, pragmatic in dashboards.

Fourth, ignoring correlation. If "30 days of orders" includes 100 orders per day from the same user, those are not independent — they are autocorrelated. The naive n = 3000 overstates the effective sample size, and the CI is too narrow. For time-series and clustered data, aggregate to one observation per cluster before computing the CI, or use cluster-robust standard errors outside SQL.

Fifth, specific to A/B testing: the CI for a difference of two means is not the same as the CIs of each group plotted side by side. The SE of the difference is sqrt(SE1^2 + SE2^2), not the larger of the two SEs. Two overlapping CIs can still hide a significant difference. If you are comparing groups, compute the CI of the delta directly.

Optimization tips

Pre-aggregate. If you compute AOV-with-CI for the same date window across ten segments, your query repeats WHERE status = 'paid' AND created_at >= ... ten times. Materialize a daily orders_daily_agg table with sum(total), sum(total*total), and count(*) per day per segment, then compute mean and stddev for any window with one scan:

SELECT
    SUM(sum_total) / NULLIF(SUM(n), 0)                          AS mean_aov,
    SQRT(
        (SUM(sum_total_sq) - POWER(SUM(sum_total), 2) / SUM(n))
        / NULLIF(SUM(n) - 1, 0)
    )                                                            AS std_aov,
    SUM(n)                                                       AS n
FROM orders_daily_agg
WHERE day >= CURRENT_DATE - INTERVAL '30 days';

The sum(x^2) trick recovers stddev from the aggregated table without rescanning raw orders. This shaves CI dashboards from a 30-second scan over millions of orders to a 50-ms lookup — Snowflake and Databricks queries that timed out return instantly.

Index created_at and any segment column you slice by. For dashboards that recompute on page load, cache the CI numbers in a materialized view refreshed hourly. The CI is cheap once you have the summary stats; the expensive part is always the scan over raw events.

If you want to drill questions like this every day, NAILDD is launching with 500+ SQL problems built around exactly this kind of analyst interview pattern.

FAQ

Should I report 90%, 95%, or 99% CI?

95% is the default everywhere — papers, dashboards, A/B test tooling. Use 90% when you want a tighter interval and your stakeholder accepts more false positives (early-stage product decisions, fast iteration). Use 99% for high-stakes decisions — pricing changes, fraud thresholds, regulated systems. The trade-off is mechanical: 99% widens the interval by about 30% versus 95%, so you need roughly 70% more data to keep the same precision.

What does a "95% confidence interval" actually mean?

The strict frequentist reading: if you repeated the experiment 100 times and computed a 95% CI each time, roughly 95 of those intervals would contain the true parameter. It is a statement about the long-run behavior of the procedure, not about this specific interval. The common-language reading — "there is a 95% chance the true value is in this interval" — is the Bayesian credible interval. Name the frequentist version in interviews; write what your audience understands in dashboards.

If two CIs do not overlap, is the difference significant?

Roughly yes, but it is a one-way test. Non-overlapping 95% CIs imply the difference is significant well below the 5% level. The converse is false — two CIs can overlap and the difference can still be significant. For a proper comparison, run a z-test or t-test on the delta. The overlap heuristic is fine for storytelling, not for a doc.

How do I compute a CI for a median?

The bootstrap percentile method. Resample your data 1,000 to 10,000 times, compute the median of each resample, and report the 2.5th and 97.5th percentiles. There is a closed-form CI for medians from the binomial distribution, but bootstrap generalizes to any percentile, trimmed mean, or robust statistic.

Narrow or wide CI — which is "better"?

Narrow means confident, wide means not. For "did our experiment move the metric?" you want a narrow CI that excludes zero. For "what is the true churn rate of new signups?" you want narrow because product decisions depend on the exact number. The lever you control is n: doubling sample size narrows the CI by about 30%. If your CI is too wide for the decision, the answer is "collect more data", not "loosen the confidence level".

When should I prefer Wilson over Wald for proportions?

Whenever p is below 5% or above 95%, or whenever n is below 100. Wald can produce bounds outside [0, 1] and undercovers badly at the tails. Wilson stays inside [0, 1], has better coverage at extreme proportions, and costs one extra line of SQL. Default to Wilson in interviews; mention Wald as the simpler alternative for well-behaved cases.