Chi-square test in SQL
Contents:
Why chi-square matters
The chi-square test (chi^2) is the workhorse statistic for binary A/B experiments. You ship a checkout redesign at Stripe, you split traffic 50/50, and a week later your PM walks over with a question: variant B converts at 14.5%, control at 12.0%, is that a real lift or are we looking at noise? Chi-square answers that question. It also generalizes cleanly to three or more variants, three or more categories of outcome, and even segment-by-variant interaction tests.
The math is older than the SaaS industry and the implementation in SQL is straightforward once you have built the contingency table. The trap is that most analysts copy a formula off Wikipedia, forget to cast to numeric, get integer division silently, and ship a dashboard that always reports "not significant." That is the bug we will debug in this post.
A realistic interview question at Airbnb, DoorDash, or Linear goes like this: "Here is an ab_experiment_users table with one row per user, columns experiment_id, variant, converted. Write a single SQL query that returns the chi-square statistic and a yes/no verdict at alpha equals 0.05." This guide gives you the recipe and explains every step.
The contingency table
The starting point for any chi-square test is a contingency table. For a two-variant A/B experiment on a binary outcome the table is two-by-two:
| Converted | Did not convert | |
|---|---|---|
| Variant A (control) | 120 | 880 |
| Variant B (treatment) | 145 | 855 |
Each row is a variant, each column is an outcome state. The total of every row, every column, and the grand total are needed for the calculation. In SQL you build this table with a single GROUP BY variant query that pivots the boolean outcome into two columns with conditional sums.
You will see analysts skip the explicit contingency-table step and try to compute the chi-square directly off the raw event table. That works for a 2x2 case if you are careful, but it falls apart the moment you need three categories or a Yates correction. Always materialize the table first, then compute the statistic on the materialized cells.
The formula
The chi-square statistic is the sum of squared standardized residuals across all cells of the contingency table:
chi^2 = sum_i (O_i - E_i)^2 / E_iO_i is the observed count in cell i. E_i is the expected count under the null hypothesis of independence between row and column. For any cell, the expected count is the product of its row total and column total divided by the grand total:
E_i = (row_total_i * col_total_i) / grand_totalDegrees of freedom are (rows - 1) * (columns - 1). For a 2x2 table that is df = 1. The chi-square statistic follows a chi-square distribution with df degrees of freedom under the null. You reject the null when the observed statistic exceeds the critical value for your chosen alpha:
df = 1, alpha = 0.05 -> critical value 3.841
df = 1, alpha = 0.01 -> critical value 6.635
df = 1, alpha = 0.001 -> critical value 10.828Postgres does not ship a chi-square CDF, so most teams compare the statistic against the critical value inside SQL and compute the exact p-value outside SQL with scipy.stats.chi2.sf or an equivalent R function.
Chi-square for an A/B test
Here is the full query for a two-variant binary experiment. It builds the contingency table, computes expected counts, and returns the statistic plus a verdict in one pass:
WITH cells AS (
SELECT
variant,
SUM(CASE WHEN converted THEN 1 ELSE 0 END) AS conv,
SUM(CASE WHEN NOT converted THEN 1 ELSE 0 END) AS non_conv,
COUNT(*) AS row_total
FROM ab_experiment_users
WHERE experiment_id = 'checkout_v2'
GROUP BY variant
),
totals AS (
SELECT
SUM(conv) AS total_conv,
SUM(non_conv) AS total_non_conv,
SUM(row_total) AS grand_total
FROM cells
),
expected AS (
SELECT
c.variant,
c.conv AS o_conv,
c.non_conv AS o_non_conv,
c.row_total::NUMERIC * t.total_conv / NULLIF(t.grand_total, 0) AS e_conv,
c.row_total::NUMERIC * t.total_non_conv / NULLIF(t.grand_total, 0) AS e_non_conv
FROM cells c
CROSS JOIN totals t
)
SELECT
SUM(POWER(o_conv - e_conv, 2) / NULLIF(e_conv, 0))
+ SUM(POWER(o_non_conv - e_non_conv, 2) / NULLIF(e_non_conv, 0)) AS chi_square,
1 AS df,
CASE
WHEN SUM(POWER(o_conv - e_conv, 2) / NULLIF(e_conv, 0))
+ SUM(POWER(o_non_conv - e_non_conv, 2) / NULLIF(e_non_conv, 0)) > 3.841
THEN 'significant at 0.05'
ELSE 'NOT significant'
END AS verdict
FROM expected;A few things to call out. The ::NUMERIC cast on row_total is load-bearing. Without it, Postgres performs integer division and silently truncates expected counts to zero for small cells, which blows up the statistic or returns NaN. The NULLIF(..., 0) guards against an empty variant — if you ramp a holdout to zero percent you will still get a clean query result. The verdict column is convenience for dashboards, but for any serious decision you should compute the exact p-value outside SQL.
Plug in the contingency table from the previous section and the query returns chi_square = 2.96, df = 1, verdict not significant. That is the right answer — a 2.5 percentage-point lift on a thousand-user-per-arm experiment is not enough to call.
More than two variants
For a multi-variant experiment with k variants the contingency table is k x 2 and degrees of freedom become k - 1. The math is identical, you just sum over more cells:
WITH cells AS (
SELECT
variant,
SUM(CASE WHEN converted THEN 1 ELSE 0 END) AS conv,
SUM(CASE WHEN NOT converted THEN 1 ELSE 0 END) AS non_conv,
COUNT(*) AS row_total
FROM ab_experiment_users
WHERE experiment_id = 'multivariate_v1'
GROUP BY variant
),
totals AS (
SELECT
COUNT(*) AS k,
SUM(conv) AS total_conv,
SUM(non_conv) AS total_non_conv,
SUM(row_total) AS grand_total
FROM cells
),
chi AS (
SELECT
SUM(POWER(c.conv - c.row_total::NUMERIC * t.total_conv / t.grand_total, 2)
/ NULLIF(c.row_total::NUMERIC * t.total_conv / t.grand_total, 0))
+ SUM(POWER(c.non_conv - c.row_total::NUMERIC * t.total_non_conv / t.grand_total, 2)
/ NULLIF(c.row_total::NUMERIC * t.total_non_conv / t.grand_total, 0)) AS chi_square,
MAX(t.k) - 1 AS df
FROM cells c
CROSS JOIN totals t
)
SELECT chi_square, df FROM chi;The query returns one chi-square value across all variants. A significant result tells you "at least one variant differs from the rest" but not which one. To localize the difference you run pairwise chi-square tests and adjust for multiple comparisons. For that step the exact p-value matters, so move to Python or R and feed chi_square plus df into scipy.stats.chi2.sf(chi_square, df).
Common pitfalls
The first trap is applying chi-square when expected cell counts are very small. The chi-square distribution is an asymptotic approximation. Once any expected count drops below five, the approximation breaks and p-values become unreliable. The standard fix is Fisher's exact test, which computes the exact hypergeometric probability and works for any cell size. Switch to Fisher whenever you see an expected count under five — most stats libraries flag this automatically, but in raw SQL you will not get a warning.
The second trap is integer division. Postgres treats bigint / bigint as integer division and truncates. Your expected count row_total * total_conv / grand_total will silently return zero for cells where total_conv is smaller than grand_total divided by row_total, and then the chi-square calculation either explodes via division by NULLIF guards or produces nonsense large numbers. Always cast to NUMERIC before the divide. The same trap shows up in BigQuery (INT64 math) and Snowflake (less aggressive but still present in some legacy column types).
The third trap is treating dependent observations as independent. Chi-square assumes one observation per row and that rows are independent. If your dataset has multiple events per user — event_count rows from PostHog, multiple sessions per visitor — and you compute conversion at the event level, you are double-counting heavy users and inflating the statistic. Aggregate to one row per user before the test, on the user identifier you randomized at.
The fourth trap is the false equivalence with z-tests. For a 2x2 contingency table the chi-square test and the two-proportion z-test are mathematically identical — chi-square equals z squared and they produce the same p-value. Analysts sometimes report both as if they are corroborating evidence; they are not, they are the same test. If you want a sanity check use Fisher's exact, not z.
The fifth trap is applying Yates' continuity correction by reflex. Yates subtracts 0.5 from each |O - E| before squaring. It was designed for very small 2x2 tables to reduce the over-estimation bias of the continuous approximation. On modern experiment sizes (anything above a few hundred per cell) Yates is overly conservative and depresses your statistic, costing you statistical power. Drop it unless your cell counts are tiny and you know why you need it.
Optimization tips
Chi-square in SQL is bounded by the cost of the underlying GROUP BY variant, not by the arithmetic. On Snowflake or BigQuery, partition or cluster ab_experiment_users by experiment_id so the WHERE clause prunes to a single micropartition. On Postgres, a composite index on (experiment_id, variant) plus a partial index on WHERE converted = true keeps the conditional sums cheap even on hundred-million-row event tables.
If you run chi-square repeatedly for dashboards or alerts, materialize the cells CTE as a view or a Databricks delta table refreshed nightly. The contingency table is small — at most a few rows per experiment — so the statistic itself is a millisecond computation off the cached cells. Avoid recomputing the underlying counts on every page load.
For very large experiments where you want both the statistic and the exact p-value inside SQL, push the chi-square value to a Python UDF: BigQuery, Snowflake, and Databricks all support inline scipy. The UDF takes chi_square and df, calls scipy.stats.chi2.sf, and returns the p-value as a column. This removes the round-trip to a notebook without sacrificing precision.
Related reading
- A/B testing peeking mistake
- Bonferroni correction in SQL
- Bootstrap confidence intervals in SQL
- SQL window functions interview questions
If you want to drill A/B-testing SQL questions like this every day, NAILDD is launching with 500+ SQL problems across exactly this pattern.
FAQ
When do I use chi-square instead of a t-test?
Chi-square is for categorical outcomes — converted yes or no, plan tier picked, churned or retained. A t-test is for continuous outcomes — revenue per user, time on page, latency. If your metric is a proportion you want chi-square or the equivalent two-proportion z-test. If your metric is a mean of a continuous variable you want a t-test or its non-parametric cousin Mann-Whitney. Mixing them is a common interview red flag.
What is the minimum cell count for chi-square?
The rule of thumb is that every expected count should be at least five. Some textbooks soften this to "no more than 20% of cells below five and none below one." When you violate the rule, switch to Fisher's exact test, which has no minimum cell-count requirement because it computes the exact hypergeometric distribution rather than approximating with the chi-square curve. Most analysts hit this constraint with rare-event metrics like fraud flag rate or churn at week one.
Where do I get the p-value if Postgres has no chi-square CDF?
Three options. First, compare against the critical value table — 3.841 for df=1 at alpha=0.05 is the one most analysts memorize. Second, run scipy outside SQL: scipy.stats.chi2.sf(chi_square, df) returns the survival function which equals the right-tail p-value. Third, use a Python UDF inside BigQuery, Snowflake, or Databricks to compute the p-value as a query column. For dashboards the UDF approach is cleanest because the result stays in the same data layer as the rest of your experiment metrics.
Is Yates' continuity correction mandatory?
Only for very small 2x2 tables. The correction adjusts for the fact that chi-square approximates a discrete distribution with a continuous one, which over-estimates the statistic when cells are tiny. For modern experiment sizes — anything above a few hundred per cell — Yates is too conservative and shrinks your effective power. Most A/B platforms (GrowthBook, Statsig, Eppo) skip Yates by default and so should you, unless an interviewer specifically asks about it.
How do I handle multiple comparisons across more than two variants?
Run pairwise chi-square tests between each pair of variants and then adjust the p-values for the family-wise error rate. The simplest correction is Bonferroni: divide alpha by the number of comparisons. A more powerful alternative is Holm-Bonferroni or Benjamini-Hochberg (FDR control) if you have many comparisons. For a three-variant test that means three pairwise comparisons and alpha at 0.05/3 = 0.0167 under Bonferroni. The omnibus chi-square across all variants is the right first step — only run pairwise tests when the omnibus is significant.
Can I use chi-square for non-binary categorical outcomes?
Yes. Chi-square generalizes to any r x c contingency table — rows are experiment variants, columns are categories of outcome. If you are testing whether a checkout redesign changes the mix of payment methods picked, the columns are credit-card, debit-card, PayPal, Apple Pay, and chi-square tells you whether the variant changes the distribution. Degrees of freedom become (r - 1) * (c - 1). The same expected-count and integer-division pitfalls apply.