How to calculate conversion uplift in SQL
Contents:
What conversion uplift is and why it matters
Conversion uplift is the difference in conversion rate between your treatment and control variants in an A/B test. It is the number that turns an experiment into a decision: ship, kill, or iterate. A p-value tells you whether the difference is unlikely to come from random noise; uplift tells you how much the change is worth in business terms. A tiny p-value with a 0.02 percentage-point lift on millions of checkout users is still a decision; a noisy 30% relative lift on 800 users is not.
Picture the scenario most analysts at Stripe, DoorDash, or Airbnb live once a week. A PM pings you Friday afternoon. They want to know whether the new pricing page beat the old one, what the lift looked like for US mobile users, and whether to roll out globally. You have a few hours, a shared Snowflake warehouse, and a CEO meeting Monday. You need a SQL query that matches your experimentation platform, broken down the way the PM asked, and you need to defend each line of it when the engineering manager pushes back.
This guide walks through the SQL you would paste into a Snowflake or Postgres console, then layers on confidence intervals, segment splits, and the five mistakes that show up in almost every interview answer.
The SQL formula
Conversion uplift has two flavors and you should compute both. Absolute uplift is the raw difference between the two conversion rates, expressed in percentage points. Relative uplift is the absolute uplift divided by the control conversion rate, expressed as a percentage. The same lift looks very different depending on which one you report, so showing both is how you stay honest.
absolute_uplift_pp = conv_treatment - conv_control
relative_uplift_pct = (conv_treatment - conv_control) / conv_control * 100Worked example: control converts at 5.0%, treatment at 5.5%. Absolute uplift is 0.5 percentage points. Relative uplift is 10%. Both numbers describe the same experiment; both belong in the report.
A baseline query that returns the headline numbers for one test. It assumes an ab_test table with one row per user-test assignment, columns test_id, user_id, variant, and converted. Adjust to your warehouse.
WITH stats AS (
SELECT
variant,
COUNT(DISTINCT user_id) AS users,
COUNT(DISTINCT user_id) FILTER (WHERE converted) AS converters,
COUNT(DISTINCT user_id) FILTER (WHERE converted)::NUMERIC * 100
/ COUNT(DISTINCT user_id) AS conv_pct
FROM ab_test
WHERE test_id = 123
GROUP BY variant
)
SELECT
MAX(CASE WHEN variant = 'A' THEN users END) AS users_a,
MAX(CASE WHEN variant = 'A' THEN conv_pct END) AS conv_a_pct,
MAX(CASE WHEN variant = 'B' THEN users END) AS users_b,
MAX(CASE WHEN variant = 'B' THEN conv_pct END) AS conv_b_pct,
MAX(CASE WHEN variant = 'B' THEN conv_pct END)
- MAX(CASE WHEN variant = 'A' THEN conv_pct END) AS absolute_uplift_pp,
(MAX(CASE WHEN variant = 'B' THEN conv_pct END)
- MAX(CASE WHEN variant = 'A' THEN conv_pct END))
/ NULLIF(MAX(CASE WHEN variant = 'A' THEN conv_pct END), 0) * 100
AS relative_uplift_pct
FROM stats;COUNT(DISTINCT user_id) is the safe choice when a user can appear more than once for the same variant. If your assignment table is already one row per user, swap it for COUNT(*) and the query runs faster. The NULLIF guard prevents divide-by-zero when control has no converters, which sounds unlikely until you run it on a new flow on day one.
Absolute vs relative uplift
The reason both numbers matter is that one of them is almost always misleading on its own. Compare three experiments that all moved conversion by exactly one percentage point in absolute terms:
| Baseline | Treatment | Absolute | Relative |
|---|---|---|---|
| 5% | 6% | 1 pp | 20% |
| 50% | 51% | 1 pp | 2% |
| 1% | 1.5% | 0.5 pp | 50% |
The first row is a normal funnel improvement; the second is a tiny relative lift on a flow that was already converting well; the third looks enormous in relative terms but represents very little extra revenue. Reporting both forces the right conversation.
Rule of thumb on which to lead with. If baseline conversion is below 5%, lead with absolute, because relative will be noisy and easy to overstate. If baseline is above 20%, lead with relative, because absolute will look unimpressive even when user impact is real. In the middle, show both side by side.
Confidence interval for uplift
Reporting a point estimate without a confidence interval is the fastest way to lose credibility on a launch decision. A 0.5 percentage-point uplift means something very different when the 95% interval is [0.45, 0.55] versus when it is [-0.2, 1.2]. The first is a clean win; the second is a coin flip. The standard approach for a difference of two proportions uses the normal approximation, which works well when both groups have at least a few hundred converters.
WITH stats AS (
SELECT
COUNT(*) FILTER (WHERE variant = 'A') AS n_a,
COUNT(*) FILTER (WHERE variant = 'A' AND converted) AS x_a,
COUNT(*) FILTER (WHERE variant = 'B') AS n_b,
COUNT(*) FILTER (WHERE variant = 'B' AND converted) AS x_b
FROM ab_test
WHERE test_id = 123
)
SELECT
x_a::NUMERIC / n_a AS p_a,
x_b::NUMERIC / n_b AS p_b,
x_b::NUMERIC / n_b - x_a::NUMERIC / n_a AS uplift_abs,
SQRT(
(x_a::NUMERIC / n_a) * (1 - x_a::NUMERIC / n_a) / n_a
+ (x_b::NUMERIC / n_b) * (1 - x_b::NUMERIC / n_b) / n_b
) AS se,
(x_b::NUMERIC / n_b - x_a::NUMERIC / n_a) - 1.96 * SQRT(
(x_a::NUMERIC / n_a) * (1 - x_a::NUMERIC / n_a) / n_a
+ (x_b::NUMERIC / n_b) * (1 - x_b::NUMERIC / n_b) / n_b
) AS ci_low,
(x_b::NUMERIC / n_b - x_a::NUMERIC / n_a) + 1.96 * SQRT(
(x_a::NUMERIC / n_a) * (1 - x_a::NUMERIC / n_a) / n_a
+ (x_b::NUMERIC / n_b) * (1 - x_b::NUMERIC / n_b) / n_b
) AS ci_high
FROM stats;Interpretation: if ci_low and ci_high are both above zero, treatment is significantly better at the 5% level. If they straddle zero, you cannot reject the null at that level. The width of the interval is your power story: a tight interval near zero means a well-powered test with a small effect; a wide interval means you need more sample. See the dedicated post on confidence intervals in SQL for the math.
Uplift by segment
Top-line uplift hides the most important question a PM will ask: is this lift uniform across users, or is one segment dragging the average up while another segment is silently being hurt? A query that splits by country, device, or acquisition channel is usually the difference between a clean launch and a rollback after a week of support tickets.
SELECT
u.country,
COUNT(*) FILTER (WHERE a.variant = 'A' AND a.converted)::NUMERIC * 100
/ NULLIF(COUNT(*) FILTER (WHERE a.variant = 'A'), 0) AS conv_a_pct,
COUNT(*) FILTER (WHERE a.variant = 'B' AND a.converted)::NUMERIC * 100
/ NULLIF(COUNT(*) FILTER (WHERE a.variant = 'B'), 0) AS conv_b_pct,
COUNT(*) FILTER (WHERE a.variant = 'B' AND a.converted)::NUMERIC * 100
/ NULLIF(COUNT(*) FILTER (WHERE a.variant = 'B'), 0)
- COUNT(*) FILTER (WHERE a.variant = 'A' AND a.converted)::NUMERIC * 100
/ NULLIF(COUNT(*) FILTER (WHERE a.variant = 'A'), 0) AS abs_uplift_pp
FROM ab_test a
JOIN users u ON u.user_id = a.user_id
WHERE a.test_id = 123
GROUP BY u.country
ORDER BY abs_uplift_pp DESC;Two things to watch. First, small segments swing wildly; cap interpretation to segments with at least a few thousand users per arm. Second, a negative-uplift segment while overall is positive is the conversation to have with the PM before launch — not automatic kill, but it means a country-specific rollout or a follow-up experiment.
Common pitfalls
When teams calculate conversion uplift for the first time, the most common mistake is reporting only the relative number because it sounds bigger. A 20% relative uplift on a 1% baseline is a 0.2 percentage-point change, barely above the daily noise floor on most flows. The fix is to always report both numbers, and to lead with absolute when baseline is low.
A second trap is pooling every user into a single estimate and ignoring segments. You can ship a feature that improves overall conversion by 5% while degrading conversion in your highest-LTV segment by 8%, and the rollback meeting will be brutal. Run the segment query alongside the headline query, and flag any segment-level uplift that is more than two standard errors away from the overall number.
A third pitfall is reporting a point estimate without a confidence interval. The PM will treat 0.50 percentage points the same as 0.45 percentage points, and the engineering manager will not push back unless they see the bounds. Always run the CI query, and if the interval crosses zero, say so in the first sentence of the writeup.
A fourth mistake is calling a winner during a novelty effect. If the new variant is visually distinct, users click on it just because it is new, and the lift decays over the next two to four weeks. Track uplift over time, not just as a single number at the end, and require at least two weekly cycles before declaring victory on anything UI-related. The A/B testing peeking mistake post covers the related sin of calling the test as soon as p drops below 0.05.
A fifth pitfall is using different denominators for the two arms. You compute conversion for arm A as converters per assigned user, but for arm B you accidentally compute converters per session because of a join that fanned out. The numerator looks the same shape but the units differ, and the uplift is meaningless. Make the denominator definition a named CTE and reuse it for both arms.
Optimization tips
If your assignment table has tens of millions of rows per test, the headline query gets expensive on Snowflake or Redshift. Cluster or partition the table on test_id so each test's query only scans the relevant micro-partitions. On Snowflake a CLUSTER BY (test_id) can drop query time by an order of magnitude for active experiments.
Second, replace COUNT(DISTINCT user_id) with COUNT(*) if you can guarantee one row per user per test in your pipeline. Distinct counting forces a sort or hash on user IDs, which is much more expensive than a streaming count. Bake the deduplication into the ETL job that loads the assignment table once, instead of paying for it on every analyst query.
Third, materialize a daily summary table with one row per (test_id, variant, day, segment) storing users and converters. Your headline query becomes an aggregation over hundreds of rows instead of millions, so dashboards stay responsive as the test ramps. On Databricks or BigQuery, an incremental materialized view that refreshes every fifteen minutes takes this further.
Related reading
- How to calculate confidence interval in SQL
- How to calculate chi-square test in SQL
- How to calculate bootstrap CI in SQL
- A/B testing peeking mistake
- SQL window functions interview questions
If you want to drill A/B testing and SQL questions like this every day, NAILDD is launching with 500+ SQL problems built around exactly these patterns.
FAQ
Should I report absolute or relative uplift?
Both, every time. Relative is the number that ends up in the launch announcement because it sounds bigger and is easier to compare across products. Absolute is what finance should use to decide whether to ship, because it maps directly to the count of extra conversions and the dollar impact. If you only have room for one, lead with absolute when baseline is low and relative when baseline is high.
What if the confidence interval crosses zero?
You cannot reject the null at the conventional 5% level, which usually means do not ship on the strength of this test alone. That said, "not significant" is not the same as "no effect" — a wide interval often means the test was underpowered. Look at the width of the interval relative to the effect size you would care about, and decide whether to ship, kill, or rerun with more sample.
What does a negative uplift mean?
Treatment is hurting conversion. Before killing the feature, sanity-check the assignment logic, look for a bug that breaks the new variant for a subset of users, and split by segment to see whether the damage is concentrated. If the negative effect is real and uniform, kill it; if it is concentrated in one segment, the feature may work for most users but break for a specific browser or country.
Why does uplift decay over time?
The two most common causes are novelty effect and seasonality. Novelty is the bump from anything new on the screen and typically fades within two to four weeks. Seasonality is the test catching a holiday weekend or a marketing push in one arm but not the other. The fix for both is to track uplift weekly and require at least two full weekly cycles before declaring the test final.
My segments contradict the overall uplift — what is going on?
Simpson's paradox. The mix of segments in the two arms is slightly different — often a randomization bug or a staggered rollout — and the segment-level effect points one direction while the pooled effect points the other. Always check segments before announcing a winner; if you see the paradox, recompute uplift with a covariate-adjusted estimator or rerun the test with stricter randomization.