May 18, 2026·13 min read

How to A/B test product pricing

Prep A/B testing and statistics

300+ questions on experiment design, sample size, p-values, and pitfalls.

Contents:

Why pricing is the highest-leverage experiment you will ever run
What you are actually testing
Designing the experiment
Which metrics to track
Ethics and legal guardrails
Common pitfalls
Related reading
FAQ

Why pricing is the highest-leverage experiment you will ever run

Price is the single biggest lever on profit a product team controls. A 10% lift that retains half the buyers still wins on revenue, and a 5% drop that doubles conversion can change the trajectory of a company. The catch is that demand elasticity is non-linear, and reactions to price depend on positioning, the moment in the customer journey, and the way the number is framed. Picking a price on a hunch is how teams quietly lose a quarter of their annual revenue.

That is why mature product teams ship pricing changes through experiments. A pricing A/B test is a normal A/B test with one twist: the primary metric is not conversion, it is revenue per user. You can raise the price, lose conversions, and still come out clearly ahead on dollars. You can also keep conversion flat and burn long-term LTV because the new buyers churn faster. Picking the wrong primary metric is the most expensive mistake in this category.

The other thing to flag up front is ethics. Showing the same price to all comparable users is fine. Showing different prices to indistinguishable users without telling them is a gray zone that Amazon learned about the hard way in 2000, when reporters caught variant prices in the wild and the story dominated the news cycle for a week. Pricing tests carry reputation risk that UI tests do not, and the playbook below bakes in the guardrails Stripe, Netflix, and DoorDash use to keep that risk near zero.

What you are actually testing

The unit you call "the pricing test" rarely turns out to be a clean change to one number. The biggest wins almost always come from changing the structure around the number, not the digit itself. Before you wire up the experiment, write down which of these you are varying — each has different sample-size math behind it.

The simple version is the headline number: 10 dollars versus 12 versus 15. Useful, but often not the biggest mover. Next is the tier structure — three tiers versus two, or a premium tier added above the existing top. Then the billing period, a sneaky favorite at SaaS companies because annual versus monthly default can shift average contract value by 40 to 60 percent. Anchors matter too: a strike-through price changes perceived value without changing the dollars you collect. Framing works the same way — "less than a dollar a day" reads differently from "29 dollars per month" even though the math is identical.

Two structural variables that get under-tested deserve attention. The first is the billing cadence discount — annual at 20% off versus monthly at full price. The second is the payment-method stack: card-only versus card plus Apple Pay plus Google Pay plus a regional method like SEPA or iDEAL. Adding Apple Pay alone can move mobile conversion by 5 to 10 percent, which teams routinely misattribute to price. If you are testing more than one of these at once, write the attribution rule down before launch.

Designing the experiment

Run the test on new users only, unless you have a very specific reason not to. Existing customers see a price change as a broken promise, and the retention damage usually swamps any incremental revenue. The standard setup at Netflix, Stripe, and Notion is the same: new visitors are bucketed at first paywall view, the assignment is sticky, and existing paid users keep the unchanged control price until a separate, explicitly communicated change ships.

The hypothesis should be specific and falsifiable. "Raising the price from 10 to 12 dollars will lift ARPU by at least 8% without dropping 30-day retention by more than 1 percentage point" is useful. "Pricing change will improve revenue" is not. Nailing down the exact bar is how you pre-register the primary metric, the guardrail, and the minimum effect size.

Sample size is where pricing tests differ most from UI tests. Revenue per user has a long, skewed tail — most users pay zero, a few pay a lot, and the variance is dominated by the right tail. Plug historical revenue-per-user variance into a sample-size calculator and you typically need 5 to 20 times the traffic of an equivalent conversion test. If the calculation says 30,000 users per arm and you have 8,000, run longer rather than ship under-powered.

Duration should be at least two weeks plus one full billing cycle, so you can observe refunds and first-cycle churn. For monthly subscriptions that means 30 to 45 days. Annual contracts are harder — pre-register a 90-day proxy metric and validate later. Traffic split is typically 50/50, but for prices you suspect might tank conversion, ramping 90/10 for the first week is a sane way to bound downside.

For the statistical test, do not blindly reach for Welch's t-test on revenue per user — the distribution is too skewed. Use a non-parametric bootstrap on the mean. If you want to keep the t-test, decompose ARPU into conversion times average ticket and test each piece separately.

-- ARPU by variant for new users assigned in the experiment,
-- truncated to a 30-day window from each user's assignment.
SELECT
  eu.variant,
  COUNT(DISTINCT eu.user_id)                                           AS users,
  SUM(COALESCE(p.amount_usd, 0))                                       AS revenue_usd,
  SUM(COALESCE(p.amount_usd, 0))::NUMERIC
    / NULLIF(COUNT(DISTINCT eu.user_id), 0)                            AS arpu_usd
FROM experiment_users eu
LEFT JOIN payments p
  ON p.user_id = eu.user_id
 AND p.created_at >= eu.assigned_at
 AND p.created_at <  eu.assigned_at + INTERVAL '30 days'
 AND p.status = 'captured'
WHERE eu.experiment = 'pricing_v3'
  AND eu.assigned_at >= DATE '2026-04-01'
  AND eu.assigned_at <  DATE '2026-05-01'
GROUP BY eu.variant
ORDER BY eu.variant;

The query is intentionally boring. The interesting work is upstream: refunds subtracted, currency converted to a single denomination, and the experiment_users table treated as the source of truth for who saw which variant. Most pricing-test debates are really about which table to trust, and the answer is almost always "the assignment table, joined left."

Which metrics to track

The primary metric is ARPU on new users over a fixed window — usually 30 days for monthly products and 90 days for longer-cycle products. That number is what you ship on. Everything else is a guardrail.

Guardrails fall into three buckets. The first is conversion from visitor to paid, which tells you whether price is repelling buyers or shifting them between tiers. The second is retention and refund quality: first-month refund rate, 30-day churn, and the share of users who downgrade in the first cycle. The third is downstream LTV, read off historical cohort decay curves applied to the new variant — see how to calculate LTV by cohort in SQL for the recipe and how to calculate ARPU in SQL for the decomposition.

The trap that catches teams over and over is reading the test on one week of conversion data. Conversion drops, ARPU is up, the team says "we won" and ships. Three months later, retention on the more expensive variant is worse and the LTV story flips. Read the test on cohorts, not weekly aggregates, and require the LTV proxy to move with ARPU. If LTV and ARPU disagree, you do not have a winner yet — you have a question.

For ARPU, pair the point estimate with a bootstrap confidence interval rather than a single p-value. A normal-theory t-test produces false positives at twice the nominal rate here, because the long right tail dominates variance. The bootstrap CI in SQL recipe handles this in a few hundred resamples.

Prep A/B testing and statistics

300+ questions on experiment design, sample size, p-values, and pitfalls.

Join the waitlist

Ethics and legal guardrails

The bad version of pricing experimentation is showing different prices to indistinguishable users with no disclosure. That is the version that ends up on the front page of the news, and the playbook to avoid it is well-established.

Five rules cover almost every case. First, run pricing tests only on new users. Second, hold the price stable for any individual user — once they see a number, that is the number for the rest of the session and ideally for the lifetime of the account. Third, never move a user from one variant to another mid-experiment. Fourth, when a variant loses, roll back the price for that segment. Fifth, if the test involves discounts, make the rules public in the offer terms.

There is also a regulatory layer. In the EU, the Digital Services Act and price-transparency rules apply to pricing experiments that touch EU users. In several US states, attorneys general have flagged personalized pricing as a potential discrimination issue. Involve legal review before any test that segments on attributes correlated with protected classes, even implicitly. Skipping this step has cost companies seven-figure settlements that dwarf any revenue the test could have produced.

Common pitfalls

The first pitfall is running the test on existing paying users to "save sample size." Your paid base is bigger, but the moment one of those users notices the price change in their billing portal, support tickets and social posts eclipse any revenue gain. Pricing tests run on new users — plan for the longer calendar time.

The second pitfall is reading the test on conversion alone and declaring victory. A variant that converts better at a lower price can easily lose on ARPU and LTV, because the funnel fills with users who pay less and churn more. Pre-register ARPU as the primary metric and require LTV to move with it before shipping.

The third pitfall is under-powering the test because nobody ran the variance calculation on revenue per user. Revenue variance is dominated by a small number of high-value buyers, and detecting a 5% ARPU lift often takes 10x the sample of a similar conversion test. Pull historical variance, plug it into a power calculator, and either commit to the longer test or raise the minimum detectable effect.

The fourth pitfall is using a normal-theory t-test on raw ARPU and trusting the p-value. The distribution is too heavy-tailed at typical sample sizes. Bootstrap the mean directly, or decompose ARPU into conversion times average ticket and test each piece separately.

The fifth pitfall is conflating the price change with a payment-method change. Teams routinely add Apple Pay as part of a "pricing test" and then attribute the conversion lift to price. Test one change at a time, or factorialize with enough sample to read each cell.

The sixth pitfall is launching without a peer review of the analysis plan. Pricing experiments are the most politically loaded experiments in any consumer product, and a reviewer outside the team catches assumptions you stopped questioning weeks ago. Write a one-page pre-registration naming the primary metric, the guardrails, the sample size, the duration, and the ship rule.

If you want to drill the SQL and statistics behind pricing experiments daily, NAILDD launches with 500+ analytics problems across this pattern.

FAQ

Can I run a pricing test on existing paying users?

Almost always no. Existing customers expect their price to be a stable contract, and showing them a different number triggers refund requests, social posts, and support escalations that wipe out experimental revenue. The exception is a grandfathering decision — migrating legacy users to a new plan — and even then it should be communicated, opt-in, and reversible.

What is a normal duration for a pricing test?

At least two weeks plus one full billing cycle — for monthly subscriptions, 30 to 45 days. For annual contracts, use a 90-day proxy metric blending first-cycle ARPU with a renewal-rate estimate from prior cohorts, validated against the full-cycle outcome later. Calling a pricing test in seven days is the most common reason teams ship a winner that turns out to be a loser.

What should I use instead of a t-test for ARPU?

A non-parametric bootstrap on the mean. Resample per-user revenue with replacement a few thousand times within each variant, compute the difference of means, and read the 2.5th and 97.5th percentiles as your confidence interval. If you want to stay parametric, decompose ARPU into conversion times ARPPU and test each component separately — both pieces are closer to well-behaved.

Can I test the price and the pricing page at the same time?

Yes, as long as the test is scoped as one combined "offer" and the analysis plan treats it that way. The danger is running them as two separate experiments with overlapping traffic and attributing the lift after the fact — interaction effects make that attribution unreliable. If you need to isolate the two changes, run them sequentially.

What do I do with the losing variant after the test ends?

Roll it back to the control price. Leaving the more expensive variant in production for users who already saw it — "they did not complain" — starts a slow trust erosion that compounds across future tests. Treating rollback as automatic is what lets you keep running pricing experiments without burning trust capital.

How do I compute significance when ARPU has a lot of zeros?

The bootstrap, because it makes no distributional assumption and works on the empirical distribution as it is. As a complement, decompose: test conversion (binomial or chi-square) and ARPPU (Welch's t-test on paying users) separately. If both move in the same direction, you have a clean story. If they disagree, the bootstrap on combined ARPU is what you trust.