Percentile in SQL cheat sheet

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Why percentiles beat averages

Averages lie on every metric with a long tail. Order value, session length, API response time, customer lifetime revenue — plot the histogram and you see a fat right tail that drags the mean far above what a typical user experiences. One whale spending fifty thousand on a Stripe-like checkout, or one outage producing ten-second responses, is enough to make the AVG true on paper and useless in practice.

Percentiles fix this. P50 is the median: half below, half above. P95 says ninety-five percent of users had an experience at least this good. P99 is the value your worst one percent of requests crossed. These numbers stay stable when outliers explode — exactly what you want when an Uber PM asks if the new map screen is faster, or a Netflix SRE checks if streaming start regressed after a deploy.

This cheat sheet covers the patterns that show up in interviews and on-call shifts. Every snippet is copy-paste ready for Postgres unless noted, with notes for BigQuery, Snowflake, Databricks, and MySQL where syntax diverges. If you want to drill these as interview problems, NAILDD ships with hundreds of SQL problems built around exactly these patterns.

The core functions

The family is small. PERCENTILE_CONT(p) returns the interpolated value at percentile p between zero and one. PERCENTILE_DISC(p) returns the closest actual value. NTILE(n) does not return a percentile at all — it tags every row with a bucket number from one to n, splitting the dataset into roughly equal groups.

Wrap PERCENTILE_CONT and PERCENTILE_DISC with WITHIN GROUP (ORDER BY col) — they are ordered-set aggregates. Wrap NTILE with OVER (ORDER BY col) — it is a window function. Mixing the two syntaxes is the most common take-home mistake.

Median and quartiles

The median is PERCENTILE_CONT(0.5). In Postgres, Snowflake, Oracle, Databricks:

SELECT
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY order_total) AS median
FROM orders;

In BigQuery the analytic version uses positional arguments and OVER (), because BigQuery treats it as a window function:

SELECT DISTINCT
    PERCENTILE_CONT(order_total, 0.5) OVER () AS median
FROM orders;

Quartiles are three calls at 0.25, 0.50, 0.75 — one scan, three statistics:

SELECT
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY order_total) AS p25,
    PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY order_total) AS p50,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY order_total) AS p75
FROM orders;

For an even number of rows, PERCENTILE_CONT(0.5) linearly interpolates between the two middle values — the standard statistical definition of the median. If you need an actual data point closest to the median (a real order ID you can join back to), use PERCENTILE_DISC(0.5).

P95, P99, P99.9 for tail latency

Latency is the canonical use case for high percentiles. The mean response time hides slow tails; P95 surfaces them; P99 and P99.9 surface the cliff SREs need to know about.

SELECT
    PERCENTILE_CONT(0.50)  WITHIN GROUP (ORDER BY response_time_ms) AS p50,
    PERCENTILE_CONT(0.95)  WITHIN GROUP (ORDER BY response_time_ms) AS p95,
    PERCENTILE_CONT(0.99)  WITHIN GROUP (ORDER BY response_time_ms) AS p99,
    PERCENTILE_CONT(0.999) WITHIN GROUP (ORDER BY response_time_ms) AS p999
FROM api_logs
WHERE created_at >= NOW() - INTERVAL '1 hour';

For dashboards, bin by minute or five-minute interval and compute percentiles per bucket. That turns the query into a time series, pairing naturally with anomaly detection on the P99 line.

Percentiles by group

Add GROUP BY for one row per group — useful for per-category, per-region, or per-endpoint summaries:

SELECT
    category,
    PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY order_total) AS median,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY order_total) AS p95
FROM orders
GROUP BY category;

If you need percentiles alongside row-level data — for example, tagging every order with the median of its category — use the window form (Postgres, Snowflake, BigQuery all support it):

SELECT
    order_id,
    category,
    order_total,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY order_total)
        OVER (PARTITION BY category) AS category_median
FROM orders;

PERCENTILE_CONT vs PERCENTILE_DISC

The two diverge when the percentile does not land exactly on a row. Take [10, 20, 30, 40] and ask for PERCENTILE_CONT(0.5): it interpolates between 20 and 30 and returns 25. PERCENTILE_DISC(0.5) returns 20 — the highest actual value at or below the 50th percentile.

-- Sample distribution: [10, 20, 30, 40]
PERCENTILE_CONT(0.5)  -- 25 (interpolated)
PERCENTILE_DISC(0.5)  -- 20 (closest real value at-OR-below)

For most analytics — revenue medians, latency percentiles, AOV by cohort — PERCENTILE_CONT is the right default. Reach for PERCENTILE_DISC when the percentile must correspond to a real row: a customer at P75 spend, or a request at P99 latency you want to open in the trace viewer.

NTILE buckets for deciles and RFM

NTILE(n) splits the result set into n roughly equal buckets by an ordering column. This is how you build RFM segments, deciles, and equal-frequency histograms.

-- Quartiles: four buckets of 25% each
SELECT
    user_id,
    order_total,
    NTILE(4) OVER (ORDER BY order_total) AS quartile
FROM orders;

-- Deciles: ten buckets of 10% each
SELECT
    user_id,
    order_total,
    NTILE(10) OVER (ORDER BY order_total) AS decile
FROM orders;

NTILE returns the bucket index, not the boundary value. To get the boundaries, aggregate after bucketing:

SELECT
    decile,
    COUNT(*)        AS orders_cnt,
    MIN(order_total) AS min_total,
    MAX(order_total) AS max_total,
    AVG(order_total) AS avg_total
FROM (
    SELECT
        order_total,
        NTILE(10) OVER (ORDER BY order_total) AS decile
    FROM orders
) t
GROUP BY decile
ORDER BY decile;

A common trick: "top twenty percent of spenders" equals NTILE(5) = 5 ascending or = 1 descending. Pick the direction that reads naturally downstream and document it in a comment.

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

IQR outlier detection

IQR is Q3 - Q1. The Tukey rule flags values outside [Q1 - 1.5 * IQR, Q3 + 1.5 * IQR]. Solid default for cleaning and sanity checks.

WITH stats AS (
    SELECT
        PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY order_total) AS q1,
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY order_total) AS q3
    FROM orders
)
SELECT o.*
FROM orders o
CROSS JOIN stats s
WHERE o.order_total < s.q1 - 1.5 * (s.q3 - s.q1)
   OR o.order_total > s.q3 + 1.5 * (s.q3 - s.q1);

For long-tailed distributions like revenue, the Tukey rule is too aggressive on the upper side — it flags every legitimate whale. Switch to a MAD-based robust z-score, or flag only the bottom tail.

MySQL workarounds

MySQL 8 does not ship PERCENTILE_CONT. The standard workaround uses row-number plus total count, filtering at the middle:

-- Median in MySQL 8+ using window functions
SELECT AVG(order_total) AS median
FROM (
    SELECT
        order_total,
        ROW_NUMBER() OVER (ORDER BY order_total) AS rn,
        COUNT(*)     OVER ()                     AS cnt
    FROM orders
) t
WHERE rn IN (FLOOR((cnt + 1) / 2), CEIL((cnt + 1) / 2));

For arbitrary percentiles, replace (cnt + 1) / 2 with ROUND(cnt * p) and drop the AVG if you accept the discrete value. On MySQL older than 8 there are no window functions — fall back to LIMIT 1 OFFSET k after sorting, which only scales on small indexed tables.

Approximate percentiles at scale

Exact percentiles need a full sort, which gets expensive on billion-row tables. Every major warehouse ships an approximate alternative based on streaming sketches like t-digest or KLL.

-- Snowflake
SELECT APPROX_PERCENTILE(response_time_ms, 0.99) AS p99
FROM api_logs;

-- BigQuery
SELECT APPROX_QUANTILES(response_time_ms, 100)[OFFSET(99)] AS p99
FROM api_logs;

-- Databricks / Spark SQL
SELECT approx_percentile(response_time_ms, 0.99, 10000) AS p99
FROM api_logs;

-- ClickHouse
SELECT quantileTDigest(0.99)(response_time_ms) AS p99
FROM api_logs;

Accuracy depends on sketch size. BigQuery's default APPROX_QUANTILES(_, 100) is good to roughly one percentage point; Snowflake's t-digest is tighter. For minute-refresh SLO dashboards over billions of rows, the approximate variant is the only practical choice. For a quarterly executive report where the exact P95 matters, run PERCENTILE_CONT against a materialized day-level table instead.

Common pitfalls

The most common failure when moving from notebook to production is forgetting WITHIN GROUP. The query parses, the function name is recognized, but the planner has no sort key and errors out. Always pair PERCENTILE_CONT and PERCENTILE_DISC with WITHIN GROUP (ORDER BY col). For the window form the WITHIN GROUP clause comes before OVER, with BigQuery as the positional-argument exception.

A second trap is mixing up NTILE and PERCENTILE_CONT. NTILE(4) does not return the 25th percentile — it returns a bucket index from 1 to 4. If a Stripe interviewer asks for median order value and you write NTILE(2), you return a bucket number, not a dollar amount. Remember the shape of the answer: percentiles return a value on the input scale; NTILE returns a small integer.

A third pitfall is treating NULL as zero. Percentile functions ignore NULL by default, so a half-NULL column returns percentiles computed only over the non-null half. If NULL should mean zero — "users with no purchases had zero spend, not unknown spend" — wrap the column in COALESCE(col, 0) inside the ORDER BY. Otherwise you are silently filtering the population.

A fourth pitfall appears when comparing distributions: means look identical, but P95 has shifted. That gap is the single most useful signal in an A/B test on a long-tailed metric. Always include P50, P95, and AVG side by side when scoring experiments on revenue, session length, or latency — the spread tells you whether the average is being yanked by outliers.

A fifth trap is comparing percentiles across non-aligned time windows. If control covers seven days and treatment covers six, P99 will differ purely from sample size: rarer percentiles are more sensitive to observation count. Align the windows or bootstrap a confidence interval before declaring the change real. The peeking problem in A/B testing is closely related.

Performance and optimization tips

Exact percentiles require a full sort of the target column, so they scale O(N log N) worst case. On Postgres, an index on the percentile column helps the planner but does not eliminate the sort — the function needs the full distribution materialized or spilled. The most effective fix is pushing computation down to a pre-aggregated table — daily or hourly rollups storing percentiles alongside row counts and sums.

On Snowflake, BigQuery, and Databricks, percentile computations parallelize across micro-partitions but still scan and sort. Calling PERCENTILE_CONT over a billion-row event table to power a live dashboard burns credits and frustrates users. Switch to approximate variants, or materialize percentile output into a metrics table the dashboard reads from.

For window-form percentiles with PARTITION BY, the planner sorts within each partition. With a high-cardinality key like user_id, sort cost per row stays constant but you fan out many small sorts. A shuffle on top of the partition in the explain plan usually points at a join-order problem upstream.

FAQ

How do I calculate the median in SQL?

Use PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY col) in Postgres, Snowflake, Oracle, and Databricks. In BigQuery, use PERCENTILE_CONT(col, 0.5) OVER () wrapped in SELECT DISTINCT. In MySQL 8, use a window-function trick with ROW_NUMBER() and total count, averaging the middle one or two rows. MySQL older than 8 has no window functions — fall back to a subquery with ORDER BY ... LIMIT 1 OFFSET k, which only scales for small tables.

Should I report P95 or P99?

P95 is the default for product metrics and most SLOs because it is stable enough to track day over day. P99 and P99.9 belong in site reliability work where the worst-case experience matters: payment processing, real-time auctions, streaming start. The deeper into the tail, the more samples you need before the number stabilizes — P99 over a thousand requests is noisy; P99 over a million is solid.

What is IQR and when do I use it?

IQR is Q3 - Q1 — the spread of the middle fifty percent. It is the building block of the Tukey outlier rule and of robust dispersion measures generally. Use it when you need a spread statistic that is not warped by outliers, and as a quick screen for points to investigate. For very skewed distributions, prefer MAD-based outlier detection or domain-specific bounds.

Can I compute percentiles per group?

Yes. Add GROUP BY for one row per group, or use the window form PERCENTILE_CONT(...) WITHIN GROUP (ORDER BY col) OVER (PARTITION BY group_col) to tag every row with its group's percentile. BigQuery uses positional PERCENTILE_CONT(col, p) OVER (PARTITION BY group_col). The partition column should be indexed or clustered if the query runs frequently.

How do approximate percentiles work?

They use streaming sketch data structures — t-digest, KLL, or GK — summarizing the distribution in bounded memory as data streams through. Small bounded error in exchange for big constant-factor gains in time and memory. Sketches merge across nodes, so they scale to trillions of rows. For sub-minute dashboards or queries scanning billions of rows, switch to APPROX_PERCENTILE, APPROX_QUANTILES, or quantileTDigest, and reserve the exact functions for pre-aggregated tables.

Why do my percentiles change when I rerun the query?

Exact PERCENTILE_CONT over static data is deterministic. Approximate variants drift slightly because sketches are non-deterministic across shuffles. Over a streaming or append-only table, numbers move as new data lands — correct behavior, but dashboard snapshots need timestamps. For reproducible analytics, materialize percentiles into a metrics table at fixed windows and read from there.