Outliers in data: detection and handling
Contents:
What an outlier actually is
An outlier is an observation that sits visibly far from the rest of the sample. No single mathematical definition exists; every team picks a rule (1.5xIQR, z-score above three, a domain threshold) and defends it. An outlier is whatever your detector flags, and your detector should match the question you are answering.
Three scenarios sharpen the idea. A Stripe storefront averages 80 dollars per order; a single Tuesday order lands at 32,000 — corporate procurement, data error, or fraud? A Notion marketing page averages 2,000 daily visits and one morning shows 45,000 — viral Hacker News pickup or a botnet? A salary spreadsheet contains a row of 12 dollars where 120,000 belongs. Same fingerprint, different decisions.
"Outlier" is rarely a label you act on directly. It is a flag that triggers a second pass — explain the row, then decide.
Why outliers matter for analysts
Outliers distort the statistics that product and finance decisions are built on. The mean is the most famous victim: a single 1.2-million-dollar transaction in a basket of 3,000-dollar orders shifts the average so far that the metric stops describing a typical customer. Variance and standard deviation react even more violently because each deviation is squared — one rogue row inflates the spread, widens confidence intervals, and erodes the power of every downstream test.
A/B tests carry that erosion into product decisions. When a single whale lands in the treatment bucket of a checkout test, the observed revenue per user can swing several percent on its own — manufacturing a fake winner or hiding a real one. Interviewers at Airbnb, Uber, and DoorDash love the scenario: "one user contributed eighty percent of the treatment revenue, what do you do?" Acceptable answers involve winsorization or bootstrap confidence intervals rather than blindly dropping the row.
Regression has its own pathology: ordinary least squares minimizes squared residuals, so outliers get disproportionate weight and bend the fitted line toward themselves. A single high-leverage point can flip the sign of a coefficient.
Detection methods
Two families show up in almost every analyst interview: distribution-free methods built on quantiles, and parametric methods built on the mean and standard deviation.
IQR. The gap between Q1 (25th percentile) and Q3 (75th percentile). The 1.5xIQR rule flags anything below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR. The 1.5 is a Tukey convention; tighten it to 3.0 for extreme outliers only. Because the rule discards the top and bottom 25 percent before measuring spread, it is robust against the very outliers it is finding — the property that makes the boxplot durable for skewed distributions.
Z-score. How many standard deviations a value sits from the mean: z = (x - mean) / std. |z| > 3 flags roughly 0.27 percent of a normal distribution; |z| > 2 flags about 5 percent. The catch is that the estimator is poisoned by the outliers you want to detect — both mean and standard deviation get pulled by the tail. On right-skewed revenue, z-score misses real outliers in the long tail and falsely flags ordinary low-value rows.
Modified z-score and MAD. Swap the mean for the median and the standard deviation for 1.4826 * MAD. The MAD detector is the workhorse for monitoring at Snowflake and Databricks scale because it does not break when a deploy lands a handful of extreme rows.
Visual checks. A boxplot shows IQR, median, whiskers, and outliers in one glance. A scatterplot reveals two-dimensional outliers that one-dimensional rules will never see — the customer who is normal on revenue and normal on session count but bizarre on the combination.
SQL recipe: IQR outliers in revenue
Assume an orders table with a revenue column. The query computes Q1, Q3, and the 1.5xIQR bounds, then returns every order outside those bounds with a label:
WITH quartiles AS (
SELECT
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY revenue) AS q1,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY revenue) AS q3
FROM orders
WHERE revenue > 0
),
bounds AS (
SELECT
q1, q3,
q3 - q1 AS iqr,
q1 - 1.5 * (q3 - q1) AS lower_bound,
q3 + 1.5 * (q3 - q1) AS upper_bound
FROM quartiles
)
SELECT
o.order_id,
o.revenue,
CASE
WHEN o.revenue < b.lower_bound THEN 'low_outlier'
WHEN o.revenue > b.upper_bound THEN 'high_outlier'
END AS outlier_type
FROM orders o
CROSS JOIN bounds b
WHERE o.revenue < b.lower_bound
OR o.revenue > b.upper_bound
ORDER BY o.revenue DESC;PERCENTILE_CONT is supported in Snowflake, BigQuery, Postgres, and Redshift with minor syntactic differences. For very large tables use APPROX_PERCENTILE. For dashboards, materialize the quartiles into a daily table and join in the bounds rather than recomputing per query.
Python recipe: z-score and pandas
The numpy version is what interviewers usually want whiteboarded:
import numpy as np
from scipy import stats
data = np.array([120, 130, 125, 128, 135, 122, 980, 127, 131, 126])
z_scores = stats.zscore(data)
outliers = np.where(np.abs(z_scores) > 2)
print("Outlier indices:", outliers[0])
print("Outlier values:", data[outliers])
# Outlier indices: [6]
# Outlier values: [980]The 980 sits about seven standard deviations from the mean of 130, so the detector lights up. On real data replace the threshold of 2 with 3 and cross-check with IQR.
For dataframes, the pandas idiom keeps row identifiers intact:
import pandas as pd
df = pd.DataFrame({
"revenue": [120, 130, 125, 128, 135, 122, 980, 127, 131, 126]
})
q1 = df["revenue"].quantile(0.25)
q3 = df["revenue"].quantile(0.75)
iqr = q3 - q1
lower, upper = q1 - 1.5 * iqr, q3 + 1.5 * iqr
mask = (df["revenue"] < lower) | (df["revenue"] > upper)
print(df[mask])For a per-group detector — outliers within each cohort, country, or experiment arm — wrap the logic in groupby().transform() so each row is compared against the percentiles of its own group.
Drop, keep, or transform
The hardest part of outlier work is the decision, not the math.
Drop or correct when the outlier is a defect — typos, duplicate transactions from a webhook retry, confirmed fraud that belongs in a separate table rather than the revenue series.
Keep and segment when the outlier is a meaningful business event. Whale customers deserve their own segment; collapsing them into the average hides the shape of the business. A viral spike from a tweet is a real acquisition event. Black Friday is a season, not an outlier.
Transform when you want the influence to shrink without losing the row. Log-transform right-skewed metrics before regression. Apply winsorization at the 1st and 99th percentiles before computing a treatment effect. Use a bootstrap confidence interval when the metric is a ratio with a heavy tail.
Robust alternatives
A trimmed mean drops the top and bottom p percent of values and averages what remains — usually 5 or 10 percent, available as scipy.stats.trim_mean(data, 0.05). It cuts variance materially on heavy-tailed metrics.
Winsorization keeps the sample size constant. Instead of deleting rows below the 5th and above the 95th percentile, replace them with the percentile values. The mean of the winsorized series captures most of the variance reduction of a trimmed mean without dropping rows.
Both transformations are common pre-processing for A/B tests because they shrink variance and increase power without discarding users. CUPED is the same idea applied to a pre-experiment signal.
Skewness and outliers
Positive skew is the default shape for most business metrics: revenue, session length, items per order. Values are bounded by zero on the left and unbounded on the right, so the tail only grows in one direction. That is why z-score detection routinely fails on revenue: the mean is dragged right by the tail, the standard deviation is inflated by the same rows, and the bounds end up both too lenient on the right and too strict on the left.
Negative skew is rarer — rating distributions on Apple's App Store, satisfaction scores capped above. Same logic, mirrored geometry. For strongly skewed distributions, switch from z-score to IQR or MAD detection, or log-transform first. log(revenue + 1) pulls the tail in and makes the distribution closer to normal.
Common pitfalls
The most common pitfall is deleting outliers before understanding them. Analysts under deadline filter out the top 1 percent and ship the dashboard, then discover three weeks later that the deleted rows were an enterprise tier launching a new plan. The fix is procedural: every outlier removal step should log a row count and a sample of affected rows so the next reviewer can see what was thrown away.
A second trap is using a single global threshold across a heterogeneous population. A 5,000-dollar order is enormous for a hobby segment and unremarkable for an enterprise segment. One set of IQR bounds across both systematically over-flags enterprise and under-flags suspicious hobby orders. Compute bounds within meaningful segments — country, tier, channel — and apply each segment's bounds to its own rows.
A third pitfall is treating refunds and chargebacks as zero. Both are negative cash events that the database may surface as NULL revenue or in a separate table. Filter on revenue > 0 to compute bounds and you have implicitly removed every refund, so a 10,000-dollar refund never shows up in the high-outlier list. Decide on the metric definition first — net or gross, refunds in or out — and document the choice in the query header.
A fourth trap is double-counting outlier detection. A nightly job removes the top and bottom 1 percent, the dashboard reads the cleaned table, then a downstream A/B test analysis applies its own 99th-percentile winsorization on top — trimming the data twice. Trust the upstream cleaning, or read from a raw view and apply cleaning exactly once.
Interview questions
Six prompts that come up in statistics, A/B testing, and SQL rounds:
- What is an outlier and how do you define one? Strong answers name a concrete detector (IQR with 1.5 multiplier, z-score with threshold 3) and call out the trade-off between distribution-free and parametric methods.
- How do outliers affect mean and median? Mean sensitive, median robust. Demonstrate with 1, 2, 3, 4, 100 versus 1, 2, 3, 4, 5.
- One user delivered eighty percent of treatment revenue — what do you do? Winsorization at the 99th percentile, trimmed mean, bootstrap CIs, switching to a conversion metric, or CUPED.
- Write a SQL query that flags outliers in a revenue table. The
PERCENTILE_CONTIQR pattern above; bonus points for handling negative revenue and windowing by segment. - When is it wrong to delete outliers? When the rows carry business signal — whales, fraud needed for investigation, real market events.
- What does positive skew mean? Long right tail, mean greater than median, z-score biased. Most business metrics are right-skewed by structure.
Related reading
- Outliers explained simply
- How to calculate IQR outliers in SQL
- How to calculate MAD outliers in SQL
- Bootstrap in A/B testing
- Statistics for data analysts
If you want to drill outlier and A/B questions in interview form, NAILDD is launching with hundreds of SQL and statistics problems on exactly this pattern.
FAQ
What is an outlier in data?
An outlier is an observation that sits visibly far from the rest of the sample by whatever rule you have chosen — typically 1.5 times the interquartile range from the quartiles, or more than three standard deviations from the mean. The label is operational, not absolute: it triggers an investigation into whether the row is an error, a technical artifact, fraud, or a genuinely extreme event that belongs in the analysis.
How do you detect outliers in data?
The two workhorse detectors are the 1.5xIQR rule and the z-score test. IQR flags anything below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR and is robust because percentiles are not affected by extreme rows. Z-score flags anything more than three standard deviations from the mean, precise on near-normal distributions and unreliable on heavy-tailed ones. For skewed metrics like revenue, IQR or the MAD variant usually wins.
When should you not delete outliers?
When the outlier carries business signal. Whale customers are a real segment that deserves its own dashboard cut. Viral acquisition spikes are real events that should be measured. Black Friday peaks are seasonality. Confirmed fraud is its own investigation stream.
How do outliers affect A/B tests?
They inflate variance, widen confidence intervals, and reduce statistical power. A single user with anomalous revenue can manufacture a false positive or hide a real effect. Standard responses are winsorization at the 99th percentile, trimmed means at 5 to 10 percent, bootstrap CIs, or switching from revenue to conversion so the influence of any single row is bounded.
Why does z-score fail on revenue data?
Revenue is bounded by zero on the left and unbounded on the right, so the distribution is right-skewed by structure. The mean is dragged right by the tail and the standard deviation is inflated by the same rows, so both estimators are biased by the very outliers you want to detect. IQR and MAD-based detectors sidestep the bias because they use percentiles instead of moments.
Should I log-transform before outlier detection?
Often, yes. log(x + 1) on a right-skewed metric pulls the tail in and makes the distribution closer to normal, after which z-score becomes a reasonable detector. Only log-transform strictly positive metrics, and reverse the transformation when reporting.