Outliers explained simply

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

The short answer

An outlier is a value that sits far away from the rest of your sample. Ninety-nine percent of your Stripe customers spend between $20 and $500 a month, and one enterprise account spends $48,000. That single account is an outlier.

Outliers come in two flavours: real (a whale, a viral post, an authentic crash spike) and artefacts (an event pipeline bug, a duplicated row, a $99.99 charge stored as $9,999). Deleting a real whale distorts your revenue forecast. Keeping a bugged row distorts your model. The whole game is figuring out which is which before you touch the data.

Most candidates fail this question because they reach for "drop anything above three standard deviations" and never explain why. A grown-up answer separates detection from action. This post walks through both.

Where outliers come from

The five sources you will see in practice:

  • Real rare cases: enterprise whales, a celebrity tweet spike, a Black Friday order that's 200x the median basket.
  • Data entry errors: a support agent typing 100,000 instead of 10,000, or a unit mismatch between cents and dollars upstream.
  • Technical glitches: double charges from a retry storm, duplicated events from a faulty SDK, a session that ran for 32 days because the user never closed their laptop.
  • Fraud and bots: card-testing rings, scrapers hitting your API 5,000 times an hour, click farms boosting an ad campaign.
  • Wrong context: a B2C product accidentally being used by a B2B team, so the per-user metrics look extreme.

Knowing the source matters more than knowing the value. A million-dollar order from an enterprise account and one from a bot look identical in a histogram. Enrich the data — join the customer table, check the IP, check the device — before deciding.

Why outliers matter

They wreck the mean. Five engineer salaries: 130, 140, 150, 160, 500 (in $k). Mean = $216k. Median = $150k. No one earns $216k, so the mean is useless. See median vs mean.

They wreck A/B tests. A single whale landing in treatment can push the revenue lift from "no effect" to "p < 0.01". Run the test next week without the whale and the effect evaporates. That's a heavy-tailed distribution tricking a t-test.

They wreck ML models. Linear regression, k-means, PCA, and gradient descent are all sensitive to extreme values. A single point at 1,000x the rest of the sample can dominate the loss function. Tree-based models are more robust, but even XGBoost will overfit to a few flashy training points.

Sometimes outliers are the insight. The top 1% of customers often drive 30-50% of revenue. Deleting these "for cleanliness" is how teams optimise away their golden goose. Segment, don't delete.

How to detect outliers

There are four detection methods you should know cold for interviews, plus a couple of ML ones for harder cases.

Z-score

The Z-score measures how many standard deviations a point sits away from the mean.

Z = (x - mean) / SD

Rule of thumb: |Z| > 2 is suspicious, |Z| > 3 is almost certainly an outlier in a roughly normal distribution.

import numpy as np

z = (x - np.mean(x)) / np.std(x)
outliers = np.where(np.abs(z) > 3)

Both the mean and the SD are themselves dragged around by outliers. One whale moves the mean, inflates the SD, and suddenly the whale doesn't look so extreme. Z-score is fine for clean, near-normal data and useless for heavy-tailed financial or engagement data.

IQR (Interquartile Range)

The default robust method. Compute Q1 (25th percentile), Q3 (75th percentile), then IQR = Q3 - Q1. Anything below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR is flagged.

import numpy as np

Q1 = np.percentile(x, 25)
Q3 = np.percentile(x, 75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = x[(x < lower) | (x > upper)]

Because percentiles are robust, IQR doesn't get fooled by the whale the way Z-score does. Reach for this first on any business metric — revenue, session length, basket size. To run the same logic in your warehouse, see how to calculate IQR outliers in SQL.

Custom percentile thresholds

When 1.5 x IQR is too aggressive or too lenient, pick percentiles directly: top 1% and bottom 1%, or top 0.5%. Pragmatic choice for product teams who want a defensible cap without arguing about formulas.

lower = np.percentile(x, 1)
upper = np.percentile(x, 99)
outliers = x[(x < lower) | (x > upper)]

For warehouse syntax, quantiles and percentiles covers Snowflake, BigQuery, and Postgres.

Visual checks

Box plots show outliers beyond the whiskers. Scatter plots reveal multivariate anomalies. Histograms show heavy tails. Always plot the distribution before you compute a single number — a thirty-second look at a box plot saves an hour of arguing with stakeholders about a "weird average".

ML-based detection

For high-dimensional data, single-column rules break down. Three workhorses:

  • Isolation Forest — fast, scalable, used in production fraud pipelines.
  • DBSCAN — clustering that flags points outside any cluster.
  • Local Outlier Factor (LOF) — catches points normal globally but anomalous locally.

For a warehouse-native baseline, how to detect anomalies in SQL covers a rolling-window approach that catches most real incidents without ML.

What to do with an outlier

Six options. Pick deliberately.

Delete when the value is provably wrong — a duplicate, a known data-entry error, a QA test transaction that leaked into prod. Prefer marking a dq_flag column over hard-deleting, so you can audit later.

Winsorize when you want to keep the row but cap the value. Everything above the 99th percentile gets clipped to the 99th percentile.

from scipy.stats.mstats import winsorize

winsorized = winsorize(x, limits=[0.01, 0.01])

Standard move before a t-test on heavy-tailed revenue data. Preserves sample size and most of the shape, just softens the tails.

Log-transform when the distribution is right-skewed and you care about multiplicative effects. log1p is safe for zero values.

x_log = np.log1p(x)  # log(1 + x), safe when x = 0

Works beautifully for basket size, session length, and revenue. Often cleaner than winsorization because it doesn't introduce a spike at the cap.

Use the median instead of the mean for reporting. For two-group comparisons with heavy tails, swap the t-test for a Mann-Whitney U-test.

Segment. Split your population into "whales" (top 1-5%) and "everyone else", and analyse them separately. Pricing and customer success usually want the whales analysed in isolation anyway.

Keep as-is. If the outlier is real and material — a whale that drives 30% of revenue — deleting it isn't cleaning, it's lying. Document the decision and move on.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Outliers in A/B tests

Heavy tails are the silent killer of revenue and engagement A/B tests. One whale landing on either side can flip the result. The toolkit:

  • Winsorize the metric before the test. Cap based on the control distribution, not the combined one, to avoid leaking treatment info.
  • Mann-Whitney U-test instead of a t-test on heavily skewed metrics.
  • CUPED — variance reduction using a pre-period covariate. Shrinks residual variance so remaining outliers matter less. See CUPED explained simply.
  • Stratified sampling by pre-period activity bucket, so whales are balanced across arms by construction.
  • Bootstrap confidence intervals instead of normal-theory ones. No assumption about the tail. Bootstrap explained simply walks through it.

The right combination depends on the metric. For purchase revenue, winsorize plus bootstrap. For session length, log-transform plus t-test. For binary conversion you have no outlier problem at all.

In the interview

The question you will get, in some form, at almost every data interview:

"How do you handle outliers?"

A confident answer:

  1. Check whether the value is a real signal or an artefact. Pull the row, look at the customer, check the upstream pipeline.
  2. Use IQR or percentile cutoffs for detection. Z-score only on clean, near-normal data.
  3. Decide the action by context: delete artefacts, winsorize for tests, log-transform for skewed metrics, segment for product analysis, keep for revenue reporting.
  4. In A/B tests, prefer Mann-Whitney or winsorized t-test on heavy-tailed metrics, and consider CUPED.
  5. Document the rule. Three months later, no one remembers why you dropped 247 rows.

The bad answer — "drop anything more than three standard deviations from the mean" — signals you memorised a heuristic without understanding when it breaks. Interviewers at Stripe, Airbnb, and Meta all ask follow-ups to catch this.

Common pitfalls

The trap of deleting anything that looks weird. Junior analysts equate "outlier" with "noise" and reach for the delete key. Real outliers carry real information — whales, viral hits, fraud signatures, edge-case bugs. Before you drop a row, spend thirty seconds investigating: who is this customer, when did this happen, does the value match anything in the upstream log? If you can't find a reason to call it a mistake, leave it in and segment it out.

The trap of using Z-score on everything. Standard deviation is itself sensitive to outliers, so the extreme values you're trying to detect are inflating the threshold you use to detect them. On heavy-tailed business data — revenue, time spent, basket size — IQR or percentile cutoffs are robust where Z-score isn't. Z-score is a clean choice only on data you already know to be approximately normal, like residuals from a well-fit model.

The trap of not documenting your decision. Two months after you ran the cohort analysis, your PM asks why the numbers no longer match production. Without a SQL comment and a note in the report explaining "rows with order_total > $50,000 excluded as suspected B2B contamination", you spend an afternoon reconstructing the reasoning. Write it down at the time.

The trap of believing "cleaner data is better data". Real revenue includes the whales. Real engagement includes the power users. A "clean" mean that strips them out is tidier but tells the business a story that isn't true. The right output for a revenue dashboard is usually the unmodified total, with a separate note showing top-N concentration so the reader can judge the tail.

The trap of one-size-fits-all handling. For an A/B test, winsorize. For a segmentation deck, segment. For tree-based ML features, often do nothing. For linear regression, transform or cap. The right answer is context-specific, and an analyst who applies the same rule everywhere hasn't thought about the downstream consumer.

If you want to drill outlier and A/B testing questions against real interview prompts, NAILDD is launching with hundreds of SQL and statistics problems.

FAQ

How do I pick the "right" detection method?

For one-dimensional business metrics, default to IQR with the 1.5x rule, and switch to percentile cutoffs (top 1% / bottom 1%) when stakeholders want something simpler to explain. For high-dimensional data — fraud signals, behavioural vectors, sensor data — use Isolation Forest or LOF. Z-score is a clean choice only on roughly normal data, which is rarer in practice than textbooks suggest.

Can I delete outliers from an A/B test after the fact?

No. Dropping rows that "look extreme" after you've seen the result is a common way to manufacture false-positive lifts. Pre-register the outlier rule before the test starts — winsorize at P99, or apply Mann-Whitney — and stick to it. If you find a genuine pipeline bug mid-test, document it and apply the rule symmetrically to both arms.

Should the threshold be 2 SD or 3 SD?

For roughly normal data, 3 SD catches about 0.3% of points — a sensible "almost certainly extreme" bar. 2 SD catches about 5%, which is too lenient and produces many false positives. For heavy-tailed data, neither threshold is appropriate because normality has already failed. Use IQR or percentiles instead.

Can I just use the median everywhere and skip outlier handling?

For reporting, yes. For inference, no. The median is fine for "what's the typical user spending", but a t-test on means is more powerful than a Mann-Whitney on medians when both are valid. Match the summary to the audience — median for product dashboards, winsorized mean for finance, raw total with tail concentration for revenue.

What's the difference between an outlier and an anomaly?

Conversationally, synonyms. In practice, "outlier" means a single extreme value in a static distribution ("this order is 50x the median"), while "anomaly" means a deviation from an expected pattern over time ("signup rate dropped 40% at 03:00 UTC"). Detection toolkits overlap — IQR, Z-score, Isolation Forest work for both — but anomaly detection usually involves seasonal baselines.

How big does a sample need to be for IQR to work?

At least 50-100 observations for the quartiles to stabilise. Below 30, look at the raw distribution and talk to a domain expert. Above 1,000, IQR is essentially noise-free as a detection rule, and the question becomes what to do with the flagged points.