EXISTS vs IN in SQL
Contents:
Why this comparison matters
If you have written SQL for more than a week, you have probably typed both IN and EXISTS and assumed they were interchangeable. They look similar, they often return the same rows, and most query planners will even rewrite one as the other behind your back. So why do interviewers at Stripe, Snowflake, and Databricks ask about the difference?
Because the NOT IN NULL trap silently breaks production queries, and because correlated subqueries are the single most useful pattern an analyst can learn after window functions. A senior data analyst is expected to read a WHERE NOT IN (SELECT ...) and immediately ask "what happens if the subquery returns NULL". That instinct separates someone who copies StackOverflow snippets from someone who owns the dashboard.
This post walks through both operators with runnable SQL, the NULL gotcha that ruins reconciliation queries, and the planner behavior you should expect on Postgres, Snowflake, and BigQuery. If you only remember one line from this article, make it the blockquote in the NULL section.
How IN works
IN compares a value to a set. The set can be a hand-typed list of literals or the result of a subquery.
-- List of constants
SELECT *
FROM orders
WHERE status IN ('paid', 'shipped', 'delivered');
-- Subquery: orders from US-based users
SELECT *
FROM orders
WHERE user_id IN (
SELECT user_id
FROM users
WHERE country = 'US'
);When IN is used with a subquery, the engine typically runs the subquery once, materializes the result into a hash set, and probes that hash for every row in the outer query. The subquery is non-correlated — it does not reference any column from the outer table.
A useful mental model: IN answers "is this value one of those values?". The set on the right has to exist before the comparison can happen — which is why IN lists with millions of elements get slow.
How EXISTS works
EXISTS checks whether a subquery produces at least one row. The subquery is almost always correlated — it references a column from the outer query, creating a "for each outer row, does anything match" pattern.
-- Users who placed at least one order
SELECT *
FROM users u
WHERE EXISTS (
SELECT 1
FROM orders o
WHERE o.user_id = u.user_id
);
-- Products that were never ordered
SELECT *
FROM products p
WHERE NOT EXISTS (
SELECT 1
FROM order_items oi
WHERE oi.product_id = p.product_id
);EXISTS short-circuits. The moment the subquery finds one matching row, the engine stops scanning and returns TRUE for that outer row. This is why EXISTS plus an index on the join column is so cheap — most outer rows resolve after a single index seek.
The SELECT 1 is a convention, not a requirement. You can write SELECT *, SELECT 42, or SELECT NULL — EXISTS only inspects whether the result set has rows, never what is inside them. The planner ignores the projection entirely.
Side-by-side comparison
| Dimension | IN |
EXISTS |
|---|---|---|
| What it checks | Value membership in a set | Whether at least one row exists |
| Typical subquery shape | Non-correlated | Correlated |
| NULL behavior | Dangerous with NOT IN |
Always safe |
| Faster when | Subquery is tiny, no index on outer | Outer is large and join column is indexed |
| Readability | Cleaner for literal lists | Cleaner for "has-related-row" checks |
| Negation | NOT IN — broken with NULLs |
NOT EXISTS — always correct |
| Short-circuits | No, builds full set | Yes, stops at first match |
The two operators are semantically equivalent only when no NULLs are involved in the candidate set. The moment a NULL enters the subquery, NOT IN and NOT EXISTS diverge — sometimes spectacularly.
The NOT IN NULL trap
This is the gotcha that gets asked in roughly every other senior analyst interview. The rule is short: if your NOT IN subquery returns even one NULL, the entire NOT IN returns an empty result set.
-- Suppose orders has a row where user_id IS NULL
-- NOT IN returns ZERO rows
SELECT *
FROM users
WHERE user_id NOT IN (
SELECT user_id FROM orders
);
-- Empty result, even though plenty of users have no orders
-- NOT EXISTS returns the correct list
SELECT *
FROM users u
WHERE NOT EXISTS (
SELECT 1
FROM orders o
WHERE o.user_id = u.user_id
);
-- Returns users without orders, as expectedWhy does this happen? SQL uses three-valued logic: TRUE, FALSE, UNKNOWN. Any comparison against NULL produces UNKNOWN. NOT IN (a, b, NULL) expands into <> a AND <> b AND <> NULL. The last clause is UNKNOWN, the whole expression collapses to UNKNOWN, and rows where WHERE is UNKNOWN are silently dropped.
Load-bearing rule: Never write NOT IN (SELECT ...) against a nullable column without an explicit IS NOT NULL filter inside the subquery. The safer default is to reach for NOT EXISTS and forget NOT IN ever existed for subqueries.
The trap hides during development. Your dev sample has no NULLs, the query returns sensible rows, you ship it. Two weeks later a single bad event lands with a NULL join key, and your churn dashboard quietly reports zero churned users for a week.
Performance under the hood
The popular myth is "EXISTS is always faster than IN". That was true on Oracle in the 1990s. On modern engines, the planner rewrites between the two forms freely, and the actual choice of plan depends on three things: cardinality of the subquery, presence of an index on the join column, and whether the outer table is the larger side.
| Scenario | Typical winner | Why |
|---|---|---|
| Subquery returns <1k unique values, no index | IN |
Hash probe over tiny set is nearly free |
| Subquery scans 100M rows, indexed join column | EXISTS |
Short-circuit on first match per outer row |
| Outer table is small, subquery is huge | EXISTS |
Avoid materializing entire subquery set |
| Filter on literal list of values | IN |
Cleaner code, planner unfolds to OR |
| Anti-join with possible NULLs | NOT EXISTS |
Correctness, not just speed |
On Postgres 13+, IN with a subquery and EXISTS produce identical plans in roughly 80% of cases. Snowflake similarly treats them as interchangeable for non-correlated forms. BigQuery is a bit pickier — IN UNNEST(...) patterns get special treatment, and EXISTS correlated subqueries occasionally force a less-optimal join order.
The headline: write whichever form expresses the intent more clearly, then check EXPLAIN when the query actually feels slow.
When to pick IN vs EXISTS
Reach for IN when you have a literal list of values or a known-small lookup set: WHERE status IN ('active', 'trial', 'past_due') or WHERE country_code IN (SELECT code FROM enabled_markets). The intent — value is one of these — reads naturally.
Reach for EXISTS when you are asking "does this row have any related row matching condition X". Especially when the condition has multiple filters layered on the related table.
-- EXISTS shines when the related-row predicate is rich
SELECT *
FROM users u
WHERE EXISTS (
SELECT 1
FROM orders o
WHERE o.user_id = u.user_id
AND o.created_at >= '2026-01-01'
AND o.status = 'paid'
AND o.amount > 100
);Trying to express the same query with IN requires either a clunky multi-column tuple or a nested subquery that pre-filters into a single column. The EXISTS form is shorter, more readable, and lets the planner push the filter into the join.
If you want to drill this pattern across 500+ real interview SQL problems, NAILDD ships a daily question feed that hammers exactly these correlated-subquery reflexes until they become muscle memory.
Common pitfalls
The first pitfall is the one we already covered: NOT IN with NULL in the subquery. The fix is either WHERE col IS NOT NULL inside the subquery or a switch to NOT EXISTS. Make NOT EXISTS your default for anti-joins and you will never get burned.
A second trap is using IN against a subquery that returns duplicates. The query is still correct, but the engine may end up building a much larger hash than necessary. Wrapping the subquery in SELECT DISTINCT sometimes helps the planner, sometimes hurts it by forcing an extra sort. Test on real data volumes before deciding.
A third pitfall is performance regression from index changes. An EXISTS query that runs in 50ms today can degrade to 30 seconds after someone drops a "redundant" index on the join column — the short-circuit advantage evaporates and you need a full scan per outer row. Document the index dependency in a comment next to the query.
A fourth, sneakier pitfall is mixing IN with multi-column predicates. Postgres supports WHERE (a, b) IN (SELECT x, y FROM t), but MySQL handles it inconsistently and BigQuery does not support tuple-IN at all. If you write cross-dialect SQL — common in ELT layers that ship to multiple warehouses — prefer the EXISTS form, which is universally portable.
The final pitfall is assuming the planner will save you. It usually does, until it doesn't. The day a stats refresh flips your row estimate and the query goes from 200ms to 4 minutes, you will want to know which form you wrote and why. Treat planner equivalence as a default, not a guarantee.
Interview questions
What is the difference between EXISTS and IN? IN compares a value to a set; EXISTS checks whether a subquery returns any rows. EXISTS is typically used with a correlated subquery and short-circuits on the first match. The most important practical difference is that NOT IN breaks silently if the subquery contains NULL, while NOT EXISTS always returns the correct result.
What happens if the NOT IN subquery returns a NULL? The whole NOT IN returns an empty set. SQL's three-valued logic turns value <> NULL into UNKNOWN, the AND chain collapses to UNKNOWN, and rows where WHERE is UNKNOWN get filtered out. Fix it by filtering NULLs in the subquery or, better, switch to NOT EXISTS.
When is EXISTS faster than IN? When the subquery would return a large set and the join column is indexed. EXISTS short-circuits on the first matching row per outer row, while IN typically builds a hash of the entire subquery result. On modern Postgres, Snowflake, and Databricks the planner often rewrites between the two, so the difference is sometimes zero.
Can IN and EXISTS be rewritten as each other? Yes, when no NULLs are involved. WHERE x IN (SELECT y FROM t) is equivalent to WHERE EXISTS (SELECT 1 FROM t WHERE t.y = x). But NOT IN and NOT EXISTS are not equivalent in the presence of NULL — that asymmetry is the whole point of the comparison.
What is a correlated subquery? A subquery that references a column from the outer query. Conceptually it runs once per outer row, though the planner often rewrites it into a join. EXISTS is almost always used with a correlated subquery; IN is usually used with a non-correlated one.
Related reading
- SQL window functions interview questions
- CTE vs subquery in SQL
- Anti and semi joins for data engineering interviews
- NULL in SQL guide
- SQL antipatterns guide
- How to read EXPLAIN ANALYZE
FAQ
Are EXISTS and IN really the same thing?
Semantically they are close for "find rows that have a match" queries, and the planner often rewrites one into the other. The critical exception is the negated form: NOT IN against a subquery that can contain NULL will silently return zero rows, while NOT EXISTS returns the expected anti-join. Treat the two operators as siblings, not twins.
Which is faster, EXISTS or IN?
On modern engines the answer is usually "it does not matter, the planner picks." When it does matter, IN tends to win for small literal lists or tiny non-correlated subqueries, and EXISTS wins for large correlated subqueries with an index on the join column. Confirm with EXPLAIN ANALYZE on production-sized data before optimizing.
When should I use NOT EXISTS instead of LEFT JOIN + IS NULL?
Both patterns produce the same anti-join result. NOT EXISTS is usually more readable because the intent — "no matching row in the other table" — is right there in the operator name. Performance is identical on Postgres and Snowflake because both forms compile to the same anti-join physical operator. Pick on style grounds, and stay consistent across the codebase.
Can I use IN with a very large list of literal values?
Yes, but there are practical ceilings. Postgres handles IN lists up to roughly 10,000 elements efficiently before the planner overhead starts dominating. For larger sets, load the values into a temp table or CTE and join against it. Snowflake and BigQuery scale further but still benefit from the temp-table pattern once you cross 100,000 values.
Is SELECT 1 faster than SELECT * inside EXISTS?
No. The query planner ignores the projection list inside EXISTS — it only cares that the subquery has rows. SELECT 1 is a style convention that signals intent to the reader. SELECT *, SELECT NULL, and SELECT 42 all produce identical execution plans.