EXPLAIN SQL cheat sheet
Contents:
Why analysts get asked about EXPLAIN
On a senior analyst loop at Stripe, Snowflake, or Databricks, the warmest moment of the SQL round is "this report runs in eight minutes — why is it slow?". A junior guesses. A senior answers in five words: "let's look at EXPLAIN ANALYZE". That is the difference between proving a fix and shipping a hope.
EXPLAIN is the conversation the planner has with you about your query: read order, join algorithm, expected row counts, work per step. Without it you cannot reason about performance — only re-shuffle SQL until it accidentally gets faster. Candidates who only know the writing side hit a ceiling around senior; the ones who can read a plan move past it.
EXPLAIN vs EXPLAIN ANALYZE
-- Plan without execution (fast, safe)
EXPLAIN SELECT * FROM orders WHERE user_id = 1;
-- Plan plus real execution plus wall-clock time
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 1;
-- Plan plus buffers and timing per node
EXPLAIN (ANALYZE, BUFFERS, TIMING) SELECT * FROM orders WHERE user_id = 1;
-- Machine-readable plan
EXPLAIN (ANALYZE, FORMAT JSON) SELECT * FROM orders WHERE user_id = 1;Plain EXPLAIN only asks the planner what it intends to do — nothing runs. EXPLAIN ANALYZE actually executes the statement and reports predicted versus observed numbers side by side. That second column is gold: a planner that expects ten rows and produces ten million is the root cause of most "mystery" slow queries.
The footgun is running EXPLAIN ANALYZE on a DELETE, UPDATE, or INSERT in production — ANALYZE runs the statement. Use BEGIN; EXPLAIN ANALYZE ...; ROLLBACK; if you need real numbers on a write, or skip ANALYZE and read the estimated plan only.
How to read a plan tree
Sort (cost=123.45..125.67 rows=100 width=32) (actual time=5.1..5.8 rows=100 loops=1)
Sort Key: created_at
-> Index Scan using idx_orders_user on orders (cost=0.42..100.00 rows=100 width=32)
Index Cond: (user_id = 1)Plans read bottom-up and inside-out: the deepest indented node runs first, then its parent consumes its rows up to the root. Above, the Index Scan pulls 100 rows for user_id = 1, those feed Sort, and Sort emits the final ordered result.
Each line carries two cost-and-row blocks. The first parenthesis is the planner's estimate. The second appears only under ANALYZE and is reality. When those diverge by an order of magnitude, you are looking at the bug.
Scan types
Seq Scan
Seq Scan on orders (cost=0..1000 rows=10000 width=32)
Filter: (user_id = 1)A sequential scan reads every row and applies the filter. Fine on small tables or filters that return more than ten to twenty percent of rows — a full read can be cheaper than random index I/O. A red flag on a hundred-million-row fact table when the filter matches a thousand rows.
Index Scan
Index Scan using idx_orders_user on orders (cost=0.42..10.00 rows=100 width=32)
Index Cond: (user_id = 1)The planner walks the B-tree to find matching keys, then visits the heap to fetch each row. The bread-and-butter access path for filters on high-cardinality columns. Notice Index Cond versus Filter: the first means the index actually limited the rows; the second means rows came back and were discarded after the fact.
Index Only Scan
Index Only Scan using idx_orders_user_covering on orders
Index Cond: (user_id = 1)
Heap Fetches: 0When the index contains every column the query reads, Postgres never touches the heap. The fastest access pattern, achieved with a covering index using the INCLUDE clause. Heap Fetches: 0 is the tell — anything above zero means the visibility map needed a check.
Bitmap Heap Scan
Bitmap Heap Scan on orders
Recheck Cond: (status = 'paid')
-> Bitmap Index Scan on idx_orders_statusA hybrid: the index builds a bitmap of matching block pointers, then the heap is read in block order. Shines when an index returns thousands of scattered matches — random I/O collapses into sequential I/O.
Join algorithms
Nested Loop
Nested Loop (cost=0.42..500.00 rows=100 width=64)
-> Seq Scan on users u
-> Index Scan on orders o
Index Cond: (user_id = u.user_id)For every outer row, the inner side is probed. Fast when the outer side has few rows and the inner side has an index on the join key. Catastrophic when the outer side has a million rows and no inner index — your query disappears into the void for an hour.
Hash Join
Hash Join (cost=10.00..1000.00 rows=1000 width=64)
Hash Cond: (o.user_id = u.user_id)
-> Seq Scan on orders o
-> Hash
-> Seq Scan on users uThe smaller side is read into an in-memory hash table; the larger side streams past and probes the hash. The workhorse for equi-joins on tables of comparable size with no useful index on either join key.
Merge Join
Merge Join (cost=100.00..2000.00 rows=10000 width=64)
Merge Cond: (o.user_id = u.user_id)
-> Index Scan on orders o
-> Index Scan on users uBoth sides are read in sorted order and zipped together. Merge join wins on very large joins where the sort cost is already paid by an index. Rare but unbeatable when it fits.
Cost, rows, and width
(cost=1000.00..1500.00 rows=100 width=32)The two cost numbers are startup and total cost in arbitrary planner units. Startup is paid before the first row is emitted; total is what it takes to drain every row. Absolute values are meaningless — they exist only to compare alternative plans for the same query on the same machine.
rows is the planner's estimate of how many rows this node produces; width is the average row width in bytes. When EXPLAIN ANALYZE shows rows=1234567 next to rows=100, the estimate is off by four orders of magnitude — almost always stale statistics. The fix is ANALYZE schema.table_name;.
Common pitfalls
The first trap is reading cost as a wall-clock estimate. It is an internal score the planner uses to rank alternatives — a plan with cost 1500 is not "1.5 seconds slow", it is "more expensive than a plan with cost 800 on this machine". Always pair cost with actual time from EXPLAIN ANALYZE before concluding anything about real performance.
The second trap is treating Seq Scan as automatically bad. On a thousand-row dimension table a sequential scan is the right answer — random index lookups would be slower. The question to ask is "what fraction of the table is being returned?". Above roughly ten percent the planner prefers a scan, and it is usually right.
The third trap is missing the difference between Index Cond and Filter under an Index Scan. Index Cond means the index navigated directly to the rows; Filter means the index brought back extra rows that were then thrown away. Your index is leading with the wrong column, or the predicate cannot be pushed into it. Composite-index column order matters.
The fourth trap is calling EXPLAIN on a query with now(), random(), or any volatile function and assuming the plan you see is the plan the next run gets. Plans for parameterized queries can flip when the runtime value differs from the planner's assumed average. Use GENERIC_PLAN to inspect the prepared-statement form.
The fifth trap is forgetting BUFFERS. Base EXPLAIN ANALYZE shows time but not I/O. BUFFERS adds shared hit and shared read counts per node — that is how you tell "fast because cached" apart from "fast because the plan is good". Eighty milliseconds with all buffers hit becomes ten seconds on a cold cache.
Ten interview situations
A senior interview rarely asks a textbook EXPLAIN question. It asks "here is a slow query, walk me through what you'd do". The next ten patterns cover most prompts.
The first is a single-table predicate running slow. Run EXPLAIN ANALYZE, check the access node, ask whether it is a Seq Scan on a large table with a narrow filter. If so, does an index exist, and if it does, why is the planner ignoring it.
The second is a join that slowed down after a migration. Compare plans before and after — a switch from Hash Join to Nested Loop on a multi-million-row table is the usual villain, stale statistics the usual cause.
The third is GROUP BY getting expensive. HashAggregate is fast for narrow group counts; GroupAggregate requires sorted input and is chosen when many groups are expected or a sort already exists.
The fourth is ORDER BY ... LIMIT taking three seconds for ten rows. A Sort node above the access means Postgres pulled every matching row, sorted them all, and returned the top ten. A matching index lets the planner walk the B-tree and stop after ten.
The fifth is DISTINCT performance. It becomes either Unique over a sort or HashAggregate. The first is slow on wide rows; the second is memory-bound. Replacing DISTINCT with GROUP BY plus the right index is the standard fix.
The sixth is a CTE that materializes when it should inline. Before Postgres 12, CTEs were always materialized — an optimization fence. From 12 onward they inline unless you write WITH ... AS MATERIALIZED.
The seventh is a multi-column filter with only one indexed column. A composite index (user_id, status) gives Index Cond on both; two single-column indexes typically use one and filter the rest. Watch for BitmapAnd.
The eighth is COUNT(*) on a huge table. Postgres stores no row count; exact COUNT(*) is a Seq Scan. For analytics use pg_class.reltuples.
The ninth is a five-table join with strange join order. Set join_collapse_limit = 1 to force the listed order. "The planner picks badly on complex joins" is a real failure mode, not user error.
The tenth is CTE versus derived table. In modern Postgres they are equivalent unless you write MATERIALIZED.
Optimization tips
Add BUFFERS to every EXPLAIN ANALYZE when diagnosing real performance — cache effects dwarf plan effects on warm databases. Cold-cache timing is what matters for the first hit; warm-cache for the steady-state cost of a dashboard refreshed every minute.
Refresh statistics aggressively when planned and actual rows diverge. ANALYZE schema.table is cheap and idempotent; an autovacuum schedule that lags behind a bulk-load job is the single most common cause of bad plans in self-hosted Postgres.
Use covering indexes (CREATE INDEX ... ON tbl(a, b) INCLUDE (c, d)) for hot-dashboard queries. The visibility-map plus Index Only Scan combo skips the heap fetch entirely and turns a hundred-millisecond query into a two-millisecond one.
Partition very large fact tables on the column you filter or join on most often. The plan shows partition pruning explicitly — if every partition is scanned, the predicate is not pruning and the scheme is not helping. Pruning is the only reason to partition.
Related reading
- SQL window functions interview questions
- CTE SQL cheat sheet
- CTE vs subquery in SQL
- CTE vs temp table in SQL
- NULL in SQL cheat sheet
If you want to drill query-plan reading and SQL optimization questions like these every day, NAILDD is launching with hundreds of analyst-loop SQL problems exactly in this format.
FAQ
When should I use EXPLAIN versus EXPLAIN ANALYZE?
Use plain EXPLAIN to peek at the plan without running the query, particularly on writes. Use EXPLAIN ANALYZE when you need real numbers to compare against estimates — that comparison tells you whether statistics are stale or the planner picked badly. Never run EXPLAIN ANALYZE on an INSERT, UPDATE, or DELETE in production unless wrapped in BEGIN; ... ROLLBACK;.
How can I tell if an index is being used?
Look for Index Scan or Index Only Scan on the access path, and check that the predicate appears under Index Cond rather than Filter. A filter under an Index Scan means the index returned more rows than necessary and a post-filter discarded the rest. If you see only Seq Scan, the planner declined the index — usually stale statistics, a function on the column, or an implicit cast.
What does cost actually measure?
An internal score in arbitrary units built from seq_page_cost, random_page_cost, cpu_tuple_cost, and others. The absolute number does not translate to milliseconds, but on a single machine the plan with lower cost is the one Postgres expects to run faster. Always confirm with EXPLAIN ANALYZE actual time.
Why is Postgres ignoring my index?
Usual suspects: a function or cast on the indexed column (WHERE LOWER(email) = ...), implicit type mismatch, stale statistics, a table small enough that Seq Scan is cheaper, or an index whose leading column is not in the predicate. Run ANALYZE first, then check for function or cast issues.
How do I read the actual time numbers?
actual time=5.1..5.8 means the node took 5.1 ms before the first row and 5.8 ms in total. For nodes that run in a loop — typically the inner side of a Nested Loop — multiply by loops=N for total time. actual time=1.0..2.0 loops=1000000 is two million milliseconds, not two.
Should I memorize cost formulas?
No. Senior interviewers want you to read a plan, spot the bottleneck, and propose a fix — not to compute seq_page_cost * pages in your head. The model that helps: estimated vs actual rows points to statistics, scan type to indexes, join algorithm to sizes and join-key indexes, buffers to cache. That covers every interview prompt.