EXPLAIN and query plan on a DE interview

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Why this comes up on a DE interview

Every production data engineering ticket eventually collapses into one sentence: "this query is slow, fix it." Without EXPLAIN you are guessing — adding indexes hoping something sticks, rewriting joins on vibes, blaming the warehouse. That is why interviewers at Stripe, Snowflake, Databricks, Airbnb, and Uber put a plan in front of you and ask, "what is wrong here." They want to see whether you can debug a slow query the same way you would debug a slow service: with evidence.

The classic failure mode on the job, and the one interviewers love to expose, is the engineer who adds an index "and it did not help." The planner did not pick the index, and the only way to know that is to read the plan. If you cannot tell a Seq Scan from a Bitmap Heap Scan, or estimate when Hash Join beats Nested Loop, the rest of the interview is fighting the question instead of answering it. Reading a plan is the same skill as reading a stack trace — it shows you have actually shipped Postgres in production.

EXPLAIN vs EXPLAIN ANALYZE

EXPLAIN SELECT * FROM users WHERE email = 'a@b';
-- planner's predicted plan, query is not executed

EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'a@b';
-- actually runs the query and reports real timings

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT ...;
-- adds page-level cache vs disk reads

The options worth memorising: ANALYZE runs the query and reports real rows and timings; BUFFERS shows how many pages came from shared_buffers (cache) versus disk; VERBOSE adds qualifier details; FORMAT JSON is machine-readable for scripted parsing; WAL reports WAL volume for write statements; SETTINGS lists non-default GUCs that influenced the plan.

The trap with ANALYZE is that it really runs the statement. If the statement is a DELETE or UPDATE, it really deletes or updates. The only safe way to time a destructive statement is inside a transaction you immediately roll back:

BEGIN;
EXPLAIN ANALYZE DELETE FROM big_table WHERE created_at < '2020-01-01';
ROLLBACK;

If an interviewer hands you a workbook and asks you to "just run EXPLAIN ANALYZE on this UPDATE," pause and wrap it in a transaction before you do. They are watching for that habit.

How to read the plan tree

A plan is a tree of nodes. The rule that trips up beginners: read it bottom-up, inside-out. The deepest indented node runs first; its output feeds the parent; the topmost node is what the client gets back.

Sort  (cost=10.5..10.6 rows=10 width=40) (actual time=0.2..0.2 rows=12 loops=1)
  Sort Key: u.created_at
  ->  Hash Join  (cost=2.0..10.4 rows=10 width=40)
        Hash Cond: (o.user_id = u.id)
        ->  Seq Scan on orders o  (cost=0..6.0 rows=100)
        ->  Hash  (cost=1.5..1.5 rows=10)
              ->  Index Scan on users u  (cost=0.1..1.5 rows=10)
                    Index Cond: (status = 'active')

Per line you get: cost=A..B (planner's estimate, A is startup cost, B is total; units anchored to a sequential page read = 1.0), rows (estimated row count), width (average row size in bytes), actual time=A..B (real milliseconds, only with ANALYZE), and loops (how many times the node ran — critical for Nested Loop, since total time is actual time × loops).

The single most important habit: compare estimated rows to actual rows. If they differ by more than ~10x, the optimiser is flying blind. That usually points to stale statistics, correlated predicates, or a planner-defeating expression on an indexed column. Run ANALYZE, check pg_stats, and if columns are correlated consider CREATE STATISTICS for multi-column estimates.

Scan methods

A Seq Scan reads the entire heap in physical order. Interviewers love a candidate who says "Seq Scan is not automatically bad." On a small table, or when you are returning more than 5-20% of rows, a Seq Scan beats an index lookup because streaming pages costs less than random IO plus heap visits.

An Index Scan walks the B-tree, then visits the heap for each match. It wins when selectivity is high — pulling a handful of rows out of millions. The catch: every match is a random page read; if you match a third of the table, an Index Scan is slower than a Seq Scan.

An Index Only Scan is the same walk, but the columns needed are fully covered by the index, and the visibility map says it can skip the heap. You get this from a covering index (INCLUDE columns) plus regular VACUUM. If the plan shows Heap Fetches: > 0, the visibility map is stale; run VACUUM to recover the win.

A Bitmap Heap Scan with a child Bitmap Index Scan is the compromise: walk the index to build a bitmap of needed heap pages, then read those pages once each in physical order. This is what you want for medium selectivity, OR-ed predicates, and combining two indexes (BitmapAnd / BitmapOr nodes).

-- fast and tight
Index Scan using idx_user_id (cost=0.1..8.5 rows=1)

-- estimate=1 but actual=1M
-- → statistics are stale, or the predicate defeats the index

Join methods

A Nested Loop iterates the outer relation and, for each outer row, probes the inner — ideally through an index.

Nested Loop
  ->  Seq Scan on a (rows=10)
  ->  Index Scan on b using idx_b_a_id (rows=1, loops=10)

This wins when the outer set is small and there is a good index on the inner join key. The interview trap: a Nested Loop with one million loops and no inner index is a runaway query, even if each per-loop time looks tiny.

A Hash Join builds a hash table on the inner relation, then streams the outer through it.

Hash Join
  ->  Seq Scan on a (rows=1M)
  ->  Hash
        ->  Seq Scan on b (rows=10K)

It is the workhorse for equality joins on large inputs. The gotcha is work_mem: if the hash does not fit, Postgres spills to disk and you see multiple Batches.

A Merge Join scans both inputs in sorted order on the join key — either because indexes provide ordering or because the planner inserted explicit Sort nodes. Competitive for very large, already-sorted inputs. If the planner has to sort both sides, Hash Join usually wins.

Interview shortcut: if a join query is slow, the join method is almost always wrong because the row estimate is wrong. Open EXPLAIN ANALYZE, find the first node where estimated and actual rows diverge, and walk upward from there.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Sort and aggregate nodes

A Sort node runs as in-memory quicksort or external merge on disk. The phrase you do not want to see is external merge Disk: 200000kB — the sort overflowed work_mem and is writing temp files. Fix: bump work_mem for the session, add an index matching the ORDER BY, or reformulate to remove the sort.

A Hash Aggregate groups rows through an in-memory hash; needs to fit in work_mem. A Group Aggregate runs after a Sort and streams groups one by one — slower per row, bounded memory. Postgres 13+ spills Hash Aggregate to disk instead of OOM-ing, but with the same penalty.

A Limit node helps top-N queries: with a matching index, ORDER BY indexed_col LIMIT 10 returns ten rows without scanning the table.

Pattern-matching real problems

The plan output is full of single-line hints that map to specific fixes:

Rows Removed by Filter: 9990 after Seq Scan on big_table (rows=10). You read 10 000 rows, threw away 9 990, kept 10. Fix: an index on the filter column, or a partial index if the predicate matches a selective constant.

rows=1 ... actual rows=1000000. The planner thinks the predicate is selective; reality says it is not. Run ANALYZE, inspect pg_stats.most_common_vals, and if columns are correlated, CREATE STATISTICS (dependencies) ON col_a, col_b FROM t.

Buckets: 32768 Batches: 4. Batches > 1 means the hash spilled to disk. Raise work_mem for the session, or pre-filter the inner relation.

Nested Loop (loops=10000000). Tens of millions of loops with no inner index is the "this query never returns" pattern. Add the index on the inner join column, or coerce the planner into a Hash Join — bumping work_mem is often enough, since the planner avoids Hash Join when it predicts a spill.

Common pitfalls

The pitfall most likely to lose you the offer is analysing EXPLAIN without ANALYZE. Plain EXPLAIN is a forecast based on statistics — it never executes, so rows numbers are estimates. Senior interviewers ask "is this an estimate or a measurement," and candidates who confuse the two end up on the bench.

A close cousin is running EXPLAIN ANALYZE on a destructive statement without a wrapping transaction. The query actually executes; the rows actually disappear. The fix is muscle memory: BEGIN; EXPLAIN ANALYZE ...; ROLLBACK; for any INSERT, UPDATE, DELETE, or DDL.

Ignoring loops in a Nested Loop is the third trap. People see actual time=0.1..0.2 and call the node fast. Total time is actual time × loops — ten million loops at 0.2 ms each is over half an hour. Read the plan node twice: per-iteration time, then the multiplier.

A subtler mistake is assuming Seq Scan is always bad. On a small table, or a query returning 30% of rows, Seq Scan is the right plan. Suggesting an index on a 1 000-row lookup table burns write-cost budget for microseconds of read time, and a good interviewer will push back.

Skipping ANALYZE after a bulk load is the final classic. After a COPY or backfill, statistics still reflect the old data shape, and the planner picks the wrong plan. Make ANALYZE part of the load pipeline. Also: parse FORMAT JSON, never the TEXT plan, and always include BUFFERSshared hit=10000 means warm cache, read=10000 means the same plan runs 100x slower cold.

If you want to drill plan-reading and SQL debugging questions every day, NAILDD is launching with 500+ SQL and data engineering problems shaped exactly like this one.

FAQ

What does cost actually measure?

An arbitrary unit anchored to one sequential page read = 1.0, tuned by GUCs like seq_page_cost, random_page_cost, cpu_tuple_cost. The number on its own is meaningless — only compare costs between candidate plans for the same query. On NVMe storage, lowering random_page_cost from the default 4.0 to ~1.1 is a one-line tune that often flips bad Seq Scans into Index Scans.

How do I force Postgres to use a specific index?

The right interview answer is "I do not — I give the planner better statistics." Run ANALYZE, confirm the column is in pg_stats, add extended statistics for correlated columns, and make sure the predicate is sargable. The wrong answer is SET enable_seqscan = OFF — useful for debugging that an index can be used, but shipping it is a flag for the on-call.

What does external merge Disk: 200000kB mean?

The Sort node did not fit in work_mem and is writing temp files. The literal fix is SET LOCAL work_mem = '512MB' inside a transaction, but the better fix is to ask whether the sort is necessary — often you can replace ORDER BY with an indexed scan or move the sort behind a pre-aggregated table.

Does this transfer to other engines?

The principles, not the syntax. Snowflake exposes plans through EXPLAIN USING JSON and the query profile UI; BigQuery shows a stage-by-stage execution graph; ClickHouse uses EXPLAIN PIPELINE and thinks in pipelines of streams rather than a tree of nodes. Scans, joins, aggregates, cardinality estimation, memory budgets — those carry over everywhere.

What is auto_explain and when should I turn it on?

A Postgres extension that logs the plan for any query exceeding auto_explain.log_min_duration (for example, 500 ms). The cheapest way to capture slow plans in production without reproducing them locally. Turn it on with BUFFERS and ANALYZE only on a slow-traffic replica or behind a feature flag — every-query overhead on a hot OLTP table is noticeable.