Recursive CTE for Data Engineering Interviews
Contents:
Why recruiters ask it
Recursive CTE is the SQL construct that separates analysts who write reports from data engineers who design systems. It shows up on DE screens at Snowflake, Databricks, Stripe, and Airbnb because three production problems — walking an org chart, filling missing dates, traversing a transaction graph — all collapse to one idiom. Write it on a whiteboard and the recruiter knows you understand fixpoint computation, termination, and accumulator semantics. Miss it and the fallback is a Python loop that pulls 50,000 rows over the wire.
The scenario that usually appears: the interviewer drops an employees(id, name, manager_id) table on the shared editor and asks you to return every report under person 1, with their depth. They are testing whether your first instinct is set-based recursion or a procedural loop. Pick the loop and the rest of the interview gets harder. Pick the CTE and you move on to the optimization round.
The same idiom solves a question that sounds different but is actually identical: a product manager wants daily revenue and a handful of days have no orders. A LEFT JOIN to a generated calendar fills the gaps; recursive CTE builds that calendar when generate_series is not available. Recruiters love this one because it disguises hierarchy thinking inside a reporting task.
Base syntax
Every recursive CTE has the same shape: an anchor query that produces the seed rows, a UNION ALL, and a recursive query that joins back to the CTE itself with a termination condition. The engine evaluates the anchor once, appends each new batch, and stops when the recursive part returns zero rows.
WITH RECURSIVE cte_name AS (
-- anchor: seed rows
SELECT ...
UNION ALL
-- recursive: references cte_name
SELECT ...
FROM cte_name
WHERE <termination condition>
)
SELECT * FROM cte_name;The four properties that matter on a whiteboard: the anchor must be non-recursive and return at least one row, the operator must be UNION ALL rather than UNION (the latter forces a sort and silent dedup that hides bugs), the recursive part must reference the CTE name exactly once, and the termination predicate must eventually return empty. Postgres and MySQL require the RECURSIVE keyword; SQL Server omits it but still demands UNION ALL. Snowflake follows the Postgres flavour. Knowing which dialect your interviewer uses changes the first line of the answer.
The mental model worth carrying in is a working table and a result table. The engine seeds the working table with anchor rows, copies them to the result, then on each step runs the recursive part against the current working table, copies new rows to both, and discards the previous working table. Termination is when the recursive part adds nothing.
Hierarchies: the org chart
The classic question: given employees(id, name, manager_id), return every direct and indirect report of person 1, tagged with their depth.
WITH RECURSIVE subordinates AS (
-- anchor: direct reports of person 1
SELECT id, name, manager_id, 1 AS depth
FROM employees
WHERE manager_id = 1
UNION ALL
-- recursive: reports of the people we already found
SELECT e.id, e.name, e.manager_id, s.depth + 1
FROM employees e
JOIN subordinates s ON e.manager_id = s.id
)
SELECT id, name, depth
FROM subordinates
ORDER BY depth, id;The depth column is the cheap upgrade that separates good answers from great ones. It costs one extra column and gives you free pretty-printing, a stop condition for unbalanced trees, and an interview talking point about why fixpoint computation needs an invariant. Recruiters at Databricks specifically prompt for it: "now print it indented", and the candidate who already has depth answers with LPAD('', depth * 2) while the candidate who does not has to rewrite.
The reverse traversal — walk up the chain to the CEO — is the same shape with the join flipped. Anchor on the target employee, then join employees against the CTE on e.id = m.manager_id.
WITH RECURSIVE managers AS (
SELECT id, name, manager_id, 1 AS lvl
FROM employees
WHERE id = 42
UNION ALL
SELECT e.id, e.name, e.manager_id, m.lvl + 1
FROM employees e
JOIN managers m ON e.id = m.manager_id
)
SELECT id, name, lvl FROM managers WHERE id <> 42;If the company seeds you with a real dataset and asks for "the path", concatenate names into an ARRAY or string at each step. That detail comes up in graph-style follow-ups and is also the cycle defence you need in the next section.
Date series generation
Postgres and Greenplum ship generate_series; ClickHouse has range and arrayJoin; Snowflake has GENERATOR. When the dialect lacks any of these, or the interviewer asks you to do it from scratch to test your understanding, recursive CTE is the portable answer.
WITH RECURSIVE date_series AS (
SELECT DATE '2026-01-01' AS d
UNION ALL
SELECT d + INTERVAL '1 day'
FROM date_series
WHERE d < DATE '2026-12-31'
)
SELECT d FROM date_series;The recruiter follow-up is always the same: left-join this against an orders table so days without sales appear as zero rows. This is the exact pattern that backs "metrics with no gaps" dashboards.
WITH RECURSIVE date_series AS (
SELECT DATE '2026-01-01' AS d
UNION ALL
SELECT d + INTERVAL '1 day'
FROM date_series
WHERE d < DATE '2026-12-31'
)
SELECT
ds.d AS event_date,
COALESCE(SUM(o.amount), 0) AS revenue
FROM date_series ds
LEFT JOIN orders o ON DATE(o.created_at) = ds.d
GROUP BY ds.d
ORDER BY ds.d;Two interview details to flag: cast both sides to the same date type or the join silently produces nulls, and prefer COALESCE(SUM(...), 0) to a join filter so empty days survive the aggregation.
Graphs and BFS
Given edges(from_id, to_id), find every node reachable from node 1. This is a breadth-first traversal expressed as fixpoint iteration.
WITH RECURSIVE reachable AS (
SELECT 1 AS node, 0 AS depth
UNION ALL
SELECT e.to_id, r.depth + 1
FROM edges e
JOIN reachable r ON e.from_id = r.node
)
SELECT DISTINCT node FROM reachable;This works on a tree. On a graph with cycles it does not terminate, because the recursive step keeps rediscovering the same nodes. The defence is to carry the visited path as an array and refuse to re-enter it.
WITH RECURSIVE reachable AS (
SELECT 1 AS node, 0 AS depth, ARRAY[1] AS path
UNION ALL
SELECT e.to_id, r.depth + 1, r.path || e.to_id
FROM edges e
JOIN reachable r ON e.from_id = r.node
WHERE NOT (e.to_id = ANY(r.path))
AND r.depth < 10
)
SELECT DISTINCT node FROM reachable;The r.depth < 10 is a belt-and-braces cap that makes the query safe even if the cycle check is wrong. In production, replace 10 with the diameter of your graph or a tuned constant. Interviewers at Stripe and Snowflake reward candidates who add both checks without being prompted.
Performance reality
Recursive CTE is not magic and it is not a query optimizer's friend. Three behaviours catch candidates out.
Recursive CTEs are evaluated iteratively, not as a single set operation. Each step pays the cost of a fresh join against the accumulator, so an O(n) traversal becomes O(depth * cost_per_step). On Postgres the recursive accumulator is not inlined into the outer plan; it materializes a working table per iteration. On Snowflake the engine spills the accumulator to a temporary if it grows past warehouse memory. On a one-million-row hierarchy you will hit out-of-memory before you hit a correct answer.
Postgres-specific footnote: pre-12, all CTEs were optimization fences and always materialized. From 12 onwards, plain CTEs are inlined when used once, but recursive CTEs are still a nested loop and never inline. If your hierarchy fits in memory and the join keys are indexed, this is fine. If they are not, every step rescans.
The escape hatches in real systems: generate_series for dates and numbers, a procedural loop in PL/pgSQL for branching logic that does not fit set semantics, a closure table — a precomputed (ancestor, descendant, depth) triple — for read-heavy hierarchy queries, and a graph database such as Neo4j for graphs beyond a few hundred thousand edges. The closure-table answer is the one Databricks recruiters want to hear, because it explains the read-write trade.
Common pitfalls
Forgetting the RECURSIVE keyword in Postgres or MySQL fails parsing immediately, but candidates who started on SQL Server often miss it on a Snowflake screen. The keyword is a hint to the planner about which CTEs may self-reference, not just a marker for the reader. Memorize the four-dialect cheat: Postgres, MySQL, Snowflake require WITH RECURSIVE; SQL Server uses WITH and infers recursion from the self-reference.
Using UNION instead of UNION ALL is the trap that does not throw, only silently changes results. UNION deduplicates by sorting, which means the recursive expansion stops adding new rows the moment a duplicate appears — even one you did not expect, like the root node showing up via two paths in a diamond hierarchy. The dedup hides the bug behind a plausible-looking answer. Use UNION ALL everywhere and dedup with SELECT DISTINCT on the outer query if needed.
Missing the termination predicate is the textbook fail. A recursive part with no WHERE will run until the engine kills it for blowing memory or depth. Postgres has no default max_recursion_depth, so the symptom is a hung connection. The fix is twofold: a logical predicate that defines done (d < final_date, depth < max_depth, node NOT IN visited) and a numeric safety cap so a wrong predicate cannot lock the warehouse.
Forgetting cycle protection on graph queries does the same in slower motion. Even if the graph is acyclic today, the moment someone inserts a self-loop or a back-edge, the query runs forever. Carrying the path as an array and filtering NOT (e.to_id = ANY(r.path)) costs one column and one predicate.
Expecting an ordering on the result is the silent killer that breaks downstream reporting. Recursive CTEs do not guarantee row order across iterations. If the consumer needs depth-first or by-name order, add an explicit ORDER BY on the outer select, or expose a depth column and sort by it. The mistake usually surfaces in code review months later when the report's row order changes after a planner update.
Related reading
- CTE vs subquery in SQL
- CTE WITH SQL guide
- Window functions for data engineering interviews
- EXPLAIN and query plan for data engineering interviews
- Materialized views for data engineering interviews
- SQL window functions interview questions
If you want to drill recursive CTE and the rest of the SQL data engineering screen on real recruiter prompts, NAILDD is launching with 1,500+ problems sliced by exactly this pattern.
FAQ
Does recursive CTE work in ClickHouse?
ClickHouse added partial support in 22.x, but the engine is not optimized for it and most teams avoid the pattern. The idiomatic ClickHouse answer for hierarchies is a hierarchical dictionary loaded from another source, a denormalized path column maintained at write time, or a precomputed closure table joined from an external store. If you face a ClickHouse interviewer, flag this distinction immediately — they are testing whether you know not to fight the engine.
What is a closure table and when do you reach for it?
A closure table stores every ancestor-descendant pair explicitly, with one row per pair and a depth column. Querying "all descendants of X" becomes a simple WHERE ancestor = X filter against an indexed table, with no recursion at runtime. The cost is on writes: inserting a new node requires inserting one row per ancestor it inherits. It pays off on read-heavy hierarchies that change rarely — org charts, category trees, comment threads — and not on graphs that mutate every second.
Can I use recursive CTE to generate test data?
Yes, and it is one of the few legitimate uses on production warehouses. WITH RECURSIVE n AS (SELECT 1 AS i UNION ALL SELECT i + 1 FROM n WHERE i < 10000) SELECT i FROM n returns ten thousand rows, and you can multiply it with cross joins to seed millions for benchmarks. On Snowflake the GENERATOR function is faster; on Postgres generate_series(1, 10000) is faster. The recursive form is the portable fallback.
How do I stop recursion on a non-standard condition?
Any predicate in the recursive part's WHERE works. Stop by depth (r.depth < 10), by value (d < final_date), by inclusion (node NOT IN (SELECT node FROM accumulator)), or by combinations of all three. The only rule is the predicate has to eventually flip to false for every traversal path the engine could take.
Does Spark SQL support recursive CTE?
Spark 3.4 added experimental support behind the spark.sql.legacy.allowRecursionInWith flag. It is not part of the standard Spark optimizer path and most teams who run hierarchies on Spark use GraphFrames or write the iterative joins by hand. If a Databricks interviewer asks you to write recursive CTE on Spark, the right move is to write it and then say "in production I would use GraphFrames for graphs and a closure table for hierarchies" — they want to see both the SQL fluency and the platform awareness.
Is this an official source?
No. The patterns are drawn from the SQL:1999 standard and the public documentation of Postgres, MySQL 8, SQL Server, Snowflake, and Spark 3.4. Every dialect has its own quirks at the edges; verify against your engine's docs before shipping production code.