SQL on the data analyst interview

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Why SQL still decides the loop

SQL is the single most predictive signal for an analyst loop. Across pipelines at Google, Meta, Stripe, Airbnb, and DoorDash, a SQL screen sits in front of every onsite, and a SQL-heavy round shows up at roughly 90% of mid-and-above interviews even when the day job is mostly Python. You can be brilliant at framing and still get downleveled because you froze on a ROW_NUMBER question with two joins.

The reason is mechanical: SQL is the lowest-variance way to test whether you can move from a vague request to a correct, runnable artifact in under 25 minutes. Python is harder to grade fairly in a 45-minute slot. SQL has tight rubrics, deterministic outputs, and the interviewer can read your join order in five seconds.

The good news: the repertoire is bounded. 20-30 patterns cover 90% of what gets asked. Drilling 30 well-chosen problems beats grinding 300 random ones.

Formats you will face

Format determines what the interviewer can grade, which determines what you should rehearse. The three you see in 2026:

Format Time What it tests Where it shows up
Live coding in a shared editor 30-45 min, 2-3 questions Syntax fluency, debugging under pressure, talking through joins Most onsites at Meta, Stripe, Airbnb, DoorDash
Shared-screen whiteboard or chat pad 30 min, 1-2 questions Logic and decomposition, less about exact syntax Phone screens, Google early rounds, some Microsoft loops
Take-home with a synthetic dataset 2-24 hours End-to-end query design, comments, sometimes a writeup Smaller startups, some Snowflake and Databricks roles

The principle is identical across all three: understand the question, narrate as you decompose, sanity-check the output. Silent LeetCode runs lose points in live coding; take-home-only practice misses the syntax-recall reps you need under pressure.

The 8 question types

Almost every analyst SQL question slots into one of eight buckets. Recognize the bucket and your candidate solutions narrow to two or three.

# Pattern Typical prompt Core construct
1 Filter + aggregate "Top 10 customers by spend last year" GROUP BY, ORDER BY, LIMIT
2 Joins (1:1, 1:N, N:M) "Users and their most recent order" LEFT JOIN, LATERAL, DISTINCT ON
3 Window functions "Second order per user", "running total" ROW_NUMBER, SUM OVER, LAG
4 Date arithmetic "DAU last 30 days", "Day 7 returners" DATE_TRUNC, INTERVAL
5 CTEs / subqueries Structuring multi-step logic WITH ... AS (...)
6 Self-joins "Consecutive purchases", "manager-report pairs" aliasing the same table twice
7 Pivots "Revenue by platform per month, columns = platforms" CASE WHEN + SUM
8 Cohorts / retention "Day-N retention by signup cohort" join cohort to activity, CASE WHEN

The frequency is heavily skewed. In a sample of 60 published analyst onsite questions from levels.fyi and Glassdoor across 2024-2025, window functions and cohort/retention accounted for more than half of medium-and-hard slots. Filter-aggregate is table stakes for screens; pivots show up at Snowflake and Databricks.

Load-bearing trick: If the prompt says "for each X, find the top/second/last Y", reach for ROW_NUMBER() OVER (PARTITION BY x ORDER BY y). That single pattern unlocks types 2, 3, and 8 in this table.

Worked recipes

Four recipes that, between them, cover roughly 70% of the medium-hard slots. Each is what you would type in the editor — not pseudocode, not a hand-wave.

Top N per group

The most-asked window-function question. Prompt: "For each department, return the two highest-paid employees."

WITH ranked AS (
    SELECT
        department,
        name,
        salary,
        ROW_NUMBER() OVER (
            PARTITION BY department
            ORDER BY salary DESC
        ) AS rn
    FROM employees
)
SELECT department, name, salary
FROM ranked
WHERE rn <= 2;

The variants matter. Use RANK() if ties should share a rank and consume slots; use DENSE_RANK() if ties share a rank but the next value gets the next integer. Interviewers love asking the difference because it separates rote memorization from understanding.

Running totals and rolling MAU

Prompt: "Cumulative revenue by day."

SELECT
    DATE,
    revenue,
    SUM(revenue) OVER (ORDER BY DATE) AS cumulative_revenue
FROM daily_revenue
ORDER BY DATE;

Drop the PARTITION BY and the window spans everything. Add it back when the prompt says "per customer" or "per cohort".

Rolling MAU — a 30-day distinct count per day — is the question that catches people. The naive instinct is COUNT(DISTINCT user_id) OVER (ORDER BY day ROWS BETWEEN 29 PRECEDING AND CURRENT ROW). That throws DISTINCT is not implemented for window functions in Postgres. Use a correlated subquery instead:

WITH days AS (
    SELECT DISTINCT DATE(event_time) AS day
    FROM events
    WHERE event_time >= CURRENT_DATE - INTERVAL '60 days'
)
SELECT
    d.day,
    (SELECT COUNT(DISTINCT e.user_id)
     FROM events e
     WHERE e.event_time >= d.day - INTERVAL '29 days'
       AND e.event_time <  d.day + INTERVAL '1 day') AS mau_rolling
FROM days d
ORDER BY d.day;

Gotcha: In BigQuery and Snowflake, COUNT(DISTINCT) OVER is allowed in newer versions, but the safe answer in an interview is the correlated subquery — it runs everywhere and signals you know the portability gotcha.

Cohort retention

Prompt: "For each signup cohort, return Day-1 and Day-7 retention."

WITH cohorts AS (
    SELECT
        user_id,
        DATE(signup_time) AS signup_date
    FROM users
),
activity AS (
    SELECT DISTINCT
        user_id,
        DATE(event_time) AS active_date
    FROM events
)
SELECT
    c.signup_date AS cohort,
    COUNT(DISTINCT c.user_id) AS cohort_size,
    COUNT(DISTINCT CASE
        WHEN a.active_date = c.signup_date + INTERVAL '1 day' THEN c.user_id
    END) * 1.0 / COUNT(DISTINCT c.user_id) AS day_1_retention,
    COUNT(DISTINCT CASE
        WHEN a.active_date = c.signup_date + INTERVAL '7 days' THEN c.user_id
    END) * 1.0 / COUNT(DISTINCT c.user_id) AS day_7_retention
FROM cohorts c
LEFT JOIN activity a ON c.user_id = a.user_id
WHERE c.signup_date >= CURRENT_DATE - INTERVAL '60 days'
GROUP BY c.signup_date
ORDER BY c.signup_date;

Multiply by 1.0 to force the division to floating point. Forgetting that and returning 0 because of integer division is the most common silent failure on this question.

Consecutive-day streaks

Prompt: "Find users with three consecutive days of activity." The trick is to subtract a row number from the date — consecutive days collapse into the same group.

WITH daily_activity AS (
    SELECT DISTINCT user_id, DATE(event_time) AS active_day
    FROM events
),
numbered AS (
    SELECT
        user_id,
        active_day,
        active_day - INTERVAL '1 day' * ROW_NUMBER() OVER (
            PARTITION BY user_id ORDER BY active_day
        ) AS streak_group
    FROM daily_activity
),
streaks AS (
    SELECT
        user_id,
        streak_group,
        COUNT(*) AS streak_length
    FROM numbered
    GROUP BY user_id, streak_group
)
SELECT DISTINCT user_id
FROM streaks
WHERE streak_length >= 3;

When you walk this through aloud, name the trick: "row number subtracted from the date is constant within a streak". Interviewers grade narration as much as code.

Median by group

SELECT
    department,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) AS median_salary
FROM employees
GROUP BY department;

If the dialect lacks PERCENTILE_CONT (older MySQL), fall back to the symmetric rank trick — but flag the dialect question first. Asking "is this Postgres, MySQL, or something else?" is a free signal that you think portably.

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Live-coding behavior that wins

Interviewers grade three axes: correctness, decomposition, and communication. Most candidates over-index on correctness and ignore the other two.

Narrate before typing. Twenty seconds of "start from orders, left-join activity, partition by user, rank" buys a free sanity check. If you go down the wrong path, the interviewer course-corrects before you waste five minutes.

Start from a tiny example. "Say three users with two orders each." Walk through expected output by hand, write the query, verify it matches. This catches off-by-one date errors and integer-division bugs before the interviewer sees them.

Ask about formats and edge cases. "Are timestamps UTC?" "Can a user have duplicate rows on the same day?" "Is signup_date inclusive in the cohort?" Each question is worth a quarter-grade. Senior candidates ask; juniors charge in.

Read your own code back at the end. "Here I partition by user, order by date, take the second row, join back for amount." If you trip over your narration, you have a bug. Fixing your own bug live is a strong signal.

Common pitfalls

The first pitfall is writing the final query immediately without discussing the approach. The query may even be correct, but you have given the interviewer nothing to grade except syntax. Senior interviewers explicitly mark "did not decompose the problem" as a downgrade, even on correct solutions. The fix is a 30-second verbal outline before you type anything.

A second trap is panic-typing syntax errors — a missing comma, a GROUP BY that omits a non-aggregated column, an unbalanced parenthesis in a window clause. The fix is mechanical: read the query top to bottom out loud before running it. The act of vocalizing surfaces three out of four typos.

A third pitfall is avoiding CTEs in favor of nested subqueries. Deeply nested subqueries are technically equivalent but they are unreadable, and unreadable code drags your communication score down. As a rule, if a subquery is referenced more than once or is more than five lines, lift it into a CTE. Modern planners optimize CTEs the same as inline subqueries in Postgres 12+, Snowflake, and BigQuery — there is no performance cost.

A fourth is forgetting how NULL interacts with aggregations. COUNT(*) counts rows. COUNT(column) skips NULLs. COUNT(DISTINCT column) skips NULLs and dedupes. The three return different numbers on the same data, and interviewers probe specifically to see if you know it. If the question is "how many users did X", default to COUNT(DISTINCT user_id) and say why.

A fifth is ignoring performance questions on senior loops. For L4-and-above analyst roles, expect "what does this query cost on a billion-row table?" The answer is rarely a number — it is a vocabulary check. Mention join order, whether the window forces a sort, whether the filter is sargable, and whether you would materialize the CTE if it ran nightly. Three sentences of cost reasoning beats a perfect query you cannot reason about.

If you want to drill exactly these patterns under interview-realistic time pressure, NAILDD is launching with 500+ SQL problems organized by the eight patterns above, each with a graded narration walkthrough.

FAQ

Which SQL dialect should I prepare in?

Default to PostgreSQL — it is the most portable, the most-used in interview platforms, and the closest to the ANSI standard. Once you are fluent in Postgres, the deltas to Snowflake, BigQuery, and Redshift are small enough to pick up in a weekend. The exception is if the role explicitly lists a stack — a Databricks role wants you fluent in Spark SQL, a fintech with a ClickHouse warehouse will probe ClickHouse-specific functions. Read the JD and ask the recruiter what the live editor uses.

Are SQL certifications worth listing on a resume?

For analyst roles in 2026, certifications are weakly positive at best. What recruiters and hiring managers actually scan for is runnable, public evidence: a GitHub with cleaned-up project queries, a portfolio post that walks through a non-trivial analysis, or LeetCode-style problem completions visible on your profile. A certificate without that evidence reads as box-checking. With that evidence, the certificate is redundant.

Do I need to memorize every window function?

Memorize five: ROW_NUMBER, RANK, DENSE_RANK, LAG, and SUM OVER. Be able to name them, write the syntax, and explain the difference between RANK and DENSE_RANK without notes. Beyond those five — NTILE, PERCENT_RANK, FIRST_VALUE, LAST_VALUE, NTH_VALUE — recognize them when prompted but you do not need to write them from memory. If the interviewer asks for NTH_VALUE, asking "is that the one that returns the nth row in the window?" is a fine signal.

What if I forget the exact syntax in the live round?

In live coding, write what you remember and narrate the gap: "I want a rolling 30-day window — the syntax is ROWS BETWEEN 29 PRECEDING AND CURRENT ROW, I think, let me check the structure." Most interviewers will confirm the syntax or let you pseudo-code it and move on. Frozen silence is the only wrong move. In a take-home, looking up syntax is expected and not penalized.

How long should I prepare?

For a strong baseline analyst, four to six weeks of consistent daily practice — say 60-90 minutes per day — is enough to handle most loops. The schedule that works: two weeks on the eight patterns above, two weeks of timed medium problems, then one to two weeks of mock interviews under live conditions. Cram a week and you will pass screens and fail onsites; the differentiation happens under time pressure.

Should I practice on paper or in an editor?

Both. Editor practice builds syntax muscle memory and lets you verify correctness. Paper or whiteboard practice — no autocomplete, no run button — builds the decomposition reflex that wins whiteboard rounds. Aim for an 80/20 split favoring the editor, but never skip paper rounds entirely. Editor-only candidates are the ones who freeze when the interviewer pastes the schema into a Google Doc.