How to do gap analysis in SQL

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Why gap analysis matters

A product manager pings you on a Wednesday morning: the daily revenue dashboard "looks weird" around late March, and they want a list of every date where the data is missing or the metric did not refresh. Five minutes later the engineering lead drops a separate request — they need the longest login streak per user for a retention campaign shipping Friday. Both questions live in the same SQL family. They are gap analysis problems, solved by the same pattern: find consecutive runs of events and the breaks between them.

The technical name is "gaps and islands." Islands are stretches where something is continuous — a user logging in every day, an order ID sequence with no holes, a status that stayed the same for two weeks. Gaps are the holes between. Once you can label every row with the island it belongs to, you can count streak length, find the first and last day, or compute exactly how many days of data are missing. The pattern shows up under a dozen interview prompts at Stripe, Snowflake, Netflix, Airbnb, Uber — streaks, session boundaries, log gaps, retention runs.

This post walks the four most common variants with runnable SQL that ports to Postgres, Snowflake, BigQuery, or Databricks with minor syntax tweaks, then collects the traps that get candidates dinged.

The gaps and islands trick

The core insight is small and a little magical the first time you see it. If a user is active on consecutive calendar days, then active_date - ROW_NUMBER() produces the same constant value for every row in the streak. The moment a gap appears, the constant changes, and a new island begins. That constant becomes your group key — GROUP BY it and you get one row per island with start date, end date, and length.

WITH activity AS (
  SELECT
    user_id,
    active_date,
    active_date - INTERVAL '1 day' * ROW_NUMBER() OVER (
      PARTITION BY user_id ORDER BY active_date
    ) AS streak_anchor
  FROM user_activity
)
SELECT
  user_id,
  streak_anchor,
  COUNT(*)              AS streak_length,
  MIN(active_date)      AS streak_start,
  MAX(active_date)      AS streak_end
FROM activity
GROUP BY user_id, streak_anchor
ORDER BY user_id, streak_start;

Read that subquery carefully. For a user who logged in on May 1, 2, 3, the row numbers are 1, 2, 3 and active_date - n days collapses to April 30 for every row. Skip May 4 and log in May 5, 6, and the next two rows produce May 1, May 1 — a different anchor, a new island. The outer query groups on streak_anchor and counts. No correlated subqueries, no self-joins, one window function. BigQuery spells the arithmetic as DATE_SUB(active_date, INTERVAL n DAY); Snowflake accepts the Postgres form. Only the function call changes.

Missing dates in a daily series

The cleanest formulation of "find dates with no data" uses LAG() to compare each row to its predecessor and emits a row only when the gap is larger than one day.

WITH ordered AS (
  SELECT
    metric_date,
    LAG(metric_date) OVER (ORDER BY metric_date) AS prev_date,
    metric_date - LAG(metric_date) OVER (ORDER BY metric_date) AS gap_days
  FROM daily_metrics
)
SELECT
  prev_date              AS gap_start,
  metric_date            AS gap_end,
  gap_days - 1           AS missing_days
FROM ordered
WHERE gap_days > 1
ORDER BY gap_start;

The output is one row per gap with the dates on either side and the count of calendar days in the hole. If you need the actual list of missing dates, generate them with generate_series (Postgres) or GENERATE_DATE_ARRAY (BigQuery) and left-join to your fact table — every row where the fact side is null is a missing date. That approach is heavier but it lets you fill the holes with zeros instead of pretending those days never existed, which matters for any time series you average over. For an "orders placed" table you only care about gaps in days the system was supposed to be processing — filter the spine to business days first or you will create false alarms every weekend.

Consecutive active days per user

Streak length per user — the metric Duolingo built half its product around — extends the basic gaps-and-islands recipe with one more aggregation layer. First compute every streak the user has ever had, then pick the longest one and, separately, the one that is still alive today.

WITH grouped AS (
  SELECT
    user_id,
    active_date,
    active_date - INTERVAL '1 day' * ROW_NUMBER() OVER (
      PARTITION BY user_id ORDER BY active_date
    ) AS group_key
  FROM user_activity
),
streaks AS (
  SELECT
    user_id,
    group_key,
    COUNT(*)              AS streak_length,
    MIN(active_date)      AS start_date,
    MAX(active_date)      AS end_date
  FROM grouped
  GROUP BY user_id, group_key
)
SELECT
  user_id,
  MAX(streak_length)                                              AS longest_streak,
  MAX(streak_length) FILTER (WHERE end_date = CURRENT_DATE - 1)   AS current_streak
FROM streaks
GROUP BY user_id
ORDER BY longest_streak DESC;

The FILTER (WHERE end_date = CURRENT_DATE - 1) clause keeps the current streak alive only if the user was active yesterday — most products consider a streak broken once you miss a day. If your product treats "today not yet visited" as still on streak, change the predicate to end_date >= CURRENT_DATE - 1. If you support a grace period (some apps give one freeze per week), join in a streak_freezes table and bridge the gap accordingly.

Gaps in a sequential ID column

Sometimes the gap question is not about time. An interviewer at Stripe will ask whether your events.id sequence has missing values — a missing event ID can mean a dropped queue message, which can mean lost revenue. Same LAG() recipe, slightly different math.

WITH ordered AS (
  SELECT
    id,
    LAG(id) OVER (ORDER BY id) AS prev_id
  FROM events
)
SELECT
  prev_id              AS last_id_before_gap,
  id                   AS first_id_after_gap,
  id - prev_id - 1     AS missing_count
FROM ordered
WHERE id - prev_id > 1
ORDER BY missing_count DESC;

This works on any monotonically increasing integer — order IDs, invoice numbers, autoincrement primary keys. It does not work on UUIDs or snowflake IDs where consecutive logical events do not produce consecutive numeric values, so check the ID generator before promising this query. If your sequence resets every day (ticket 1 to N within each support shift), add PARTITION BY shift_date to the window so LAG does not bleed across days.

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Islands by changing attribute

This variant comes up when an attribute changes over time and you need to find runs where it held one value. A user's subscription status goes trial, trial, paid, paid, paid, paused, paid, paid — how long was the longest paid run? The row-number trick does not work directly because the values are not consecutive integers. The standard fix is a "starts" flag plus a running sum, which assigns a fresh island ID every time the value changes.

WITH status_changes AS (
  SELECT
    user_id,
    status_date,
    status,
    LAG(status) OVER (PARTITION BY user_id ORDER BY status_date) AS prev_status,
    SUM(CASE
      WHEN LAG(status) OVER (PARTITION BY user_id ORDER BY status_date) = status THEN 0
      ELSE 1
    END) OVER (PARTITION BY user_id ORDER BY status_date) AS island_id
  FROM user_status_history
)
SELECT
  user_id,
  status,
  island_id,
  MIN(status_date)     AS island_start,
  MAX(status_date)     AS island_end,
  COUNT(*)             AS island_length_days
FROM status_changes
GROUP BY user_id, status, island_id
ORDER BY user_id, island_start;

When the current row's status matches the previous row's status, the flag is zero and the running sum does not advance — same island. When it differs, the flag is one and the sum jumps — new island. The first row of each user gets a one because LAG returns null, making the comparison fail.

Common pitfalls

The most common mistake is reaching for RANK() instead of ROW_NUMBER(). The gaps and islands trick depends on every row receiving a strictly sequential integer, with no ties. RANK() assigns the same number to ties, so two rows with the same date both get rank 5, and your date - rank constant collapses islands together that should be separate. Always switch to ROW_NUMBER() and resolve ties explicitly in the ORDER BY with a tiebreaker like ORDER BY active_date, event_id.

Another trap is forgetting PARTITION BY on the window function. Without it, the row numbers run across all users in one global sequence, which makes the date-minus-rank constant meaningless and merges every user's activity into one chaotic island. The error is silent — the query returns rows, just nonsense rows. Always partition by whatever entity owns the streak (user_id, account_id, device_id), and double-check with a small WHERE user_id IN (1, 2, 3) filter before trusting aggregate output.

Date arithmetic mismatches catch people too. date - integer returns a date in Postgres but errors in BigQuery, where you need DATE_SUB(date, INTERVAL n DAY). The date - INTERVAL '1 day' * n form is safer cross-engine, though Snowflake prefers DATEADD. Always be explicit about the type — convert to a date at the end of the expression if the engine returns a timestamp.

A subtler trap shows up at the edges. LAG() returns null for the first row, and null > 1 is null, which fails the WHERE filter. That is usually what you want for gap detection. But if you flip the logic to "find rows where the streak resets," handle nulls explicitly with COALESCE(prev_date, '1900-01-01') or OR prev_date IS NULL. Walk through the first and last row by hand whenever you build a gap query.

The gap threshold itself is a judgment call interviewers love to probe. A one-day gap in a daily metric usually means a pipeline blip and should be flagged; a one-hour gap in a session log might be the user closing the app for a phone call. Surface the threshold as a parameter (gap_days > :alert_threshold) and document the business reason for the value you pick.

Optimization tips

The most expensive part of gap analysis is sorting — every window function needs an ordered partition. If your activity table is huge, cluster or sort the underlying storage by (user_id, active_date) so the engine skips the sort step. Snowflake calls this a clustering key, BigQuery calls it clustering plus partitioning, Postgres calls it a covering index with the right column order. A billion-row activity table goes from a multi-minute sort to a streaming scan.

Date spines for missing-date queries should be generated lazily, not materialized. Postgres generate_series and BigQuery GENERATE_DATE_ARRAY produce on the fly and join cheaply against indexed fact tables. If you build a permanent calendar_days table, prune it — most analyses do not need every day from 1970 to 2099.

If you want to drill SQL questions like this every day, NAILDD is launching with 500+ SQL problems across exactly this pattern.

FAQ

What is the difference between gaps and islands?

Islands are contiguous stretches where a condition holds — a user active every day, an order sequence with no missing IDs, a status that stays the same. Gaps are the breaks between islands — the dates where activity stopped, the IDs that were skipped. The two are duals; the same query usually produces both, since computing islands implicitly tells you where every gap starts and ends.

Why does date - ROW_NUMBER() produce a constant for consecutive dates?

Because consecutive dates differ by one day and consecutive row numbers differ by one. Subtract them and the difference is constant — say, April 30 for the entire run. Skip a day, and the row number still ticks by one while the date jumps by two, so the difference shifts to a new value, marking the start of a new island.

Should I use ROW_NUMBER or RANK for this?

Always ROW_NUMBER. RANK assigns the same number to ties, which breaks the strictly-increasing assumption that makes the date arithmetic work. If your data has true ties on the ordering column, resolve them in the ORDER BY with a tiebreaker like a primary key. Reach for RANK or DENSE_RANK only when computing leaderboards or top-N analysis, not streaks.

Does this work for non-date columns?

Yes. Anything that increments by a known step — order IDs, invoice numbers, hourly buckets — can be gap-analyzed with the same recipe. The math changes from date - INTERVAL '1 day' * row_number to numeric_id - row_number or hour_bucket - INTERVAL '1 hour' * row_number. The pattern is identical; only the step size shifts.

What is a real use case interviewers expect me to mention?

Streaks at Duolingo or Strava — both companies built campaigns around longest active runs. Pipeline freshness monitoring at any data platform team — alerting on dates a daily job did not produce a row. Subscription status runs at Netflix, Notion, or Stripe — finding the longest paid stretch before churn. Picking one concrete example up front signals you have shipped this in production, not just read about it.