How to calculate D1, D7, D30 Retention in SQL
Contents:
What D1, D7, D30 actually measure
Your PM ducks into the standup with a single question: "What is our D1, D7, and D30 retention for the April signup cohort, and is the trend holding?" These three numbers are the load-bearing health metric for almost every consumer product on the market. D1 is the fraction of new users who return on the next calendar day after signup, D7 the fraction who return on day seven, and D30 the fraction who return on day thirty. Together they describe the early shape of the retention curve: onboarding strength at D1, habit formation at D7, and durable value at D30.
The teams at Meta, Snap, Notion, Duolingo, and DoorDash run on some version of this triple, and it is one of the most common SQL questions in data analyst interviews at Stripe, Airbnb, Linear, Uber, and Netflix — usually phrased as "write the query, then explain what you would investigate if D7 dropped two points week over week". Industry baselines: thirty to fifty percent for D1, fifteen to thirty for D7, and eight to fifteen for D30 on consumer mobile, with games higher and B2B SaaS lower on D7 but higher on D30. Treat these as orientation — your own historical cohort is the only benchmark that matters.
Classic versus rolling retention
Classic retention is strict — a user counts toward D7 if and only if they were active on exactly the seventh day after signup. Day six does not count and day eight does not count. Classic is the format every public benchmark and most board decks use, because it gives a clean comparable number across products.
Rolling retention is generous. A user counts toward D7 if they were active any day in a window from day seven through day thirteen. Rolling D7 is always at least as high as classic D7 — the window covers more chances to be active. Rolling is the right choice for B2B and low-frequency consumer apps where missing exactly day seven is normal behavior, not a churn signal.
A worked example helps. Imagine three users with the same signup date and the following activity log:
Day 0 Day 1 Day 7 Day 30
User A signup open open — classic D1, D7; missed D30
User B signup — open open classic D7, D30; missed D1
User C signup — — — churned immediatelyUnder classic accounting, D1 of this micro-cohort is one out of three, D7 is two out of three, and D30 is one out of three. Under rolling D7 with a seven-to-thirteen-day window, User B counts if they returned anywhere in that range. Always confirm the definition in an interview before you start writing SQL — the same data produces different numbers.
The SQL formula for classic D1
Start with two tables every analytics warehouse has in some shape: a users table with at least user_id and signup_date, and an events table with at least user_id and created_at. The first CTE pins each user to their signup day, also known as day zero. The second collapses raw activity into a single boolean — "did this user fire any event exactly one calendar day after their signup". The final SELECT aggregates the cohort and computes the ratio.
WITH cohort AS (
SELECT
user_id,
signup_date::DATE AS day_0
FROM users
WHERE signup_date BETWEEN '2026-04-01' AND '2026-04-30'
),
returns AS (
SELECT
c.user_id,
EXISTS (
SELECT 1
FROM events e
WHERE e.user_id = c.user_id
AND e.created_at::DATE = c.day_0 + INTERVAL '1 day'
) AS d1_retained
FROM cohort c
)
SELECT
COUNT(*) AS cohort_size,
SUM(CASE WHEN d1_retained THEN 1 ELSE 0 END) AS retained_d1,
ROUND(
SUM(CASE WHEN d1_retained THEN 1.0 ELSE 0 END)
/ NULLIF(COUNT(*), 0),
4
) AS d1_retention
FROM returns;The EXISTS is the right shape because it answers "did the user have at least one qualifying event" without dragging the entire event stream into the aggregate. The NULLIF(COUNT(*), 0) guards against zero-division on an empty cohort window. The ROUND(..., 4) keeps a dashboard readable while preserving enough precision to spot a small week-over-week move.
The day_0 + INTERVAL '1 day' term is the safest cross-database way to add days on Postgres and Snowflake. On BigQuery, switch to DATE_ADD(day_0, INTERVAL 1 DAY). On Redshift, use DATEADD(day, 1, day_0). Interviewers love pinning you to a specific dialect — memorize all three.
Extending to D7 and D30 in one pass
The classic-retention query generalizes to D7 and D30 by repeating the EXISTS block with a different offset. Three columns of booleans, one aggregate per offset, all in a single pass over the cohort.
WITH cohort AS (
SELECT
user_id,
signup_date::DATE AS day_0
FROM users
WHERE signup_date BETWEEN '2026-04-01' AND '2026-04-30'
),
retention AS (
SELECT
c.user_id,
EXISTS (
SELECT 1 FROM events e
WHERE e.user_id = c.user_id
AND e.created_at::DATE = c.day_0 + INTERVAL '1 day'
) AS d1,
EXISTS (
SELECT 1 FROM events e
WHERE e.user_id = c.user_id
AND e.created_at::DATE = c.day_0 + INTERVAL '7 days'
) AS d7,
EXISTS (
SELECT 1 FROM events e
WHERE e.user_id = c.user_id
AND e.created_at::DATE = c.day_0 + INTERVAL '30 days'
) AS d30
FROM cohort c
)
SELECT
COUNT(*) AS cohort_size,
ROUND(SUM(CASE WHEN d1 THEN 1.0 ELSE 0 END) / NULLIF(COUNT(*), 0), 4) AS d1_retention,
ROUND(SUM(CASE WHEN d7 THEN 1.0 ELSE 0 END) / NULLIF(COUNT(*), 0), 4) AS d7_retention,
ROUND(SUM(CASE WHEN d30 THEN 1.0 ELSE 0 END) / NULLIF(COUNT(*), 0), 4) AS d30_retention
FROM retention;Three subqueries against the same events table is fine on small to medium event volumes. On a billion-row log, the planner will scan the events table three times and the query gets ugly. The faster shape is one scan and one join — tag each event with the offset it belongs to using MAX(CASE WHEN e.created_at::DATE = c.day_0 + INTERVAL '7 days' THEN 1 ELSE 0 END), and constrain the join with e.created_at::DATE BETWEEN c.day_0 + INTERVAL '1 day' AND c.day_0 + INTERVAL '30 days'. The BETWEEN predicate lets the planner use a partition or index on created_at and skip irrelevant ranges entirely.
Rolling retention SQL
Rolling retention swaps the equality on a single day for a range. The condensed version below computes both classic and rolling D7 side by side, which is the shape most interviewers want to see — they will ask "okay, now show me the rolling version" right after the classic one, and a single query saves you fifteen seconds of cognitive switching.
WITH cohort AS (
SELECT user_id, signup_date::DATE AS day_0
FROM users
WHERE signup_date BETWEEN '2026-04-01' AND '2026-04-30'
),
flags AS (
SELECT
c.user_id,
MAX(CASE
WHEN e.created_at::DATE = c.day_0 + INTERVAL '7 days' THEN 1 ELSE 0
END) AS d7_classic,
MAX(CASE
WHEN e.created_at::DATE BETWEEN c.day_0 + INTERVAL '7 days'
AND c.day_0 + INTERVAL '13 days' THEN 1 ELSE 0
END) AS d7_rolling
FROM cohort c
LEFT JOIN events e
ON e.user_id = c.user_id
AND e.created_at::DATE BETWEEN c.day_0 + INTERVAL '7 days'
AND c.day_0 + INTERVAL '13 days'
GROUP BY c.user_id
)
SELECT
COUNT(*) AS cohort_size,
ROUND(SUM(d7_classic)::NUMERIC / NULLIF(COUNT(*), 0), 4) AS d7_classic,
ROUND(SUM(d7_rolling)::NUMERIC / NULLIF(COUNT(*), 0), 4) AS d7_rolling
FROM flags;The rolling number will always be greater than or equal to the classic number on the same cohort. The interesting signal is the ratio between them — if rolling D7 is forty percent and classic D7 is fifteen percent, your users are clearly using the product weekly, just not on the calendar day matched to signup. That points to a re-engagement opportunity in product, not a retention failure.
Common pitfalls
The first pitfall is the time-zone smash on day boundaries. A user signs up at twenty-three thirty UTC and opens the app at zero thirty UTC the next calendar day — strictly speaking they returned thirty minutes later, but your D1 query will count it as a D1. The fix is to convert both signup_date and created_at to the same zone before truncating to day. Pick one rule and document it; mixing the two produces silent bias.
The second pitfall is reporting on immature cohorts. A user who signed up three days ago cannot possibly be D7-retained yet, but a careless query will divide D7-retained users by the full cohort including those three-day-olds. Filter the cohort to signup dates at least N days in the past for a Dn calculation, or render the D7 and D30 cells as em-dashes for immature rows.
The third pitfall is using DAU as the proxy for "retained". If your DAU pipeline counts silent push deliveries, lock-screen impressions, or background sync as activity, your retention numbers are inflated and meaningless. Define an active event as a deliberate user action — session start tied to a foreground app open, content view, or business-meaningful click — and join against that, not the raw firehose.
The fourth pitfall is mixing classic and rolling without naming them. Two teams report D7 retention, one is rolling and one is classic, the numbers do not match, and a whole offsite is spent debugging the dashboard. The fix is etiquette, not SQL — label every chart with the flavor and window, and in interviews confirm which one the interviewer means before writing the first CTE.
The fifth pitfall is collapsing cohorts with different acquisition mixes. A cohort dominated by paid Instagram traffic and a cohort dominated by organic search are not directly comparable at the same Dn — the channels select for different intent. If retention dropped two points week over week, segment by acquisition channel before assuming the product broke. The drop is usually a budget shift, not an onboarding regression.
Optimization tips
The first lever is partitioning the events table on created_at. On Snowflake and BigQuery this is native clustering or partitioning DDL; on Postgres it is monthly partitions with pg_partman. A D30 query then scans roughly thirty-one days of partitions instead of the full history — on a billion-row stream, the difference is seconds versus minutes.
The second lever is pre-aggregating a daily user activity rollup, shaped as (user_id, active_date). The retention queries join cohort against this rollup instead of the raw event stream, which compresses storage by an order of magnitude and turns the EXISTS checks into single-key lookups. This is the standard pattern at Databricks and Snowflake shops shipping retention dashboards without melting the warehouse.
The third lever is materializing the cohort summary itself. If twenty stakeholders refresh the dashboard each morning, run the rollup once on a nightly schedule and store the result. Confirm with consumers that overnight cadence is acceptable — regressions caught at nine a.m. are cheaper than the ones found in the Friday review.
Related reading
- How to calculate Cohort Retention in SQL
- How to calculate Active Cohort in SQL
- How to calculate Churn in SQL
- How to calculate MAU in SQL
- How to calculate Active Days in SQL
- How to find the Aha Moment in SQL
- SQL window functions interview questions
If you want to drill retention SQL questions like this daily, NAILDD is launching with 500+ SQL problems on exactly this pattern.
FAQ
Is D7 retention of thirty percent good?
It depends entirely on category and acquisition mix. For consumer social and messaging apps, thirty percent at D7 is below normal — leaders sit at fifty percent and up. For B2B SaaS with a weekly cadence, thirty percent is solid. For a product used once a month by design, thirty percent at D7 is overperforming. Always frame the answer against your own historical cohorts; category baselines drift by an order of magnitude.
Which is more important — D1 or D7?
D1 measures onboarding mechanics and the first session. D7 measures whether the product earned a second visit when motivation is no longer fresh, which is the closest proxy for habit formation. If forced to pick one, prioritize D7 — a great D1 with a weak D7 means onboarding is sticky but the product does not pull users back, which is the harder problem to solve.
Should the cohort be a single day or a range?
Single-day cohorts are noisy on small products — one acquisition spike skews the entire curve. A weekly or monthly cohort window smooths the noise without hiding genuine signal. Single-day cohorts only stabilize at the scale of a Notion or a Duolingo, where ten thousand signups per day is the floor.
How do I handle users who return on D1 but never again?
Include them in D1 and exclude them from D7 and D30 — the query does this automatically because each retention column is independently computed against the offset. The interesting follow-up is the conversion from D1-retained to D7-retained, which isolates the second-session funnel. If forty percent of D1-retained users do not make it to D7, the problem lives between session one and session seven, not in the first session itself.
What is the difference between retention and engagement?
Retention is binary at the offset: was the user active on day N, yes or no. Engagement is intensity inside the active period — how many sessions, how many minutes, how many events. A user can be retained for thirty consecutive days but engagement-thin, one push-driven session per day with no real activity, and the retention number alone will not flag it. Pair retention with sessions per active day or active minutes per day when the goal is product depth, not just survival.