SQL for cohort retention
Contents:
- What cohort retention is and why it shows up in every product interview
- The base pattern: two CTEs in long format
- Retention as a percentage
- Classic vs rolling vs unbounded retention
- Retention matrix in a single query
- Retention with segmentation
- Common pitfalls
- Advanced patterns
- Pandas for visualization
- Related reading
- FAQ
What cohort retention is and why it shows up in every product interview
Cohort retention is the most-asked product analytics question in SQL screens at Meta, DoorDash, Airbnb, and Notion. The interviewer hands you an events table, asks for weekly retention by signup cohort, and the way you structure the query in ninety seconds tells them whether you have shipped this metric or just read about it.
The metric is simple. Group users by when they first showed up, then for each cohort count what fraction was still active N days later. A cohort of 1,000 users who signed up April 1, of whom 350 came back during the week starting seven days later, gives D7 retention of 35 percent. Stack that across every cohort and offset and the matrix shows whether the last release helped and whether paid decays faster than organic.
The question forces four things in one query: an event-derived dimension, a join of the event stream against a sized base, normalization without double counting, and a decision on what "active" means. The dataset assumption is an events table with user_id and event_time, plus a users table with registered_at and acquisition_channel. Dialect is Postgres; everything ports to BigQuery and Snowflake with a one-line DATE_TRUNC tweak.
The base pattern: two CTEs in long format
The canonical structure is two CTEs and a final aggregation. The first CTE assigns every user to a cohort. The second joins the event stream back so each row carries the cohort label and the week of activity. The final SELECT counts distinct users per cohort and offset.
WITH cohort AS (
SELECT
user_id,
DATE_TRUNC('week', MIN(event_time))::DATE AS cohort_week
FROM events
GROUP BY user_id
),
activity AS (
SELECT
c.cohort_week,
c.user_id,
DATE_TRUNC('week', e.event_time)::DATE AS active_week
FROM cohort c
JOIN events e USING (user_id)
)
SELECT
cohort_week,
EXTRACT(WEEK FROM active_week - cohort_week)::int AS weeks_since_signup,
COUNT(DISTINCT user_id) AS active_users
FROM activity
GROUP BY cohort_week, weeks_since_signup
ORDER BY cohort_week, weeks_since_signup;The output is long format: one row per cohort per offset. Long format stays cheap on wide retention windows and pivots cleanly in pandas or any BI tool. Resist pivoting in SQL unless your downstream layer cannot — SQL pivots get ugly fast past eight offsets.
Two details interviewers probe. The cohort definition uses MIN(event_time) rather than a registration timestamp; that survives a botched signup pipeline, but prefer users.registered_at if reliable. The offset uses week subtraction inside EXTRACT; in BigQuery write DATE_DIFF(active_week, cohort_week, WEEK) — that unit conversion is the most common dialect gotcha.
Retention as a percentage
Absolute counts are useless for cross-cohort comparison. A cohort of 50,000 with 12,500 active at W4 looks bigger than a cohort of 1,000 with 600 active, even though the second cohort retained twice as well. Always normalize by cohort size — a third CTE holds the denominator.
WITH cohort AS (
SELECT user_id, DATE_TRUNC('week', MIN(event_time))::DATE AS cohort_week
FROM events GROUP BY user_id
),
cohort_size AS (
SELECT cohort_week, COUNT(*) AS size
FROM cohort
GROUP BY cohort_week
),
activity AS (
SELECT
c.cohort_week,
c.user_id,
EXTRACT(WEEK FROM DATE_TRUNC('week', e.event_time)::DATE - c.cohort_week)::int AS weeks_since
FROM cohort c
JOIN events e USING (user_id)
)
SELECT
a.cohort_week,
a.weeks_since,
COUNT(DISTINCT a.user_id) AS active_users,
cs.size AS cohort_size,
ROUND(COUNT(DISTINCT a.user_id) * 100.0 / cs.size, 1) AS retention_pct
FROM activity a
JOIN cohort_size cs USING (cohort_week)
GROUP BY a.cohort_week, a.weeks_since, cs.size
ORDER BY a.cohort_week, a.weeks_since;Week 0 always reads as 100 percent because everyone in the cohort was active during the week they joined. That is a useful sanity check — if week 0 drops below 100 percent, the join key is wrong or the cohort and activity definitions disagree. The 100.0 literal forces floating-point division in Postgres; without the .0 you get integer-divided zeros.
Classic vs rolling vs unbounded retention
This is where most candidates lose points. Three competing definitions of retention produce wildly different numbers on the same data.
Classic N-day retention counts a user as retained on day N only if active on that exact day. Strict, and the only definition where a user active on day 6 and day 8 is missing from day 7 retention.
-- D7 classic retention
SELECT
COUNT(DISTINCT a.user_id) * 100.0 / (SELECT COUNT(*) FROM cohort) AS d7_classic
FROM cohort c
JOIN activity a ON a.user_id = c.user_id
AND a.active_day = c.cohort_day + INTERVAL '7 day';Rolling retention counts a user as retained on day N if active on day N or after. Strictly dominates classic. Most consumer apps report this because it captures users who came back on day 9 but missed day 7.
-- D7 rolling retention
SELECT
COUNT(DISTINCT a.user_id) * 100.0 / (SELECT COUNT(*) FROM cohort) AS d7_rolling
FROM cohort c
JOIN activity a ON a.user_id = c.user_id
AND a.active_day >= c.cohort_day + INTERVAL '7 day';Unbounded retention is the loosest: the user came back at least once after signup. Useful as a binary "did this user ever return" flag, useless as a curve.
Same dataset, classic D7 reads 20 percent, rolling 35, unbounded 45. The numbers answer different questions. Pin down the definition before publishing and label it on every dashboard.
Retention matrix in a single query
For wide format in a single query — the shape you paste into a deck — use boolean aggregates over windowed time ranges. Each BOOL_OR collapses a user's activity in a given week into a single true/false.
WITH cohort AS (
SELECT user_id, DATE_TRUNC('week', MIN(event_time))::DATE AS cohort_week
FROM events GROUP BY user_id
),
retention AS (
SELECT
c.cohort_week,
c.user_id,
BOOL_OR(e.event_time >= c.cohort_week + INTERVAL '1 week'
AND e.event_time < c.cohort_week + INTERVAL '2 week') AS w1,
BOOL_OR(e.event_time >= c.cohort_week + INTERVAL '2 week'
AND e.event_time < c.cohort_week + INTERVAL '3 week') AS w2,
BOOL_OR(e.event_time >= c.cohort_week + INTERVAL '3 week'
AND e.event_time < c.cohort_week + INTERVAL '4 week') AS w3,
BOOL_OR(e.event_time >= c.cohort_week + INTERVAL '4 week'
AND e.event_time < c.cohort_week + INTERVAL '5 week') AS w4
FROM cohort c
LEFT JOIN events e USING (user_id)
GROUP BY c.cohort_week, c.user_id
)
SELECT
cohort_week,
COUNT(*) AS cohort_size,
SUM(CASE WHEN w1 THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS w1_retention,
SUM(CASE WHEN w2 THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS w2_retention,
SUM(CASE WHEN w3 THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS w3_retention,
SUM(CASE WHEN w4 THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS w4_retention
FROM retention
GROUP BY cohort_week
ORDER BY cohort_week;Rows are cohorts, columns are W1 through W4 retention. The downside is explicit column expansion — every new offset means another BOOL_OR and another SUM(CASE WHEN ...) pair. Past eight columns the query becomes unwieldy and you should keep long format in SQL, pivoting in pandas.
The LEFT JOIN events matters. With an inner join, cohorts where some users never produced a follow-up event vanish entirely — the denominator silently shrinks and retention reads artificially high.
Retention with segmentation
Real teams never look at one retention curve. They look at one per acquisition channel, platform, country, or pricing tier. Fold the segment column into the cohort CTE so every downstream join carries it along.
WITH cohort AS (
SELECT
u.user_id,
u.acquisition_channel,
DATE_TRUNC('week', u.registered_at)::DATE AS cohort_week
FROM users u
)
SELECT
c.cohort_week,
c.acquisition_channel,
COUNT(DISTINCT c.user_id) AS cohort_size,
COUNT(DISTINCT CASE
WHEN a.active_date >= c.cohort_week + INTERVAL '7 day'
AND a.active_date < c.cohort_week + INTERVAL '14 day'
THEN a.user_id END
) AS d7_active
FROM cohort c
LEFT JOIN activity a USING (user_id)
GROUP BY c.cohort_week, c.acquisition_channel
ORDER BY c.cohort_week, d7_active DESC;Now you can compare retention across paid social, paid search, organic, and referral in one table. Organic and referral typically retain meaningfully better than paid, sometimes by a factor of two — which is why blended LTV/CAC misleads. When segment cardinality is high, bucket with CASE WHEN acquisition_channel IN ('google_ads', 'facebook_ads') THEN 'paid' ELSE 'organic' END so samples stay statistically meaningful.
Common pitfalls
The newest cohort never has the full retention window. A cohort that signed up four days ago cannot have W4 retention because W4 has not happened yet. The naive matrix query shows 0 or NULL in the right-hand columns and your stakeholder emails asking why retention "collapsed". Filter cohorts by age in the final SELECT — WHERE cohort_week <= CURRENT_DATE - INTERVAL '4 week' for a 4-week view — or mark immature cells as N/A. Reporting 0 percent for a cohort that has not had time to retain is statistically wrong and politically expensive.
Timezone mismatches silently corrupt the cohort boundary. If event timestamps are in UTC and your business runs on Pacific time, a user signing up at 11:30 PM Pacific lands in tomorrow's UTC cohort, and their D7 anniversary sits a day off from what the product team expects. Convert to business time once, at the bottom of the stack, with AT TIME ZONE 'America/Los_Angeles' — never mix raw UTC and converted timestamps in the same query.
Activity does not equal value. Retention against the "session start" event can read 50 percent at D7 while retention against "made a purchase" reads 8 percent. Pick the retention event that gates revenue or core value — a transaction for marketplaces, a message sent for messaging apps, a workspace edit for productivity tools — not the vanity session event.
Distinct count without partitioning melts the warehouse. On a 50-million row events table, COUNT(DISTINCT user_id) can scan the whole table per cohort. Cluster by event_time on Snowflake, BRIN index on Postgres, partition by DATE(event_time) on BigQuery — otherwise this query goes from minutes to hours as volume grows.
Advanced patterns
Weekly versus monthly cohorts matters more than people admit. For a young consumer app, weekly smooths daily noise without hiding the curve. For B2B SaaS with annual contracts, weekly is pure noise and monthly is the only grain where signal beats variance. Pick a grain that gives 200 to 500 members per cohort.
Rolling cohorts smooth shock events: instead of "the April 1 cohort", define "users who signed up between April 1 and April 7" and average via AVG(d7_retention) OVER (ORDER BY cohort_day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW).
Pre- versus post-release comparison is the highest-value use of cohort retention. Split cohorts at the release date, plot curves separately, look for divergence in the first three weeks. Overlap means the release did not move the metric and that line should die in the roadmap deck.
Pandas for visualization
SQL gives long format. Pandas turns it into a heatmap.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_sql("""
WITH cohort AS (...)
SELECT cohort_week, weeks_since, retention_pct
FROM ...
""", conn)
cohort_matrix = df.pivot(
index='cohort_week',
columns='weeks_since',
values='retention_pct'
)
plt.figure(figsize=(14, 8))
sns.heatmap(cohort_matrix, annot=True, fmt='.0f', cmap='YlGnBu')
plt.title('Retention cohort heatmap')
plt.xlabel('Weeks since signup')
plt.ylabel('Cohort week')
plt.show()The heatmap is the canonical visualization. Rows are cohorts, columns are weeks since signup, color encodes retention. The N/A triangle in the top right is the immature-cohort artifact — leave those blank rather than zero so the eye is not drawn to a fake collapse.
If you want to drill SQL questions exactly like this one, NAILDD is launching with 500+ SQL problems mapped to real interview loops.
Related reading
- How to calculate cohort retention in SQL
- How to calculate cohort decay in SQL
- How to calculate D1 D7 D30 retention in SQL
- SQL window functions interview questions
FAQ
Weekly or monthly cohorts?
Weekly works for young consumer products under a year old, where lifecycles are days and signup volume is high enough for each weekly cohort to be statistically meaningful. Monthly fits mature B2B SaaS where the behavior cycle is months. Daily makes sense only for high-frequency products like mobile games.
How do I handle users with multiple "first events" of different types?
It depends on the business question. If you are evaluating onboarding, cohort by signup; if monetization, by first purchase; if the aha moment, by the first action that defines value. Document the choice in a dashboard caption so the next analyst inherits the convention rather than re-litigating it.
What about bots and synthetic traffic?
Filter inside the cohort CTE, not at the activity stage. Most event tables have an is_bot flag or a user_type column that distinguishes real users from internal QA and scrapers. Filtering at the cohort stage keeps numerator and denominator consistent — filter only on activity and the denominator stays inflated, so retention reads artificially low.
Retention dropped — is that always bad?
Not always, and this is the most useful nuance to internalize before an interview. If you are scaling paid acquisition into lower-intent traffic, new cohorts retain worse than old cohorts even when nothing about the product changed. The product can be healthier in absolute terms — more active users, more revenue — while headline retention falls. Pair every retention chart with absolute counts, and segment by channel before concluding the product regressed.
Distinct users or active events in the numerator?
Distinct users, almost always. Active events double-count engaged users and inflate retention in ways that mask churn — a single power user opening the app twenty times wipes out the signal from twenty dormant users. The exception is reporting engagement intensity rather than retention; call that "sessions per cohort", not retention.