Grace period recovery in SQL

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Why grace period recovery matters

A grace period is the window — usually three to thirty days — after a failed payment when a subscription is still considered active. During that window, the billing system retries the charge, dunning emails go out, and the customer sees a "your card was declined" banner inside the product. Grace period recovery is the share of those failing customers who pay before the window closes. At a healthy consumer SaaS shop, smart dunning recovers between forty and sixty percent of failed payments, and at the top of the market that figure crosses eighty. The gap between a passive retry schedule and a tuned one is measured in real eight-figure ARR for any company past Series B.

Picture the scene a senior data analyst hits in their first week at a subscription company. The growth PM pings them on Slack at 4pm on a Thursday: "Finance closes the month tomorrow. They are arguing with payments about how much of last month's involuntary churn was recovered inside grace. Can you get me a number by 9am?" This post is the SQL recipe you run for that ask, plus the variants the PM will follow up with — time to recovery, breakdown by failure reason, weekly trend — without a second round of clarifying questions. It is also a frequent loop interview question at Stripe, Shopify, and Notion.

The metric is sometimes called involuntary churn defense, dunning recovery, or simply payment recovery. The names differ, the math is the same: failed-payment customers who came back to good standing inside a defined window. The trick is that every join, every window, and every deduplication step has a way to overcount or undercount.

Data model assumptions

The minimum table you need is a payment event log with one row per attempt and an outcome flag. Different stacks model this differently, but the canonical shape from a Stripe webhook ingestion or a Snowflake event table looks roughly like the rows below.

user_id | event_type      | event_date  | failure_reason
42      | payment_failed  | 2026-04-01  | insufficient_funds
42      | payment_retry   | 2026-04-03  | NULL
42      | payment_success | 2026-04-03  | NULL
57      | payment_failed  | 2026-04-02  | expired_card
57      | grace_expired   | 2026-04-16  | NULL

Most production systems also surface a separate subscription state table. For grace period recovery you can usually get away with the event log alone, as long as the event types are consistent and you have a failure_reason column hanging off the failed rows.

The SQL formula

The core query has two CTEs. The first picks each customer's first failed payment in the analysis window. The second joins forward to find any successful payment that landed inside the grace window. The final SELECT divides the two counts.

WITH failed_payments AS (
    SELECT
        user_id,
        MIN(event_date) AS first_failure
    FROM payment_events
    WHERE event_type = 'payment_failed'
      AND event_date >= CURRENT_DATE - INTERVAL '90 days'
    GROUP BY user_id
),
recovered AS (
    SELECT
        f.user_id,
        MIN(p.event_date) AS recovery_date
    FROM failed_payments f
    JOIN payment_events p
      ON p.user_id = f.user_id
     AND p.event_type = 'payment_success'
     AND p.event_date BETWEEN f.first_failure
                          AND f.first_failure + INTERVAL '14 days'
    GROUP BY f.user_id
)
SELECT
    (SELECT COUNT(*) FROM failed_payments)             AS failed,
    (SELECT COUNT(*) FROM recovered)                   AS recovered,
    (SELECT COUNT(*) FROM recovered)::NUMERIC * 100
        / NULLIF((SELECT COUNT(*) FROM failed_payments), 0)
                                                       AS recovery_rate_pct;

The INTERVAL '14 days' literal is the only product-specific knob in this query. Set it to whatever your billing system actually allows before it expires the subscription. If you parameterize the dashboard, expose this as a dropdown — finance will want to slice on it.

Notice the MIN(event_date) in failed_payments. Customers who fail multiple times inside the analysis window are still counted once, anchored on the first failure. That is the right grain for "what share of failures got recovered" — every subsequent failure inside the same dunning episode is already part of the same recovery attempt. If you skip the MIN and group on every failed row, your denominator inflates by the average number of retries per episode and the recovery rate looks much lower than it really is. This is the number-one bug in dashboards built by analysts who have not lived through a dunning cycle.

Time to recovery

The next question after "what is the recovery rate" is always "how fast." This matters because dunning email cadence, retry schedules, and in-product banners all have to be designed around the recovery distribution, not the average.

WITH events AS (
    SELECT
        user_id,
        MIN(event_date) AS first_failure
    FROM payment_events
    WHERE event_type = 'payment_failed'
    GROUP BY user_id
),
recovery_time AS (
    SELECT
        f.user_id,
        EXTRACT(EPOCH FROM (
            MIN(p.event_date) - f.first_failure
        )) / 86400 AS days_to_recovery
    FROM events f
    JOIN payment_events p
      ON p.user_id = f.user_id
     AND p.event_type = 'payment_success'
     AND p.event_date > f.first_failure
    GROUP BY f.user_id, f.first_failure
)
SELECT
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY days_to_recovery) AS median_days,
    AVG(days_to_recovery)                                         AS avg_days,
    COUNT(*) FILTER (WHERE days_to_recovery <= 1)                 AS recovered_in_1_day,
    COUNT(*) FILTER (WHERE days_to_recovery <= 3)                 AS recovered_in_3_days,
    COUNT(*) FILTER (WHERE days_to_recovery <= 7)                 AS recovered_in_7_days
FROM recovery_time;

In practice the distribution is heavily front-loaded. Across most consumer subscriptions the bulk of recoveries happen inside the first seventy-two hours, when automatic retries fire on a 1-3-5 day schedule. After day seven the daily recovery rate collapses toward zero — anyone who has not paid by then almost never pays at all, even with extended grace.

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Segmenting by failure reason

The reason a charge fails dictates how recoverable it is. Insufficient funds clears the moment payday lands, so those failures recover at a very high rate. Expired cards require a new card on file and recover only when the dunning flow surfaces a clean "update card" CTA. Lost or stolen cards almost never recover inside grace because the original card is dead.

SELECT
    pf.failure_reason,
    COUNT(DISTINCT pf.user_id)                      AS failures,
    COUNT(DISTINCT ps.user_id)                      AS recoveries,
    COUNT(DISTINCT ps.user_id) * 100.0
        / NULLIF(COUNT(DISTINCT pf.user_id), 0)     AS recovery_rate_pct
FROM payment_failures pf
LEFT JOIN payment_successes ps
       ON ps.user_id = pf.user_id
      AND ps.event_date BETWEEN pf.event_date
                            AND pf.event_date + INTERVAL '14 days'
GROUP BY pf.failure_reason
ORDER BY recovery_rate_pct DESC;

The output of this query is what unblocks product investment. If expired_card recovers at twenty percent and insufficient_funds at sixty-five percent, you do not invest more in retry scheduling — you invest in a card-update flow with magic links and Apple Pay token refresh. If both lines are low, your dunning emails are landing in spam. The metric is the same number either way; the action it implies depends entirely on this segmentation.

Common pitfalls

When teams calculate grace period recovery for the first time, the most common mistake is to count any successful payment after a failure as a recovery. Counting a payment that landed sixty days after the original failure is double-counting — that customer was already churned and then resubscribed, which is a separate funnel. The fix is to bound the join with the actual grace window your billing system enforces. If you do not know the window, ask payments for the retry schedule and use the last retry date plus 24 hours.

Another trap is treating multiple failures from the same dunning episode as independent events. One customer with a card that failed three times in a row should appear once in the denominator, anchored on the first failure. If you group on every failed row you will see a recovery rate that looks shockingly low and finance will spend two days reconciling against the payment processor dashboard before someone realizes the denominator is wrong. Always anchor on MIN(event_date) per user per episode.

A subtler pitfall is conflating voluntary cancellations with grace-period expiries. A customer who clicks "cancel subscription" inside the app and then lets the grace period lapse is not the same as a customer whose card simply went silent. Voluntary cancels should be excluded from both the numerator and the denominator of recovery rate — they were never going to recover regardless of dunning quality. Pull the cancel reason from the subscription state table and filter them out before the join.

Pre-dunning email programs also break the naive query if you ignore them. A growing number of teams send a "your card is about to expire" reminder three to seven days before the next charge. If a customer updates their card based on that reminder, you will see a successful payment with no preceding failure — that is great for revenue but it does not belong in this metric. Make sure your payment_failed filter only captures actual decline events, not pre-emptive nudges.

Finally, do not forget to segment by payment provider. Stripe, PayPal, Apple Pay in-app purchase, and Google Play all have their own retry logic and their own recovery curves. A blended recovery rate masks the fact that your Apple IAP cohort is recovering at fifteen percent while your Stripe cohort is at sixty. The product fix in each case is completely different, and a single rolled-up number will lead leadership to invest in the wrong place.

Optimization tips

Payment event tables grow fast — a million customers churn out tens of millions of payment events per quarter. Three changes turn this from a five-minute query into a sub-second one.

First, range-partition the event table on event_date. Recovery analysis almost always asks for the trailing thirty or ninety days, so partition pruning gives you a one-order-of-magnitude speedup for free in Snowflake, BigQuery, and Databricks alike. Second, create a clustered or sort key on (user_id, event_date). The join between failures and successes is a textbook nested-loop on those two columns and clustering makes the join keys colocated. Third, if you serve this query from a daily dashboard, materialize the failed_payments and recovered CTEs as nightly tables. The full query then collapses to two count-stars plus a divide, which runs instantly even on years of history.

For ad-hoc analysis on top of the materialized layer, expose a parameterized view that takes the grace window length as an argument. Finance, payments, and product will each want a different window — give them a knob instead of three near-identical queries.

If you want to drill SQL questions like this every day, NAILDD is launching with 500+ SQL problems across exactly this pattern.

FAQ

What counts as a "good" grace period recovery rate?

For consumer SaaS without any retry tuning, the floor is roughly thirty percent — that is the share that recovers on automatic retries alone. With a tuned dunning sequence (smart retry timing, branded emails, in-product banners, magic-link card update), most product-led companies land between fifty and seventy. Above eighty is rare and usually involves a card-tokenization partnership with the major networks plus a dedicated dunning vendor.

How long should the grace period be?

The answer is set by the unit economics, not by best practice. For a $10/month consumer subscription, fourteen days of grace is conventional — beyond that the cost of serving a non-paying customer eats the recovery upside. For enterprise contracts with annual billing and net-30 terms, thirty days is the floor and sixty is common. The key is that the SQL window should match whatever your billing system actually enforces, because customers who recover after the technical grace expiry are by definition resubscriptions.

How much of recovery happens on day one?

In consumer subscriptions, between forty and sixty percent of all recoveries land inside the first twenty-four hours, almost entirely from automatic retries the payment processor fires when the issuer rejects the first attempt for a soft reason like insufficient funds. This is also why a poorly configured retry schedule (one retry, then nothing) leaves a huge amount of money on the table compared to a smart schedule that retries on paydays and after holiday weekends.

Should we count email retries the same as automatic retries?

For the headline recovery rate, no — recovery is recovery, regardless of which channel triggered the payment. For attribution and product investment, yes — split the recovery cohort by which surface was last touched before the successful charge (automatic retry, email link, in-app banner, push notification). That split is what tells you which channel deserves more investment, and it is the natural follow-up question after the recovery rate dashboard is live.

What should not count as recovery?

Voluntary cancellations, chargebacks, and expired-card situations where the customer never returned with a new card on file. All three are different failure modes from "the card declined, we retried, the customer paid." Lumping them together blurs the signal and pushes the team toward the wrong fix — usually more dunning emails, when the real problem is the card-update flow or a UX gap on the cancel screen.

How often should the dashboard refresh?

Daily is fine for trend monitoring. Finance only needs monthly close numbers. Payments engineering, however, may want near-real-time during a dunning experiment, because a regression in recovery rate is a five-figure-per-day revenue hit and waiting twenty-four hours to spot it is unacceptable. Build the dashboard with both cadences in mind: daily aggregates for everyone, raw event stream for the payments on-call.