Crash rate in SQL

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

What is crash rate and why it matters

It is Friday afternoon and a PM at DoorDash pings you on Slack. The 4.18.0 mobile release went out at 10% rollout last night and the support queue is loud. "Is the new build crashing more than 4.17.2? Can we ship to 50% over the weekend or do we need a hotfix?" That is the kind of question crash rate is designed to answer, and it is the kind of question mobile analysts at Uber, Airbnb, and Linear get asked every week.

Crash rate is the share of sessions or users that experienced an unhandled exception in a given window. It is a leading indicator for store rating, reinstall rate, and short-term revenue. Apple and Google both surface high-crash builds and throttle promotion of apps whose crash rate spikes above industry norms. A consumer app that drifts above 1 percent crashed sessions for a week will see its rating slide; above 2 percent and it will start losing daily actives faster than acquisition can refill the funnel.

The metric matters because it is the cheapest objective signal you can give engineering. Bug reports are noisy and biased toward articulate users. A clean crash rate query, grouped by version and OS, points engineers at the specific build, device class, and signature that needs a fix. The rest of this post walks through the SQL you actually need: per-session crash rate, crash-free users, slicing by device and version, top signatures, and the recovery cohort. Then we cover the traps that catch first-time mobile analysts.

The SQL formula

Two definitions show up in production. Per-session crash rate is crashed sessions divided by total sessions. Crash-free users is users with zero crashes in the window divided by total users. Sessions tells you how stable the app is per use; users tells you how many humans had a bad time at all in the window.

Crash rate (per session) = crashed_sessions / total_sessions * 100%
Crash-free users (24h)   = users_without_crash / total_users * 100%

For per-session crash rate over 30 days, assuming a fact table app_sessions with a boolean crashed column, the daily query is:

SELECT
    DATE_TRUNC('day', session_time) AS day,
    COUNT(*) AS total_sessions,
    COUNT(*) FILTER (WHERE crashed) AS crashed_sessions,
    COUNT(*) FILTER (WHERE crashed)::NUMERIC * 100 / COUNT(*) AS crash_rate_pct
FROM app_sessions
WHERE session_time >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY 1
ORDER BY 1;

A healthy consumer app sits well under 0.5 percent. A B2B app with a narrower device matrix and more controlled session length usually sits under 0.2 percent. Anything above 1 percent for 48 hours is an incident, not a metric.

Crash-free users

Crash-free users is the metric leadership usually wants on a slide: "how many of our users had a bad day on the app." The computation is a two-step aggregation: collapse sessions to users, then take the share that never crashed.

WITH user_crashes AS (
    SELECT
        user_id,
        BOOL_OR(crashed) AS had_crash
    FROM app_sessions
    WHERE session_time >= CURRENT_DATE - INTERVAL '1 day'
    GROUP BY user_id
)
SELECT
    COUNT(*) AS users,
    COUNT(*) FILTER (WHERE NOT had_crash) AS crash_free_users,
    COUNT(*) FILTER (WHERE NOT had_crash)::NUMERIC * 100 / COUNT(*) AS crash_free_pct
FROM user_crashes;

The industry benchmark from Crashlytics and the Apple App Store Connect dashboard is 99.5 percent crash-free users over a 24-hour window for a stable build. Below 99 percent and the store reviewers and the rating tab will start flagging it. The number drifts down with longer windows because the universe of users grows: a 7-day crash-free-users figure is always lower than a 1-day figure for the same app.

Slicing by device, OS, and app version

A flat crash rate is almost never actionable. What you actually want is the cross of app version times OS version, with enough sample to be trustworthy. The HAVING COUNT(*) >= 100 clause below filters out the long tail of low-volume combinations where two crashed sessions would push the rate to 50 percent.

SELECT
    app_version,
    device_os,
    COUNT(*) AS sessions,
    COUNT(*) FILTER (WHERE crashed) AS crashes,
    COUNT(*) FILTER (WHERE crashed)::NUMERIC * 100 / COUNT(*) AS crash_rate_pct
FROM app_sessions
WHERE session_time >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY app_version, device_os
HAVING COUNT(*) >= 100
ORDER BY crash_rate_pct DESC
LIMIT 30;

A real result might look like: app 4.5.1 on iOS 18.2 at 3.4 percent crash rate over 1,200 sessions, while every other combination sits under 0.4 percent. That row is the hotfix priority. Ship 4.5.2 with the iOS 18.2 codepath patched, and the global rate drops on its own.

Top crash signatures

Once you know which build and OS combination is bad, the next question is which specific bug. Crash SDKs like Crashlytics, Sentry, and Bugsnag emit a crash_signature that hashes the top of the stack trace. Grouping by signature gives you the engineering work list.

SELECT
    crash_type,
    crash_signature,
    COUNT(*) AS occurrences,
    COUNT(DISTINCT user_id) AS affected_users,
    MIN(crash_time) AS first_seen,
    MAX(crash_time) AS last_seen
FROM crashes
WHERE crash_time >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY crash_type, crash_signature
ORDER BY occurrences DESC
LIMIT 20;

A useful rule of thumb is that the top signature accounts for 20 to 40 percent of all crash occurrences in a stable app, and the top five account for over 70 percent. If you have a long flat distribution where the top signature is only 3 percent of crashes, you probably have a deduplication problem in your signature generation, not a uniformly buggy app.

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Crash recovery cohort

Crashes only matter operationally if users abandon. The recovery cohort tells you how many users who crashed yesterday opened the app again today. A high recovery rate means crashes are annoying but tolerable; a low recovery rate means the crash is also a churn event.

WITH crash_users AS (
    SELECT DISTINCT user_id
    FROM crashes
    WHERE crash_time::DATE = CURRENT_DATE - 1
),
returned AS (
    SELECT DISTINCT user_id
    FROM app_sessions
    WHERE session_time::DATE = CURRENT_DATE
)
SELECT
    COUNT(c.user_id) AS crashed_yesterday,
    COUNT(r.user_id) AS returned_today,
    COUNT(r.user_id)::NUMERIC * 100 / NULLIF(COUNT(c.user_id), 0) AS return_pct
FROM crash_users c
LEFT JOIN returned r USING (user_id);

For a healthy app, day-after return after a single crash sits around 80 to 90 percent. Below 60 percent and the crash is probably hitting a critical flow such as onboarding, checkout, or first message send, and users are giving up rather than retrying.

Common pitfalls

The trap that catches the most first-time mobile analysts is conflating per-session crash rate with per-user crash rate. A session-weighted number can look beautiful at 0.3 percent while a user-weighted number reveals that 8 percent of daily actives hit a crash at least once, because power users have many sessions and absorb most of the denominator. The fix is to publish both numbers on the same dashboard and use crash-free users for leadership reports and per-session for engineering prioritization.

A related mistake is treating Application Not Responding events on Android as the same thing as a crash. An ANR is the OS killing the app because the main thread was blocked for more than five seconds; the code did not throw an exception, but the user experience is identical. If you bundle ANRs into your crash count you will overstate the metric on Android; if you ignore them you understate user pain. Track them in separate columns with the same denominator of total sessions.

Stack trace deduplication is the third common trap. The same underlying bug can produce dozens of distinct signatures because of obfuscated symbols, inlined functions, or different call paths into the same broken method. If your top crash list looks like a long flat tail rather than a steep curve, you almost certainly have under-deduplicated signatures. Lean on your crash SDK's grouping logic or write a normalization step that strips line numbers and merges signatures by top frame method name.

App version bias is the fourth trap, and it is the one that bites analysts during phased rollouts. A new build at 10 percent rollout has a self-selected sample: staff devices first, then users on faster networks, then aggressive updaters. Comparing that slice to the 100 percent baseline of the previous version is unfair to either side. Wait until rollout is at 50 percent or higher before declaring the new build stable, and segment the baseline to the same OS and device mix when you can.

Finally, background crashes deserve their own category. A crash in a silent push handler or a background fetch is invisible to the user, yet your SDK reports it identically to a foreground crash. A refactor of a background service can then make your crash rate look like an incident when users see nothing. Add a was_foreground boolean to the crashes table and report foreground crash rate as the user-facing number.

Optimization tips

The sessions and crashes tables grow fast. A consumer app with one million daily actives generates ten to twenty million session rows per day, and even a healthy crashes table grows by tens of thousands of rows per day. Partition both by day on the timestamp column. Postgres native partitioning, Snowflake clustering on session date, or BigQuery time-unit partitioning all work. The crash rate query then prunes to a handful of partitions instead of scanning full history.

For the per-version, per-OS cross, build a daily materialized aggregate. A table with one row per (day, app_version, device_os) and counts of sessions, crashes, foreground crashes, and ANRs is small enough to fit in memory and powers every chart on the mobile health dashboard. Refresh hourly, or every fifteen minutes during a rollout window. Raw fact tables stay for ad-hoc forensics.

Use approximate count distinct where you can. COUNT(DISTINCT user_id) on a multi-billion row sessions table is expensive everywhere. Snowflake's APPROX_COUNT_DISTINCT, BigQuery's APPROX_COUNT_DISTINCT, and Postgres HyperLogLog give a within-one-percent answer at a fraction of the cost. For an executive dashboard, that error is invisible.

If you want to drill SQL questions like this every day, NAILDD is launching with 500+ SQL problems across exactly this pattern.

FAQ

What is a healthy crash rate for a mobile app?

Rough benchmarks from the Apple App Store Connect dashboard and the Google Play console: below 0.5 percent crashed sessions is excellent, 0.5 to 1 percent is acceptable for most consumer apps, and above 1 percent is a signal to investigate within the day. Above 2 percent the store algorithms start to throttle your app's promotion in search and recommendations, and your rating will slide within a week. B2B apps with narrower device matrices should hold themselves to tighter numbers, usually below 0.2 percent.

Should I report crash-free users or crash-free sessions to leadership?

Crash-free users is what most leadership teams adopt because it maps cleanly to a human experience: what share of our daily actives had a bad moment today. Sessions is the better operational metric for engineering because it scales with app usage and is more sensitive to subtle regressions. Publish both on the same mobile health dashboard, with crash-free users as the headline and crashed sessions as the supporting chart.

How is crash rate different on iOS and Android?

iOS runs on a narrower set of hardware and OS versions, and the platform is aggressive about killing misbehaving processes before they crash. Android runs on thousands of device models with very different memory and CPU profiles, plus ANRs as a separate failure mode. A well-instrumented app will usually report Android crash rate that is one and a half to three times the iOS number, and that gap is normal rather than alarming.

Do I count background crashes the same as foreground crashes?

No. Track them in separate columns and report foreground crash rate as the user-facing number. A background crash inside a silent push handler or a periodic data sync is invisible to the user and will not affect store rating or session abandonment. It still matters to engineering because it usually points at a real bug, but mixing background events into your foreground rate distorts the metric users actually feel.

What should I do when crash rate spikes right after a release?

First, confirm the spike is real by checking both per-session and crash-free users definitions, sliced to the new app version only. Then check the top crash signature for the new version and see whether it concentrates on a specific OS or device class. If the top signature is more than 30 percent of new-version crashes and is OS-specific, you have a hotfix candidate. If the spike is diffuse across many signatures, it is more often a telemetry change or a symbol-mapping issue than a real regression, and you should investigate the build pipeline before rolling back.

How long should the crash window be?

The default window is 24 hours for the headline crash-free users number, and 7 days for the engineering prioritization view by signature and version. A 24-hour window catches a bad rollout within a day; a 7-day window gathers enough volume to make per-version, per-OS slices statistically meaningful. For very small apps with low daily volume, stretch the headline window to 7 days so the number is not dominated by random variance.