How to calculate customer journey in SQL

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

What customer journey is and why it matters

Imagine a Monday morning where the head of growth at a Notion-sized SaaS company asks why upgrades from the free plan have stalled. The funnel dashboard looks fine, paid traffic is steady, signups are flat. But conversions to paid plans dropped from 4.1% to 2.7% in three weeks, and no single funnel step looks responsible. The funnel report breaks the world into fixed slots, and the answer rarely lives inside one slot. The answer lives in the order people visited pages, the loops they ran, and the pages they revisited right before they bounced.

A customer journey is the ordered sequence of touchpoints a user went through before converting or abandoning. In SQL, the standard output is a string like landing -> blog -> pricing -> checkout -> purchase per user, then aggregations on top of those strings. Hiring managers at Stripe, Airbnb, DoorDash, and Linear ask candidates to write this query because it tests three skills at once: window functions for ordering, string aggregation for compression, and pattern matching for segmentation.

The business value is not the path string itself; it is the comparison between paths. The journey landing -> pricing -> checkout converts at one rate; landing -> blog -> blog -> pricing -> support -> checkout converts at a very different rate. Once you have the path per user, you can answer questions a flat funnel cannot: which loop correlates with a refund, which content sequence predicts a trial upgrade, which path lengths are likely bots.

The SQL formula

The base table needs three columns: user_id, event_name, and event_timestamp. A channel column is optional but useful for marketing attribution. One user produces many rows, ordered by time. The job is to fold those rows into one row per user, with the events concatenated in the right order. Here is the worked example on Postgres-flavored SQL. Snowflake, BigQuery, and Databricks support the same shape with minor function renames such as LISTAGG instead of STRING_AGG.

WITH user_events AS (
    SELECT
        user_id,
        event_timestamp,
        event_name,
        ROW_NUMBER() OVER (
            PARTITION BY user_id
            ORDER BY event_timestamp
        ) AS event_rank
    FROM events
    WHERE event_timestamp >= CURRENT_DATE - INTERVAL '30 days'
      AND event_name IN (
          'landing', 'blog', 'pricing', 'signup',
          'trial_start', 'checkout', 'purchase', 'support'
      )
),
converted AS (
    SELECT DISTINCT user_id
    FROM user_events
    WHERE event_name = 'purchase'
),
journey AS (
    SELECT
        ue.user_id,
        STRING_AGG(ue.event_name, ' -> ' ORDER BY ue.event_rank) AS path,
        COUNT(*) AS path_length,
        MAX(ue.event_timestamp) - MIN(ue.event_timestamp) AS journey_duration
    FROM user_events ue
    JOIN converted c ON c.user_id = ue.user_id
    GROUP BY ue.user_id
)
SELECT
    path,
    COUNT(*) AS users,
    AVG(path_length) AS avg_steps,
    AVG(EXTRACT(EPOCH FROM journey_duration) / 3600) AS avg_hours
FROM journey
GROUP BY path
ORDER BY users DESC
LIMIT 25;

The most-traveled paths are usually short and obvious: landing -> pricing -> checkout -> purchase. The interesting rows are below the top five. A common path like landing -> pricing -> blog -> pricing -> checkout tells you that pricing is failing as a first read, sending users to a blog post that re-sells the value, then bringing them back. That is a copy problem on the pricing page, and you would never see it in a funnel report. The filter on event_name IN (...) is doing real work; if you let every click into the path, the strings explode and the segmentation downstream becomes meaningless.

Path length and time to convert

The aggregate path string is one half of the picture; the other half is how long the path was and how long the user took. Path length tells you whether the product is converting fast or grinding. Time-to-convert tells you whether to pay for a one-week conversion window or a one-month conversion window in your ad platform. The query below produces both in one pass and is the second snippet hiring managers usually ask for after the path string.

WITH converted AS (
    SELECT DISTINCT user_id
    FROM events
    WHERE event_name = 'purchase'
      AND event_timestamp >= CURRENT_DATE - INTERVAL '90 days'
),
journey AS (
    SELECT
        e.user_id,
        COUNT(*) AS path_length,
        EXTRACT(EPOCH FROM (MAX(e.event_timestamp) - MIN(e.event_timestamp)))
            / 3600 AS hours_to_convert
    FROM events e
    JOIN converted c ON c.user_id = e.user_id
    WHERE e.event_timestamp >= CURRENT_DATE - INTERVAL '90 days'
    GROUP BY e.user_id
)
SELECT
    path_length,
    COUNT(*) AS users,
    ROUND(AVG(hours_to_convert)::numeric, 1) AS avg_hours,
    ROUND(
        PERCENTILE_CONT(0.5) WITHIN GROUP (
            ORDER BY hours_to_convert
        )::numeric,
        1
    ) AS median_hours,
    ROUND(
        PERCENTILE_CONT(0.9) WITHIN GROUP (
            ORDER BY hours_to_convert
        )::numeric,
        1
    ) AS p90_hours
FROM journey
WHERE path_length BETWEEN 2 AND 50
GROUP BY path_length
ORDER BY path_length;

A healthy product usually shows a hump at path length four to eight, with median time-to-convert under an hour for short paths and rising sharply after step ten. A path length above thirty almost always means a tracking bug, a server-side replay, or a bot. The BETWEEN 2 AND 50 filter cuts the noise without trimming the long tail of genuinely-slow buyers; skip it and a few users with five thousand events each drag the averages into garbage. The percentile columns matter more than the average in any product where most users convert quickly and a few take forever. Reporting only the mean is the fastest way to lose credibility with a product manager who knows the shape of the distribution.

Segmentation by journey type

Once the path string per user exists, you can bucket users into named journey archetypes with simple LIKE patterns. This is the third query interviewers ask for and the one that drives the most product decisions, because product managers think in archetypes, not in raw paths. The pattern below buckets every user into one of five named journeys and counts them.

WITH journey AS (
    SELECT
        user_id,
        STRING_AGG(
            event_name,
            ' -> '
            ORDER BY event_timestamp
        ) AS path,
        MAX(CASE WHEN event_name = 'purchase' THEN 1 ELSE 0 END) AS converted
    FROM events
    WHERE event_timestamp >= CURRENT_DATE - INTERVAL '30 days'
    GROUP BY user_id
)
SELECT
    CASE
        WHEN path LIKE '%pricing%checkout%purchase%' THEN 'direct_intent'
        WHEN path LIKE '%blog%pricing%purchase%'    THEN 'content_to_paid'
        WHEN path LIKE '%support%'                   THEN 'pre_sale_question'
        WHEN path LIKE '%trial_start%'               THEN 'trial_user'
        ELSE 'other'
    END AS journey_type,
    COUNT(*) AS users,
    SUM(converted) AS converters,
    ROUND(100.0 * SUM(converted) / COUNT(*), 2) AS conversion_rate_pct
FROM journey
GROUP BY journey_type
ORDER BY users DESC;

In real datasets, content_to_paid usually converts higher than direct_intent, often by a factor of two. That is not because the blog is magical; it is because users who read a blog post are already further down the consideration funnel. Order the output by users when you want to see traffic distribution, by conversion rate when you want to decide where marketing should invest. The five-bucket schema is a starting point. A mature implementation has fifteen to twenty named journeys with overlapping rules, scored in priority order so each user lands in one bucket.

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Common pitfalls

The most frequent mistake is forgetting ORDER BY inside STRING_AGG. Postgres does not guarantee aggregation order without it, so two runs of the same query produce different path strings, and downstream pattern matching becomes nondeterministic. The fix is to always specify STRING_AGG(event_name, ' -> ' ORDER BY event_timestamp) even when the outer query has an ORDER BY. Snowflake and BigQuery have the same trap with LISTAGG WITHIN GROUP. Reviewers on whiteboards catch this within ten seconds, and it is the most common reason a candidate loses points.

A subtler issue is collapsing identity across devices. A user typically visits on mobile first, signs up later on desktop, and converts a week after that on a third device. Without an identity resolution layer that maps device_id to user_id post-login, each device session looks like a separate user, and every metric you compute will be wrong. The fix is to resolve identity before the journey CTE, using a stitching table that maps anonymous IDs to logged-in user IDs at first login, then backfilling that user ID onto every earlier event in the chain.

A third trap is keeping every page view in the path. Most production event streams have fifty or more event types, many of which fire on every page render. If you let them all into the path, the resulting strings are unique per user, and you cannot aggregate across users at all. The fix is to whitelist seven to ten meaningful events at the source CTE, exactly like the event_name IN (...) filter above, and ignore the rest until you have a specific question that needs them.

The fourth pitfall is treating duplicate consecutive events as separate steps. Three landing events in a row almost always mean the user reloaded the page or the tracker double-fired. Left in, they inflate path length and skew time-to-convert. The fix uses LAG(event_name) OVER (PARTITION BY user_id ORDER BY event_timestamp) to flag rows where the current event equals the previous event, then filters them out.

The last trap is computing journey on raw events with no conversion endpoint defined. Without an endpoint, strings grow without bound and segmentation logic cannot match patterns reliably. The fix is to truncate every path at the first occurrence of the target conversion event, using a window function to flag the conversion row and a WHERE event_rank <= conversion_rank filter to cut everything after.

Optimization tips

For tables under a hundred million rows, the query as written runs fine on Postgres with an index on (user_id, event_timestamp). Above that, STRING_AGG becomes the bottleneck because it has to materialize a sorted list per user before concatenating. The fastest practical optimization is to precompute the per-user journey in a daily incremental table partitioned by acquisition date, then query the precomputed table for downstream analysis. The increment job touches only the previous twenty-four hours of events plus the rolling thirty-day window for active users.

On Snowflake and BigQuery, LISTAGG and STRING_AGG have a hard length cap, often four megabytes per group. Users with extremely long journeys will silently truncate, and your pattern matching will fail in the rows that matter most. Cap the path length with QUALIFY ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_timestamp) <= 100 before aggregating. On Databricks, prefer collect_list followed by concat_ws for the same reason. A second optimization is to denormalize event types into integer codes before aggregating: concatenating ten-character strings is roughly four times slower than concatenating two-character codes, and at a few hundred million rows the difference is measured in minutes.

If you want to drill SQL problems like this one every day, NAILDD is launching with 500+ SQL questions covering exactly this style of journey, funnel, and attribution work.

FAQ

How many events should a normal customer journey have?

For SaaS products with a buying cycle measured in days, five to fifteen events is the healthy range. For e-commerce with single-session purchases, three to eight is normal. Anything above twenty events on a converted user is usually a tracking artifact, a session that was never closed, or a bot. The right answer for your specific product comes from looking at the path-length histogram and finding the elbow, then setting your bot filter just past it.

How do I handle users who convert more than once?

There are three valid choices and you should pick based on the question. If you care about acquisition, use the first conversion only and cut the path there; the rest is retention behavior. If you care about retention loops, keep all conversions and define a session that starts after each one. If you care about lifetime value, drop the journey question entirely and use a cohort revenue query. Mixing all three produces strings that look right but cannot be aggregated meaningfully.

Which events belong in the path?

Whitelist seven to ten events that map to real product moments: landing, key content, pricing, signup, trial start, checkout, purchase, and one or two support or activation events. Every other click stays out. The temptation to include everything is the single biggest reason journey queries produce unreadable output. If a stakeholder asks about a specific click, add it as a new whitelist entry and rerun rather than expanding the whitelist permanently.

What is the difference between customer journey and funnel?

A funnel is a fixed ordered set of steps, computed as the conversion rate from step N to step N+1. A journey is the actual ordered path a user took, in whatever order it happened. The funnel report is a special case of the journey: the journey query filtered to users whose path matches a specific sequence. If you only ever need a funnel, build a funnel; if you need to understand variance or find loops, build the journey query.

How do I handle anonymous-to-identified transitions?

Build an identity resolution table that maps every anonymous_id to a user_id at first login, then backfill that user_id onto every prior event for that anonymous chain. The table needs to be updated daily, ideally in near real time. Once it exists, the journey query joins through it and treats anonymous and logged-in events as a single user. Without this layer, every metric that depends on user-level paths is wrong by twenty to forty percent.