How to build a lookalike audience in SQL

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Why lookalike audiences land in your SQL queue

A growth PM at DoorDash or Airbnb pings the marketing analyst Friday afternoon: "we need an in-house audience of users who look like our top spenders so the targeting team can upload it to Meta and Google Ads." That ask lands in the analytics SQL queue rather than the data science roadmap, because the product needs the list by Monday and a feature store is two sprints away. The platform lookalike — Meta or Google trained on embeddings — is more accurate but only covers users on that platform. Building it in SQL lets you screen against your own active base, exclude already-targeted segments, and ship the same logic to email, push, and CRM.

The pattern is also a common Stripe and Snowflake interview prompt because it forces you to reason about cohort definition, feature engineering, normalization, and thresholds without a Python notebook. Examples below run on Postgres with minor edits for Snowflake, BigQuery, and Databricks.

The core idea in four steps

Every lookalike pipeline boils down to four moves. First, define a seed of users whose behavior you want to clone — usually high-revenue customers, long-retained subscribers, or users who hit a specific activation event. Second, profile the seed by the features you believe drive that behavior — country, device, signup source, engagement metrics, plan tier. Third, score every non-seed user by how closely their features match the seed profile, using bucket-matching or vector similarity. Fourth, calibrate the score into a likelihood (or pick a top-K cutoff) and hand the list to the targeting team.

The hard parts are not the SQL — they are seed definition and feature choice. Treat the queries below as a skeleton, then iterate against actual conversion lift.

Define the seed cohort

A workable seed is statistically stable (500+ users) and tight enough that behavior is consistent. The most common definition is high-revenue customers over a recent window — Stripe's marketing analytics, for example, defines a seed as accounts above a revenue percentile in the trailing 90 days.

WITH seed_revenue AS (
    SELECT
        user_id,
        SUM(amount) AS lifetime_revenue
    FROM transactions
    WHERE status = 'paid'
      AND created_at >= CURRENT_DATE - INTERVAL '180 days'
    GROUP BY user_id
    HAVING SUM(amount) >= 500
),
seed AS (
    SELECT
        u.user_id,
        u.country,
        u.device,
        u.signup_source,
        u.plan_tier,
        u.age_bucket,
        s.lifetime_revenue
    FROM users u
    JOIN seed_revenue s USING (user_id)
    WHERE u.account_status = 'active'
)
SELECT COUNT(*) AS seed_size, AVG(lifetime_revenue) AS avg_rev
FROM seed;

The SELECT COUNT(*) exists for a reason: every lookalike post-mortem starts with someone realizing the seed had 47 users after a revenue tweak. A seed under 200 is too small for stable profiles; if the numbers look thin, loosen the threshold or widen the window. The account_status = 'active' clause keeps churned accounts out — cloning users who already left is one of the most common ways to ship an audience that converts well and retains terribly.

Profile the seed by features

Once the seed is fixed, the profile is the distribution of features inside it. The simplest version is a count of seed users in each feature bucket; that count becomes the lookalike score for any non-seed user matching the bucket.

WITH seed AS (
    SELECT
        user_id,
        country,
        device,
        signup_source,
        age_bucket
    FROM users
    WHERE user_id IN (SELECT user_id FROM high_value_users)
),
seed_profile AS (
    SELECT
        country,
        device,
        signup_source,
        age_bucket,
        COUNT(*) AS seed_count,
        COUNT(*)::NUMERIC / SUM(COUNT(*)) OVER () AS seed_share
    FROM seed
    GROUP BY country, device, signup_source, age_bucket
)
SELECT *
FROM seed_profile
ORDER BY seed_count DESC
LIMIT 20;

The window SUM(COUNT(*)) OVER () turns the raw count into a share of the seed — useful for picking out buckets where the seed over-indexes against the base rate. A bucket holding 8% of the seed but 0.4% of the base is strong signal; one at 30% seed and 28% base is barely informative.

Score lookalikes with categorical matching

The most direct scorer joins every non-seed user to the seed profile on the feature buckets and uses seed count (or share) as the score.

WITH seed_profile AS (
    SELECT
        country,
        device,
        signup_source,
        age_bucket,
        COUNT(*) AS seed_count
    FROM users
    WHERE is_seed = TRUE
    GROUP BY country, device, signup_source, age_bucket
)
SELECT
    u.user_id,
    u.email,
    sp.seed_count AS lookalike_score
FROM users u
JOIN seed_profile sp
    ON sp.country = u.country
   AND sp.device = u.device
   AND sp.signup_source = u.signup_source
   AND sp.age_bucket = u.age_bucket
WHERE u.is_seed = FALSE
  AND u.account_status = 'active'
  AND u.last_active_at >= CURRENT_DATE - INTERVAL '30 days'
ORDER BY sp.seed_count DESC
LIMIT 50000;

The last_active_at filter turns this from a feature-engineering exercise into a usable marketing list. Ad platforms will happily charge to retarget users who haven't opened the app in a year, and conversion on those impressions is close to zero. Bound to an active window — 30 days for consumer, 90 for B2B SaaS — before exporting. The account_status = 'active' clause does the same for hard exclusions; without it, marketing teams end up emailing users who unsubscribed two years ago.

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Cosine similarity on feature vectors

Categorical matching is cheap and stops working once features go numeric. For continuous signals — engagement minutes per week, AOV, days since signup — cosine similarity against the seed centroid is the natural upgrade.

WITH seed_centroid AS (
    SELECT
        AVG(engagement_minutes) AS f1_mean,
        AVG(orders_last_30d)    AS f2_mean,
        AVG(avg_order_value)    AS f3_mean,
        SQRT(SUM(POW(engagement_minutes, 2))) AS f1_norm,
        SQRT(SUM(POW(orders_last_30d, 2)))    AS f2_norm,
        SQRT(SUM(POW(avg_order_value, 2)))    AS f3_norm
    FROM user_features
    WHERE is_seed = TRUE
),
user_norms AS (
    SELECT
        user_id,
        SQRT(
            POW(engagement_minutes, 2)
          + POW(orders_last_30d, 2)
          + POW(avg_order_value, 2)
        ) AS user_norm,
        engagement_minutes,
        orders_last_30d,
        avg_order_value
    FROM user_features
    WHERE is_seed = FALSE
)
SELECT
    u.user_id,
    (
        u.engagement_minutes * s.f1_mean
      + u.orders_last_30d    * s.f2_mean
      + u.avg_order_value    * s.f3_mean
    )
    / NULLIF(
        u.user_norm
        * SQRT(POW(s.f1_mean, 2) + POW(s.f2_mean, 2) + POW(s.f3_mean, 2)),
        0
    ) AS similarity_score
FROM user_norms u, seed_centroid s
ORDER BY similarity_score DESC
LIMIT 10000;

The trap that catches every analyst: features have different scales. Engagement minutes range 0 to 5,000, orders 0 to 50, AOV $5 to $500. Without normalization, engagement minutes dominate the dot product and the lookalike collapses to "users who use the app a lot." The fix is to z-score each feature with (x - AVG(x) OVER ()) / STDDEV(x) OVER () before the centroid. The query above assumes pre-standardized columns; if features are raw, add normalization first or the ranking is meaningless.

Calibrate a likelihood score

Raw cosine and bucket counts are fine for ranking but rough for thresholding. A calibrated likelihood — the empirical probability that a user in a bucket is a seed — is more interpretable and combines naturally with revenue forecasts.

WITH features AS (
    SELECT
        user_id,
        country,
        device,
        signup_source,
        age_bucket,
        is_seed
    FROM users
    WHERE account_status = 'active'
),
bucket_rates AS (
    SELECT
        country,
        device,
        signup_source,
        age_bucket,
        COUNT(*) FILTER (WHERE is_seed) AS seed_count,
        COUNT(*) AS total,
        COUNT(*) FILTER (WHERE is_seed)::NUMERIC
            / NULLIF(COUNT(*), 0) AS seed_rate
    FROM features
    GROUP BY country, device, signup_source, age_bucket
    HAVING COUNT(*) >= 100
)
SELECT
    f.user_id,
    br.seed_rate AS lookalike_likelihood,
    br.total AS bucket_size
FROM features f
JOIN bucket_rates br USING (country, device, signup_source, age_bucket)
WHERE f.is_seed = FALSE
ORDER BY br.seed_rate DESC, br.total DESC
LIMIT 20000;

The HAVING COUNT(*) >= 100 is the support filter the dashboard always needs. A bucket with one user who happens to be a seed registers seed_rate = 1.0 and rockets to the top while saying nothing. For a million-user base, 100 is the default; for sparser B2B, drop to 25 and accept higher variance. Returning bucket_size alongside the likelihood gives the targeting team a feel for confidence per row.

Common pitfalls

The first common mistake is a seed that is too narrow. Fifty handpicked "best customers" feels like quality over quantity, but the profile gets dominated by anecdotes — one user from a particular signup campaign skews the country-and-source bucket, and the lookalike clones the campaign rather than the behavior. Floor at 500 seed users; 1,000 to 5,000 is the sweet spot. If you cannot reach 500, loosen the threshold or widen the window before tightening features.

The second trap is updating the seed too aggressively. Daily refresh sounds responsible and produces a noisy audience that swings with whoever paid last week. Monthly is the right cadence; quarterly for seasonal businesses. Hold the lookalike fixed for one campaign window when the seed changes — otherwise you cannot tell whether a conversion drop came from audience or creative.

The third trap is mixing geographies into one pool. A US whale and an India whale share the label "high revenue" and almost nothing else; the country bucket dominates every feature and the cross-country lookalike degenerates to "users whose country has higher ARPU." Build separate lookalikes per major market — US, EU, APAC at minimum — or stratify the seed. The same applies to web vs mobile and free vs paid.

The fourth trap is keeping already-targeted users in the candidate set. A lookalike that ranks existing high spenders highly is technically correct and operationally useless — the team spends money reminding your best customers they are your best customers. Exclude seed users, recent ad-clickers, and retargeting segments via a do_not_include table joined with NOT EXISTS.

The fifth trap is raw event flags. "Signed up on Tuesday" or "device version contains 14.2.1" are technically features and pure noise. Bucket time features to weekday-vs-weekend, group device versions into major releases, and prefer semantically meaningful aggregates (sessions per week, orders per month, days active in 30). Junk features inflate rank correlation in validation and disappear in production.

Optimization tips

Materialize the seed and seed profile as tables refreshed on a schedule rather than recomputing inside every scoring query. The seed-definition CTE often aggregates 180 days across hundreds of millions of transactions, and rebuilding it on every export burns warehouse credits. A nightly job landing seed_users and seed_profile cuts scoring to a single join against a small, indexed dimension.

Partition the user-features table by the freshness column you filter on — typically last_active_at or feature_snapshot_date. Lookalike scoring is always range-bounded by recency, and partition pruning takes the scan from the full warehouse to one or two partitions. Combine with clustering on country for geo loops or signup_source for channel audiences.

For bases over 50M active users the pairwise cosine becomes the long pole. The cheap win is scoring against the seed centroid rather than every seed user — one dot product per candidate instead of N. Beyond that, push similarity to pgvector or a dedicated vector store and let SQL handle filtering, exclusions, and metadata joins.

If you want to drill SQL like this daily, NAILDD is launching with hundreds of analytics and ML-flavored problems.

FAQ

What is a lookalike audience, exactly?

A lookalike audience is a list of users who resemble a seed of known-valuable users along the features you believe drive that value. The classic use case is paid acquisition — upload to Meta, Google Ads, or TikTok and let the platform optimize toward similar users. Owned channels (email, push, CRM) work the same way: rank by similarity, take top N, route the campaign there first.

How big does the seed need to be?

At least 500 users; 1,000 to 5,000 is the sweet spot. Meta accepts seeds as small as 100, but at that size the profile is dominated by noise and the lookalike inherits it. If you cannot reach 500, loosen the revenue floor or widen the window before tightening features.

Meta Ads lookalike vs SQL?

Meta and Google build lookalikes on their platform graph (interests, social, in-platform behavior) — signal you cannot reproduce in SQL, and more accurate for users on those platforms. SQL wins when you target owned channels, screen against your own active base, explain audience inclusion, or build offline against rules that block uploading raw user data.

How often should the lookalike refresh?

Monthly for consumer products and most B2B SaaS, quarterly for seasonal businesses. Daily refresh is almost never right — the audience swings with last week's noise and you cannot run campaigns against a moving list. Hold the previous audience fixed for one campaign window to attribute lift cleanly.

Lookalike vs retargeting — what's the difference?

Retargeting contains users who already interacted — visitors, cart abandoners, free-tier accounts. Lookalike contains users who have not interacted yet but resemble the ones who did. They are complementary: retargeting closes warm leads, lookalikes open the top of the funnel. Most growth teams run both with hard exclusions across channels.

How do I validate the lookalike before spending?

Run a holdout. Take the top 10% by score and compare conversion to a randomly selected matched group from the eligible base. If the lookalike converts at 1.5x to 3x the random base over four weeks, you have a real signal; below 20% lift, the features are not separating signal from noise. Logging the score percentile per user before spend is the cheapest insurance against shipping an audience that quietly underperforms.