Jaccard similarity in SQL

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Why Jaccard matters

Picture this Monday morning. The growth PM at a DoorDash-sized marketplace pings you in Slack: "We launched two parallel push campaigns last week, and finance wants to know how much the audiences overlap. Can you have a number by stand-up?" You do not need a model or cosine vectors — you need a single SQL query that takes two user sets and returns one number between zero and one. That number is Jaccard similarity.

Jaccard similarity (or Jaccard index) is the size of the intersection of two sets divided by the size of their union. It is the canonical similarity metric for situations where presence matters and frequency does not. Did the user receive the campaign or not? Did the basket include the SKU or not? When the data is naturally binary, Jaccard is the right tool, and it lives entirely inside SQL.

Analysts at Meta, Stripe, and Airbnb use Jaccard for the boring-but-critical work: deduplicating mailing lists, finding overlapping ad targeting segments, building cheap lookalike audiences for cold-start recommendations, comparing reach sets of two feature rollouts. This post walks the recipe from a two-set comparison to an all-pairs segment matrix to a per-user neighbor lookup, then closes with the traps that quietly corrupt the number.

The formula

The math is one line:

Jaccard(A, B) = |A intersection B| / |A union B|

The range is the closed interval from zero to one. Zero means the sets share no elements. One means the sets are identical. Anything in between is a continuous measure of how much overlap you have relative to the total ground covered by both sets.

The identity that makes Jaccard cheap in SQL is the inclusion-exclusion shortcut:

|A union B| = |A| + |B| - |A intersection B|

That means you only need three counts to compute Jaccard: the size of A, the size of B, and the size of the intersection. You never have to materialize the union explicitly. Every snippet below leans on this identity, because computing the union as a separate UNION subquery doubles the work without adding information.

Unlike cosine similarity, Jaccard ignores how many times each element appears. A user who bought a product once and a user who bought it ten times look identical to Jaccard — both are in the set. If frequency matters for your problem, switch to cosine. If presence is what you care about, Jaccard is the honest answer.

Jaccard between two sets

Start with the simplest case: two named campaigns, one similarity number. You want to know what fraction of the combined reach was shared between a spring sale campaign and a spring promo campaign on the same week.

WITH set_a AS (
    SELECT DISTINCT user_id
    FROM events
    WHERE campaign = 'spring_sale'
),
set_b AS (
    SELECT DISTINCT user_id
    FROM events
    WHERE campaign = 'spring_promo'
),
metrics AS (
    SELECT
        (SELECT COUNT(*) FROM set_a) AS size_a,
        (SELECT COUNT(*) FROM set_b) AS size_b,
        (SELECT COUNT(*)
         FROM set_a
         INNER JOIN set_b USING (user_id)) AS intersection
)
SELECT
    size_a,
    size_b,
    intersection,
    size_a + size_b - intersection AS union_size,
    intersection::NUMERIC
        / NULLIF(size_a + size_b - intersection, 0) AS jaccard
FROM metrics;

The DISTINCT in each set CTE is non-negotiable. Event tables typically log one row per impression, so a single user who saw the campaign three times would otherwise be triple-counted in the intersection. The NULLIF on the denominator protects you from a divide-by-zero error when both sets are empty — a rare case in production, but a real one during back-testing or sandbox runs.

If you read the result and see a Jaccard of 0.42, that means 42% of the combined audience was hit by both campaigns. Whether that is good or bad depends on the goal. For brand-lift testing, high overlap is contamination. For loyalty re-engagement, high overlap might be the point.

Audience overlap across many segments

The two-set query is fine when someone asks about a specific pair. The harder version, and the one that turns up in onsite interviews at Snowflake and Databricks, is all-pairs overlap across an entire segment library. Marketing has 80 named segments and wants a heatmap of which ones are nearly duplicates.

WITH user_segments AS (
    SELECT DISTINCT user_id, segment_id
    FROM segment_assignments
    WHERE assigned_at >= CURRENT_DATE - INTERVAL '30 days'
),
pair_stats AS (
    SELECT
        a.segment_id AS seg_a,
        b.segment_id AS seg_b,
        COUNT(*) AS intersection
    FROM user_segments a
    JOIN user_segments b USING (user_id)
    WHERE a.segment_id < b.segment_id
    GROUP BY a.segment_id, b.segment_id
),
seg_sizes AS (
    SELECT segment_id, COUNT(*) AS size
    FROM user_segments
    GROUP BY segment_id
)
SELECT
    p.seg_a,
    p.seg_b,
    sa.size AS size_a,
    sb.size AS size_b,
    p.intersection,
    p.intersection::NUMERIC
        / NULLIF(sa.size + sb.size - p.intersection, 0) AS jaccard
FROM pair_stats p
JOIN seg_sizes sa ON sa.segment_id = p.seg_a
JOIN seg_sizes sb ON sb.segment_id = p.seg_b
ORDER BY jaccard DESC
LIMIT 50;

The trick is the self-join on user_id. For every user, you produce all pairs of segments they belong to, then count by segment pair. The a.segment_id < b.segment_id filter both eliminates the diagonal (a segment compared to itself, which is always one) and avoids counting each pair twice in opposite orders. Without that filter, an 80-segment library produces 6,400 rows; with it, you get the proper 3,160 unique pairs.

A high Jaccard between two named segments means the segments are nearly the same audience under different labels. That is a signal to merge them, prioritize one, or run a deduplicated send. A growth team that maintains 80 segments often has six or seven that are 0.9+ similar to another segment — they accumulate over years and nobody prunes them.

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Lookalike users by purchases

The third pattern is per-user similarity: given a target user, find the most similar users in the rest of the population. This is the cheapest possible lookalike audience and it is often the first iteration before someone reaches for matrix factorization.

WITH target_items AS (
    SELECT DISTINCT product_id
    FROM purchases
    WHERE user_id = 42
),
candidate_items AS (
    SELECT user_id, product_id
    FROM purchases
    WHERE user_id <> 42
),
overlap AS (
    SELECT
        c.user_id,
        COUNT(*) FILTER (
            WHERE c.product_id IN (SELECT product_id FROM target_items)
        ) AS intersection,
        COUNT(DISTINCT c.product_id) AS user_size
    FROM candidate_items c
    GROUP BY c.user_id
),
target_size AS (
    SELECT COUNT(*) AS size FROM target_items
)
SELECT
    o.user_id,
    o.intersection,
    o.intersection::NUMERIC
        / NULLIF(t.size + o.user_size - o.intersection, 0) AS jaccard
FROM overlap o
CROSS JOIN target_size t
WHERE o.intersection > 0
ORDER BY jaccard DESC
LIMIT 20;

A few things to notice. The intersection is computed with a COUNT(*) FILTER clause, which is the Postgres idiom for a conditional count and faster than a CASE WHEN ... THEN 1 END sum on modern planners. The WHERE o.intersection > 0 guard drops every user who shares nothing with the target, which removes the long zero-Jaccard tail and shrinks the sort. Without it, a marketplace with 10 million users will spend most of its time sorting zeros.

For a single target user this query is interactive on most warehouses. For batch lookalike across every user against every other user, you have crossed into MinHash territory and SQL is no longer the right tool — see the optimization section.

Common pitfalls

The most frequent error is forgetting to deduplicate the input sets. Event tables almost never store one row per user — they store one row per event, click, or impression. If you skip DISTINCT in the set CTEs, a single user who clicked the campaign five times contributes five to the size, and the intersection inflates proportionally. The resulting Jaccard is mathematically a similarity between multisets, a different quantity with different properties. Always deduplicate at the set boundary, and only then count.

A second trap is the empty-set case. Two empty sets have an undefined Jaccard because the denominator is zero. In production this surfaces as a division by zero error mid-query, often during back-fills when the data has not arrived yet for a given day. Wrap the denominator in NULLIF(..., 0) so the result becomes NULL instead of erroring out.

A third pitfall is treating Jaccard as the right metric when frequency actually matters. If you are comparing baskets where a user who bought the SKU 20 times is meaningfully different from one who bought it once, Jaccard collapses that information. Cosine similarity over a count vector, or weighted Jaccard with the Tanimoto formulation, will give you more signal. Pick the metric that matches the question.

A fourth pitfall is comparing sets of wildly different sizes. Jaccard punishes size imbalance — a small set fully contained inside a much larger one still scores low because the union dominates. If the right question is "is A a subset of B?", use the overlap coefficient, which is the intersection divided by the smaller set, not the union.

The fifth pitfall hits in the all-pairs query. Without the a.segment_id < b.segment_id filter, you compute every pair twice and the diagonal once, wasting compute and polluting the output with self-matches that always score one. On a thousand-segment library this is a million spurious rows.

Optimization tips

Reduce the input set with a WHERE clause as early as possible. Filter to the relevant time window inside each CTE, not after the join. A 30-day window on a five-year events table is two orders of magnitude smaller, and the planner cannot always push the filter down when it lives in an outer query.

Materialize the per-user-segment relationship as a daily table. If you compute audience overlap weekly, do not re-derive user_segments from raw events every time. A nightly job that writes a slim (user_id, segment_id, day) table makes the all-pairs query cheap forever. On Snowflake or BigQuery, cluster that table on segment_id so the self-join scans the relevant partitions.

When sets become huge, switch to MinHash. It is a probabilistic sketch that approximates Jaccard with constant memory per set. Compute MinHash signatures in a Python or Spark job, store them as fixed-length arrays in the warehouse, and compute Jaccard estimates in SQL with array-overlap functions. The estimate is within a few percent of true Jaccard and the query is orders of magnitude faster.

For exploratory work, swap in APPROX_COUNT_DISTINCT (available on BigQuery, Snowflake, and most modern warehouses). It can cut runtime in half on billion-row event tables. Just label the output column so nobody downstream assumes it is exact.

If you want to drill SQL questions like this every day, NAILDD is launching with 500+ SQL problems across exactly this pattern.

FAQ

Jaccard vs cosine — which one should I pick?

Use Jaccard when your data is naturally binary: did the event happen or not, was the user in the segment or not, did the basket contain the SKU or not. Use cosine when you have counts or weights that carry information — purchase quantities, time spent, click frequencies. Cosine treats each element as a dimension with magnitude; Jaccard treats each element as a flag. If you are unsure, run both on the same pair and look at how the rankings change.

Can Jaccard be used for document deduplication?

Yes, and it is the classic application. You split each document into overlapping n-grams (called shingles, typically 3 to 7 tokens long), treat the shingle set as your input to Jaccard, and any document pair with Jaccard above 0.8 is almost certainly a near-duplicate. This is the basis of every web-scale dedup pipeline. The trick at scale is MinHash with locality-sensitive hashing so you do not compare every pair — but the underlying similarity metric is still Jaccard.

What Jaccard value counts as "similar"?

It depends entirely on the domain and the size of your sets. In recommendation systems where users typically interact with a few dozen items out of millions, a Jaccard of 0.05 between two users is already a strong signal. In document deduplication where shingle sets overlap heavily for any real document, anything below 0.8 is considered distinct. Build the histogram on your own data first, look at where the long tail thickens, and set the threshold there.

How do I make Jaccard work on hundreds of millions of users?

Switch to MinHash. A signature of 128 or 256 hashes per user gives you Jaccard estimates within a few percent of the true value while using constant memory per set. For all-pairs similarity, layer locality-sensitive hashing on top so you only compare candidate pairs that share at least one MinHash band. The exact Jaccard query in SQL is fine up to maybe ten million users; past that, MinHash is mandatory.

Why am I getting a Jaccard greater than one?

You should never see Jaccard greater than one with correct math, so this is always a bug. The most common cause is computing the union as a literal UNION ALL (which double-counts intersection elements) instead of using the inclusion-exclusion shortcut. The second most common cause is missing DISTINCT on the set CTEs. Check the intermediate counts before dividing: if the union size is smaller than the intersection size, your union calculation is wrong, not your formula.