How to calculate cosine similarity in SQL
Contents:
Why cosine similarity belongs in your SQL toolbox
"Recommend three more items the user is likely to buy" lands in a DoorDash or Amazon standup, and the senior data scientist asks the same follow-up every time: "what similarity metric and computed where?" Cosine similarity is the workhorse answer because it measures the angle between two vectors and ignores their length — two shoppers can have wildly different basket sizes and still count as similar if their preference distribution lines up. The same property makes it the default for document matching: a 200-word abstract and a 2,000-word paper can sit close together if the term frequencies are proportional.
Most analysts compute cosine in pandas or numpy. Fine when the matrix fits in memory; wrong tool when a Snowflake or Databricks table holds tens of millions of users and a recommendation job runs nightly next to revenue dashboards. This post walks the queries you will actually ship: user-user, item-item, user-item scoring, and adjusted cosine for ratings. Examples run in Postgres with minor edits for Snowflake, BigQuery, and Redshift.
The formula
Cosine similarity between vectors A and B is the dot product divided by the product of their L2 norms:
cos(A, B) = (A . B) / (||A|| * ||B||)
= sum(a_i * b_i) / sqrt(sum(a_i^2)) / sqrt(sum(b_i^2))For non-negative vectors (counts, term frequencies, purchase quantities) the output is bounded to [0, 1]. For signed values (mean-centered ratings, embeddings) the range is [-1, 1]. Cosine ignores magnitude, so two vectors (2, 0) and (10, 0) are perfectly similar; the metric cares only about direction.
The data model
A single long-format fact table is enough for every query below. Rows store the interaction strength between an actor (user, document) and a feature (product, term, item).
purchases (user_id, product_id, quantity, purchase_date)quantity is the interaction weight. For documents it would be term frequency. For ratings it is the 1-5 star value. Keep it long, not wide — a user_id x product_id pivot blows up on tens of thousands of products and forces every join through a sparse matrix that SQL is not built for.
User-user cosine similarity
Two users, each represented by a vector of product quantities. Cosine between them:
WITH user_a AS (
SELECT product_id, SUM(quantity) AS qty
FROM purchases
WHERE user_id = 42
GROUP BY product_id
),
user_b AS (
SELECT product_id, SUM(quantity) AS qty
FROM purchases
WHERE user_id = 99
GROUP BY product_id
),
dot AS (
SELECT SUM(a.qty * b.qty) AS dot_product
FROM user_a a
JOIN user_b b USING (product_id)
),
norm_a AS (
SELECT SQRT(SUM(qty * qty)) AS n FROM user_a
),
norm_b AS (
SELECT SQRT(SUM(qty * qty)) AS n FROM user_b
)
SELECT
COALESCE(dot.dot_product, 0)::NUMERIC
/ NULLIF(na.n * nb.n, 0) AS cosine_similarity
FROM dot, norm_a na, norm_b nb;The norms are computed on each user's full vector, not on the intersection. That is the subtle point most candidates miss: if you compute the norm on the joined CTE alone, you are only measuring agreement on products both users bought, and the similarity becomes artificially high. The NULLIF guards a brand-new user with zero purchases — without it the query blows up on a divide-by-zero the first time someone runs it for a brand-new account.
Item-item recommendations
For each pair of products, how similar are they by the users who bought them? This is the core of collaborative filtering: items co-purchased by overlapping users get high cosine, and you serve "people who bought X also bought Y" from the top of the list.
WITH item_user AS (
SELECT product_id, user_id, SUM(quantity) AS qty
FROM purchases
WHERE purchase_date >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY product_id, user_id
),
norms AS (
SELECT product_id, SQRT(SUM(qty * qty)) AS norm
FROM item_user
GROUP BY product_id
),
dots AS (
SELECT
a.product_id AS item_a,
b.product_id AS item_b,
SUM(a.qty * b.qty) AS dot_product
FROM item_user a
JOIN item_user b USING (user_id)
WHERE a.product_id < b.product_id
GROUP BY a.product_id, b.product_id
)
SELECT
d.item_a,
d.item_b,
d.dot_product::NUMERIC / NULLIF(na.norm * nb.norm, 0) AS cosine
FROM dots d
JOIN norms na ON na.product_id = d.item_a
JOIN norms nb ON nb.product_id = d.item_b
WHERE d.dot_product > 0
ORDER BY cosine DESC
LIMIT 100;The a.product_id < b.product_id predicate is doing two jobs. It removes self-pairs (a product paired with itself, which always returns 1.0 and would dominate the top), and it halves the row count by emitting each pair only once instead of A-B and B-A. The 90-day window is a deliberate choice: pairwise affinity drifts as catalog and seasonality change, and a year-long window blurs winter parkas into summer sandals.
User-item similarity scoring
The recommendation flavor most interviews ask about: given a target user, score every candidate item by similarity to the items the user already bought. Sort, take the top N, exclude already-purchased.
WITH user_items AS (
SELECT DISTINCT product_id FROM purchases WHERE user_id = 42
),
item_norms AS (
SELECT product_id, SQRT(SUM(quantity * quantity)) AS norm
FROM purchases
GROUP BY product_id
),
similarity_scores AS (
SELECT
b.product_id AS candidate,
SUM(a.quantity * b.quantity)
/ NULLIF(MAX(na.norm) * MAX(nb.norm), 0) AS cosine
FROM purchases a
JOIN purchases b USING (user_id)
JOIN item_norms na ON na.product_id = a.product_id
JOIN item_norms nb ON nb.product_id = b.product_id
WHERE EXISTS (SELECT 1 FROM user_items u WHERE u.product_id = a.product_id)
AND NOT EXISTS (SELECT 1 FROM user_items u WHERE u.product_id = b.product_id)
GROUP BY b.product_id
)
SELECT candidate, cosine
FROM similarity_scores
WHERE cosine > 0.1
ORDER BY cosine DESC
LIMIT 20;The two EXISTS clauses are the recommendation logic: candidate items are the products the target user has not bought, and the anchor items are the ones they have. The cosine > 0.1 filter is taste-dependent — set it too low and the tail fills with noisy near-zero matches that look smart in offline metrics and feel random in the product. Most teams pick the floor by eyeballing the score distribution and cutting at the first elbow.
Adjusted cosine for ratings
Raw cosine on 1-5 star ratings rewards users who agree on direction but treats a 1-star and a 5-star as more similar than they should be, because both are positive numbers. Adjusted cosine centers each user's ratings on their personal mean before the dot product, so "I gave a 5" is rescaled to "I rated this above my average" and the metric captures preference shape rather than rating optimism.
WITH user_means AS (
SELECT user_id, AVG(rating) AS mean_rating
FROM ratings
GROUP BY user_id
),
centered AS (
SELECT
r.user_id,
r.product_id,
r.rating - um.mean_rating AS centered_rating
FROM ratings r
JOIN user_means um USING (user_id)
),
item_norms AS (
SELECT product_id, SQRT(SUM(centered_rating * centered_rating)) AS norm
FROM centered
GROUP BY product_id
)
SELECT
a.product_id AS item_a,
b.product_id AS item_b,
SUM(a.centered_rating * b.centered_rating)
/ NULLIF(MAX(na.norm) * MAX(nb.norm), 0) AS adjusted_cosine
FROM centered a
JOIN centered b USING (user_id)
JOIN item_norms na ON na.product_id = a.product_id
JOIN item_norms nb ON nb.product_id = b.product_id
WHERE a.product_id < b.product_id
GROUP BY a.product_id, b.product_id
HAVING COUNT(*) >= 5
ORDER BY adjusted_cosine DESC
LIMIT 100;The HAVING COUNT(*) >= 5 is the support filter the dashboard always needs: a pair rated by two overlapping users can register adjusted_cosine = 1.0 and dominate the top of the list while saying nothing. Movie-rating benchmarks usually set the floor between 5 and 20 co-raters depending on catalog density.
Common pitfalls
The first trap is computing norms on the joined intersection rather than the full vector. JOIN ... USING (product_id) keeps only products both users bought, and if you compute SQRT(SUM(qty * qty)) inside that join the norm is wrong by construction. Two users who agree on three of three shared products get a perfect 1.0 even when one of them bought another two hundred items the other never touched. Always compute each norm on the full user vector in its own CTE, then divide.
The second trap is divide-by-zero on cold-start vectors. A brand-new user with no purchases has norm zero, and a brand-new product no one has bought has norm zero on the item side. Without NULLIF the query crashes the first time recommendations are requested for the most important customer of the launch week. Wrap every denominator in NULLIF(x, 0) and decide whether downstream consumers want NULL or a sentinel.
The third trap is raw count vectors when one user has 10,000 logged events of a single product. The dot product blows up, the norm blows up faster, and the similarity gets dragged by one item. Sublinear weighting like LOG(1 + quantity) or full TF-IDF (term-frequency times inverse-document-frequency) flattens the long tail and stabilizes the score for power users.
The fourth trap is self-similarity sneaking into the top of the list. In item-item recommendations, a product joined to itself always returns cosine of 1.0, and it drowns out every real recommendation. The a.product_id < b.product_id predicate (or <> if you still want both directions) is not optional. The same trap shows up in user-user when you forget to exclude the target from the candidate set.
The fifth trap is treating cosine on signed values like cosine on counts. Mean-centered ratings, neural embeddings, and PCA components can be negative, and cosine ranges over [-1, 1] for them. A naive WHERE cosine > 0.1 silently drops meaningfully negative correlations a recommender might want to surface as anti-recommendations.
Optimization tips
Materialize per-product norms in a daily-refreshed table. Norms are the part of the formula every query needs and the only part that does not depend on which pair you are scoring. Recomputing them on every run is the biggest source of slow dashboards on this pattern — once cached, even a million-row pairwise join collapses to two lookups per row.
Partition the interaction table by date. Every cosine query is range-bounded by a recency window, and partition pruning drops the scan from billions of rows to the partitions inside the window. Combine with a sort key on user_id for user-anchored queries and on product_id for item-anchored ones.
For catalogs above a few hundred thousand items, pure SQL hits an O(N^2) wall on the pairwise join. Production systems push the heavy similarity computation to specialized engines: pgvector with HNSW indexes for embeddings, locality-sensitive hashing (LSH) for approximate nearest neighbors, or FAISS as a sidecar service. SQL stays the place where the candidates land for filtering, ranking, and business rules.
A Python sanity check
When the SQL cosine looks off, recompute on a sample in scikit-learn and compare. The two should match to the floating-point limit on the same vectors.
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
a = np.array([[2.0, 0.0, 1.0, 3.0]])
b = np.array([[1.0, 0.0, 2.0, 1.0]])
print(cosine_similarity(a, b))If the numbers differ, the cause is almost always norm scope (joined intersection vs full vector) or NULL handling — scikit-learn errors on NaN, SQL silently drops the row. Reconcile on the same non-null subset and the values line up.
Related reading
- How to calculate AUC ROC in SQL — ranking quality once the similarity scores feed a binary outcome.
- How to calculate Brier score in SQL — calibration of probability outputs that often sit downstream of a similarity feature.
- SQL window functions interview questions —
ROW_NUMBERandRANKare the patterns behind the top-N recommendation queries above.
If you want to drill SQL like this every day, NAILDD is launching with hundreds of analytics and ML-flavored problems.
FAQ
Cosine vs Euclidean — when do I pick which?
Cosine measures direction, Euclidean measures absolute distance. For sparse, high-dimensional vectors (text, user-item interactions, embeddings) cosine is the default because Euclidean gets noisy when magnitude is dominated by vector length. For dense, low-dimensional vectors where absolute scale carries signal (geographic coordinates, sensor data) Euclidean is the better choice.
Cosine vs Jaccard — what's the difference?
Jaccard measures set overlap — it only cares whether the user bought a product, not how often. Cosine on quantity vectors uses frequency, so a 10x repeat purchase weighs more than a one-off. On strictly binary data the two give similar rankings; Jaccard wins when the data really is set membership (tags, follows, page visits without dwell).
Does cosine similarity work for text?
Yes — TF-IDF vectorization followed by cosine is the textbook approach to near-duplicate detection and semantic search before transformer embeddings became cheap. The same SQL works: replace product_id with term_id, quantity with tf_idf_weight, and you have a text similarity engine. For modern retrieval, the vectors are dense embeddings from a model and the SQL above runs inside pgvector or a dedicated vector store.
Is the range really negative for some inputs?
For non-negative inputs (term frequencies, counts, purchase quantities) cosine is bounded to [0, 1]. For inputs with signed values — mean-centered ratings, neural embeddings, PCA scores — the range extends to [-1, 1]. A negative cosine means the two vectors point in opposite directions, which is genuine signal in a recommender and worth surfacing rather than filtering out.
How do I pick the similarity threshold for recommendations?
Plot the score distribution on a representative day and cut at the first visible elbow. For user-item cosine the floor sits between 0.05 and 0.2 — below that the tail is dominated by accidental overlaps that look like signal offline and feel random in the product. Revisit when catalog sparsity shifts.