Cosine vs Euclidean distance for the DS interview

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Why distance metrics show up on every DS loop

If you are interviewing for a Data Scientist role at Google, Meta, Stripe, or Airbnb, the distance-metric question is almost guaranteed to show up — usually inside a bigger task. It hides under kNN classifiers, k-means clustering, recsys embedding retrieval, vector search for RAG, and anomaly detection. The interviewer rarely asks "define cosine similarity" out of nowhere. Instead you get something like: "You have user embeddings from a two-tower model. How do you find the top-100 most similar users for a candidate ad?" Pick the wrong metric and the answer falls apart.

The three questions that recruiters at Notion, Databricks, and OpenAI use to separate strong candidates from people who memorized definitions are: what is the difference between cosine and Euclidean, why do we normalize before k-means, and when does Manhattan beat Euclidean? This post walks through each metric, the math, the geometric intuition, and the real failure modes. If you came here for the one-liner answer: normalize your embeddings and cosine vs Euclidean give you the same neighbors — but the rest of the post is where the interview points come from.

Metric Range Cares about magnitude? Triangle inequality? Best fit
Euclidean (L2) [0, ∞) Yes Yes Low-dim, physical coords, k-means
Cosine similarity [-1, 1] No No (cosine distance) High-dim sparse, text, recsys
Manhattan (L1) [0, ∞) Yes Yes Outlier-robust, grid data
Jaccard [0, 1] N/A (sets) Yes (1-J) Tags, shingles, sets
Hamming [0, n] N/A (bits) Yes Binary hashes, LSH
Mahalanobis [0, ∞) Scaled by Σ Yes Correlated features

Euclidean distance

The straight-line geometric distance. Also called the L2 norm of the difference vector.

d(a, b) = sqrt( sum_i (a_i - b_i)^2 )

In Python:

import numpy as np
d = np.linalg.norm(a - b)

Three properties that interviewers want you to bring up unprompted. First, Euclidean is sensitive to scale — a feature measured in dollars (range 0 to 100,000) will dominate a feature measured in fractions (range 0 to 1). That is why StandardScaler is mandatory before k-means or kNN on raw tabular features. Second, Euclidean suffers from the curse of dimensionality: in high-dim spaces, the ratio between the nearest and farthest point shrinks toward 1, so "nearest neighbor" stops being meaningful. Third, it is a true metric — non-negative, symmetric, satisfies the triangle inequality — which means data structures like KD-trees and Ball-trees can index it efficiently.

Sanity check: if your raw features are not on the same scale, never use Euclidean distance without standardizing first. This is the single most common mistake on take-home tasks.

Cosine similarity

Cosine measures the angle between two vectors, ignoring their magnitudes.

cos(a, b) = (a . b) / (||a|| * ||b||)

The output sits in [-1, 1] for general vectors, or [0, 1] if all components are non-negative (which is the case for TF-IDF, count vectors, and most non-negative embedding schemes). A value of 1 means same direction, 0 means orthogonal, -1 means opposite.

from scipy.spatial.distance import cosine
similarity = 1 - cosine(a, b)

The classic gotcha: cosine distance is defined as 1 - cos(a, b), but it is not a true metric because it violates the triangle inequality. In practice nobody cares — vector databases like Pinecone, Weaviate, and pgvector all expose cosine as a first-class distance and the math just works. But if the interviewer is being pedantic, that is the right answer.

Cosine is the default for text embeddings, sentence transformers, and most recsys two-tower setups. The reason is geometric: in high-dim sparse spaces (think a 50,000-dim TF-IDF vector), the magnitude of a vector mostly reflects document length, not topic. Two documents about the same topic but of different lengths look far apart in Euclidean space and close in cosine space. That intuition alone gets you most of the partial credit on the question.

Connecting cosine and Euclidean

A relationship that catches most candidates off guard. If both vectors have unit L2 norm (||a|| = ||b|| = 1), then:

||a - b||^2 = 2 - 2 * cos(a, b)

Which means Euclidean distance and cosine similarity rank neighbors identically on L2-normalized vectors. The top-K nearest neighbors by Euclidean equals the top-K by cosine. This is exactly why FAISS, ScaNN, and most vector indexes recommend L2-normalizing your embeddings and then using Euclidean — you get cosine semantics with the indexing efficiency of a true metric.

Load-bearing trick: L2-normalize embeddings once at write time, then use Euclidean at query time. Same neighbors, faster index, simpler code.

Manhattan distance

The L1 norm of the difference. Walking from corner to corner on a city grid.

d(a, b) = sum_i |a_i - b_i|

Manhattan has two practical advantages over Euclidean. It is less sensitive to outliers because it does not square the deltas — a single huge coordinate-wise difference does not dominate the sum quadratically. And it shows up naturally in L1 regularization (Lasso), where the L1 penalty on the weight vector encourages sparsity. That is why Lasso zeroes out features while Ridge only shrinks them.

d = np.sum(np.abs(a - b))

Manhattan is also a true metric. Some research has suggested that in very high dimensions, L1 distances are more discriminative than L2 — the curse of dimensionality hits L1 a bit less brutally. In practice, for most production workloads you will reach for cosine before Manhattan, but it is worth knowing for the interview.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Hamming, Jaccard, Mahalanobis, Edit

Hamming distance counts the positions at which two equal-length binary vectors differ.

d_H("10110", "10011") = 3

Where it matters in industry: error-correcting codes, Locality Sensitive Hashing (LSH) for billion-scale duplicate detection, and SimHash for near-duplicate document detection at Google and Stripe.

Jaccard similarity is the standard metric for sets, not vectors.

J(A, B) = |A intersect B| / |A union B|

If you need to compare users by the set of tags they follow, or articles by their set of n-gram shingles, Jaccard is the right tool. The trick on large sets is to avoid materializing the intersection — that is what MinHash is for. MinHash gives you an unbiased estimate of Jaccard with a few hundred hash signatures per set, which then plugs into LSH for sub-linear search.

Mahalanobis distance rescales by the inverse covariance matrix.

d^2 = (a - b)^T * Sigma^-1 * (a - b)

This corrects for correlated features. Mahalanobis is what you reach for in anomaly detection on tabular data when your features are not independent. The gotcha is that Sigma must be invertible — if you have collinear features, the inversion blows up. Add a small ridge to the diagonal, or use the pseudo-inverse.

Edit distance (Levenshtein) is the minimum number of single-character insertions, deletions, or substitutions to turn one string into another. Used for fuzzy matching and dedup pipelines at DoorDash and Uber.

When to choose which metric

Use case Recommended metric Reason
Text similarity (TF-IDF, sentence embeddings) Cosine Magnitude reflects length, not topic
Two-tower recsys retrieval Cosine or L2-normalized Euclidean Equivalent neighbors, choose by index support
k-means on tabular features Euclidean after StandardScaler k-means objective minimizes squared L2
Geo coordinates (lat/lon, short range) Euclidean or Haversine Local plane approximation
Outlier-heavy tabular Manhattan Less penalty on extreme deltas
Set similarity (tags, followers) Jaccard / MinHash Sets, not vectors
Binary hash retrieval Hamming Native bit-level XOR is fast
Anomaly detection on correlated features Mahalanobis Decorrelates dimensions

The rule of thumb that experienced ICs at Anthropic, Meta, and Snowflake will repeat: when in doubt on embeddings, L2-normalize and use cosine. When in doubt on raw tabular, standardize and use Euclidean. Everything else is a special case you should be able to justify in one sentence.

Common pitfalls

The most expensive mistake at the take-home stage is running k-means with Euclidean distance on un-normalized features. The largest-scale column will dominate the cluster assignments, and your clusters will end up being essentially a histogram of that one column. Always pipe your features through StandardScaler, MinMaxScaler, or a robust scaler before clustering. If the interviewer asks why your clusters look weird, this is the first thing to check.

A close second is applying cosine to embeddings that are already L2-normalized without realizing it. Most modern sentence transformers (SBERT, OpenAI text-embedding-3, Cohere embed-v3) return unit-norm vectors by default. Computing cosine on already-normalized vectors is just a dot product — fine, but if you then also normalize again you are doing redundant work, and if you forget that they are normalized you might apply a second normalization to the difference vector and get nonsense. Check the model card before you write the retrieval pipeline.

Another trap is using cosine distance inside a KD-tree. The scikit-learn KDTree and most spatial indexes assume a true metric, and cosine distance is not one. If you need approximate nearest neighbor search with cosine semantics, use BallTree, Annoy, HNSWLIB, FAISS with IndexFlatIP, or convert to L2-normalized Euclidean. The same applies to pgvector — pick the correct operator (<=>, <->, <#>) for your metric.

Mahalanobis on a singular covariance matrix breaks silently. If you have perfectly collinear features, numpy.linalg.inv(cov) will either blow up or return garbage from numerical instability. Add a small regularization term to the diagonal (cov + 1e-6 * np.eye(n)) or use numpy.linalg.pinv and document the choice.

Finally, using Euclidean in very high dimensions without thinking. Above roughly 30-50 dimensions of meaningful signal, the concentration of measure phenomenon kicks in: the gap between nearest and farthest neighbor shrinks toward zero. Switch to cosine, reduce dimensionality first with PCA or UMAP, or move to learned indexes like ScaNN.

If you want to drill DS interview questions on metrics, embeddings, and retrieval every day, NAILDD is launching with hundreds of problems built around exactly these patterns.

FAQ

Is cosine distance a real metric?

Cosine similarity is bounded in [-1, 1]. Cosine distance is defined as 1 - cos(a, b) and ranges over [0, 2]. It violates the triangle inequality, so it is not a true metric in the mathematical sense. In practice this rarely matters — vector databases and retrieval libraries treat it as a distance and the math is well-behaved. But if your interviewer is testing whether you know the difference between an industrial definition and a mathematical one, the right answer is: not a true metric, used as one.

What metric should I use for word and sentence embeddings?

Cosine is the de facto standard. word2vec, GloVe, BERT, SBERT, OpenAI text-embedding-3, and Cohere embed-v3 are all designed with cosine similarity in mind, and most of them produce L2-normalized vectors by default. If your vectors are already unit-norm, cosine and Euclidean give identical neighbor rankings, so pick whichever your index library supports natively for performance.

Why do we normalize features before k-means?

k-means minimizes the sum of squared Euclidean distances between points and centroids. If one feature has a range of 0 to 100,000 and another has a range of 0 to 1, the optimization is essentially clustering on the large-scale feature alone — the small-scale feature contributes negligible variance. StandardScaler (zero mean, unit variance) puts all features on a comparable footing. Skip this and your clusters become a histogram of your largest-scale column.

When should I prefer Manhattan over Euclidean?

Three situations. First, when your data has heavy outliers — Manhattan does not square the deltas, so a single extreme coordinate does not dominate the distance. Second, when you want sparse solutions in linear regression, where Lasso (L1 penalty) zeroes out coefficients. Third, when the data is naturally grid-aligned — city street distances, board games, or some types of count data. For most embedding work, cosine still wins.

How do I handle high-dimensional data where all distances look the same?

This is the curse of dimensionality in action. Three escape hatches: switch from Euclidean to cosine (less affected, though not immune); reduce dimensionality first with PCA, UMAP, or learned embeddings before computing distances; or use a metric like Mahalanobis that adapts to the actual data covariance. In practice, the cleanest fix for production retrieval is to learn lower-dimensional embeddings end-to-end with a two-tower model and then use cosine on those.

Cosine vs dot product — what is the difference?

The dot product is a . b, with no normalization. Cosine similarity is the dot product divided by the product of magnitudes. If both vectors are unit-norm, dot product equals cosine similarity exactly. FAISS often exposes IndexFlatIP (inner product) rather than a cosine index — L2-normalize at write time and the inner product is cosine. Most production embedding pipelines normalize once at indexing time and treat it as a one-time cost.