May 18, 2026·13 min read

Clustering on a data science interview

Q: How do I scale k-means to 10 million points?

`MiniBatchKMeans` is the first answer — batches of a few thousand points per update, fraction of the time, quality close to full k-means. For GPU workloads, `faiss` is orders of magnitude faster and standard in embedding pipelines. Spark MLlib provides distributed `KMeans`; Snowflake and BigQuery have native `ML.KMEANS` callable from SQL.

Train for your next tech interview

1,500+ real interview questions across engineering, product, design, and data — with worked solutions.

Join the waitlist

Contents:

Why clustering shows up
k-means: algorithm and gotchas
Picking K: elbow, silhouette, BIC
DBSCAN and HDBSCAN
Hierarchical clustering
Gaussian mixture models
Algorithm comparison
Common pitfalls
Related reading
FAQ

Why clustering shows up

Clustering is the canonical unsupervised question on a data science loop because it forces you to talk about distance metrics, assumptions about cluster shape, and how to validate a model without labels — three places where weak candidates hand-wave. At Stripe, Airbnb, and DoorDash you see it framed as user segmentation; at Tesla and Uber it shows up as geo-clustering for fleet patterns; at Snowflake and Databricks it lives inside customer success scoring.

The most common failure mode is reaching for k-means on geo data with unequal density and producing four convex blobs that split a single dense city in half. Cluster shape matters more than the algorithm name. Knowing when to grab DBSCAN over k-means and why Ward linkage needs Euclidean distance separates a candidate who has shipped segmentation from one who memorized a blog post.

Load-bearing trick: the interviewer tests whether you can match the cluster shape your data has to the shape the algorithm assumes. k-means assumes convex, equal-variance, equal-size. DBSCAN assumes uniform density. GMM assumes elliptical. Saying this out loud is half the answer.

k-means: algorithm and gotchas

The standard procedure is Lloyd's algorithm: pick K centroids, assign each point to the nearest centroid, recompute centroids as the cluster mean, repeat until convergence. The objective is the within-cluster sum of squares (WCSS), also called inertia:

WCSS = Σ_k Σ_{i ∈ C_k} ||x_i - μ_k||²

This optimization is non-convex, so the result depends on initialization. That is why production code uses k-means++ — it samples seed centroids with probability proportional to squared distance from already-chosen seeds, spreading them out and reducing the chance of a bad local minimum. scikit-learn defaults to init='k-means++' and n_init=10, keeping the lowest-inertia run.

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

X_scaled = StandardScaler().fit_transform(X)
km = KMeans(
    n_clusters=5,
    init="k-means++",
    n_init=10,
    max_iter=300,
    random_state=42,
)
labels = km.fit_predict(X_scaled)

Complexity is O(N · K · d · iters) — essentially linear when K and dimensionality are fixed. That is why k-means scales to tens of millions of rows where DBSCAN and hierarchical do not. For very large data, MiniBatchKMeans is a roughly 10x speedup with a tiny quality hit.

k-means produces a Voronoi partition, so cluster boundaries are linear hyperplanes — it cannot recover a crescent moon or two interlocking spirals. It is also sensitive to feature scale: if one column ranges 0-1 and another ranges 0-1,000,000, the second owns the distance metric. Always standardize before fitting. The mean is sensitive to outliers, so a single extreme point can drag a centroid noticeably; k-medoids (PAM) is the robust alternative.

Picking K: elbow, silhouette, BIC

The elbow method plots WCSS against K and looks for the kink where adding clusters stops paying off. It works on textbook examples and is famously ambiguous on real data — sometimes there is no clear elbow, sometimes there are two. State this honestly in an interview rather than pretending elbow is a clean answer.

import matplotlib.pyplot as plt

wcss = []
for k in range(1, 15):
    km = KMeans(n_clusters=k, n_init=10, random_state=42).fit(X_scaled)
    wcss.append(km.inertia_)
plt.plot(range(1, 15), wcss, marker="o")

The silhouette score is more rigorous. For each point compute a(i) (mean distance to its own cluster) and b(i) (to the nearest other cluster), then s(i) = (b - a) / max(a, b). Range -1 to +1: above 0.5 is strong, 0.25-0.5 reasonable, below 0.25 weak.

from sklearn.metrics import silhouette_score
score = silhouette_score(X_scaled, labels, sample_size=10_000, random_state=42)

The Davies-Bouldin index (lower is better) blends intra-cluster scatter and inter-cluster distance. The Calinski-Harabasz index (higher is better) is the ratio of between-cluster to within-cluster dispersion. For probabilistic models like GMM, BIC and AIC are the right tools because they penalize parameter count.

The most senior answer is to acknowledge that K is often dictated by the product spec — three pricing tiers, four onboarding segments — and the metric just sanity-checks that the prescribed K is not absurd.

DBSCAN and HDBSCAN

DBSCAN classifies each point as core, border, or noise based on local density. A core point has at least min_samples neighbors within radius eps; a border point has fewer neighbors but at least one core neighbor; everything else is noise (label -1). Clusters are connected components of core points plus their borders.

from sklearn.cluster import DBSCAN

db = DBSCAN(eps=0.5, min_samples=5, metric="euclidean", n_jobs=-1).fit(X_scaled)
labels = db.labels_
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = (labels == -1).sum()

DBSCAN does not require K, recovers non-convex shapes, and identifies anomalies. The teaching example — two concentric rings — is invisible to k-means and trivial here. The weakness: it assumes roughly uniform density, so a galaxy with a dense core and sparse arm gets either merged or shredded depending on eps.

The recipe for choosing eps is the k-distance plot: for each point compute the distance to its k-th nearest neighbor (k = min_samples), sort descending, look for the knee. Setting min_samples to roughly 2 × dimensionality is a defensible heuristic from the original paper. Never accept the scikit-learn default eps=0.5 as a final answer.

HDBSCAN generalizes DBSCAN by building a hierarchy of density levels and extracting stable clusters. It removes the eps parameter and handles variable-density clusters cleanly. For exploratory analysis on messy real-world data — geo or embedding spaces — HDBSCAN is usually the better default.

Hierarchical clustering

Agglomerative clustering starts with each point as its own cluster and merges the two closest clusters at each step until one root remains. The output is a dendrogram; cutting it horizontally at any height gives a specific K. Hierarchical is the only family that lets you defer the K decision until after seeing the structure.

from scipy.cluster.hierarchy import linkage, fcluster

Z = linkage(X_scaled, method="ward")
labels = fcluster(Z, t=5, criterion="maxclust")

The choice of linkage determines what "distance between two clusters" means:

Linkage	Cluster distance	Tends to produce	Notes
Single	Min pairwise distance	Long chains, "snakes"	Good for non-convex, fragile to noise
Complete	Max pairwise distance	Tight, compact clusters	Sensitive to outliers
Average	Mean pairwise distance	Balanced compactness	Decent general default
Ward	Minimizes WCSS increase	Roughly equal-size convex clusters	Default in most libs, requires Euclidean

Complexity is O(N²) memory and at least O(N² log N) time, capping practical use around 10k-50k points. Above that: sample down and project the rest with nearest-centroid assignment, use BIRCH (linear-time clustering feature tree), or switch to mini-batch k-means. Divisive (top-down) hierarchical exists but is rarely used.

Train for your next tech interview

1,500+ real interview questions across engineering, product, design, and data — with worked solutions.

Join the waitlist

Gaussian mixture models

GMM treats the data as drawn from a mixture of K Gaussians, each with its own μ, Σ, and weight π. The EM algorithm alternates an E-step computing posterior responsibilities (soft probability each point belongs to each component) and an M-step updating parameters using those responsibilities as weights.

from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(
    n_components=5,
    covariance_type="full",
    n_init=5,
    random_state=42,
)
labels = gmm.fit_predict(X_scaled)
probs = gmm.predict_proba(X_scaled)
bic = gmm.bic(X_scaled)

The covariance_type knob trades flexibility for parameters: full gives every component its own covariance, tied shares one across all (close to k-means), diag uses axis-aligned ellipses (stable in high dimensions), spherical is essentially soft k-means. The dangerous failure mode is a singularity, where one component collapses onto a single point; regularize with reg_covar or fewer components.

GMM is right when you need soft assignments — "70% power user, 30% casual" instead of a hard label — or when the data is plausibly elliptical, which is common after PCA projection.

Algorithm comparison

Algorithm	Cluster shape	Scales to N=1M	Needs K?	Handles noise	Best for
k-means	Convex, equal size	Yes (Mini-Batch)	Yes	No	Large datasets, known K, fast prototyping
DBSCAN	Arbitrary	Hard (memory)	No	Yes (`-1`)	Geo data, anomaly detection, uniform density
HDBSCAN	Arbitrary, variable density	Moderate	No	Yes	Exploratory analysis on messy real-world data
Hierarchical (Ward)	Convex-ish	No (O(N²))	Decide after	Sensitive	Small datasets, dendrogram interpretation
GMM	Elliptical	Yes	Yes (BIC)	No	Soft assignment, statistical model
Mini-Batch k-means	Convex, equal size	Yes (very)	Yes	No	Streaming, very large data

A useful move is to talk through which row matches your dataset. 5M rows of behavioral features? Mini-batch k-means. Fraud rings on transaction embeddings? HDBSCAN — cluster size varies and noise is interesting. A gene expression matrix? Ward hierarchical, because biologists want the dendrogram.

Common pitfalls

Forgetting to standardize features is the most common error. Distance-based algorithms let the largest-scale feature dominate. If one column is account age in years (0-15) and another is lifetime revenue in dollars (0-50,000), revenue owns 99.99% of the distance metric and your clusters become a percentile split on revenue. Apply StandardScaler or RobustScaler first.

Running k-means once and trusting the answer is fragile because results depend on initialization. The fix is n_init=10 or higher to keep the lowest-inertia run. In recent scikit-learn versions the default changed to n_init='auto', so check your version. If results swing wildly across runs, that itself is a finding — the data may not have natural clusters at the K you picked.

Using k-means on categorical features by one-hot encoding and praying is a smell. The arithmetic mean of one-hot vectors has no meaningful interpretation. Use k-modes for pure categorical data, k-prototypes for mixed, or learn a dense embedding via an autoencoder and cluster that.

Computing silhouette on millions of points hangs because the calculation is O(N²). Use the sample_size argument in sklearn.metrics.silhouette_score to estimate on a random subsample, or lean on Davies-Bouldin and bootstrap stability instead.

Treating DBSCAN's -1 label as a regular cluster breaks downstream code. The noise label is meaningful — those are points the algorithm refused to assign — and frequently the most interesting output for anomaly detection. Filter labels != -1 before computing per-cluster aggregates, and surface the noise fraction as its own KPI.

Comparing two clustering results by label number is a category error: cluster labels are arbitrary permutations. To compare partitions, use Adjusted Rand Index (ARI) or Normalized Mutual Information (NMI), which are invariant to label permutations.

If you want to drill DS interview questions every day, NAILDD is launching with hundreds of problems on this pattern — clustering, A/B testing, SQL, ML system design.

FAQ

Can I use k-means on categorical features?

Not directly — the mean of a one-hot vector is not a meaningful category. Use k-modes for purely categorical data (mode instead of mean, mismatch count instead of distance) or k-prototypes for mixed features. In production it is also common to learn a dense embedding via an autoencoder and cluster that with k-means.

How do I validate clustering without labels?

Use intrinsic metrics that score the partition against the geometry of the data: silhouette, Davies-Bouldin, Calinski-Harabasz. If you have even a few hundred ground-truth labels, switch to extrinsic metrics like ARI, NMI, or V-measure — invariant to label permutations. Bootstrap stability is another strong unlabeled signal.

How do I scale k-means to 10 million points?

MiniBatchKMeans is the first answer — batches of a few thousand points per update, fraction of the time, quality close to full k-means. For GPU workloads, faiss is orders of magnitude faster and standard in embedding pipelines. Spark MLlib provides distributed KMeans; Snowflake and BigQuery have native ML.KMEANS callable from SQL.

What is cluster stability and how do I measure it?

If I re-ran clustering on a slightly different sample, would I get the same partition? Draw B bootstrap samples (B = 100 is common), cluster each, and for every pair of points compute the fraction of runs they shared a cluster. Stable clusterings have a co-membership matrix close to block-diagonal — also a principled way to choose K.

Does hierarchical clustering work with non-Euclidean distance?

Most linkage methods do. Single, complete, and average linkage accept any metric — Manhattan, cosine, Hamming, or precomputed distance matrices. Ward is the exception: it minimizes WCSS increase, which is only well-defined for squared Euclidean. For Ward-like behavior with another metric, use average linkage.

When should I use HDBSCAN instead of DBSCAN?

Default to HDBSCAN when cluster densities vary, when no single eps works for the whole dataset, or when you want a stability-based notion of which clusters are real. The cost is slightly higher complexity and an extra dependency. The win is removing the most fragile parameter and getting back a soft membership score that downstream code can use as a confidence signal.

Is this content official or affiliated with any vendor?

No. It is a study guide based on canonical references — Lloyd 1982 (k-means), Ester et al. 1996 (DBSCAN), Campello et al. 2013 (HDBSCAN), Murtagh and Contreras 2011 (hierarchical) — and the scikit-learn docs. Validate behavior against your own data and library version before shipping.