K-Means Clustering — Empty Clusters from Outlier Injection
Empty clusters in K-Means from outlier injection cause NaN centroids and broken dashboards.
20+ years shipping production ML systems and the infrastructure behind them. Everything here is grounded in real deployments.
- K-Means partitions data into K clusters by minimizing within-cluster variance
- Iterates between assignment and update steps until convergence
- Results depend heavily on initialization — k-means++ reduces bad starts
- Scales O(n·K·d·I) — fast for millions of points with few clusters
- Production trap: unequal cluster sizes or outliers warp centroids silently
- Biggest mistake: using Euclidean distance on unscaled features
Imagine you dump 500 LEGO bricks on the floor and ask three kids to sort them into piles however they like — each kid keeps pulling bricks closer to their pile until no brick wants to move anymore. That's K-Means. It automatically groups data points into K clusters by repeatedly asking 'which group is this point closest to?' until the groupings stop changing. No labels, no supervision — just geometry and repetition.
Every major tech company uses unsupervised learning to find patterns nobody explicitly told them to look for. Spotify groups listeners by taste without knowing your favourite genre upfront. Netflix segments users into viewer personas without anyone labelling them. Retailers spot buying behaviour clusters before a single marketing email is sent. K-Means is the workhorse behind most of these — fast, interpretable, and battle-tested across decades of real deployments.
The core problem K-Means solves is deceptively simple: given a pile of unlabelled data, can we automatically discover natural groupings? Traditional supervised learning needs labels — someone has to say 'this email is spam' before the model learns. K-Means asks no such thing. You hand it raw data points and a number K, and it figures out the rest by minimising the distance between each point and its assigned group's centre.
By the end of this article you'll understand exactly how the algorithm moves through its update loop, why your choice of K matters enormously, how to pick a good K using the Elbow Method, and how to avoid the two mistakes that silently wreck most beginners' results. You'll also have complete, runnable Python code you can drop into a real project today.
How K-Means Clustering Actually Works
K-means clustering partitions n observations into k clusters by minimizing within-cluster variance. Each observation belongs to the cluster with the nearest mean (centroid). The algorithm iterates between assignment and update steps until convergence — typically O(n·k·d·i) where i is iterations, often < 100.
Initialization matters: random seeds can produce different results. The standard algorithm (Lloyd's) converges to a local optimum, not global. Empty clusters occur when no points are assigned to a centroid — common with outlier injection or poor initialization. Reinitializing the empty centroid to a distant point is a practical fix.
Use k-means when you need fast, scalable clustering on numeric data with spherical cluster shapes. It's the default for customer segmentation, image compression, and anomaly detection baselines. But it assumes equal cluster sizes and fails on non-convex shapes — know when to reach for DBSCAN or hierarchical methods instead.
The K-Means Algorithm: Assignment and Update
K-Means works in two repeating steps. First, assign each data point to the nearest centroid (using Euclidean distance). Second, recompute each centroid as the mean of all points assigned to it. Stop when assignments no longer change or after a fixed number of iterations.
This is a coordinate descent optimisation — it converges to a local minimum of the within-cluster sum of squares (inertia). The guarantee? Only local, not global. Different initial centroids lead to different results.
Here's a pure Python implementation that makes the loop explicit:
Choosing K: Elbow Method vs Silhouette Score
The number of clusters K must be specified upfront — this is K-Means' biggest limitation. The Elbow Method runs the algorithm for a range of K values and plots inertia vs K. The 'elbow' is where the rate of decrease flattens — that's your best K.
But inertia always decreases as K increases. At K=N, inertia reaches zero. That's why the Elbow Method is subjective. A more rigorous alternative is the Silhouette Score, which measures how similar a point is to its own cluster compared to other clusters. Higher is better. Scores range from -1 to 1.
Neither method is perfect. Structural ground truth rarely exists in unsupervised problems — choose K based on business interpretability as much as metrics.
- Elbow gives a range, not a single answer.
- Silhouette prefers compact, well-separated clusters — can be misleading with elongated shapes.
- Use business validation: plot the centroids or inspect cluster characteristics with domain experts.
Initialization Methods: Random vs k-means++
Initial centroids determine which local minimum the algorithm converges to. Pure random initialization can produce empty clusters or long convergence time. k-means++ spreads out the initial centroids by sampling probabilistically — points farther from existing centroids are more likely to be chosen.
scikit-learn's init='k-means++' (default) uses this. It's a small overhead for dramatically better results. Never use pure random (init='random') unless you're running many seeds and picking the best.
Other initialization strategies: furthest-first (rare), or using a small subset of data (e.g., k-means|| for distributed systems).
init='random' with random_state fixed — but then you're stuck with that local minimum.Feature Scaling: The Hidden Assumption
K-Means uses Euclidean distance internally. If one feature has a range 0–1000 and another is 0–1, the first feature dominates the distance calculation. This is the number one silent mistake beginners make.
Standardise (z-score) or min-max scale all features before clustering. Use StandardScaler unless you know your features should have different importance.
Categorical features are a bigger problem — one-hot encoding introduces many binary dimensions, making distances meaningless. Consider Gower distance or use K-Prototypes (mixed-type clustering) instead.
X.var(axis=0) to see variance per feature. If they differ by more than a factor of 10, you must scale.Common Pitfalls and How to Avoid Them
Beyond scaling and initialization, K-Means has three more traps: (1) Outliers — a single extreme point can pull a centroid away from the true cluster. Use DBSCAN or remove outliers first. (2) Non-spherical clusters — K-Means assumes spherical clusters of similar size. If your data has elongated or interlocking shapes, K-Means will split them arbitrarily. (3) The 'curse of dimensionality' — Euclidean distance becomes less meaningful in high dimensions. Consider PCA first to reduce to 10–50 dimensions.
Production tip: never trust the cluster labels blindly. Plot them (using PCA/t-SNE) and inspect a sample of each cluster. The model's 'pattern' may be meaningless noise.
Real-World Uses: Where K-Means Actually Earns Its Keep
K-Means isn't a toy. It's a workhorse you deploy when you need structure from chaos—fast. Customer segmentation is the classic: hand a marketing team five clusters of user behavior and watch them build campaigns that don't suck. But the real money is in production pipelines where latency matters.
Image compression. You take every pixel (R, G, B) as a point in 3D space. Run K-Means with K=16 or 256. Replace each pixel with its nearest centroid color. You just shrank a 24-bit image to 4 or 8 bits per pixel with surprisingly tolerable visual loss. That's not academic—that's how early web thumbnails worked.
Anomaly detection is another hidden gem. Points that land far from any centroid after convergence? Those are outliers. In fraud detection, that's your signal. In sensor data, that's the impending failure. K-Means clusters the normal, and the stragglers tell you something broke. It's not perfect—it assumes spherical clusters—but for a first pass at 10 million records? Unbeatable.
Bottom line: K-Means is your cheap, fast clustering hammer. Use it when you need answers in seconds, not after a neural net finishes training.
Implementation: From Scratch So You Actually Understand the Math
Production libraries hide the brittle mechanics of convergence. Before you call KMeans(n_clusters=3), understand that the algorithm is just two loops: assign points to nearest centroid, then recompute centroids as the mean of their assigned points. That second step is why this works only for Euclidean distance — means minimize squared error. The real gotcha? Initialization. Random seeds cause empty clusters and local minima. The fix is k-means++: choose first centroid uniformly, then each subsequent centroid with probability proportional to squared distance from the nearest existing centroid. This dramatically improves both speed and final inertia. The code below implements vanilla k-means with k-means++ initialization. Watch for numerical instability when a cluster gets zero points — we log a warning and skip recomputation. That edge case isn't academic; it will crash a production pipeline if handled naively.
Why K-Means Fails on Categorical Data (And What to Do)
K-Means computes centroids by averaging coordinates, which only works for continuous numeric features. Categorical variables like color, country, or product category break the algorithm because mean is meaningless for labels. One-hot encoding creates a high-dimensional sparse space where Euclidean distance distorts similarity — two red apples and two blue cars appear equally distant. Instead, use K-Modes (replaces mean with mode) or Gower distance for mixed data. If you must use K-Means, apply dimensionality reduction like PCA after one-hot encoding, but expect degraded cluster quality. The hidden assumption: every feature must be numeric and ratio-scaled. When your data includes zip codes, gender, or yes/no fields, K-Means silently produces garbage clusters. Validate by examining cluster centroids — if they contain fractional values for binary features, your algorithm is lying to you. Always separate categorical preprocessing as a non-negotiable step before clustering.
Scalability: Why K-Means Crashes on 100M Rows and How to Fix It
Standard K-Means loads all data into memory and iterates until convergence. At 100 million rows, this causes memory blowout and hours of training. The fix: Mini-Batch K-Means. It samples small random batches (e.g., 1000 rows) per iteration, updating centroids online with a learning rate. Convergence is faster (10x-100x) and memory stays constant. But there's a trade-off: centroids oscillate near the final solution instead of converging exactly. Set batch_size to 10% of data or 1000 (whichever is smaller). For true big data, use Spark MLlib's K-Means which distributes the assignment step across nodes. Another trick: reduce dimensionality first with random projection or PCA on 50M+ rows. Never run Elbow Method on full data — sample 1% to pick K, then train on the rest. The rule: if your dataset exceeds RAM, Mini-Batch K-Means is the only viable off-the-shelf option.
Evaluating Cluster Quality When You Have No Ground Truth
Without labels, you cannot use accuracy. Instead, rely on internal validation metrics. Silhouette Score (range -1 to 1) measures how similar a point is to its own cluster vs others — values above 0.5 indicate reasonable separation. Davies-Bouldin Index (lower is better) compares cluster similarity — below 1.0 suggests compact, well-separated clusters. Calinski-Harabasz Index (higher is better) is the ratio of between-cluster variance to within-cluster variance. All three have flaws: Silhouette favors spherical clusters, Davies-Bouldin gets unstable with many clusters, Calinski-Harabasz assumes equal-sized clusters. Never trust a single metric. Run all three, plus visual inspection via PCA or t-SNE if dimensionality allows. A production rule: if Silhouette < 0.3 or Davies-Bouldin > 1.5, re-examine your K and preprocessing. These metrics also help choose K — use the 'elbow' of the Silhouette curve or minimum Davies-Bouldin. But remember: good metrics don't guarantee actionable clusters. Always ground-truth check with domain experts.
Empty Clusters in Production: The Silent Model Degradation
- Always verify cluster membership counts after fitting — empty clusters are a silent signal of bad initialization or data distribution shifts.
- Use k-means++ or deterministic initialization (e.g., from prior run) instead of pure random seed.
- Monitor feature distributions in production — outlier injection can orphan centroids without any explicit error.
kmeans.cluster_centers_. If all centroids overlap, reduce K or try different initialization.silhouette_score per cluster. Look for clusters with negative silhouette values — they overlap.random_state and switch to k-means++. If still unstable, your data may not have natural clusters — try DBSCAN or hierarchical clustering.python -c "from sklearn.cluster import KMeans; km = KMeans(n_clusters=5, init='k-means++', random_state=42)"km.fit(X); cluster_counts = pd.Series(km.labels_).value_counts()numpy.clip(X, lower, upper)Key takeaways
Common mistakes to avoid
4 patternsNot scaling features before clustering
Using a single random initialisation without comparing
Choosing K arbitrarily without validation
Ignoring outliers
Interview Questions on This Topic
Explain how K-Means clustering works at a high level. What are the key steps?
Frequently Asked Questions
20+ years shipping production ML systems and the infrastructure behind them. Everything here is grounded in real deployments.
That's Algorithms. Mark it forged?
7 min read · try the examples if you haven't