K-Means Clustering — Empty Clusters from Outlier Injection
Empty clusters in K-Means from outlier injection cause NaN centroids and broken dashboards.
- K-Means partitions data into K clusters by minimizing within-cluster variance
- Iterates between assignment and update steps until convergence
- Results depend heavily on initialization — k-means++ reduces bad starts
- Scales O(n·K·d·I) — fast for millions of points with few clusters
- Production trap: unequal cluster sizes or outliers warp centroids silently
- Biggest mistake: using Euclidean distance on unscaled features
Imagine you dump 500 LEGO bricks on the floor and ask three kids to sort them into piles however they like — each kid keeps pulling bricks closer to their pile until no brick wants to move anymore. That's K-Means. It automatically groups data points into K clusters by repeatedly asking 'which group is this point closest to?' until the groupings stop changing. No labels, no supervision — just geometry and repetition.
Every major tech company uses unsupervised learning to find patterns nobody explicitly told them to look for. Spotify groups listeners by taste without knowing your favourite genre upfront. Netflix segments users into viewer personas without anyone labelling them. Retailers spot buying behaviour clusters before a single marketing email is sent. K-Means is the workhorse behind most of these — fast, interpretable, and battle-tested across decades of real deployments.
The core problem K-Means solves is deceptively simple: given a pile of unlabelled data, can we automatically discover natural groupings? Traditional supervised learning needs labels — someone has to say 'this email is spam' before the model learns. K-Means asks no such thing. You hand it raw data points and a number K, and it figures out the rest by minimising the distance between each point and its assigned group's centre.
By the end of this article you'll understand exactly how the algorithm moves through its update loop, why your choice of K matters enormously, how to pick a good K using the Elbow Method, and how to avoid the two mistakes that silently wreck most beginners' results. You'll also have complete, runnable Python code you can drop into a real project today.
The K-Means Algorithm: Assignment and Update
K-Means works in two repeating steps. First, assign each data point to the nearest centroid (using Euclidean distance). Second, recompute each centroid as the mean of all points assigned to it. Stop when assignments no longer change or after a fixed number of iterations.
This is a coordinate descent optimisation — it converges to a local minimum of the within-cluster sum of squares (inertia). The guarantee? Only local, not global. Different initial centroids lead to different results.
Here's a pure Python implementation that makes the loop explicit:
Choosing K: Elbow Method vs Silhouette Score
The number of clusters K must be specified upfront — this is K-Means' biggest limitation. The Elbow Method runs the algorithm for a range of K values and plots inertia vs K. The 'elbow' is where the rate of decrease flattens — that's your best K.
But inertia always decreases as K increases. At K=N, inertia reaches zero. That's why the Elbow Method is subjective. A more rigorous alternative is the Silhouette Score, which measures how similar a point is to its own cluster compared to other clusters. Higher is better. Scores range from -1 to 1.
Neither method is perfect. Structural ground truth rarely exists in unsupervised problems — choose K based on business interpretability as much as metrics.
- Elbow gives a range, not a single answer.
- Silhouette prefers compact, well-separated clusters — can be misleading with elongated shapes.
- Use business validation: plot the centroids or inspect cluster characteristics with domain experts.
Initialization Methods: Random vs k-means++
Initial centroids determine which local minimum the algorithm converges to. Pure random initialization can produce empty clusters or long convergence time. k-means++ spreads out the initial centroids by sampling probabilistically — points farther from existing centroids are more likely to be chosen.
scikit-learn's init='k-means++' (default) uses this. It's a small overhead for dramatically better results. Never use pure random (init='random') unless you're running many seeds and picking the best.
Other initialization strategies: furthest-first (rare), or using a small subset of data (e.g., k-means|| for distributed systems).
init='random' with random_state fixed — but then you're stuck with that local minimum.Feature Scaling: The Hidden Assumption
K-Means uses Euclidean distance internally. If one feature has a range 0–1000 and another is 0–1, the first feature dominates the distance calculation. This is the number one silent mistake beginners make.
Standardise (z-score) or min-max scale all features before clustering. Use StandardScaler unless you know your features should have different importance.
Categorical features are a bigger problem — one-hot encoding introduces many binary dimensions, making distances meaningless. Consider Gower distance or use K-Prototypes (mixed-type clustering) instead.
X.var(axis=0) to see variance per feature. If they differ by more than a factor of 10, you must scale.Common Pitfalls and How to Avoid Them
Beyond scaling and initialization, K-Means has three more traps: (1) Outliers — a single extreme point can pull a centroid away from the true cluster. Use DBSCAN or remove outliers first. (2) Non-spherical clusters — K-Means assumes spherical clusters of similar size. If your data has elongated or interlocking shapes, K-Means will split them arbitrarily. (3) The 'curse of dimensionality' — Euclidean distance becomes less meaningful in high dimensions. Consider PCA first to reduce to 10–50 dimensions.
Production tip: never trust the cluster labels blindly. Plot them (using PCA/t-SNE) and inspect a sample of each cluster. The model's 'pattern' may be meaningless noise.
Empty Clusters in Production: The Silent Model Degradation
- Always verify cluster membership counts after fitting — empty clusters are a silent signal of bad initialization or data distribution shifts.
- Use k-means++ or deterministic initialization (e.g., from prior run) instead of pure random seed.
- Monitor feature distributions in production — outlier injection can orphan centroids without any explicit error.
kmeans.cluster_centers_. If all centroids overlap, reduce K or try different initialization.silhouette_score per cluster. Look for clusters with negative silhouette values — they overlap.random_state and switch to k-means++. If still unstable, your data may not have natural clusters — try DBSCAN or hierarchical clustering.numpy.clip(X, lower, upper)Key takeaways
Common mistakes to avoid
4 patternsNot scaling features before clustering
Using a single random initialisation without comparing
Choosing K arbitrarily without validation
Ignoring outliers
Interview Questions on This Topic
Explain how K-Means clustering works at a high level. What are the key steps?
Frequently Asked Questions
That's Algorithms. Mark it forged?
3 min read · try the examples if you haven't