Senior 3 min · March 06, 2026

K-Means Clustering — Empty Clusters from Outlier Injection

Empty clusters in K-Means from outlier injection cause NaN centroids and broken dashboards.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • K-Means partitions data into K clusters by minimizing within-cluster variance
  • Iterates between assignment and update steps until convergence
  • Results depend heavily on initialization — k-means++ reduces bad starts
  • Scales O(n·K·d·I) — fast for millions of points with few clusters
  • Production trap: unequal cluster sizes or outliers warp centroids silently
  • Biggest mistake: using Euclidean distance on unscaled features
Plain-English First

Imagine you dump 500 LEGO bricks on the floor and ask three kids to sort them into piles however they like — each kid keeps pulling bricks closer to their pile until no brick wants to move anymore. That's K-Means. It automatically groups data points into K clusters by repeatedly asking 'which group is this point closest to?' until the groupings stop changing. No labels, no supervision — just geometry and repetition.

Every major tech company uses unsupervised learning to find patterns nobody explicitly told them to look for. Spotify groups listeners by taste without knowing your favourite genre upfront. Netflix segments users into viewer personas without anyone labelling them. Retailers spot buying behaviour clusters before a single marketing email is sent. K-Means is the workhorse behind most of these — fast, interpretable, and battle-tested across decades of real deployments.

The core problem K-Means solves is deceptively simple: given a pile of unlabelled data, can we automatically discover natural groupings? Traditional supervised learning needs labels — someone has to say 'this email is spam' before the model learns. K-Means asks no such thing. You hand it raw data points and a number K, and it figures out the rest by minimising the distance between each point and its assigned group's centre.

By the end of this article you'll understand exactly how the algorithm moves through its update loop, why your choice of K matters enormously, how to pick a good K using the Elbow Method, and how to avoid the two mistakes that silently wreck most beginners' results. You'll also have complete, runnable Python code you can drop into a real project today.

The K-Means Algorithm: Assignment and Update

K-Means works in two repeating steps. First, assign each data point to the nearest centroid (using Euclidean distance). Second, recompute each centroid as the mean of all points assigned to it. Stop when assignments no longer change or after a fixed number of iterations.

This is a coordinate descent optimisation — it converges to a local minimum of the within-cluster sum of squares (inertia). The guarantee? Only local, not global. Different initial centroids lead to different results.

Here's a pure Python implementation that makes the loop explicit:

kmeans_scratch.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
import numpy as np

def k_means(X, K, max_iters=100):
    n_samples, n_features = X.shape
    centroids = X[np.random.choice(n_samples, K, replace=False)]
    for _ in range(max_iters):
        distances = np.linalg.norm(X[:, np.newaxis] - centroids, axis=2)
        labels = np.argmin(distances, axis=1)
        new_centroids = np.array([X[labels == k].mean(axis=0) for k in range(K)])
        if np.allclose(centroids, new_centroids):
            break
        centroids = new_centroids
    return labels, centroids
Output
Returns cluster labels (shape (n,)) and final centroids (shape (K, n_features))
Algorithmic Insight
Each step reduces the inertia monotonically, so convergence is guaranteed in finite steps. But the quality of the solution depends entirely on where you start.
Production Insight
The loop is trivially parallelisable — sklearn's implementation uses Cython and OpenMP.
For large datasets, the recomputation of centroids is a bottleneck. Use mini-batch K-Means for speed.
Empty clusters usually mean a centroid was never nearest to any point — check membership counts after training.
Key Takeaway
Assignment and update steps monotonically reduce inertia.
The algorithm converges to a local minimum — never assume you found the global one.
Always run multiple initialisations and pick the lowest inertia result.
Choosing Between Batch and Mini-Batch K-Means
IfDataset fits in memory (< 100k points)
UseUse standard K-Means (batch). Lower variance in final clusters.
IfDataset > 100k points or streaming data
UseUse Mini-Batch K-Means. Tune batch_size (default 100) and max_iter.
IfNeed exact reproducibility and deterministic output
UseUse batch K-Means with random_state set and n_init large.

Choosing K: Elbow Method vs Silhouette Score

The number of clusters K must be specified upfront — this is K-Means' biggest limitation. The Elbow Method runs the algorithm for a range of K values and plots inertia vs K. The 'elbow' is where the rate of decrease flattens — that's your best K.

But inertia always decreases as K increases. At K=N, inertia reaches zero. That's why the Elbow Method is subjective. A more rigorous alternative is the Silhouette Score, which measures how similar a point is to its own cluster compared to other clusters. Higher is better. Scores range from -1 to 1.

Neither method is perfect. Structural ground truth rarely exists in unsupervised problems — choose K based on business interpretability as much as metrics.

choose_k.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import numpy as np

inertias = []
scores = []
K_range = range(2, 11)
for k in K_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X)
    inertias.append(km.inertia_)
    scores.append(silhouette_score(X, km.labels_))

# Plot both side by side
import matplotlib.pyplot as plt
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12,4))
ax1.plot(K_range, inertias, 'bo-')
ax1.set_xlabel('K'); ax1.set_ylabel('Inertia'); ax1.set_title('Elbow Method')
ax2.plot(K_range, scores, 'ro-')
ax2.set_xlabel('K'); ax2.set_ylabel('Silhouette Score'); ax2.set_title('Silhouette Analysis')
Output
Two plots: inertia curve (decreasing) and silhouette curve (peak shows best K)
No Free K
  • Elbow gives a range, not a single answer.
  • Silhouette prefers compact, well-separated clusters — can be misleading with elongated shapes.
  • Use business validation: plot the centroids or inspect cluster characteristics with domain experts.
Production Insight
Automated K selection (e.g., Gap Statistic) is computationally expensive — O(K * n_init) runs.
For production pipelines, cache elbow results and only re-evaluate when data distribution shifts.
Never deploy with K chosen solely by metric — a human must approve the segment interpretation.
Key Takeaway
Elbow method is heuristic; silhouette is metric-driven.
Both fail when clusters are not spherical or have different sizes.
The best K is the one that makes business sense — not the one that maximises a score.

Initialization Methods: Random vs k-means++

Initial centroids determine which local minimum the algorithm converges to. Pure random initialization can produce empty clusters or long convergence time. k-means++ spreads out the initial centroids by sampling probabilistically — points farther from existing centroids are more likely to be chosen.

scikit-learn's init='k-means++' (default) uses this. It's a small overhead for dramatically better results. Never use pure random (init='random') unless you're running many seeds and picking the best.

Other initialization strategies: furthest-first (rare), or using a small subset of data (e.g., k-means|| for distributed systems).

init_comparison.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
from sklearn.cluster import KMeans

# k-means++ (default, recommended)
km_pp = KMeans(n_clusters=5, init='k-means++', random_state=42)

# Random init (needs many runs)
km_rand = KMeans(n_clusters=5, init='random', n_init=20, random_state=42)

# Compare inertia after fitting
km_pp.fit(X)
km_rand.fit(X)
print(f"k-means++ inertia: {km_pp.inertia_}")
print(f"Random (20 runs) inertia: {km_rand.inertia_}")
Output
k-means++ almost always yields lower inertia and more stable clusters.
Don't Skip Initialization Tuning
A single random run can produce a completely different segmentation. Always compare multiple seeds. k-means++ is not magic — it only reduces the variance, not eliminates it.
Production Insight
In distributed settings, k-means++ requires computing distances across partitions — communication overhead.
Use deterministic initialisation from a previous run's centroids to ensure stable cluster definitions over time.
If clusters must be absolutely deterministic, use init='random' with random_state fixed — but then you're stuck with that local minimum.
Key Takeaway
k-means++ reduces bad initialisations but does not guarantee global optimum.
Always run with multiple random seeds and pick the one with lowest inertia.
For production, store the fitted centroids and reuse them for subsequent retraining — this keeps cluster definitions consistent.

Feature Scaling: The Hidden Assumption

K-Means uses Euclidean distance internally. If one feature has a range 0–1000 and another is 0–1, the first feature dominates the distance calculation. This is the number one silent mistake beginners make.

Standardise (z-score) or min-max scale all features before clustering. Use StandardScaler unless you know your features should have different importance.

Categorical features are a bigger problem — one-hot encoding introduces many binary dimensions, making distances meaningless. Consider Gower distance or use K-Prototypes (mixed-type clustering) instead.

scale_before_cluster.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

km = KMeans(n_clusters=3, random_state=42)
km.fit(X_scaled)

# To interpret centroids in original scale:
centroids_original = scaler.inverse_transform(km.cluster_centers_)
Output
Clusters are now formed based on relative shape, not absolute magnitude.
Scale Check
Run X.var(axis=0) to see variance per feature. If they differ by more than a factor of 10, you must scale.
Production Insight
Scaling based on training statistics changes when new data arrives — pipeline must recompute scaler periodically.
If your features are distances or ratios, standardisation may destroy interpretability. Consider min-max scaling instead.
Never scale categorical encoded features the same way as continuous — you'll distort the encoding representation.
Key Takeaway
Euclidean distance is scale-sensitive — always scale features before K-Means.
StandardScaler is the default choice, but min-max works for bounded features.
Categorical data requires special treatment — one-hot + scaling only works for high-cardinality.

Common Pitfalls and How to Avoid Them

Beyond scaling and initialization, K-Means has three more traps: (1) Outliers — a single extreme point can pull a centroid away from the true cluster. Use DBSCAN or remove outliers first. (2) Non-spherical clusters — K-Means assumes spherical clusters of similar size. If your data has elongated or interlocking shapes, K-Means will split them arbitrarily. (3) The 'curse of dimensionality' — Euclidean distance becomes less meaningful in high dimensions. Consider PCA first to reduce to 10–50 dimensions.

Production tip: never trust the cluster labels blindly. Plot them (using PCA/t-SNE) and inspect a sample of each cluster. The model's 'pattern' may be meaningless noise.

cluster_inspection.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

plt.figure(figsize=(8,6))
plt.scatter(X_pca[:,0], X_pca[:,1], c=km.labels_, cmap='viridis', alpha=0.6)
plt.scatter(pca.transform(km.cluster_centers_)[:,0], pca.transform(km.cluster_centers_)[:,1], marker='X', s=200, c='red')
plt.title('Cluster Visualization with PCA')
plt.show()
Output
2D scatter plot of data points colored by cluster, with centroids marked as red X's.
Trust but Verify
Always visualise clusters after fitting. A high silhouette score can hide overlapping clusters.
Production Insight
Outlier removal must be robust — mean ± 3*std is not robust to heavy tails. Use IQR or MAD methods.
In high dimensions, consider using cosine similarity instead of Euclidean (by normalising each vector to unit length).
If clusters drift over time (concept drift), retrain periodically and monitor the centroid shift magnitude.
Key Takeaway
K-Means assumes spherical clusters of similar size — don't use it when that's false.
Outliers dominate centroid locations — remove them before clustering.
Always inspect clusters with a 2D projection — numbers alone are not enough.
When NOT to Use K-Means
IfClusters have irregular shapes or varying densities
UseUse DBSCAN or HDBSCAN instead.
IfData contains categorical or mixed types
UseUse K-Prototypes or spectral clustering on similarity matrix.
IfNumber of clusters is unknown and large
UseUse hierarchical clustering with dendrogram or affinity propagation.
IfNeed probabilistic cluster assignment or overlapping clusters
UseUse Gaussian Mixture Models (GMM).
● Production incidentPOST-MORTEMseverity: high

Empty Clusters in Production: The Silent Model Degradation

Symptom
One or more clusters contain zero data points after training. The model still runs, but the centroids for empty clusters become NaN or remain at their initial random position, causing downstream dashboards to show nonsensical segments.
Assumption
The team assumed K-Means always produces balanced clusters because the algorithm minimises inertia — they didn't realise that poor initialization or outlier drift can leave centroids orphaned.
Root cause
The data pipeline began including a new feature with extreme outliers that pulled centroids away from the smaller cluster's region. Combined with random initialization (not k-means++), one centroid was never assigned any points and stayed static.
Fix
Switch to k-means++ initialization, add outlier clipping (cap at 99th percentile), and implement a check after training: if any cluster has zero members, re-run with a different seed or reduce K.
Key lesson
  • Always verify cluster membership counts after fitting — empty clusters are a silent signal of bad initialization or data distribution shifts.
  • Use k-means++ or deterministic initialization (e.g., from prior run) instead of pure random seed.
  • Monitor feature distributions in production — outlier injection can orphan centroids without any explicit error.
Production debug guideCommon symptoms, root causes, and immediate actions4 entries
Symptom · 01
Model returns same cluster for all points
Fix
Check if K is too small or data is highly imbalanced. Examine centroid spread via kmeans.cluster_centers_. If all centroids overlap, reduce K or try different initialization.
Symptom · 02
Inertia drops but silhouette score is low
Fix
Inertia favours compact clusters, not separation. Compute silhouette_score per cluster. Look for clusters with negative silhouette values — they overlap.
Symptom · 03
Results change drastically between runs
Fix
Fix random seed with random_state and switch to k-means++. If still unstable, your data may not have natural clusters — try DBSCAN or hierarchical clustering.
Symptom · 04
Centroids contain NaN after fit
Fix
Check for missing values in the input data. K-Means does not handle NaN natively. Impute or drop missing rows. Also verify no integer overflow in distance computations (scale features).
★ Quick Debug Cheat Sheet: K-Means FailuresDiagnose and fix the most common production issues in minutes.
Empty clusters appear
Immediate action
Restart with k-means++ and random_state=42
Commands
python -c "from sklearn.cluster import KMeans; km = KMeans(n_clusters=5, init='k-means++', random_state=42)"
km.fit(X); cluster_counts = pd.Series(km.labels_).value_counts()
Fix now
If cluster_counts has zeros, increase K or reduce outliers with numpy.clip(X, lower, upper)
Different runs give different labels+
Immediate action
Set random_state and use n_init=10 (default)
Commands
km = KMeans(n_clusters=3, n_init=10, random_state=42)
km.fit(X); print(km.inertia_)
Fix now
If i in range(10): run with different seeds and pick the one with lowest inertia.
Silhouette score is negative or near zero+
Immediate action
Check cluster separation with PCA or t-SNE
Commands
from sklearn.metrics import silhouette_score; score = silhouette_score(X, km.labels_)
import matplotlib.pyplot as plt; plt.scatter(pca[:,0], pca[:,1], c=km.labels_)
Fix now
Try K=2 first, then increase. If no improvement, data may not have clusters — use DBSCAN instead.
K-Means vs Other Clustering Algorithms
AlgorithmAssumptionScalabilityOutputK required?
K-MeansSpherical clusters, equal sizeVery fast (O(n·K·d·I))Hard assignmentYes
DBSCANDensity-connected regionsFast (O(n·log n)) with indexNoise points, arbitrary shapesNo
Gaussian MixtureElliptical clustersModerate (O(n·K·d²))Probabilistic assignmentYes
Hierarchical (Agglomerative)Hierarchical nestingSlow (O(n³) naive)Dendrogram, any granularityNo
Spectral ClusteringGraph connectivitySlow (O(n³))Non-convex clustersYes (or use eigen-gap)

Key takeaways

1
K-Means iteratively assigns points to nearest centroid and recomputes centroids
converges to a local minimum of inertia.
2
Always scale features before clustering; distance-based algorithms are sensitive to magnitude differences.
3
k-means++ initialization reduces variance but does not guarantee global optimum
run multiple seeds.
4
Choose K using both Elbow Method and Silhouette Score, then validate with business context.
5
Outliers and non-spherical clusters break K-Means
inspect your data before and after clustering.
6
Never deploy a K-Means model without checking cluster membership counts and visualizing the result.

Common mistakes to avoid

4 patterns
×

Not scaling features before clustering

Symptom
Clusters are dominated by features with larger numeric range (e.g., salary vs age). The centroid positions become meaningless.
Fix
Apply StandardScaler or MinMaxScaler before fitting. Verify feature variances post-scaling are within the same order of magnitude.
×

Using a single random initialisation without comparing

Symptom
Different runs produce wildly different clusters. No reproducibility.
Fix
Set n_init to at least 10 (default is 10) and fix random_state for reproducibility. Compare inertia across seeds and pick the lowest one.
×

Choosing K arbitrarily without validation

Symptom
Clusters do not align with any business interpretability — they split obvious groups or merge unrelated points.
Fix
Use the Elbow Method and Silhouette Score together, then validate with domain experts. Run K=2,3,4 and inspect each result.
×

Ignoring outliers

Symptom
One or two extreme points pull a centroid away, and that cluster becomes a catch-all for outliers.
Fix
Remove or cap outliers before clustering. Use robust scaling or quantile clipping (e.g., clip at 1st and 99th percentile).
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR
Explain how K-Means clustering works at a high level. What are the key s...
Q02SENIOR
How do you choose the value of K in K-Means? Compare two methods and the...
Q03SENIOR
K-Means uses Euclidean distance. How would you adapt it for high-dimensi...
Q01 of 03JUNIOR

Explain how K-Means clustering works at a high level. What are the key steps?

ANSWER
K-Means is an iterative clustering algorithm. Step 1: Initialise K centroids (often with k-means++). Step 2: Assign each data point to the nearest centroid based on Euclidean distance. Step 3: Recompute each centroid as the mean of all points assigned to it. Repeat steps 2 and 3 until convergence (assignments stop changing or inertia change is below a threshold). The algorithm minimises the within-cluster sum of squares (inertia). It converges to a local minimum, not global.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
What is K-Means Clustering in simple terms?
02
Why does K-Means sometimes produce different results on the same data?
03
Can K-Means handle categorical data?
04
What is the difference between K-Means and DBSCAN?
🔥

That's Algorithms. Mark it forged?

3 min read · try the examples if you haven't

Previous
K-Nearest Neighbours
7 / 14 · Algorithms
Next
Naive Bayes Classifier