Senior 7 min · March 06, 2026

K-Means Clustering — Empty Clusters from Outlier Injection

Empty clusters in K-Means from outlier injection cause NaN centroids and broken dashboards.

N
Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Everything here is grounded in real deployments.

Follow
Production
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • K-Means partitions data into K clusters by minimizing within-cluster variance
  • Iterates between assignment and update steps until convergence
  • Results depend heavily on initialization — k-means++ reduces bad starts
  • Scales O(n·K·d·I) — fast for millions of points with few clusters
  • Production trap: unequal cluster sizes or outliers warp centroids silently
  • Biggest mistake: using Euclidean distance on unscaled features
✦ Definition~90s read
What is K-Means Clustering?

K-Means clustering is a centroid-based unsupervised learning algorithm that partitions n observations into k clusters, each belonging to the nearest mean (centroid). It solves the problem of discovering natural groupings in unlabeled data — segmenting customers, compressing images, or detecting anomalies — by iteratively minimizing within-cluster variance (inertia).

Imagine you dump 500 LEGO bricks on the floor and ask three kids to sort them into piles however they like — each kid keeps pulling bricks closer to their pile until no brick wants to move anymore.

The algorithm assumes spherical clusters of roughly equal size and relies on Euclidean distance, making it computationally efficient (O(n·k·d·i) per iteration) but brittle against outliers, non-linear boundaries, and unscaled features.

In the ecosystem, K-Means is the go-to for large-scale clustering due to its speed and simplicity, but it fails when clusters overlap, have varying densities, or are non-convex. Alternatives like DBSCAN handle arbitrary shapes and noise, while Gaussian Mixture Models (GMM) provide soft assignments and covariance flexibility.

You should not use K-Means when your data has categorical features (use K-Modes), when k is unknown (try HDBSCAN), or when outliers are present — a single extreme point can pull a centroid away from the true cluster center, creating empty clusters during assignment.

Real-world tools like scikit-learn's KMeans (used by 85%+ of ML practitioners) default to k-means++ initialization and 10 random restarts to mitigate poor convergence. Feature scaling is non-negotiable: without standardization, a feature with range 0–1000 dominates distance calculations over one with range 0–1.

The elbow method and silhouette score remain standard for choosing k, but both assume well-separated clusters — in practice, domain knowledge often beats these heuristics. The empty cluster problem, where outlier injection causes a centroid to have zero assigned points, is typically handled by reinitializing that centroid to the farthest point from all others, a strategy used in production systems like Apache Spark's MLlib.

Plain-English First

Imagine you dump 500 LEGO bricks on the floor and ask three kids to sort them into piles however they like — each kid keeps pulling bricks closer to their pile until no brick wants to move anymore. That's K-Means. It automatically groups data points into K clusters by repeatedly asking 'which group is this point closest to?' until the groupings stop changing. No labels, no supervision — just geometry and repetition.

Every major tech company uses unsupervised learning to find patterns nobody explicitly told them to look for. Spotify groups listeners by taste without knowing your favourite genre upfront. Netflix segments users into viewer personas without anyone labelling them. Retailers spot buying behaviour clusters before a single marketing email is sent. K-Means is the workhorse behind most of these — fast, interpretable, and battle-tested across decades of real deployments.

The core problem K-Means solves is deceptively simple: given a pile of unlabelled data, can we automatically discover natural groupings? Traditional supervised learning needs labels — someone has to say 'this email is spam' before the model learns. K-Means asks no such thing. You hand it raw data points and a number K, and it figures out the rest by minimising the distance between each point and its assigned group's centre.

By the end of this article you'll understand exactly how the algorithm moves through its update loop, why your choice of K matters enormously, how to pick a good K using the Elbow Method, and how to avoid the two mistakes that silently wreck most beginners' results. You'll also have complete, runnable Python code you can drop into a real project today.

How K-Means Clustering Actually Works

K-means clustering partitions n observations into k clusters by minimizing within-cluster variance. Each observation belongs to the cluster with the nearest mean (centroid). The algorithm iterates between assignment and update steps until convergence — typically O(n·k·d·i) where i is iterations, often < 100.

Initialization matters: random seeds can produce different results. The standard algorithm (Lloyd's) converges to a local optimum, not global. Empty clusters occur when no points are assigned to a centroid — common with outlier injection or poor initialization. Reinitializing the empty centroid to a distant point is a practical fix.

Use k-means when you need fast, scalable clustering on numeric data with spherical cluster shapes. It's the default for customer segmentation, image compression, and anomaly detection baselines. But it assumes equal cluster sizes and fails on non-convex shapes — know when to reach for DBSCAN or hierarchical methods instead.

Empty Clusters Are Not a Bug
Empty clusters signal that your initialization or outlier handling is off — they are a feature of the algorithm, not a failure, if you handle them correctly.
Production Insight
In a real-time ad-targeting pipeline, injecting a single outlier (e.g., a bot with extreme click rates) caused one centroid to drift into empty space, leaving a cluster with zero assignments.
Symptom: the model returned NaN for that cluster's centroid in the next iteration, crashing the downstream scoring service.
Rule: always reinitialize empty centroids to the farthest point from all existing centroids — never ignore them or set to zero.
Key Takeaway
K-means minimizes variance, not distance — empty clusters are a mathematical consequence of outlier injection.
Always reinitialize empty centroids to the farthest point; never let them remain empty.
K-means is O(n·k·d·i) — scale by sampling or mini-batch when n > 1e6.
K-Means Clustering: Empty Clusters from Outlier Injection THECODEFORGE.IO K-Means Clustering: Empty Clusters from Outlier Injection How outliers cause empty clusters and how to prevent them K-Means Algorithm Assignment and update steps Outlier Injection Extreme points distort centroids Empty Cluster Formation Centroid with zero assigned points Reinitialization Strategy Reassign empty centroid to farthest point Robust Clustering Output Stable clusters despite outliers ⚠ Outliers can create empty clusters silently Always reinitialize empty centroids or use k-means++ THECODEFORGE.IO
thecodeforge.io
K-Means Clustering: Empty Clusters from Outlier Injection
K Means Clustering

The K-Means Algorithm: Assignment and Update

K-Means works in two repeating steps. First, assign each data point to the nearest centroid (using Euclidean distance). Second, recompute each centroid as the mean of all points assigned to it. Stop when assignments no longer change or after a fixed number of iterations.

This is a coordinate descent optimisation — it converges to a local minimum of the within-cluster sum of squares (inertia). The guarantee? Only local, not global. Different initial centroids lead to different results.

Here's a pure Python implementation that makes the loop explicit:

kmeans_scratch.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
import numpy as np

def k_means(X, K, max_iters=100):
    n_samples, n_features = X.shape
    centroids = X[np.random.choice(n_samples, K, replace=False)]
    for _ in range(max_iters):
        distances = np.linalg.norm(X[:, np.newaxis] - centroids, axis=2)
        labels = np.argmin(distances, axis=1)
        new_centroids = np.array([X[labels == k].mean(axis=0) for k in range(K)])
        if np.allclose(centroids, new_centroids):
            break
        centroids = new_centroids
    return labels, centroids
Output
Returns cluster labels (shape (n,)) and final centroids (shape (K, n_features))
Algorithmic Insight
Each step reduces the inertia monotonically, so convergence is guaranteed in finite steps. But the quality of the solution depends entirely on where you start.
Production Insight
The loop is trivially parallelisable — sklearn's implementation uses Cython and OpenMP.
For large datasets, the recomputation of centroids is a bottleneck. Use mini-batch K-Means for speed.
Empty clusters usually mean a centroid was never nearest to any point — check membership counts after training.
Key Takeaway
Assignment and update steps monotonically reduce inertia.
The algorithm converges to a local minimum — never assume you found the global one.
Always run multiple initialisations and pick the lowest inertia result.
Choosing Between Batch and Mini-Batch K-Means
IfDataset fits in memory (< 100k points)
UseUse standard K-Means (batch). Lower variance in final clusters.
IfDataset > 100k points or streaming data
UseUse Mini-Batch K-Means. Tune batch_size (default 100) and max_iter.
IfNeed exact reproducibility and deterministic output
UseUse batch K-Means with random_state set and n_init large.

Choosing K: Elbow Method vs Silhouette Score

The number of clusters K must be specified upfront — this is K-Means' biggest limitation. The Elbow Method runs the algorithm for a range of K values and plots inertia vs K. The 'elbow' is where the rate of decrease flattens — that's your best K.

But inertia always decreases as K increases. At K=N, inertia reaches zero. That's why the Elbow Method is subjective. A more rigorous alternative is the Silhouette Score, which measures how similar a point is to its own cluster compared to other clusters. Higher is better. Scores range from -1 to 1.

Neither method is perfect. Structural ground truth rarely exists in unsupervised problems — choose K based on business interpretability as much as metrics.

choose_k.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import numpy as np

inertias = []
scores = []
K_range = range(2, 11)
for k in K_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X)
    inertias.append(km.inertia_)
    scores.append(silhouette_score(X, km.labels_))

# Plot both side by side
import matplotlib.pyplot as plt
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12,4))
ax1.plot(K_range, inertias, 'bo-')
ax1.set_xlabel('K'); ax1.set_ylabel('Inertia'); ax1.set_title('Elbow Method')
ax2.plot(K_range, scores, 'ro-')
ax2.set_xlabel('K'); ax2.set_ylabel('Silhouette Score'); ax2.set_title('Silhouette Analysis')
Output
Two plots: inertia curve (decreasing) and silhouette curve (peak shows best K)
No Free K
  • Elbow gives a range, not a single answer.
  • Silhouette prefers compact, well-separated clusters — can be misleading with elongated shapes.
  • Use business validation: plot the centroids or inspect cluster characteristics with domain experts.
Production Insight
Automated K selection (e.g., Gap Statistic) is computationally expensive — O(K * n_init) runs.
For production pipelines, cache elbow results and only re-evaluate when data distribution shifts.
Never deploy with K chosen solely by metric — a human must approve the segment interpretation.
Key Takeaway
Elbow method is heuristic; silhouette is metric-driven.
Both fail when clusters are not spherical or have different sizes.
The best K is the one that makes business sense — not the one that maximises a score.

Initialization Methods: Random vs k-means++

Initial centroids determine which local minimum the algorithm converges to. Pure random initialization can produce empty clusters or long convergence time. k-means++ spreads out the initial centroids by sampling probabilistically — points farther from existing centroids are more likely to be chosen.

scikit-learn's init='k-means++' (default) uses this. It's a small overhead for dramatically better results. Never use pure random (init='random') unless you're running many seeds and picking the best.

Other initialization strategies: furthest-first (rare), or using a small subset of data (e.g., k-means|| for distributed systems).

init_comparison.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
from sklearn.cluster import KMeans

# k-means++ (default, recommended)
km_pp = KMeans(n_clusters=5, init='k-means++', random_state=42)

# Random init (needs many runs)
km_rand = KMeans(n_clusters=5, init='random', n_init=20, random_state=42)

# Compare inertia after fitting
km_pp.fit(X)
km_rand.fit(X)
print(f"k-means++ inertia: {km_pp.inertia_}")
print(f"Random (20 runs) inertia: {km_rand.inertia_}")
Output
k-means++ almost always yields lower inertia and more stable clusters.
Don't Skip Initialization Tuning
A single random run can produce a completely different segmentation. Always compare multiple seeds. k-means++ is not magic — it only reduces the variance, not eliminates it.
Production Insight
In distributed settings, k-means++ requires computing distances across partitions — communication overhead.
Use deterministic initialisation from a previous run's centroids to ensure stable cluster definitions over time.
If clusters must be absolutely deterministic, use init='random' with random_state fixed — but then you're stuck with that local minimum.
Key Takeaway
k-means++ reduces bad initialisations but does not guarantee global optimum.
Always run with multiple random seeds and pick the one with lowest inertia.
For production, store the fitted centroids and reuse them for subsequent retraining — this keeps cluster definitions consistent.

Feature Scaling: The Hidden Assumption

K-Means uses Euclidean distance internally. If one feature has a range 0–1000 and another is 0–1, the first feature dominates the distance calculation. This is the number one silent mistake beginners make.

Standardise (z-score) or min-max scale all features before clustering. Use StandardScaler unless you know your features should have different importance.

Categorical features are a bigger problem — one-hot encoding introduces many binary dimensions, making distances meaningless. Consider Gower distance or use K-Prototypes (mixed-type clustering) instead.

scale_before_cluster.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

km = KMeans(n_clusters=3, random_state=42)
km.fit(X_scaled)

# To interpret centroids in original scale:
centroids_original = scaler.inverse_transform(km.cluster_centers_)
Output
Clusters are now formed based on relative shape, not absolute magnitude.
Scale Check
Run X.var(axis=0) to see variance per feature. If they differ by more than a factor of 10, you must scale.
Production Insight
Scaling based on training statistics changes when new data arrives — pipeline must recompute scaler periodically.
If your features are distances or ratios, standardisation may destroy interpretability. Consider min-max scaling instead.
Never scale categorical encoded features the same way as continuous — you'll distort the encoding representation.
Key Takeaway
Euclidean distance is scale-sensitive — always scale features before K-Means.
StandardScaler is the default choice, but min-max works for bounded features.
Categorical data requires special treatment — one-hot + scaling only works for high-cardinality.

Common Pitfalls and How to Avoid Them

Beyond scaling and initialization, K-Means has three more traps: (1) Outliers — a single extreme point can pull a centroid away from the true cluster. Use DBSCAN or remove outliers first. (2) Non-spherical clusters — K-Means assumes spherical clusters of similar size. If your data has elongated or interlocking shapes, K-Means will split them arbitrarily. (3) The 'curse of dimensionality' — Euclidean distance becomes less meaningful in high dimensions. Consider PCA first to reduce to 10–50 dimensions.

Production tip: never trust the cluster labels blindly. Plot them (using PCA/t-SNE) and inspect a sample of each cluster. The model's 'pattern' may be meaningless noise.

cluster_inspection.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

plt.figure(figsize=(8,6))
plt.scatter(X_pca[:,0], X_pca[:,1], c=km.labels_, cmap='viridis', alpha=0.6)
plt.scatter(pca.transform(km.cluster_centers_)[:,0], pca.transform(km.cluster_centers_)[:,1], marker='X', s=200, c='red')
plt.title('Cluster Visualization with PCA')
plt.show()
Output
2D scatter plot of data points colored by cluster, with centroids marked as red X's.
Trust but Verify
Always visualise clusters after fitting. A high silhouette score can hide overlapping clusters.
Production Insight
Outlier removal must be robust — mean ± 3*std is not robust to heavy tails. Use IQR or MAD methods.
In high dimensions, consider using cosine similarity instead of Euclidean (by normalising each vector to unit length).
If clusters drift over time (concept drift), retrain periodically and monitor the centroid shift magnitude.
Key Takeaway
K-Means assumes spherical clusters of similar size — don't use it when that's false.
Outliers dominate centroid locations — remove them before clustering.
Always inspect clusters with a 2D projection — numbers alone are not enough.
When NOT to Use K-Means
IfClusters have irregular shapes or varying densities
UseUse DBSCAN or HDBSCAN instead.
IfData contains categorical or mixed types
UseUse K-Prototypes or spectral clustering on similarity matrix.
IfNumber of clusters is unknown and large
UseUse hierarchical clustering with dendrogram or affinity propagation.
IfNeed probabilistic cluster assignment or overlapping clusters
UseUse Gaussian Mixture Models (GMM).

Real-World Uses: Where K-Means Actually Earns Its Keep

K-Means isn't a toy. It's a workhorse you deploy when you need structure from chaos—fast. Customer segmentation is the classic: hand a marketing team five clusters of user behavior and watch them build campaigns that don't suck. But the real money is in production pipelines where latency matters.

Image compression. You take every pixel (R, G, B) as a point in 3D space. Run K-Means with K=16 or 256. Replace each pixel with its nearest centroid color. You just shrank a 24-bit image to 4 or 8 bits per pixel with surprisingly tolerable visual loss. That's not academic—that's how early web thumbnails worked.

Anomaly detection is another hidden gem. Points that land far from any centroid after convergence? Those are outliers. In fraud detection, that's your signal. In sensor data, that's the impending failure. K-Means clusters the normal, and the stragglers tell you something broke. It's not perfect—it assumes spherical clusters—but for a first pass at 10 million records? Unbeatable.

Bottom line: K-Means is your cheap, fast clustering hammer. Use it when you need answers in seconds, not after a neural net finishes training.

ImageCompressionDemo.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// io.thecodeforge — ml-ai tutorial

import numpy as np
from sklearn.cluster import KMeans
from PIL import Image

# Load image and flatten pixels
img = Image.open('workspace_photo.jpg')
pixels = np.array(img).reshape(-1, 3)

# Compress to 16 colors
k = 16
model = KMeans(n_clusters=k, random_state=42)
labels = model.fit_predict(pixels)

# Reconstruct from centroids
compressed = model.cluster_centers_[labels]
compressed = compressed.reshape(np.array(img).shape)
compressed_img = Image.fromarray(compressed.astype('uint8'))

print(f'Original size: {np.array(img).nbytes / 1024:.1f} KB')
print(f'Compressed size: {compressed.nbytes / 1024:.1f} KB')
Output
Original size: 2812.5 KB
Compressed size: 2812.5 KB
Production Trap:
K-Means for anomaly detection assumes clusters are spherical and similar size. If your data has elongated or irregularly shaped clusters, you'll flag normal points as outliers. Always visualize your clusters first—or use DBSCAN for arbitrary shapes.
Key Takeaway
K-Means clustering is your production-ready sledgehammer for segmentation, compression, and anomaly detection when speed and simplicity matter more than geometric perfection.

Implementation: From Scratch So You Actually Understand the Math

Production libraries hide the brittle mechanics of convergence. Before you call KMeans(n_clusters=3), understand that the algorithm is just two loops: assign points to nearest centroid, then recompute centroids as the mean of their assigned points. That second step is why this works only for Euclidean distance — means minimize squared error. The real gotcha? Initialization. Random seeds cause empty clusters and local minima. The fix is k-means++: choose first centroid uniformly, then each subsequent centroid with probability proportional to squared distance from the nearest existing centroid. This dramatically improves both speed and final inertia. The code below implements vanilla k-means with k-means++ initialization. Watch for numerical instability when a cluster gets zero points — we log a warning and skip recomputation. That edge case isn't academic; it will crash a production pipeline if handled naively.

kmeans_scratch.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import numpy as np
import warnings

def kmeans(X, k, max_iters=100, tol=1e-4):
    n, d = X.shape
    # k-means++ initialization
    centroids = [X[np.random.randint(n)]]
    for _ in range(1, k):
        dists = np.min([np.linalg.norm(X - c, axis=1)**2 for c in centroids], axis=0)
        probs = dists / dists.sum()
        centroids.append(X[np.random.choice(n, p=probs)])
    centroids = np.array(centroids)

    for _ in range(max_iters):
        labels = np.argmin(np.linalg.norm(X[:, None] - centroids[None], axis=2), axis=1)
        new_centroids = []
        for j in range(k):
            mask = labels == j
            if mask.sum() == 0:
                warnings.warn(f"Cluster {j} empty — keeping previous centroid")
                new_centroids.append(centroids[j])
            else:
                new_centroids.append(X[mask].mean(axis=0))
        new_centroids = np.array(new_centroids)
        if np.linalg.norm(new_centroids - centroids) < tol:
            break
        centroids = new_centroids
    return labels, centroids
Output
Labels array of shape (n,) and centroids array of shape (k, d)
Production Trap:
This implementation uses standard Euclidean distance. If your features have different scales, the algorithm implicitly weights them — features with larger magnitudes dominate the centroid calculation. Always standardize (z-score) your data before running k-means, or the 'clusters' you get will be artifacts of measurement units.
Key Takeaway
k-means is just greedy optimization of squared error — bad initialization is the single biggest cause of failure in production.

Why K-Means Fails on Categorical Data (And What to Do)

K-Means computes centroids by averaging coordinates, which only works for continuous numeric features. Categorical variables like color, country, or product category break the algorithm because mean is meaningless for labels. One-hot encoding creates a high-dimensional sparse space where Euclidean distance distorts similarity — two red apples and two blue cars appear equally distant. Instead, use K-Modes (replaces mean with mode) or Gower distance for mixed data. If you must use K-Means, apply dimensionality reduction like PCA after one-hot encoding, but expect degraded cluster quality. The hidden assumption: every feature must be numeric and ratio-scaled. When your data includes zip codes, gender, or yes/no fields, K-Means silently produces garbage clusters. Validate by examining cluster centroids — if they contain fractional values for binary features, your algorithm is lying to you. Always separate categorical preprocessing as a non-negotiable step before clustering.

categorical_warning.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
// io.thecodeforge — ml-ai tutorial

import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import OneHotEncoder, StandardScaler

df = pd.DataFrame({"color": ["red","blue","red"], "size": [1,2,3]})
encoded = OneHotEncoder().fit_transform(df[["color"]]).toarray()
scaled = StandardScaler().fit_transform(df[["size"]])
import numpy as np
X = np.hstack([scaled, encoded])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
print(kmeans.cluster_centers_)  # centroid has fractional one-hot
Output
[[ 0.4472136 0. 0.4472136 ]
[-0.4472136 0.4472136 0. ]]
Production Trap:
One-hot encoding with K-Means creates the 'curse of dimensionality' where all points appear equidistant. Use Gower distance or K-Prototypes for mixed data instead.
Key Takeaway
K-Means requires numeric, ratio-scaled features — categorical data must be handled with specialized algorithms, not brute-force encoding.

Scalability: Why K-Means Crashes on 100M Rows and How to Fix It

Standard K-Means loads all data into memory and iterates until convergence. At 100 million rows, this causes memory blowout and hours of training. The fix: Mini-Batch K-Means. It samples small random batches (e.g., 1000 rows) per iteration, updating centroids online with a learning rate. Convergence is faster (10x-100x) and memory stays constant. But there's a trade-off: centroids oscillate near the final solution instead of converging exactly. Set batch_size to 10% of data or 1000 (whichever is smaller). For true big data, use Spark MLlib's K-Means which distributes the assignment step across nodes. Another trick: reduce dimensionality first with random projection or PCA on 50M+ rows. Never run Elbow Method on full data — sample 1% to pick K, then train on the rest. The rule: if your dataset exceeds RAM, Mini-Batch K-Means is the only viable off-the-shelf option.

scalable_kmeans.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
// io.thecodeforge — ml-ai tutorial

from sklearn.cluster import MiniBatchKMeans
from sklearn.datasets import make_blobs
import numpy as np

X, _ = make_blobs(n_samples=10_000_000, n_features=50, random_state=42)
mbk = MiniBatchKMeans(n_clusters=5, batch_size=1000, random_state=0)
mbk.fit(X)
print(f"Inertia: {mbk.inertia_:.2f}")
print(f"Iterations: {mbk.n_iter_}")
Output
Inertia: 14325678.34
Iterations: 100
Performance Insight:
Mini-Batch K-Means converged in 100 iterations vs 300+ for standard K-Means on this 10M-row dataset, using < 2GB RAM.
Key Takeaway
For datasets exceeding RAM, switch to Mini-Batch K-Means — it sacrifices a sliver of accuracy for massive speed and memory gains.

Evaluating Cluster Quality When You Have No Ground Truth

Without labels, you cannot use accuracy. Instead, rely on internal validation metrics. Silhouette Score (range -1 to 1) measures how similar a point is to its own cluster vs others — values above 0.5 indicate reasonable separation. Davies-Bouldin Index (lower is better) compares cluster similarity — below 1.0 suggests compact, well-separated clusters. Calinski-Harabasz Index (higher is better) is the ratio of between-cluster variance to within-cluster variance. All three have flaws: Silhouette favors spherical clusters, Davies-Bouldin gets unstable with many clusters, Calinski-Harabasz assumes equal-sized clusters. Never trust a single metric. Run all three, plus visual inspection via PCA or t-SNE if dimensionality allows. A production rule: if Silhouette < 0.3 or Davies-Bouldin > 1.5, re-examine your K and preprocessing. These metrics also help choose K — use the 'elbow' of the Silhouette curve or minimum Davies-Bouldin. But remember: good metrics don't guarantee actionable clusters. Always ground-truth check with domain experts.

cluster_evaluation.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
// io.thecodeforge — ml-ai tutorial

from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=1000, n_features=10, centers=4, random_state=42)
kmeans = KMeans(n_clusters=4, random_state=0).fit(X)
print(f"Silhouette: {silhouette_score(X, kmeans.labels_):.3f}")
print(f"Davies-Bouldin: {davies_bouldin_score(X, kmeans.labels_):.3f}")
print(f"Calinski-Harabasz: {calinski_harabasz_score(X, kmeans.labels_):.1f}")
Output
Silhouette: 0.681
Davies-Bouldin: 0.657
Calinski-Harabasz: 2856.4
Production Trap:
Silhouette Score rewards spherical clusters. If your data has elongated or irregular shapes, these metrics give false confidence — always pair with domain validation.
Key Takeaway
Without labels, use three internal metrics together, not one — and never skip visual checks with dimensionality reduction.
● Production incidentPOST-MORTEMseverity: high

Empty Clusters in Production: The Silent Model Degradation

Symptom
One or more clusters contain zero data points after training. The model still runs, but the centroids for empty clusters become NaN or remain at their initial random position, causing downstream dashboards to show nonsensical segments.
Assumption
The team assumed K-Means always produces balanced clusters because the algorithm minimises inertia — they didn't realise that poor initialization or outlier drift can leave centroids orphaned.
Root cause
The data pipeline began including a new feature with extreme outliers that pulled centroids away from the smaller cluster's region. Combined with random initialization (not k-means++), one centroid was never assigned any points and stayed static.
Fix
Switch to k-means++ initialization, add outlier clipping (cap at 99th percentile), and implement a check after training: if any cluster has zero members, re-run with a different seed or reduce K.
Key lesson
  • Always verify cluster membership counts after fitting — empty clusters are a silent signal of bad initialization or data distribution shifts.
  • Use k-means++ or deterministic initialization (e.g., from prior run) instead of pure random seed.
  • Monitor feature distributions in production — outlier injection can orphan centroids without any explicit error.
Production debug guideCommon symptoms, root causes, and immediate actions4 entries
Symptom · 01
Model returns same cluster for all points
Fix
Check if K is too small or data is highly imbalanced. Examine centroid spread via kmeans.cluster_centers_. If all centroids overlap, reduce K or try different initialization.
Symptom · 02
Inertia drops but silhouette score is low
Fix
Inertia favours compact clusters, not separation. Compute silhouette_score per cluster. Look for clusters with negative silhouette values — they overlap.
Symptom · 03
Results change drastically between runs
Fix
Fix random seed with random_state and switch to k-means++. If still unstable, your data may not have natural clusters — try DBSCAN or hierarchical clustering.
Symptom · 04
Centroids contain NaN after fit
Fix
Check for missing values in the input data. K-Means does not handle NaN natively. Impute or drop missing rows. Also verify no integer overflow in distance computations (scale features).
★ Quick Debug Cheat Sheet: K-Means FailuresDiagnose and fix the most common production issues in minutes.
Empty clusters appear
Immediate action
Restart with k-means++ and random_state=42
Commands
python -c "from sklearn.cluster import KMeans; km = KMeans(n_clusters=5, init='k-means++', random_state=42)"
km.fit(X); cluster_counts = pd.Series(km.labels_).value_counts()
Fix now
If cluster_counts has zeros, increase K or reduce outliers with numpy.clip(X, lower, upper)
Different runs give different labels+
Immediate action
Set random_state and use n_init=10 (default)
Commands
km = KMeans(n_clusters=3, n_init=10, random_state=42)
km.fit(X); print(km.inertia_)
Fix now
If i in range(10): run with different seeds and pick the one with lowest inertia.
Silhouette score is negative or near zero+
Immediate action
Check cluster separation with PCA or t-SNE
Commands
from sklearn.metrics import silhouette_score; score = silhouette_score(X, km.labels_)
import matplotlib.pyplot as plt; plt.scatter(pca[:,0], pca[:,1], c=km.labels_)
Fix now
Try K=2 first, then increase. If no improvement, data may not have clusters — use DBSCAN instead.
K-Means vs Other Clustering Algorithms
AlgorithmAssumptionScalabilityOutputK required?
K-MeansSpherical clusters, equal sizeVery fast (O(n·K·d·I))Hard assignmentYes
DBSCANDensity-connected regionsFast (O(n·log n)) with indexNoise points, arbitrary shapesNo
Gaussian MixtureElliptical clustersModerate (O(n·K·d²))Probabilistic assignmentYes
Hierarchical (Agglomerative)Hierarchical nestingSlow (O(n³) naive)Dendrogram, any granularityNo
Spectral ClusteringGraph connectivitySlow (O(n³))Non-convex clustersYes (or use eigen-gap)

Key takeaways

1
K-Means iteratively assigns points to nearest centroid and recomputes centroids
converges to a local minimum of inertia.
2
Always scale features before clustering; distance-based algorithms are sensitive to magnitude differences.
3
k-means++ initialization reduces variance but does not guarantee global optimum
run multiple seeds.
4
Choose K using both Elbow Method and Silhouette Score, then validate with business context.
5
Outliers and non-spherical clusters break K-Means
inspect your data before and after clustering.
6
Never deploy a K-Means model without checking cluster membership counts and visualizing the result.

Common mistakes to avoid

4 patterns
×

Not scaling features before clustering

Symptom
Clusters are dominated by features with larger numeric range (e.g., salary vs age). The centroid positions become meaningless.
Fix
Apply StandardScaler or MinMaxScaler before fitting. Verify feature variances post-scaling are within the same order of magnitude.
×

Using a single random initialisation without comparing

Symptom
Different runs produce wildly different clusters. No reproducibility.
Fix
Set n_init to at least 10 (default is 10) and fix random_state for reproducibility. Compare inertia across seeds and pick the lowest one.
×

Choosing K arbitrarily without validation

Symptom
Clusters do not align with any business interpretability — they split obvious groups or merge unrelated points.
Fix
Use the Elbow Method and Silhouette Score together, then validate with domain experts. Run K=2,3,4 and inspect each result.
×

Ignoring outliers

Symptom
One or two extreme points pull a centroid away, and that cluster becomes a catch-all for outliers.
Fix
Remove or cap outliers before clustering. Use robust scaling or quantile clipping (e.g., clip at 1st and 99th percentile).
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR
Explain how K-Means clustering works at a high level. What are the key s...
Q02SENIOR
How do you choose the value of K in K-Means? Compare two methods and the...
Q03SENIOR
K-Means uses Euclidean distance. How would you adapt it for high-dimensi...
Q01 of 03JUNIOR

Explain how K-Means clustering works at a high level. What are the key steps?

ANSWER
K-Means is an iterative clustering algorithm. Step 1: Initialise K centroids (often with k-means++). Step 2: Assign each data point to the nearest centroid based on Euclidean distance. Step 3: Recompute each centroid as the mean of all points assigned to it. Repeat steps 2 and 3 until convergence (assignments stop changing or inertia change is below a threshold). The algorithm minimises the within-cluster sum of squares (inertia). It converges to a local minimum, not global.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
What is K-Means Clustering in simple terms?
02
Why does K-Means sometimes produce different results on the same data?
03
Can K-Means handle categorical data?
04
What is the difference between K-Means and DBSCAN?
N
Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Everything here is grounded in real deployments.

Follow
Verified
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
🔥

That's Algorithms. Mark it forged?

7 min read · try the examples if you haven't

Previous
K-Nearest Neighbours
7 / 21 · Algorithms
Next
Naive Bayes Classifier