Intermediate 6 min · March 06, 2026

K-Means Clustering — Empty Clusters from Outlier Injection

Q: What is K-Means Clustering in simple terms?

K-Means Clustering is a fundamental concept in ML / AI. Think of it as a tool — once you understand its purpose, you'll reach for it constantly.

Q: Why does K-Means sometimes produce different results on the same data?

Because it depends on random initial centroids. Each run lands in a different local minimum of the inertia function. Use k-means++ and set a fixed `random_state` to make results reproducible. Even with k-means++, the randomness in the first centroid selection can still lead to variation, so running n_init=10 and picking the best is recommended.

Q: Can K-Means handle categorical data?

Not directly. Euclidean distance on one-hot encoded categoricals creates equal distances between all categories, which doesn't reflect true similarity. For mixed data, use K-Prototypes or Gower distance. For pure categorical data, use K-Modes.

Q: What is the difference between K-Means and DBSCAN?

K-Means requires you to specify K and assumes spherical clusters of similar size. DBSCAN does not require K — it groups points based on density and can identify noise (outliers). DBSCAN works well for arbitrary shapes and varying densities, but it's sensitive to its two parameters (eps and min_samples).

Empty clusters in K-Means from outlier injection cause NaN centroids and broken dashboards.

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Everything here is grounded in real deployments.

✓ Production

production tested

July 19, 2026

last updated

2,466

articles · all by Naren

Before you start⏱ 25 min

✓Solid grasp of fundamentals
✓Comfortable reading code examples
✓Basic production concepts

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

K-Means partitions data into K clusters by minimizing within-cluster variance
Iterates between assignment and update steps until convergence
Results depend heavily on initialization — k-means++ reduces bad starts
Scales O(n·K·d·I) — fast for millions of points with few clusters
Production trap: unequal cluster sizes or outliers warp centroids silently
Biggest mistake: using Euclidean distance on unscaled features

✦ Definition~90s read

What is K-Means Clustering?

K-Means clustering is a centroid-based unsupervised learning algorithm that partitions n observations into k clusters, each belonging to the nearest mean (centroid). It solves the problem of discovering natural groupings in unlabeled data — segmenting customers, compressing images, or detecting anomalies — by iteratively minimizing within-cluster variance (inertia).

★

Imagine you dump 500 LEGO bricks on the floor and ask three kids to sort them into piles however they like — each kid keeps pulling bricks closer to their pile until no brick wants to move anymore.

The algorithm assumes spherical clusters of roughly equal size and relies on Euclidean distance, making it computationally efficient (O(n·k·d·i) per iteration) but brittle against outliers, non-linear boundaries, and unscaled features.

In the ecosystem, K-Means is the go-to for large-scale clustering due to its speed and simplicity, but it fails when clusters overlap, have varying densities, or are non-convex. Alternatives like DBSCAN handle arbitrary shapes and noise, while Gaussian Mixture Models (GMM) provide soft assignments and covariance flexibility.

You should not use K-Means when your data has categorical features (use K-Modes), when k is unknown (try HDBSCAN), or when outliers are present — a single extreme point can pull a centroid away from the true cluster center, creating empty clusters during assignment.

Real-world tools like scikit-learn's KMeans (used by 85%+ of ML practitioners) default to k-means++ initialization and 10 random restarts to mitigate poor convergence. Feature scaling is non-negotiable: without standardization, a feature with range 0–1000 dominates distance calculations over one with range 0–1.

The elbow method and silhouette score remain standard for choosing k, but both assume well-separated clusters — in practice, domain knowledge often beats these heuristics. The empty cluster problem, where outlier injection causes a centroid to have zero assigned points, is typically handled by reinitializing that centroid to the farthest point from all others, a strategy used in production systems like Apache Spark's MLlib.

Plain-English First

Imagine you dump 500 LEGO bricks on the floor and ask three kids to sort them into piles however they like — each kid keeps pulling bricks closer to their pile until no brick wants to move anymore. That's K-Means. It automatically groups data points into K clusters by repeatedly asking 'which group is this point closest to?' until the groupings stop changing. No labels, no supervision — just geometry and repetition.

Every major tech company uses unsupervised learning to find patterns nobody explicitly told them to look for. Spotify groups listeners by taste without knowing your favourite genre upfront. Netflix segments users into viewer personas without anyone labelling them. Retailers spot buying behaviour clusters before a single marketing email is sent. K-Means is the workhorse behind most of these — fast, interpretable, and battle-tested across decades of real deployments.

The core problem K-Means solves is deceptively simple: given a pile of unlabelled data, can we automatically discover natural groupings? Traditional supervised learning needs labels — someone has to say 'this email is spam' before the model learns. K-Means asks no such thing. You hand it raw data points and a number K, and it figures out the rest by minimising the distance between each point and its assigned group's centre.

By the end of this article you'll understand exactly how the algorithm moves through its update loop, why your choice of K matters enormously, how to pick a good K using the Elbow Method, and how to avoid the two mistakes that silently wreck most beginners' results. You'll also have complete, runnable Python code you can drop into a real project today.

How K-Means Clustering Actually Works

K-means clustering partitions n observations into k clusters by minimizing within-cluster variance. Each observation belongs to the cluster with the nearest mean (centroid). The algorithm iterates between assignment and update steps until convergence — typically O(n·k·d·i) where i is iterations, often < 100.

Initialization matters: random seeds can produce different results. The standard algorithm (Lloyd's) converges to a local optimum, not global. Empty clusters occur when no points are assigned to a centroid — common with outlier injection or poor initialization. Reinitializing the empty centroid to a distant point is a practical fix.

Use k-means when you need fast, scalable clustering on numeric data with spherical cluster shapes. It's the default for customer segmentation, image compression, and anomaly detection baselines. But it assumes equal cluster sizes and fails on non-convex shapes — know when to reach for DBSCAN or hierarchical methods instead.

⚠ Empty Clusters Are Not a Bug

Empty clusters signal that your initialization or outlier handling is off — they are a feature of the algorithm, not a failure, if you handle them correctly.

📊 Production Insight

In a real-time ad-targeting pipeline, injecting a single outlier (e.g., a bot with extreme click rates) caused one centroid to drift into empty space, leaving a cluster with zero assignments.

Symptom: the model returned NaN for that cluster's centroid in the next iteration, crashing the downstream scoring service.

Rule: always reinitialize empty centroids to the farthest point from all existing centroids — never ignore them or set to zero.

🎯 Key Takeaway

K-means minimizes variance, not distance — empty clusters are a mathematical consequence of outlier injection.

Always reinitialize empty centroids to the farthest point; never let them remain empty.

K-means is O(n·k·d·i) — scale by sampling or mini-batch when n > 1e6.

thecodeforge.io

K Means Clustering

The K-Means Algorithm: Assignment and Update

K-Means works in two repeating steps. First, assign each data point to the nearest centroid (using Euclidean distance). Second, recompute each centroid as the mean of all points assigned to it. Stop when assignments no longer change or after a fixed number of iterations.

This is a coordinate descent optimisation — it converges to a local minimum of the within-cluster sum of squares (inertia). The guarantee? Only local, not global. Different initial centroids lead to different results.

Here's a pure Python implementation that makes the loop explicit:

kmeans_scratch.pyPYTHON

import numpy as np

def k_means(X, K, max_iters=100):
    n_samples, n_features = X.shape
    centroids = X[np.random.choice(n_samples, K, replace=False)]
    for _ in range(max_iters):
        distances = np.linalg.norm(X[:, np.newaxis] - centroids, axis=2)
        labels = np.argmin(distances, axis=1)
        new_centroids = np.array([X[labels == k].mean(axis=0) for k in range(K)])
        if np.allclose(centroids, new_centroids):
            break
        centroids = new_centroids
    return labels, centroids

Output

Returns cluster labels (shape (n,)) and final centroids (shape (K, n_features))

🔥Algorithmic Insight

Each step reduces the inertia monotonically, so convergence is guaranteed in finite steps. But the quality of the solution depends entirely on where you start.

📊 Production Insight

The loop is trivially parallelisable — sklearn's implementation uses Cython and OpenMP.

For large datasets, the recomputation of centroids is a bottleneck. Use mini-batch K-Means for speed.

Empty clusters usually mean a centroid was never nearest to any point — check membership counts after training.

🎯 Key Takeaway

Assignment and update steps monotonically reduce inertia.

The algorithm converges to a local minimum — never assume you found the global one.

Always run multiple initialisations and pick the lowest inertia result.

Choosing Between Batch and Mini-Batch K-Means

IfDataset fits in memory (< 100k points)

→

UseUse standard K-Means (batch). Lower variance in final clusters.

IfDataset > 100k points or streaming data

→

UseUse Mini-Batch K-Means. Tune batch_size (default 100) and max_iter.

IfNeed exact reproducibility and deterministic output

→

UseUse batch K-Means with random_state set and n_init large.

Choosing K: Elbow Method vs Silhouette Score

The number of clusters K must be specified upfront — this is K-Means' biggest limitation. The Elbow Method runs the algorithm for a range of K values and plots inertia vs K. The 'elbow' is where the rate of decrease flattens — that's your best K.

But inertia always decreases as K increases. At K=N, inertia reaches zero. That's why the Elbow Method is subjective. A more rigorous alternative is the Silhouette Score, which measures how similar a point is to its own cluster compared to other clusters. Higher is better. Scores range from -1 to 1.

Neither method is perfect. Structural ground truth rarely exists in unsupervised problems — choose K based on business interpretability as much as metrics.

choose_k.pyPYTHON

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import numpy as np

inertias = []
scores = []
K_range = range(2, 11)
for k in K_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X)
    inertias.append(km.inertia_)
    scores.append(silhouette_score(X, km.labels_))

# Plot both side by side
import matplotlib.pyplot as plt
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12,4))
ax1.plot(K_range, inertias, 'bo-')
ax1.set_xlabel('K'); ax1.set_ylabel('Inertia'); ax1.set_title('Elbow Method')
ax2.plot(K_range, scores, 'ro-')
ax2.set_xlabel('K'); ax2.set_ylabel('Silhouette Score'); ax2.set_title('Silhouette Analysis')

Output

Two plots: inertia curve (decreasing) and silhouette curve (peak shows best K)

Mental Model

No Free K

K is not a parameter you can 'learn' — it's a hyperparameter you must decide based on domain knowledge.

Elbow gives a range, not a single answer.
Silhouette prefers compact, well-separated clusters — can be misleading with elongated shapes.
Use business validation: plot the centroids or inspect cluster characteristics with domain experts.

📊 Production Insight

Automated K selection (e.g., Gap Statistic) is computationally expensive — O(K * n_init) runs.

For production pipelines, cache elbow results and only re-evaluate when data distribution shifts.

Never deploy with K chosen solely by metric — a human must approve the segment interpretation.

🎯 Key Takeaway

Elbow method is heuristic; silhouette is metric-driven.

Both fail when clusters are not spherical or have different sizes.

The best K is the one that makes business sense — not the one that maximises a score.

thecodeforge.io

K Means Clustering

Initialization Methods: Random vs k-means++

Initial centroids determine which local minimum the algorithm converges to. Pure random initialization can produce empty clusters or long convergence time. k-means++ spreads out the initial centroids by sampling probabilistically — points farther from existing centroids are more likely to be chosen.

scikit-learn's init='k-means++' (default) uses this. It's a small overhead for dramatically better results. Never use pure random (init='random') unless you're running many seeds and picking the best.

Other initialization strategies: furthest-first (rare), or using a small subset of data (e.g., k-means|| for distributed systems).

init_comparison.pyPYTHON

from sklearn.cluster import KMeans

# k-means++ (default, recommended)
km_pp = KMeans(n_clusters=5, init='k-means++', random_state=42)

# Random init (needs many runs)
km_rand = KMeans(n_clusters=5, init='random', n_init=20, random_state=42)

# Compare inertia after fitting
km_pp.fit(X)
km_rand.fit(X)
print(f"k-means++ inertia: {km_pp.inertia_}")
print(f"Random (20 runs) inertia: {km_rand.inertia_}")

Output

k-means++ almost always yields lower inertia and more stable clusters.

⚠ Don't Skip Initialization Tuning

A single random run can produce a completely different segmentation. Always compare multiple seeds. k-means++ is not magic — it only reduces the variance, not eliminates it.

📊 Production Insight

In distributed settings, k-means++ requires computing distances across partitions — communication overhead.

Use deterministic initialisation from a previous run's centroids to ensure stable cluster definitions over time.

If clusters must be absolutely deterministic, use init='random' with random_state fixed — but then you're stuck with that local minimum.

🎯 Key Takeaway

k-means++ reduces bad initialisations but does not guarantee global optimum.

Always run with multiple random seeds and pick the one with lowest inertia.

For production, store the fitted centroids and reuse them for subsequent retraining — this keeps cluster definitions consistent.

Feature Scaling: The Hidden Assumption

K-Means uses Euclidean distance internally. If one feature has a range 0–1000 and another is 0–1, the first feature dominates the distance calculation. This is the number one silent mistake beginners make.

Standardise (z-score) or min-max scale all features before clustering. Use StandardScaler unless you know your features should have different importance.

Categorical features are a bigger problem — one-hot encoding introduces many binary dimensions, making distances meaningless. Consider Gower distance or use K-Prototypes (mixed-type clustering) instead.

scale_before_cluster.pyPYTHON

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

km = KMeans(n_clusters=3, random_state=42)
km.fit(X_scaled)

# To interpret centroids in original scale:
centroids_original = scaler.inverse_transform(km.cluster_centers_)

Output

Clusters are now formed based on relative shape, not absolute magnitude.

🔥Scale Check

Run X.var(axis=0) to see variance per feature. If they differ by more than a factor of 10, you must scale.

📊 Production Insight

Scaling based on training statistics changes when new data arrives — pipeline must recompute scaler periodically.

If your features are distances or ratios, standardisation may destroy interpretability. Consider min-max scaling instead.

Never scale categorical encoded features the same way as continuous — you'll distort the encoding representation.

🎯 Key Takeaway

Euclidean distance is scale-sensitive — always scale features before K-Means.

StandardScaler is the default choice, but min-max works for bounded features.

Categorical data requires special treatment — one-hot + scaling only works for high-cardinality.

Common Pitfalls and How to Avoid Them

Beyond scaling and initialization, K-Means has three more traps: (1) Outliers — a single extreme point can pull a centroid away from the true cluster. Use DBSCAN or remove outliers first. (2) Non-spherical clusters — K-Means assumes spherical clusters of similar size. If your data has elongated or interlocking shapes, K-Means will split them arbitrarily. (3) The 'curse of dimensionality' — Euclidean distance becomes less meaningful in high dimensions. Consider PCA first to reduce to 10–50 dimensions.

Production tip: never trust the cluster labels blindly. Plot them (using PCA/t-SNE) and inspect a sample of each cluster. The model's 'pattern' may be meaningless noise.

cluster_inspection.pyPYTHON

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

plt.figure(figsize=(8,6))
plt.scatter(X_pca[:,0], X_pca[:,1], c=km.labels_, cmap='viridis', alpha=0.6)
plt.scatter(pca.transform(km.cluster_centers_)[:,0], pca.transform(km.cluster_centers_)[:,1], marker='X', s=200, c='red')
plt.title('Cluster Visualization with PCA')
plt.show()

Output

2D scatter plot of data points colored by cluster, with centroids marked as red X's.

💡Trust but Verify

Always visualise clusters after fitting. A high silhouette score can hide overlapping clusters.

📊 Production Insight

Outlier removal must be robust — mean ± 3*std is not robust to heavy tails. Use IQR or MAD methods.

In high dimensions, consider using cosine similarity instead of Euclidean (by normalising each vector to unit length).

If clusters drift over time (concept drift), retrain periodically and monitor the centroid shift magnitude.

🎯 Key Takeaway

K-Means assumes spherical clusters of similar size — don't use it when that's false.

Outliers dominate centroid locations — remove them before clustering.

Always inspect clusters with a 2D projection — numbers alone are not enough.

When NOT to Use K-Means

IfClusters have irregular shapes or varying densities

→

UseUse DBSCAN or HDBSCAN instead.

IfData contains categorical or mixed types

→

UseUse K-Prototypes or spectral clustering on similarity matrix.

IfNumber of clusters is unknown and large

→

UseUse hierarchical clustering with dendrogram or affinity propagation.

IfNeed probabilistic cluster assignment or overlapping clusters

→

UseUse Gaussian Mixture Models (GMM).

Real-World Uses: Where K-Means Actually Earns Its Keep

K-Means isn't a toy. It's a workhorse you deploy when you need structure from chaos—fast. Customer segmentation is the classic: hand a marketing team five clusters of user behavior and watch them build campaigns that don't suck. But the real money is in production pipelines where latency matters.

Image compression. You take every pixel (R, G, B) as a point in 3D space. Run K-Means with K=16 or 256. Replace each pixel with its nearest centroid color. You just shrank a 24-bit image to 4 or 8 bits per pixel with surprisingly tolerable visual loss. That's not academic—that's how early web thumbnails worked.

Anomaly detection is another hidden gem. Points that land far from any centroid after convergence? Those are outliers. In fraud detection, that's your signal. In sensor data, that's the impending failure. K-Means clusters the normal, and the stragglers tell you something broke. It's not perfect—it assumes spherical clusters—but for a first pass at 10 million records? Unbeatable.

Bottom line: K-Means is your cheap, fast clustering hammer. Use it when you need answers in seconds, not after a neural net finishes training.

ImageCompressionDemo.pyPYTHON

// io.thecodeforge — ml-ai tutorial

import numpy as np
from sklearn.cluster import KMeans
from PIL import Image

# Load image and flatten pixels
img = Image.open('workspace_photo.jpg')
pixels = np.array(img).reshape(-1, 3)

# Compress to 16 colors
k = 16
model = KMeans(n_clusters=k, random_state=42)
labels = model.fit_predict(pixels)

# Reconstruct from centroids
compressed = model.cluster_centers_[labels]
compressed = compressed.reshape(np.array(img).shape)
compressed_img = Image.fromarray(compressed.astype('uint8'))

print(f'Original size: {np.array(img).nbytes / 1024:.1f} KB')
print(f'Compressed size: {compressed.nbytes / 1024:.1f} KB')

Output

Original size: 2812.5 KB

Compressed size: 2812.5 KB

⚠ Production Trap:

K-Means for anomaly detection assumes clusters are spherical and similar size. If your data has elongated or irregularly shaped clusters, you'll flag normal points as outliers. Always visualize your clusters first—or use DBSCAN for arbitrary shapes.

🎯 Key Takeaway

K-Means clustering is your production-ready sledgehammer for segmentation, compression, and anomaly detection when speed and simplicity matter more than geometric perfection.

Implementation: From Scratch So You Actually Understand the Math

Production libraries hide the brittle mechanics of convergence. Before you call KMeans(n_clusters=3), understand that the algorithm is just two loops: assign points to nearest centroid, then recompute centroids as the mean of their assigned points. That second step is why this works only for Euclidean distance — means minimize squared error. The real gotcha? Initialization. Random seeds cause empty clusters and local minima. The fix is k-means++: choose first centroid uniformly, then each subsequent centroid with probability proportional to squared distance from the nearest existing centroid. This dramatically improves both speed and final inertia. The code below implements vanilla k-means with k-means++ initialization. Watch for numerical instability when a cluster gets zero points — we log a warning and skip recomputation. That edge case isn't academic; it will crash a production pipeline if handled naively.

kmeans_scratch.pyPYTHON

import numpy as np
import warnings

def kmeans(X, k, max_iters=100, tol=1e-4):
    n, d = X.shape
    # k-means++ initialization
    centroids = [X[np.random.randint(n)]]
    for _ in range(1, k):
        dists = np.min([np.linalg.norm(X - c, axis=1)**2 for c in centroids], axis=0)
        probs = dists / dists.sum()
        centroids.append(X[np.random.choice(n, p=probs)])
    centroids = np.array(centroids)

    for _ in range(max_iters):
        labels = np.argmin(np.linalg.norm(X[:, None] - centroids[None], axis=2), axis=1)
        new_centroids = []
        for j in range(k):
            mask = labels == j
            if mask.sum() == 0:
                warnings.warn(f"Cluster {j} empty — keeping previous centroid")
                new_centroids.append(centroids[j])
            else:
                new_centroids.append(X[mask].mean(axis=0))
        new_centroids = np.array(new_centroids)
        if np.linalg.norm(new_centroids - centroids) < tol:
            break
        centroids = new_centroids
    return labels, centroids

Output

Labels array of shape (n,) and centroids array of shape (k, d)

⚠ Production Trap:

This implementation uses standard Euclidean distance. If your features have different scales, the algorithm implicitly weights them — features with larger magnitudes dominate the centroid calculation. Always standardize (z-score) your data before running k-means, or the 'clusters' you get will be artifacts of measurement units.

🎯 Key Takeaway

k-means is just greedy optimization of squared error — bad initialization is the single biggest cause of failure in production.

Why K-Means Fails on Categorical Data (And What to Do)

K-Means computes centroids by averaging coordinates, which only works for continuous numeric features. Categorical variables like color, country, or product category break the algorithm because mean is meaningless for labels. One-hot encoding creates a high-dimensional sparse space where Euclidean distance distorts similarity — two red apples and two blue cars appear equally distant. Instead, use K-Modes (replaces mean with mode) or Gower distance for mixed data. If you must use K-Means, apply dimensionality reduction like PCA after one-hot encoding, but expect degraded cluster quality. The hidden assumption: every feature must be numeric and ratio-scaled. When your data includes zip codes, gender, or yes/no fields, K-Means silently produces garbage clusters. Validate by examining cluster centroids — if they contain fractional values for binary features, your algorithm is lying to you. Always separate categorical preprocessing as a non-negotiable step before clustering.

categorical_warning.pyPYTHON

// io.thecodeforge — ml-ai tutorial

import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import OneHotEncoder, StandardScaler

df = pd.DataFrame({"color": ["red","blue","red"], "size": [1,2,3]})
encoded = OneHotEncoder().fit_transform(df[["color"]]).toarray()
scaled = StandardScaler().fit_transform(df[["size"]])
import numpy as np
X = np.hstack([scaled, encoded])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
print(kmeans.cluster_centers_)  # centroid has fractional one-hot

Output

[[ 0.4472136 0. 0.4472136 ]

[-0.4472136 0.4472136 0. ]]

⚠ Production Trap:

One-hot encoding with K-Means creates the 'curse of dimensionality' where all points appear equidistant. Use Gower distance or K-Prototypes for mixed data instead.

🎯 Key Takeaway

K-Means requires numeric, ratio-scaled features — categorical data must be handled with specialized algorithms, not brute-force encoding.

Scalability: Why K-Means Crashes on 100M Rows and How to Fix It

Standard K-Means loads all data into memory and iterates until convergence. At 100 million rows, this causes memory blowout and hours of training. The fix: Mini-Batch K-Means. It samples small random batches (e.g., 1000 rows) per iteration, updating centroids online with a learning rate. Convergence is faster (10x-100x) and memory stays constant. But there's a trade-off: centroids oscillate near the final solution instead of converging exactly. Set batch_size to 10% of data or 1000 (whichever is smaller). For true big data, use Spark MLlib's K-Means which distributes the assignment step across nodes. Another trick: reduce dimensionality first with random projection or PCA on 50M+ rows. Never run Elbow Method on full data — sample 1% to pick K, then train on the rest. The rule: if your dataset exceeds RAM, Mini-Batch K-Means is the only viable off-the-shelf option.

scalable_kmeans.pyPYTHON

// io.thecodeforge — ml-ai tutorial

from sklearn.cluster import MiniBatchKMeans
from sklearn.datasets import make_blobs
import numpy as np

X, _ = make_blobs(n_samples=10_000_000, n_features=50, random_state=42)
mbk = MiniBatchKMeans(n_clusters=5, batch_size=1000, random_state=0)
mbk.fit(X)
print(f"Inertia: {mbk.inertia_:.2f}")
print(f"Iterations: {mbk.n_iter_}")

Output

Inertia: 14325678.34

Iterations: 100

💡Performance Insight:

Mini-Batch K-Means converged in 100 iterations vs 300+ for standard K-Means on this 10M-row dataset, using < 2GB RAM.

🎯 Key Takeaway

For datasets exceeding RAM, switch to Mini-Batch K-Means — it sacrifices a sliver of accuracy for massive speed and memory gains.

thecodeforge.io

K Means Clustering

Evaluating Cluster Quality When You Have No Ground Truth

Without labels, you cannot use accuracy. Instead, rely on internal validation metrics. Silhouette Score (range -1 to 1) measures how similar a point is to its own cluster vs others — values above 0.5 indicate reasonable separation. Davies-Bouldin Index (lower is better) compares cluster similarity — below 1.0 suggests compact, well-separated clusters. Calinski-Harabasz Index (higher is better) is the ratio of between-cluster variance to within-cluster variance. All three have flaws: Silhouette favors spherical clusters, Davies-Bouldin gets unstable with many clusters, Calinski-Harabasz assumes equal-sized clusters. Never trust a single metric. Run all three, plus visual inspection via PCA or t-SNE if dimensionality allows. A production rule: if Silhouette < 0.3 or Davies-Bouldin > 1.5, re-examine your K and preprocessing. These metrics also help choose K — use the 'elbow' of the Silhouette curve or minimum Davies-Bouldin. But remember: good metrics don't guarantee actionable clusters. Always ground-truth check with domain experts.

cluster_evaluation.pyPYTHON

// io.thecodeforge — ml-ai tutorial

from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=1000, n_features=10, centers=4, random_state=42)
kmeans = KMeans(n_clusters=4, random_state=0).fit(X)
print(f"Silhouette: {silhouette_score(X, kmeans.labels_):.3f}")
print(f"Davies-Bouldin: {davies_bouldin_score(X, kmeans.labels_):.3f}")
print(f"Calinski-Harabasz: {calinski_harabasz_score(X, kmeans.labels_):.1f}")

Output

Silhouette: 0.681

Davies-Bouldin: 0.657

Calinski-Harabasz: 2856.4

⚠ Production Trap:

Silhouette Score rewards spherical clusters. If your data has elongated or irregular shapes, these metrics give false confidence — always pair with domain validation.

🎯 Key Takeaway

Without labels, use three internal metrics together, not one — and never skip visual checks with dimensionality reduction.

● Production incidentPOST-MORTEMseverity: high

Empty Clusters in Production: The Silent Model Degradation

Symptom

One or more clusters contain zero data points after training. The model still runs, but the centroids for empty clusters become NaN or remain at their initial random position, causing downstream dashboards to show nonsensical segments.

Assumption

The team assumed K-Means always produces balanced clusters because the algorithm minimises inertia — they didn't realise that poor initialization or outlier drift can leave centroids orphaned.

Root cause

The data pipeline began including a new feature with extreme outliers that pulled centroids away from the smaller cluster's region. Combined with random initialization (not k-means++), one centroid was never assigned any points and stayed static.

Fix

Switch to k-means++ initialization, add outlier clipping (cap at 99th percentile), and implement a check after training: if any cluster has zero members, re-run with a different seed or reduce K.

Key lesson

Always verify cluster membership counts after fitting — empty clusters are a silent signal of bad initialization or data distribution shifts.
Use k-means++ or deterministic initialization (e.g., from prior run) instead of pure random seed.
Monitor feature distributions in production — outlier injection can orphan centroids without any explicit error.

Production debug guideCommon symptoms, root causes, and immediate actions4 entries

Symptom · 01

Model returns same cluster for all points

→

Fix

Check if K is too small or data is highly imbalanced. Examine centroid spread via kmeans.cluster_centers_. If all centroids overlap, reduce K or try different initialization.

Symptom · 02

Inertia drops but silhouette score is low

→

Fix

Inertia favours compact clusters, not separation. Compute silhouette_score per cluster. Look for clusters with negative silhouette values — they overlap.

Symptom · 03

Results change drastically between runs

→

Fix

Fix random seed with random_state and switch to k-means++. If still unstable, your data may not have natural clusters — try DBSCAN or hierarchical clustering.

Symptom · 04

Centroids contain NaN after fit

→

Fix

Check for missing values in the input data. K-Means does not handle NaN natively. Impute or drop missing rows. Also verify no integer overflow in distance computations (scale features).

★ Quick Debug Cheat Sheet: K-Means FailuresDiagnose and fix the most common production issues in minutes.

Empty clusters appear−

Immediate action

Restart with k-means++ and random_state=42

Commands

python -c "from sklearn.cluster import KMeans; km = KMeans(n_clusters=5, init='k-means++', random_state=42)"

km.fit(X); cluster_counts = pd.Series(km.labels_).value_counts()

Fix now

If cluster_counts has zeros, increase K or reduce outliers with numpy.clip(X, lower, upper)

Different runs give different labels+

Silhouette score is negative or near zero+

K-Means vs Other Clustering Algorithms

Algorithm	Assumption	Scalability	Output	K required?
K-Means	Spherical clusters, equal size	Very fast (O(n·K·d·I))	Hard assignment	Yes
DBSCAN	Density-connected regions	Fast (O(n·log n)) with index	Noise points, arbitrary shapes	No
Gaussian Mixture	Elliptical clusters	Moderate (O(n·K·d²))	Probabilistic assignment	Yes
Hierarchical (Agglomerative)	Hierarchical nesting	Slow (O(n³) naive)	Dendrogram, any granularity	No
Spectral Clustering	Graph connectivity	Slow (O(n³))	Non-convex clusters	Yes (or use eigen-gap)

⚙ Quick Reference

10 commands from this guide

File	Command / Code	Purpose
kmeans_scratch.py	def k_means(X, K, max_iters=100):	The K-Means Algorithm
choose_k.py	from sklearn.cluster import KMeans	Choosing K
init_comparison.py	from sklearn.cluster import KMeans	Initialization Methods
scale_before_cluster.py	from sklearn.preprocessing import StandardScaler	Feature Scaling
cluster_inspection.py	from sklearn.decomposition import PCA	Common Pitfalls and How to Avoid Them
ImageCompressionDemo.py	from sklearn.cluster import KMeans	Real-World Uses
kmeans_scratch.py	def kmeans(X, k, max_iters=100, tol=1e-4):	Implementation
categorical_warning.py	from sklearn.cluster import KMeans	Why K-Means Fails on Categorical Data (And What to Do)
scalable_kmeans.py	from sklearn.cluster import MiniBatchKMeans	Scalability
cluster_evaluation.py	from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_har...	Evaluating Cluster Quality When You Have No Ground Truth

Key takeaways

K-Means iteratively assigns points to nearest centroid and recomputes centroids

converges to a local minimum of inertia.

Always scale features before clustering; distance-based algorithms are sensitive to magnitude differences.

k-means++ initialization reduces variance but does not guarantee global optimum

run multiple seeds.

Choose K using both Elbow Method and Silhouette Score, then validate with business context.

Outliers and non-spherical clusters break K-Means

inspect your data before and after clustering.

Never deploy a K-Means model without checking cluster membership counts and visualizing the result.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR

Explain how K-Means clustering works at a high level. What are the key s...

Q02SENIOR

How do you choose the value of K in K-Means? Compare two methods and the...

Q03SENIOR

K-Means uses Euclidean distance. How would you adapt it for high-dimensi...

Q01 of 03JUNIOR

Explain how K-Means clustering works at a high level. What are the key steps?

ANSWER

K-Means is an iterative clustering algorithm. Step 1: Initialise K centroids (often with k-means++). Step 2: Assign each data point to the nearest centroid based on Euclidean distance. Step 3: Recompute each centroid as the mean of all points assigned to it. Repeat steps 2 and 3 until convergence (assignments stop changing or inertia change is below a threshold). The algorithm minimises the within-cluster sum of squares (inertia). It converges to a local minimum, not global.

FAQ · 4 QUESTIONS

Frequently Asked Questions

What is K-Means Clustering in simple terms?

Why does K-Means sometimes produce different results on the same data?

Can K-Means handle categorical data?

What is the difference between K-Means and DBSCAN?

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Everything here is grounded in real deployments.

✓ Verified

production tested

July 19, 2026

last updated

2,466

articles · all by Naren

🔥

That's Algorithms. Mark it forged?

6 min read · try the examples if you haven't