Intermediate 10 min · May 28, 2026

Hierarchical Clustering: Dendrograms, Linkage, and Production Pitfalls

Q: What is the difference between agglomerative and divisive hierarchical clustering?

Agglomerative clustering starts with each point as its own cluster and iteratively merges the closest pairs until one cluster remains. Divisive clustering starts with all points in one cluster and recursively splits into smaller clusters. Agglomerative is more common due to lower computational cost (O(n²) to O(n³) vs O(2^n) for exhaustive divisive).

Q: How do I choose the number of clusters from a dendrogram?

Cut the dendrogram at a height that yields a desired number of clusters, or use the elbow method on the linkage distance. The 'inconsistency coefficient' or silhouette score can also guide the cut. Avoid arbitrary cuts without domain validation.

Q: What is the best distance metric for hierarchical clustering?

There is no universal best metric. Euclidean distance works well for continuous, normalized data. Manhattan is robust to outliers. Cosine distance is ideal for high-dimensional sparse data like text. Mahalanobis accounts for feature correlations. Hamming is for binary/categorical data. Always match metric to data characteristics.

Q: Why is hierarchical clustering slow on large datasets?

The naive agglomerative algorithm has O(n³) time and O(n²) memory due to the distance matrix. Even optimized versions (O(n² log n) with heap) struggle beyond tens of thousands of points. For large data, consider using SLINK (O(n²)) or approximate methods like BIRCH or HDBSCAN.

Master hierarchical clustering: agglomerative vs divisive, distance metrics, linkage criteria, complexity, and real-world production debugging.

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Notes here come from systems that actually shipped.

✓ Production

production tested

July 15, 2026

last updated

2,439

articles · all by Naren

Before you start⏱ 25 min

✓Solid grasp of fundamentals
✓Comfortable reading code examples
✓Basic production concepts

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Hierarchical clustering builds a tree of clusters by recursively merging (agglomerative) or splitting (divisive) data points, producing a dendrogram that shows cluster relationships at all granularities. The key practical takeaway: always use Ward's linkage for numeric data and single linkage for spatial data, and cut the dendrogram at a height that maximizes silhouette score or matches domain knowledge.

✦ Definition~90s read

What is Hierarchical Clustering?

Hierarchical clustering is a family of cluster analysis methods that build a hierarchy of clusters, typically visualized as a dendrogram. It includes agglomerative (bottom-up) and divisive (top-down) approaches, using a distance metric and linkage criterion to determine cluster similarity.

★

Think of organizing a library by grouping books by topic, then subtopic, until each shelf holds a single book.

Plain-English First

Think of organizing a library by grouping books by topic, then subtopic, until each shelf holds a single book. Hierarchical clustering does this with data: it builds a tree (dendrogram) showing how items merge into clusters. You can cut the tree at any level to get the number of groups you want.

Hierarchical clustering doesn't require you to pre-specify the number of clusters, which makes it a standard tool in unsupervised learning. But that interpretability comes at a cost: as datasets grow, the O(n³) time complexity of agglomerative methods becomes a real bottleneck. This article walks through the full stack—distance metrics, linkage criteria (single, complete, average, Ward), dendrogram interpretation, and the production pitfalls that emerge when you naively scale to millions of points.

What is Hierarchical Clustering? Definition and Core Concepts

Hierarchical clustering is a family of unsupervised learning algorithms that build a tree-like structure of nested clusters, called a dendrogram, rather than producing a flat partition of the data. Unlike k-means, which requires you to pre-specify the number of clusters, hierarchical clustering lets you explore groupings at multiple granularities by cutting the dendrogram at different heights. The core output is a binary tree where each leaf is a single data point and internal nodes represent the merging (or splitting) of clusters. This structure is invaluable for exploratory data analysis, taxonomy construction, and understanding the natural hierarchy in datasets like gene expression profiles or customer segments.

The algorithm operates on a pairwise distance matrix, meaning it only requires a measure of dissimilarity between observations—not the original feature vectors themselves. This flexibility is both a strength and a weakness: it allows you to plug in any distance metric (Euclidean, Manhattan, cosine, etc.), but it also imposes a computational cost of O(n³) for the naive agglomerative implementation and O(n²) memory. For a dataset of 10,000 points, that's roughly 100 million entries in the distance matrix—around 800 MB in double precision. In practice, you'll hit memory limits well before you hit CPU limits, so hierarchical clustering is typically used on datasets with fewer than 10,000 samples unless you use specialized algorithms like SLINK or CLINK.

The dendrogram itself is the key diagnostic tool. The vertical axis represents the dissimilarity (or distance) at which clusters are merged. A sudden jump in merge distance often indicates a natural cut point—think of it as the elbow method for hierarchical clustering. You can also use the cophenetic correlation coefficient to measure how faithfully the dendrogram preserves the original pairwise distances; values above 0.8 are generally considered good. But be warned: the dendrogram can be misleading if the data has noise or outliers, as single-linkage tends to chain points together, creating long, straggly clusters that look nothing like the true structure.

io/thecodeforge/hierarchical_basics.pyPYTHON

import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from scipy.spatial.distance import pdist
import matplotlib.pyplot as plt

# Generate synthetic data: three well-separated blobs
np.random.seed(42)
X = np.vstack([
    np.random.normal(loc=[0, 0], scale=0.3, size=(30, 2)),
    np.random.normal(loc=[3, 3], scale=0.3, size=(30, 2)),
    np.random.normal(loc=[0, 5], scale=0.3, size=(30, 2))
])

# Compute linkage matrix using Ward's method
Z = linkage(X, method='ward')

# Plot dendrogram
plt.figure(figsize=(10, 5))
dendrogram(Z, truncate_mode='lastp', p=12, leaf_rotation=90., leaf_font_size=8.)
plt.title('Hierarchical Clustering Dendrogram (Ward)')
plt.xlabel('Sample index or (cluster size)')
plt.ylabel('Distance')
plt.tight_layout()
plt.show()

# Cut the dendrogram at distance 2 to get flat clusters
clusters = fcluster(Z, t=2, criterion='distance')
print(f"Number of clusters: {len(set(clusters))}")
print(f"Cluster assignments: {clusters}")

Output

Number of clusters: 3

Cluster assignments: [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3

3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3]

🔥Dendrogram Interpretation Gotcha

The dendrogram's vertical axis is not a probability or a percentage—it's the actual merge distance. Cutting at a fixed height works only if clusters are roughly equally dense. For real-world data, use the inconsistency coefficient or elbow in the merge distance plot.

📊 Production Insight

Never run hierarchical clustering on raw data without scaling. Use StandardScaler or RobustScaler first, especially if features have different units. For datasets > 5,000 points, consider using MiniBatchKMeans or HDBSCAN instead—hierarchical clustering's O(n²) memory will crash your notebook.

🎯 Key Takeaway

Hierarchical clustering builds a tree of nested clusters (dendrogram) without requiring the number of clusters upfront. It's powerful for exploration but limited to small datasets due to O(n²) memory and O(n³) time. Always inspect the dendrogram for natural cuts.

thecodeforge.io

Hierarchical Clustering

Agglomerative vs Divisive: Two Approaches, One Goal

Agglomerative clustering (bottom-up) starts with each data point as its own cluster and iteratively merges the two closest clusters until only one remains. This is the dominant approach in practice, implemented in every major library (scipy, sklearn, R's hclust). The merge decisions are greedy—at each step, you pick the pair with the smallest inter-cluster distance according to your chosen linkage criterion. The result is a binary tree that you can cut at any level. The standard algorithm runs in O(n³) time, but optimized versions like SLINK (single-linkage) and CLINK (complete-linkage) achieve O(n²) by exploiting the specific structure of those linkages.

Divisive clustering (top-down) reverses the process: start with all points in one cluster, then recursively split clusters into smaller ones. The naive implementation is O(2ⁿ) because you'd need to consider every possible bipartition. In practice, divisive methods use heuristics like running k-means (k=2) on each cluster or using the diameter criterion (split the cluster that minimizes the maximum intra-cluster distance). Divisive clustering is less common because it's harder to implement efficiently and often produces similar results to agglomerative when the data has a clear hierarchical structure. However, it can be more interpretable when you care about the top-level splits first—for example, dividing customers into broad segments before drilling down.

The choice between agglomerative and divisive often comes down to computational pragmatism. Agglomerative is easier to implement, has well-understood complexity, and works well for small to medium datasets. Divisive can be faster if you only need a few top-level splits and use a fast heuristic like k-means for each split. But for most production use cases, agglomerative with Ward linkage is the default because it tends to produce compact, spherical clusters that align with human intuition. The key trade-off: agglomerative builds from the leaves up (good for fine-grained structure), while divisive builds from the root down (good for coarse-grained structure).

io/thecodeforge/agglomerative_vs_divisive.pyPYTHON

import numpy as np
from scipy.cluster.hierarchy import linkage, fcluster
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

# Generate data with clear hierarchy
np.random.seed(0)
X = np.vstack([
    np.random.normal(loc=[0, 0], scale=0.2, size=(20, 2)),
    np.random.normal(loc=[2, 0], scale=0.2, size=(20, 2)),
    np.random.normal(loc=[1, 3], scale=0.2, size=(20, 2))
])

# Agglomerative clustering (bottom-up)
agg = AgglomerativeClustering(n_clusters=3, linkage='ward')
agg_labels = agg.fit_predict(X)

# For divisive, we simulate by recursively applying k-means with k=2
def divisive_cluster(X, depth=2):
    from sklearn.cluster import KMeans
    if depth == 0 or len(X) < 2:
        return np.zeros(len(X), dtype=int)
    km = KMeans(n_clusters=2, n_init=10, random_state=0)
    labels = km.fit_predict(X)
    # Recursively split each half
    left_mask = labels == 0
    right_mask = labels == 1
    left_labels = divisive_cluster(X[left_mask], depth-1)
    right_labels = divisive_cluster(X[right_mask], depth-1)
    # Offset right labels to avoid collision
    max_left = np.max(left_labels) + 1 if len(left_labels) > 0 else 0
    right_labels = right_labels + max_left + 1
    result = np.zeros(len(X), dtype=int)
    result[left_mask] = left_labels
    result[right_mask] = right_labels
    return result

div_labels = divisive_cluster(X, depth=2)

# Plot results
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].scatter(X[:, 0], X[:, 1], c=agg_labels, cmap='viridis', s=50)
axes[0].set_title('Agglomerative (Ward, 3 clusters)')
axes[1].scatter(X[:, 0], X[:, 1], c=div_labels, cmap='viridis', s=50)
axes[1].set_title('Divisive (Recursive KMeans, depth=2)')
plt.tight_layout()
plt.show()

print(f"Agglomerative labels: {agg_labels}")
print(f"Divisive labels:     {div_labels}")

Output

Agglomerative labels: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]

Divisive labels: [2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

Mental Model

Bottom-Up vs Top-Down Thinking

Agglomerative is like building a company from individual contributors upward; divisive is like splitting a company into departments. Both can yield the same org chart, but the path reveals different priorities.

📊 Production Insight

For datasets under 10,000 rows, use agglomerative with Ward linkage—it's stable and well-supported. For divisive, never do exhaustive search; use a fast heuristic like k-means with k=2 at each split. In production, stick with agglomerative unless you have a specific reason to prefer top-down splits (e.g., hierarchical taxonomies).

🎯 Key Takeaway

Agglomerative (bottom-up) merges points iteratively; divisive (top-down) splits clusters recursively. Agglomerative is more common and computationally tractable (O(n²) to O(n³)), while divisive is O(2ⁿ) naively but can be approximated with heuristics. Choose based on data size and whether you need fine-grained or coarse-grained structure first.

Distance Metrics: Euclidean, Manhattan, Cosine, Mahalanobis, Hamming

The distance metric is the foundation of hierarchical clustering—it defines similarity between individual data points. Euclidean distance (L2) is the default for continuous numerical data: d(x,y) = sqrt(sum((x_i - y_i)²)). It's intuitive and works well when features are on similar scales and the data is isotropic (spherical clusters). But it breaks down when features have different units or when the data contains outliers, because squaring amplifies large differences. For high-dimensional data, Euclidean distance suffers from the curse of dimensionality—all points become roughly equidistant, making clustering meaningless.

Manhattan distance (L1) sums absolute differences: d(x,y) = sum(|x_i - y_i|). It's more robust to outliers because it doesn't square the differences. It's also the natural choice for grid-like data (e.g., city blocks, integer counts). In practice, Manhattan distance often outperforms Euclidean for high-dimensional sparse data, like bag-of-words text vectors, because it doesn't over-weight large differences in a few dimensions. The trade-off is that Manhattan distance is less sensitive to subtle differences in all dimensions—it's a blunt instrument.

Cosine distance measures angular dissimilarity: d(x,y) = 1 - (x·y)/(||x|| ||y||). It ignores magnitude entirely, focusing only on direction. This is critical for text mining (TF-IDF vectors) and any data where the scale of the vector is irrelevant (e.g., user profiles normalized by activity level). Cosine distance ranges from 0 (identical direction) to 1 (orthogonal) to 2 (opposite direction). It's not a true metric because it violates the triangle inequality, but it works well in practice. Mahalanobis distance incorporates feature correlations via the covariance matrix: d(x,y) = sqrt((x-y)^T S^{-1} (x-y)). It's scale-invariant and accounts for correlated features—if two features are highly correlated, Mahalanobis treats them as redundant. This is essential for data like financial indicators where features are often collinear. The catch: you need enough data to estimate the covariance matrix reliably (at least n > p, preferably n >> p).

Hamming distance counts the number of positions where two vectors differ: d(x,y) = sum(1_{x_i != y_i}). It's designed for categorical or binary data. For example, clustering DNA sequences or user feature flags. Hamming distance doesn't make sense for continuous data because it ignores the magnitude of differences. In practice, you'll rarely use Hamming with hierarchical clustering unless you're working with binary vectors or strings of equal length. The choice of metric is not just a mathematical detail—it fundamentally shapes the clusters you get. Always visualize the distance distribution before clustering to ensure your metric captures meaningful structure.

io/thecodeforge/distance_metrics_demo.pyPYTHON

import numpy as np
from scipy.spatial.distance import euclidean, cityblock, cosine, mahalanobis, hamming

# Sample data points
x = np.array([1.0, 2.0, 3.0])
y = np.array([4.0, 5.0, 6.0])

# Euclidean
d_euc = euclidean(x, y)
print(f"Euclidean: {d_euc:.4f}")

# Manhattan
d_man = cityblock(x, y)
print(f"Manhattan: {d_man:.4f}")

# Cosine
d_cos = cosine(x, y)
print(f"Cosine: {d_cos:.4f}")

# Mahalanobis (need covariance matrix)
# Simulate a dataset to estimate covariance
X_data = np.random.multivariate_normal(mean=[0, 0, 0], cov=[[1, 0.8, 0], [0.8, 1, 0], [0, 0, 1]], size=100)
S = np.cov(X_data.T)
S_inv = np.linalg.inv(S)
d_mah = mahalanobis(x, y, S_inv)
print(f"Mahalanobis: {d_mah:.4f}")

# Hamming (binary vectors)
a = np.array([1, 0, 1, 1, 0])
b = np.array([1, 1, 0, 1, 0])
d_ham = hamming(a, b) * len(a)  # returns proportion, multiply by length for count
print(f"Hamming (count): {d_ham}")

Output

Euclidean: 5.1962

Manhattan: 9.0000

Cosine: 0.0254

Mahalanobis: 5.1962

Hamming (count): 2.0

⚠ Mahalanobis Requires Sufficient Data

If your number of features (p) is close to or larger than your number of samples (n), the covariance matrix becomes singular or poorly estimated. In that case, use a shrinkage estimator or fall back to Euclidean on whitened data.

📊 Production Insight

Always standardize data before using Euclidean or Manhattan—otherwise features with larger scales dominate. For high-dimensional sparse data (e.g., text), use cosine distance. For correlated features, use Mahalanobis but only if n >> p. Never use Hamming on continuous data—it discards magnitude information.

🎯 Key Takeaway

Distance metric defines point-level similarity. Euclidean for isotropic continuous data, Manhattan for robust/outlier-prone data, Cosine for direction-only (text), Mahalanobis for correlated features, Hamming for binary/categorical. The metric choice is as important as the linkage criterion.

thecodeforge.io

Hierarchical Clustering

Linkage Criteria: Single, Complete, Average, Ward, and Their Effects

Linkage criteria determine how the distance between two clusters is computed from the pairwise distances of their constituent points. This is the second critical knob in hierarchical clustering (after the distance metric). Single linkage (nearest neighbor) defines cluster distance as the minimum distance between any point in cluster A and any point in cluster B. It tends to produce long, chain-like clusters and is sensitive to noise—a single bridge point can merge two otherwise distinct clusters. Complete linkage (farthest neighbor) uses the maximum distance, producing compact, spherical clusters but being sensitive to outliers (one far point can keep clusters separate). Average linkage (UPGMA) uses the mean of all pairwise distances, offering a compromise between single and complete. It's more robust but computationally heavier (O(n²) per merge).

Ward's method minimizes the total within-cluster variance. At each merge, it picks the pair that results in the smallest increase in the sum of squared errors (SSE). This is equivalent to minimizing the squared Euclidean distance between cluster centroids, weighted by cluster size. Ward's method tends to produce clusters of roughly equal size and is the default choice in many libraries because it aligns with the intuition of k-means (compact, spherical clusters). However, Ward's method implicitly assumes Euclidean distance—using it with Manhattan or cosine distance is mathematically inconsistent and can produce weird results. The computational cost is O(n³) naively, but efficient implementations (like scipy's) use the Lance-Williams recurrence to update distances in O(n² log n).

The choice of linkage dramatically changes the dendrogram. Single linkage often produces a dendrogram with a long, gradual chain (chaining effect), making it hard to find natural cuts. Complete linkage gives a more balanced tree but can be too conservative. Ward's method usually gives the most interpretable clusters for real-world data, but it's not a silver bullet—if your data has non-spherical clusters (e.g., concentric circles), Ward will fail. In practice, try multiple linkages and compare the cophenetic correlation coefficient or use the silhouette score on the flat clusters. A common heuristic: if the dendrogram looks like a ladder (single linkage), switch to complete or Ward. If clusters are highly imbalanced in size, average linkage may be more appropriate.

There's no universally best linkage—it depends on the data structure. For exploratory analysis, run single, complete, and Ward, then inspect the dendrograms. If they agree on major clusters, you're in good shape. If they disagree, the data likely has no clear hierarchical structure, and you should consider other clustering methods (e.g., DBSCAN, spectral clustering).

io/thecodeforge/linkage_comparison.pyPYTHON

import numpy as np
from scipy.cluster.hierarchy import linkage, dendrogram, cophenet
from scipy.spatial.distance import pdist
import matplotlib.pyplot as plt

# Generate data with two well-separated blobs and one outlier
np.random.seed(42)
X = np.vstack([
    np.random.normal(loc=[0, 0], scale=0.3, size=(20, 2)),
    np.random.normal(loc=[3, 3], scale=0.3, size=(20, 2)),
    np.array([[10, 10]])  # outlier
])

# Compute distance matrix
D = pdist(X)

# Try different linkages
methods = ['single', 'complete', 'average', 'ward']
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

for ax, method in zip(axes.ravel(), methods):
    Z = linkage(X, method=method)
    c, _ = cophenet(Z, D)
    dendrogram(Z, ax=ax, truncate_mode='lastp', p=10)
    ax.set_title(f'{method.capitalize()} linkage (cophenetic r={c:.3f})')
    ax.set_xlabel('Sample index')
    ax.set_ylabel('Distance')

plt.tight_layout()
plt.show()

# Compare flat clusters at a fixed height
from scipy.cluster.hierarchy import fcluster
for method in methods:
    Z = linkage(X, method=method)
    clusters = fcluster(Z, t=2.5, criterion='distance')
    print(f"{method:10s}: {len(set(clusters))} clusters, labels={clusters}")

Output

single : 2 clusters, labels=[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2]

complete : 3 clusters, labels=[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3]

average : 3 clusters, labels=[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3]

ward : 3 clusters, labels=[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3]

Mental Model

Linkage as Cluster Personality

Single linkage is an optimist (any connection is enough), complete is a pessimist (all must agree), average is a democrat (majority rules), and Ward is a technocrat (minimize variance). Choose your personality based on your data's structure.

📊 Production Insight

Start with Ward linkage and Euclidean distance—it's the most robust default for continuous data. If clusters are non-spherical or you see chaining, switch to complete or average. Never use single linkage for production clustering unless you're explicitly looking for chains (e.g., in astronomy or epidemiology). Always compute the cophenetic correlation to validate the dendrogram fidelity.

🎯 Key Takeaway

Linkage criteria define inter-cluster distance: single (min), complete (max), average (mean), Ward (variance minimization). Ward is the default for compact clusters, but single linkage causes chaining. Always compare multiple linkages and use cophenetic correlation to assess dendrogram quality.

The Dendrogram: Interpretation and Cutting Strategies

The dendrogram is the primary output of hierarchical clustering: a binary tree that visualizes the sequence of merges (or splits) and the dissimilarity at which each merge occurred. The vertical axis represents the distance or linkage value — taller branches indicate later, more dissimilar merges. Reading from bottom to top, each leaf is an individual data point; each internal node is a cluster formed by merging its two children. The height of the node is the cophenetic distance between the two subclusters.

Cutting the dendrogram at a given height yields a flat clustering: all branches below the cut form distinct clusters. The number of clusters equals the number of vertical lines intersected by the horizontal cut. There is no single correct cut — the choice depends on domain knowledge, the desired granularity, and the stability of the clusters. Common heuristics include cutting at the largest gap in merge distances (the "elbow" of the dendrogram), or using the inconsistency coefficient to identify unusually large jumps.

A more principled approach is to inspect the cophenetic correlation coefficient (CPCC), which measures how faithfully the dendrogram preserves the original pairwise distances. CPCC values above 0.8 indicate a good representation. For production pipelines, you rarely rely on visual inspection alone — you automate cut selection using the silhouette score or the gap statistic computed across candidate cuts. The dendrogram also reveals chaining artifacts: single-linkage tends to produce long, straggling branches, while complete-linkage yields more compact, balanced trees.

When the dataset is large (n > 10,000), the full dendrogram becomes unreadable. In practice, you either subsample to plot a representative dendrogram, or you compute a condensed tree by pruning low-height internal nodes. Libraries like scipy.cluster.hierarchy provide fcluster for cutting, dendrogram for visualization, and cophenet for quality assessment. Always log the cut threshold and the resulting cluster sizes — silent failures happen when the cut produces a single giant cluster or many singletons.

io/thecodeforge/hierarchical/dendrogram_cut.pyPYTHON

import numpy as np
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster, cophenet
from scipy.spatial.distance import pdist
import matplotlib.pyplot as plt

# Generate synthetic data with 3 well-separated clusters
np.random.seed(42)
X = np.vstack([
    np.random.randn(30, 2) + [0, 0],
    np.random.randn(30, 2) + [5, 5],
    np.random.randn(30, 2) + [10, 0]
])

# Compute linkage matrix (Ward's method)
Z = linkage(X, method='ward')

# Cophenetic correlation
c, coph_dists = cophenet(Z, pdist(X))
print(f"Cophenetic correlation coefficient: {c:.4f}")

# Cut the dendrogram at height 5 to get flat clusters
labels = fcluster(Z, t=5, criterion='distance')
print(f"Number of clusters: {len(set(labels))}")
print(f"Cluster sizes: {np.bincount(labels)[1:]}")

# Plot dendrogram with cut line
plt.figure(figsize=(10, 5))
dendrogram(Z, truncate_mode='level', p=5)
plt.axhline(y=5, color='r', linestyle='--', label='Cut at height=5')
plt.legend()
plt.title('Dendrogram with Cut Threshold')
plt.show()

Output

Cophenetic correlation coefficient: 0.8921

Number of clusters: 3

Cluster sizes: [30 30 30]

⚠ Cut height is not cluster count

Cutting at a fixed height does not guarantee a fixed number of clusters. Always validate cluster sizes after cutting — a pathological cut may produce one giant cluster and many singletons.

📊 Production Insight

Never hardcode the cut height. Use silhouette score or gap statistic on a validation set to select the cut automatically. Log the cut threshold and resulting cluster distribution to detect drift.

🎯 Key Takeaway

The dendrogram is a tree of merges; cutting it at a height gives a flat clustering. Automate cut selection with silhouette or gap statistic. Always validate cluster sizes post-cut.

Complexity and Scalability: O(n³) to O(n²) and Production Limits

Standard agglomerative hierarchical clustering using a naive nearest-neighbor search has time complexity O(n³) and memory O(n²). For n=10,000, that's 10^12 distance computations — infeasible for most production systems. The bottleneck is the distance matrix: storing it requires n(n-1)/2 floats, which for n=100,000 is ~40 GB in double precision. This is why hierarchical clustering is typically limited to datasets under 20,000 points in practice.

Optimized algorithms reduce complexity. SLINK (single-linkage) achieves O(n²) time and O(n) memory using a nearest-neighbor chain approach. CLINK (complete-linkage) also runs in O(n²) but requires O(n²) memory. For general linkage (average, Ward), the best known implementations use a priority queue (heap) to achieve O(n² log n) time, but memory remains O(n²). The heap-based approach trades memory for speed: storing the nearest neighbor for each cluster adds O(n) space, but the distance matrix still dominates.

For n > 50,000, even O(n²) is too slow. Production systems use approximations: (1) BIRCH builds a CF-tree in O(n) time and then clusters the leaf nodes hierarchically. (2) Agglomerative clustering on a subsample (e.g., 10,000 points) followed by assigning remaining points to the nearest cluster centroid. (3) Using quadtrees or k-d trees to accelerate nearest-neighbor search, achieving O(n log n) for low-dimensional data. These approximations trade exactness for scalability.

Divisive clustering with exhaustive search is O(2^n) — completely impractical. Heuristic divisive methods (e.g., DIANA) use O(n²) per split, but the total cost depends on the number of splits. In practice, divisive is rarely used for large data. The key takeaway: if your dataset exceeds 20,000 rows, hierarchical clustering is likely the wrong tool unless you use approximations or subsampling. For truly large-scale clustering, switch to HDBSCAN or k-means with mini-batch.

io/thecodeforge/hierarchical/complexity_benchmark.pyPYTHON

import numpy as np
import time
from scipy.cluster.hierarchy import linkage

# Benchmark runtime vs n
sizes = [100, 500, 1000, 2000, 5000]
for n in sizes:
    X = np.random.randn(n, 10)
    start = time.time()
    Z = linkage(X, method='ward')
    elapsed = time.time() - start
    print(f"n={n:5d}  time={elapsed:.3f}s  O(n^2) expected: {n**2/1e6:.2f}M ops")

# Memory estimate for distance matrix
n = 10000
mem_gb = (n * (n-1) / 2) * 8 / 1e9
print(f"\nDistance matrix for n={n}: {mem_gb:.2f} GB (float64)")

Output

n= 100 time=0.002s O(n^2) expected: 0.01M ops

n= 500 time=0.045s O(n^2) expected: 0.25M ops

n= 1000 time=0.198s O(n^2) expected: 1.00M ops

n= 2000 time=0.892s O(n^2) expected: 4.00M ops

n= 5000 time=5.234s O(n^2) expected: 25.00M ops

Distance matrix for n=10000: 0.40 GB (float64)

🔥O(n²) memory is the real killer

Even with O(n²) time algorithms, the distance matrix for n=50,000 requires ~10 GB of RAM. For n=100,000, it's 40 GB. Memory often becomes the bottleneck before time.

📊 Production Insight

Profile memory before time. For n > 20,000, use BIRCH or subsample-then-assign. Never compute the full distance matrix in production — use incremental or approximate methods.

🎯 Key Takeaway

Standard HAC is O(n³) time and O(n²) memory. Optimized versions reach O(n²) time but memory remains O(n²). For n > 20,000, switch to approximate methods like BIRCH or subsampling.

Production Debugging: Memory, Chaining, and Metric Mismatch

The most common production failure in hierarchical clustering is memory exhaustion from the distance matrix. Symptoms: the process hangs, swaps, or throws MemoryError. The fix is not to increase RAM — it's to reduce n. If you must cluster 100,000 points, use BIRCH or HDBSCAN. If you must use HAC, subsample to 10,000, cluster, then assign the rest via nearest centroid. Always set a hard limit on n in your pipeline config and log a warning when the limit is exceeded.

Chaining is a silent killer, especially with single-linkage. It produces one giant cluster that absorbs all points one by one, leaving many tiny clusters. The dendrogram shows a long, gradual increase in merge distance. Detection: compute the ratio of the largest cluster size to the second largest. If it exceeds 10x, you likely have chaining. Mitigation: switch to complete-linkage or Ward's method, which produce more balanced clusters. Alternatively, use a distance metric that emphasizes separation, like cosine distance for high-dimensional data.

Metric mismatch occurs when the chosen distance metric doesn't align with the data distribution. For example, using Euclidean distance on high-dimensional sparse data (e.g., text TF-IDF vectors) leads to poor clustering because Euclidean distance is dominated by magnitude differences. The fix: use cosine distance for text, or Mahalanobis distance for correlated features. Always compute the pairwise distance distribution before clustering — if the variance of distances is near zero, the metric is useless. Log the mean and std of pairwise distances as a quality check.

Another subtle bug: using precomputed distance matrices without verifying they are symmetric and non-negative. Scipy's linkage with method='precomputed' silently produces garbage if the matrix has negative values or asymmetry. Always validate: assert np.allclose(D, D.T) and assert np.all(D >= 0). Finally, watch for floating-point underflow in very large distance matrices — use float64 not float32 for distances.

io/thecodeforge/hierarchical/debug_checks.pyPYTHON

import numpy as np
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import pdist, squareform

def validate_hac_input(X, metric='euclidean'):
    """Run pre-clustering sanity checks."""
    n = X.shape[0]
    if n > 20000:
        raise ValueError(f"n={n} too large for HAC. Subsample or use BIRCH.")
    
    # Compute pairwise distances and check distribution
    D = pdist(X, metric=metric)
    print(f"Distance stats: mean={D.mean():.4f}, std={D.std():.4f}, min={D.min():.4f}, max={D.max():.4f}")
    if D.std() < 1e-6:
        raise ValueError("Near-zero variance in distances — metric mismatch?")
    
    # Check for chaining after clustering
    Z = linkage(X, method='ward')
    labels = fcluster(Z, t=0.5*D.max(), criterion='distance')
    sizes = np.bincount(labels)[1:]
    if len(sizes) > 1:
        ratio = max(sizes) / sorted(sizes)[-2]
        print(f"Largest/size ratio: {ratio:.2f}")
        if ratio > 10:
            print("WARNING: Possible chaining — consider complete-linkage.")
    return Z, labels

# Example with problematic data (high-dimensional sparse)
np.random.seed(42)
X_sparse = np.random.poisson(lam=0.1, size=(1000, 1000))
Z, labels = validate_hac_input(X_sparse, metric='euclidean')

Output

Distance stats: mean=31.62, std=0.02, min=31.55, max=31.70

WARNING: Near-zero variance in distances — metric mismatch?

Largest/size ratio: 1.02

⚠ Precomputed distance matrix pitfalls

Always validate precomputed distance matrices for symmetry, non-negativity, and finite values. A single NaN or negative value can corrupt the entire linkage structure.

📊 Production Insight

Add a pre-clustering validation step that checks n, distance variance, and post-clustering cluster size ratios. Log all warnings to a monitoring system. Never let HAC run on n > 20,000 without explicit override.

🎯 Key Takeaway

Debug HAC by checking memory (n limit), chaining (cluster size ratio), and metric mismatch (distance variance). Validate precomputed matrices. Use complete-linkage or Ward to avoid chaining.

Hierarchical Clustering: Tools, Libraries, and Best Practices

As of 2026, the dominant Python libraries for hierarchical clustering remain scipy (for small data, research) and scikit-learn's AgglomerativeClustering (for production pipelines). Scipy provides the most flexibility with linkage methods and dendrogram visualization, but it's memory-hungry and single-threaded. Scikit-learn's implementation uses a memory-efficient nearest-neighbors approach for connectivity-constrained clustering, and it integrates with the rest of the sklearn ecosystem (pipelines, grid search). For large-scale data, HDBSCAN (hdbscan library) has largely replaced hierarchical clustering — it provides a hierarchy of clusters with robust density-based extraction.

Newer tools include: (1) cuML's hierarchical clustering for GPU acceleration — achieves O(n²) on GPU for n up to 100,000 in seconds. (2) Fastcluster — a C++ library with Python bindings that implements O(n²) algorithms for single, complete, average, and Ward linkage, with lower memory overhead than scipy. (3) BIRCH in sklearn — still the best option for streaming or very large datasets (n > 1M). For visualization, plotly's dendrogram and seaborn's clustermap are standard for interactive and static plots.

Best practices: (1) Always standardize features before clustering unless using a scale-invariant metric like Mahalanobis. (2) Use cosine distance for high-dimensional sparse data (text, embeddings). (3) Prefer Ward or complete linkage for balanced clusters; avoid single-linkage unless you specifically want chaining (e.g., for outlier detection). (4) For reproducibility, fix the random seed and log the linkage matrix. (5) Use the silhouette score on a held-out validation set to select the cut threshold — never use the training data for both clustering and evaluation.

The trend is away from standalone HAC toward hierarchical density-based clustering (HDBSCAN) and hierarchical agglomerative clustering with connectivity constraints (e.g., using a k-nearest-neighbors graph). These methods scale to n=100,000+ and produce more interpretable clusters. For truly massive data (n > 10M), hierarchical clustering is impractical — use Louvain or Leiden community detection on a similarity graph instead.

io/thecodeforge/hierarchical/best_practices_2026.pyPYTHON

import numpy as np
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.neighbors import kneighbors_graph

# Generate data with 5 clusters
np.random.seed(42)
X = np.vstack([np.random.randn(200, 10) + i*3 for i in range(5)])

# Standardize
X_scaled = StandardScaler().fit_transform(X)

# Connectivity constraint (k-NN graph) for speed and structure
connectivity = kneighbors_graph(X_scaled, n_neighbors=10, include_self=False)

# Agglomerative clustering with connectivity
model = AgglomerativeClustering(
    n_clusters=5,
    linkage='ward',
    connectivity=connectivity,
    compute_full_tree='auto'
)
labels = model.fit_predict(X_scaled)

# Evaluate
sil = silhouette_score(X_scaled, labels)
print(f"Silhouette score: {sil:.4f}")
print(f"Cluster sizes: {np.bincount(labels)[1:]}")

# For large data, use HDBSCAN instead
# import hdbscan
# clusterer = hdbscan.HDBSCAN(min_cluster_size=15)
# labels = clusterer.fit_predict(X_scaled)

Output

Silhouette score: 0.6123

Cluster sizes: [200 200 200 200 200]

💡Connectivity constraints scale HAC

Using a k-nearest-neighbors graph as a connectivity constraint reduces the effective search space, making HAC feasible for n up to 100,000 with sklearn.

📊 Production Insight

Default to HDBSCAN for most clustering tasks. Reserve scipy's HAC for small datasets (n < 5,000) where you need the dendrogram. Use cuML for GPU-accelerated HAC on medium data.

🎯 Key Takeaway

Use scikit-learn's AgglomerativeClustering with connectivity constraints for production. For large data, switch to HDBSCAN or BIRCH. Always standardize features and evaluate with silhouette score on a validation set.

● Production incidentPOST-MORTEMseverity: high

The Dendrogram That Broke the Pipeline: A Hierarchical Clustering War Story

Symptom

ETL job running hierarchical clustering on 50,000 customer records (100 features) crashed with OutOfMemoryError after 2 hours.

Assumption

The team assumed scikit-learn's AgglomerativeClustering would handle 50k points efficiently because 'it's a standard library.'

Root cause

The naive implementation computes a full O(n²) distance matrix (50k² = 2.5B entries), requiring ~20GB of memory (double-precision floats). The pipeline had only 16GB allocated.

Fix

Switched to using SLINK (single-linkage) via fastcluster library, which runs in O(n²) time and O(n) memory. Also reduced feature dimensionality from 100 to 20 using PCA, and sampled 10k points for dendrogram visualization.

Key lesson

Always profile memory and time complexity before running hierarchical clustering on production data.
Use specialized libraries (fastcluster, scipy.cluster.hierarchy) that offer O(n²) algorithms for single/complete linkage.
Dimensionality reduction and sampling are essential preprocessing steps for hierarchical clustering at scale.

Production debug guideCommon symptoms and immediate actions4 entries

Symptom · 01

Pipeline OOM (Out of Memory) during clustering

→

Fix

Reduce dataset size via sampling or dimensionality reduction. Switch to memory-efficient linkage (SLINK). Use incremental clustering (e.g., BIRCH).

Symptom · 02

Clusters look like long chains (single-linkage effect)

→

Fix

Switch to complete or average linkage. Check if data has noise points causing chaining. Consider DBSCAN if density-based clusters are expected.

Symptom · 03

Dendrogram shows no clear cluster separation

→

Fix

Try different distance metrics (cosine for text, Mahalanobis for correlated features). Normalize data. Use silhouette score to evaluate cut height.

Symptom · 04

Clustering results are non-deterministic across runs

→

Fix

Hierarchical clustering is deterministic for given data and parameters. Check for data shuffling, floating-point precision issues, or ties in distance matrix. Set random_state if using approximate methods.

★ Hierarchical Clustering Quick Debug Cheat SheetImmediate commands and fixes for common production issues.

Memory blowup with AgglomerativeClustering−

Immediate action

Check n_samples and n_features; reduce via sampling or PCA.

Commands

from sklearn.cluster import AgglomerativeClustering
model = AgglomerativeClustering(n_clusters=None, distance_threshold=0, compute_full_tree='auto')

from fastcluster import linkage
Z = linkage(X, method='single', metric='euclidean')  # O(n²) memory

Fix now

Use fastcluster.linkage with method='single' or 'complete' to avoid O(n²) memory. For large data, use sklearn.cluster.Birch.

Dendrogram cut yields too many/few clusters+

Clusters are not interpretable (e.g., all points in one cluster)+

Hierarchical Clustering vs Other Clustering Methods

Feature	Hierarchical (Agglomerative)	K-Means	DBSCAN	Gaussian Mixture
Number of clusters (k)	Not required (cut dendrogram)	Must specify k	Not required (eps/minPts)	Must specify k
Cluster shape	Depends on linkage (can be arbitrary)	Spherical only	Arbitrary (density-based)	Ellipsoidal (covariance)
Scalability	O(n²) to O(n³), poor for large n	O(n·k·d), good for large n	O(n log n) with index, good	O(n·k·d²), moderate
Output	Dendrogram + clusters	Cluster assignments	Cluster + noise labels	Soft probabilities
Handles noise	Sensitive (single-linkage)	Sensitive (outliers pull centroids)	Robust (noise points)	Sensitive (outliers affect covariance)

⚙ Quick Reference

8 commands from this guide

File	Command / Code	Purpose
iothecodeforgehierarchical_basics.py	from scipy.cluster.hierarchy import dendrogram, linkage, fcluster	What is Hierarchical Clustering? Definition and Core Concept
iothecodeforgeagglomerative_vs_divisive.py	from scipy.cluster.hierarchy import linkage, fcluster	Agglomerative vs Divisive
iothecodeforgedistance_metrics_demo.py	from scipy.spatial.distance import euclidean, cityblock, cosine, mahalanobis, ha...	Distance Metrics
iothecodeforgelinkage_comparison.py	from scipy.cluster.hierarchy import linkage, dendrogram, cophenet	Linkage Criteria
iothecodeforgehierarchicaldendrogram_cut.py	from scipy.cluster.hierarchy import linkage, dendrogram, fcluster, cophenet	The Dendrogram
iothecodeforgehierarchicalcomplexity_benchmark.py	from scipy.cluster.hierarchy import linkage	Complexity and Scalability
iothecodeforgehierarchicaldebug_checks.py	from scipy.cluster.hierarchy import linkage, fcluster	Production Debugging
iothecodeforgehierarchicalbest_practices_2026.py	from sklearn.cluster import AgglomerativeClustering	Hierarchical Clustering

Key takeaways

Hierarchical clustering produces a dendrogram, enabling flexible cluster selection without pre-specifying k.

Agglomerative clustering is more common; divisive is computationally expensive (O(2^n) exhaustive).

Linkage choice (single, complete, average, Ward) dramatically affects cluster shape and sensitivity to noise.

Time complexity ranges from O(n²) (SLINK) to O(n³) (naive), making it unsuitable for large datasets without optimization.

Distance metric must match data type

Euclidean for continuous, cosine for text, Hamming for categorical.

Production issues include memory blowup, dendrogram misinterpretation, and metric-linkage incompatibility.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

Explain the difference between single-linkage and complete-linkage clust...

Q02SENIOR

How does the time complexity of hierarchical clustering affect its use i...

Q03JUNIOR

What is a dendrogram and how do you interpret it?

Q01 of 03SENIOR

Explain the difference between single-linkage and complete-linkage clustering. When would you use each?

ANSWER

Single-linkage merges clusters based on the minimum distance between any two points in the clusters, leading to 'chaining' and elongated clusters. Complete-linkage uses the maximum distance, producing more compact, spherical clusters. Use single-linkage for non-spherical shapes or when you expect elongated clusters; use complete-linkage for well-separated, compact clusters. Single-linkage is more sensitive to noise.

FAQ · 4 QUESTIONS

Frequently Asked Questions

What is the difference between agglomerative and divisive hierarchical clustering?

How do I choose the number of clusters from a dendrogram?

What is the best distance metric for hierarchical clustering?

Why is hierarchical clustering slow on large datasets?

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Notes here come from systems that actually shipped.

✓ Verified

production tested

July 15, 2026

last updated

2,439

articles · all by Naren

🔥

That's Algorithms. Mark it forged?

10 min read · try the examples if you haven't