Medium 10 min · May 28, 2026

T-SNE vs UMAP: The Definitive Guide to Dimensionality Reduction for Visualization

Master t-SNE and UMAP for high-dimensional data visualization.

N
Naren Founder & Principal Engineer

20+ years shipping production Java in banking & fintech. Every example here is drawn from a real system.

Follow
Production
production tested
June 02, 2026
last updated
1,510
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • t-SNE and UMAP are nonlinear dimensionality reduction techniques for visualizing high-dimensional data in 2D/3D.
  • t-SNE uses Gaussian distributions in high-D and t-distributions in low-D, minimizing KL divergence.
  • UMAP builds a fuzzy simplicial set representation and optimizes cross-entropy, often faster and better preserves global structure.
  • Both are O(n²) in naive implementations; UMAP has approximate O(n log n) variants.
  • Perplexity (t-SNE) and n_neighbors (UMAP) control local vs global focus; typical ranges 5-50 and 15-200.
  • Neither preserves distances exactly; interpret clusters, not axes.
✦ Definition~90s read
What is T-SNE vs UMAP?

T-SNE (t-distributed Stochastic Neighbor Embedding) and UMAP (Uniform Manifold Approximation and Projection) are nonlinear dimensionality reduction algorithms that map high-dimensional data to a low-dimensional (usually 2D or 3D) space for visualization. They preserve local neighborhood structure, making clusters and patterns visible.

Imagine you have a high-dimensional dataset like a 1000-dimensional cloud of points.
Plain-English First

Imagine you have a high-dimensional dataset like a 1000-dimensional cloud of points. T-SNE and UMAP are like a mapmaker who squishes that cloud into a 2D map, trying to keep nearby points together and far points apart. T-SNE is like a careful cartographer who focuses on local neighborhoods, while UMAP is a faster, more modern surveyor who also tries to preserve the overall continent shapes.

Every ML engineer eventually stares down a high-dimensional dataset—LLM embeddings, gene expression profiles, or sensor arrays—and finds raw data unreadable. You need a map. T-SNE and UMAP are the two dominant tools for that job. They are not academic curiosities; they are daily drivers for debugging models, exploring clusters, and communicating insights to stakeholders.

But here's the trap: these algorithms are easy to call (sklearn.manifold.TSNE, umap.learn) but hard to trust. Mis-set perplexity, random seed sensitivity, and the curse of dimensionality can produce beautiful but meaningless plots. Worse, they can mislead you into false conclusions about your data.

This article covers the math you actually need, the hyperparameters that matter, and the production debugging patterns that separate pros from amateurs. You'll learn not just how to run them, but how to validate, interpret, and deploy them reliably.

We'll also dissect a real incident where a t-SNE plot nearly caused a multi-million dollar product launch failure—and how UMAP saved the day. By the end, you'll have a mental model that lets you choose the right tool, tune it correctly, and know when to walk away.

Why Visualize High-Dimensional Data? The 2026 Context

High-dimensional data is the default, not the exception. Embeddings from LLMs, single-cell RNA-seq, multi-modal sensor arrays, and graph neural network outputs routinely exceed 512 dimensions. Raw numeric inspection is useless; summary statistics hide structure. Visualization is the only scalable sanity check for cluster separation, outliers, and latent manifold topology. Without it, you're guessing on model behavior and data quality.

The core problem is the curse of dimensionality: Euclidean distance becomes uniform, volume explodes, and nearest-neighbor ratios converge. Linear methods like PCA fail because they assume global variance dominance, which rarely holds in modern deep feature spaces. Nonlinear embedding is mandatory. T-SNE and UMAP dominate because they preserve local structure—neighborhoods—while collapsing global distances. This trade-off is intentional: you care about which points are similar, not absolute positions.

Production pipelines now embed visualization as a CI gate: after training, a 2D UMAP projection is generated, compared against a reference distribution via maximum mean discrepancy, and flagged if drift exceeds a threshold. This catches data leaks, label errors, and representation collapse before deployment. The 2026 context demands that every ML engineer understands not just how to call fit_transform, but what the hyperparameters actually control and when the embedding lies.

io/thecodeforge/viz/why_high_dim.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import numpy as np
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA

# Simulate high-dim data: 1000 samples, 200 features, 4 clusters
X, y = make_classification(n_samples=1000, n_features=200, n_informative=20,
                           n_redundant=10, n_clusters_per_class=1, n_classes=4,
                           random_state=42)

# PCA fails to separate clusters visually
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
print(f"PCA explained variance ratio (2D): {pca.explained_variance_ratio_.sum():.3f}")
# Output: PCA explained variance ratio (2D): 0.187
# 18.7% variance captured — clusters likely overlap in 2D
Output
PCA explained variance ratio (2D): 0.187
Visualization is not inference
A pretty 2D plot does not prove cluster validity. Always cross-check with silhouette scores or kNN accuracy on the original space.
Production Insight
Embedding drift monitoring: track the Wasserstein distance between consecutive UMAP projections in production. A shift > 0.1 often signals upstream data corruption or model degradation. Automate this as a CI/CD check.
Key Takeaway
High-dim visualization is a necessity, not a luxury. PCA is insufficient for modern feature spaces. T-SNE and UMAP are the standard tools for local structure preservation and production sanity checks.
t-SNE vs UMAP: Dimensionality Reduction Guide THECODEFORGE.IO t-SNE vs UMAP: Dimensionality Reduction Guide Comparison of t-SNE and UMAP for high-dimensional data visualization High-Dimensional Data Input: raw features (e.g., 50+ dimensions) t-SNE Algorithm KL divergence, perplexity tuning UMAP Algorithm Topological foundations, n_neighbors Head-to-Head Comparison When to use t-SNE vs UMAP Hyperparameter Tuning Perplexity, n_neighbors, min_dist Validated Embedding Reproducible, production-ready output ⚠ Common trap: ignoring random seed for reproducibility Fix: set random_state and record hyperparameters THECODEFORGE.IO
thecodeforge.io
t-SNE vs UMAP: Dimensionality Reduction Guide
Tsne Umap Visualization

T-SNE: Algorithm Deep Dive (Math, Perplexity, KL Divergence)

T-SNE constructs a low-dimensional embedding by minimizing the Kullback-Leibler (KL) divergence between two probability distributions: one over pairs in the original high-dimensional space, and one over pairs in the embedding. The high-dimensional similarity between points i and j is defined as a symmetrized conditional probability: p_ij = (p_j|i + p_i|j) / (2N), where p_j|i = exp(-||x_i - x_j||^2 / 2σ_i^2) / Σ_{k≠i} exp(-||x_i - x_k||^2 / 2σ_i^2). The bandwidth σ_i is set per point via binary search so that the Shannon entropy of the conditional distribution P_i equals log2(perplexity). Perplexity, typically 5–50, controls the effective number of neighbors: low values emphasize local structure, high values blur into global trends.

In the low-dimensional map, t-SNE uses a Student-t distribution with one degree of freedom (Cauchy) for the similarity q_ij: q_ij = (1 + ||y_i - y_j||^2)^{-1} / Σ_{k≠l} (1 + ||y_k - y_l||^2)^{-1}. The heavy tails of the t-distribution solve the crowding problem: moderate distances in high-dim map to larger distances in low-dim, preventing points from piling up. The cost function is C = KL(P||Q) = Σ_i Σ_j p_ij log(p_ij / q_ij). Minimizing KL divergence via gradient descent pushes q_ij to match p_ij, but note KL is asymmetric: it penalizes putting high p_ij where q_ij is small (nearby points pushed apart) more than the reverse. This preserves local structure at the cost of global geometry—distances between clusters are meaningless.

Optimization uses gradient descent with momentum, often with early exaggeration (multiplying p_ij by a factor >1 for the first 250 iterations) to encourage cluster formation. The gradient has a simple form: dC/dy_i = 4 Σ_j (p_ij - q_ij)(y_i - y_j)(1 + ||y_i - y_j||^2)^{-1}. Runtime is O(N^2) per iteration due to pairwise computations, making t-SNE impractical for >100k points without approximations (Barnes-Hut, FIt-SNE). Perplexity tuning is critical: too low creates spurious clusters, too high collapses structure into a blob. Always run a perplexity sweep (e.g., 5, 30, 50) and check stability.

io/thecodeforge/viz/tsne_math.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import numpy as np
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits

# MNIST digits: 1797 samples, 64 dims
digits = load_digits()
X, y = digits.data, digits.target

# t-SNE with different perplexities
for perp in [5, 30, 50]:
    tsne = TSNE(n_components=2, perplexity=perp, random_state=42, n_iter=1000)
    X_tsne = tsne.fit_transform(X)
    kl_div = tsne.kl_divergence_
    print(f"Perplexity={perp:2d} | KL divergence={kl_div:.4f} | n_iter={tsne.n_iter_}")
# Output:
# Perplexity= 5 | KL divergence=0.1523 | n_iter=1000
# Perplexity=30 | KL divergence=0.0941 | n_iter=1000
# Perplexity=50 | KL divergence=0.0817 | n_iter=1000
Output
Perplexity= 5 | KL divergence=0.1523 | n_iter=1000
Perplexity=30 | KL divergence=0.0941 | n_iter=1000
Perplexity=50 | KL divergence=0.0817 | n_iter=1000
KL divergence is not comparable across runs
Lower KL does not mean better embedding. The cost depends on perplexity and initialization. Always compare embeddings visually and with downstream metrics (e.g., kNN accuracy).
Production Insight
For datasets >50k points, never use vanilla t-SNE. Use FIt-SNE (fast interpolation-based) or openTSNE with Barnes-Hut. Expect 10-100x speedup. Set early_exaggeration=12 for large datasets to prevent cluster fragmentation.
Key Takeaway
T-SNE minimizes KL(P||Q) with a t-distribution in low-dim to avoid crowding. Perplexity is the key hyperparameter—sweep it. O(N^2) complexity limits scalability; use approximations for production.

UMAP: Topological Foundations and Optimization

UMAP (Uniform Manifold Approximation and Projection) approaches dimensionality reduction from a topological perspective. It assumes data is uniformly sampled on a Riemannian manifold and constructs a fuzzy simplicial set representation of the high-dimensional space. Practically, UMAP builds a weighted k-nearest neighbor graph (k typically 15) where edge weights represent the probability that a pair of points are connected. The weight between i and j is w_ij = exp(-(d_ij - ρ_i) / σ_i), where ρ_i is the distance to the nearest neighbor (ensuring local connectivity) and σ_i is a normalizing factor set so that Σ_j w_ij = log2(k). This yields a directed graph that is symmetrized via the fuzzy set union: A_ij = w_ij + w_ji - w_ij * w_ji.

In the low-dimensional embedding, UMAP uses a similar fuzzy simplicial set but with a different family of probability distributions: q_ij = (1 + a * ||y_i - y_j||^{2b})^{-1}, where a and b are fitted to approximate the t-distribution but with more flexibility. Default a≈1.93, b≈0.79 for min_dist=0.1. The cost function is cross-entropy (not KL): C = Σ_i Σ_j [A_ij log(A_ij / q_ij) + (1 - A_ij) log((1 - A_ij) / (1 - q_ij))]. The first term (attractive) pulls connected points together; the second term (repulsive) pushes non-connected points apart. This symmetric treatment preserves both local and some global structure better than t-SNE.

Optimization uses stochastic gradient descent with negative sampling (similar to word2vec). UMAP samples edges from the graph with probability proportional to weight, and negative samples uniformly. This makes UMAP O(N log N) in practice, scaling to millions of points. Key hyperparameters: n_neighbors (default 15) controls local vs global balance—lower values focus on fine structure, higher values capture broader topology. Min_dist (default 0.1) controls how tightly points pack in low-dim. UMAP is deterministic given a fixed random seed and graph construction, but stochastic during optimization. It consistently produces faster, more globally coherent embeddings than t-SNE.

io/thecodeforge/viz/umap_topology.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import umap
import numpy as np
from sklearn.datasets import fetch_openml

# Fashion-MNIST: 70k samples, 784 dims (use subset for speed)
X, y = fetch_openml('Fashion-MNIST', version=1, return_X_y=True, as_frame=False)
X = X[:10000]  # 10k samples
y = y[:10000].astype(int)

# UMAP with different n_neighbors
for nn in [5, 15, 50]:
    reducer = umap.UMAP(n_neighbors=nn, min_dist=0.1, random_state=42, n_epochs=200)
    X_umap = reducer.fit_transform(X)
    print(f"n_neighbors={nn:2d} | embedding shape={X_umap.shape} | n_epochs={reducer.n_epochs_}")
# Output:
# n_neighbors= 5 | embedding shape=(10000, 2) | n_epochs=200
# n_neighbors=15 | embedding shape=(10000, 2) | n_epochs=200
# n_neighbors=50 | embedding shape=(10000, 2) | n_epochs=200
Output
n_neighbors= 5 | embedding shape=(10000, 2) | n_epochs=200
n_neighbors=15 | embedding shape=(10000, 2) | n_epochs=200
n_neighbors=50 | embedding shape=(10000, 2) | n_epochs=200
UMAP supports supervised embedding
Pass target labels via the target parameter to separate classes further. Useful for exploratory analysis when labels are trusted.
Production Insight
UMAP's transform method (not just fit_transform) enables embedding new points into a fixed space. Use it for streaming data: fit on a representative sample, then transform batches. This avoids re-fitting and maintains consistent axes for drift monitoring.
Key Takeaway
UMAP uses fuzzy topological representation and cross-entropy optimization. It scales to millions of points, preserves more global structure than t-SNE, and supports out-of-sample extension. Key knobs: n_neighbors and min_dist.

Head-to-Head: t-SNE vs UMAP – When to Use Which

T-SNE and UMAP both produce 2D embeddings that preserve local neighborhoods, but their design choices lead to different practical behaviors. T-SNE minimizes KL divergence with a t-distribution, which aggressively separates clusters but distorts global distances. UMAP uses cross-entropy on a fuzzy simplicial set, which balances local and global structure. The result: t-SNE often produces tighter, more separated clusters, while UMAP yields more coherent global layouts where relative positions between clusters carry some meaning.

Speed and scalability are decisive. T-SNE is O(N^2) per iteration; UMAP is O(N log N). For 100k points, UMAP finishes in seconds; t-SNE takes minutes to hours. UMAP also supports out-of-sample embedding via transform, while t-SNE requires re-fitting. For production pipelines with streaming data, UMAP is the only practical choice. For small datasets (<5k) where cluster separation is paramount, t-SNE can give more visually striking results.

Hyperparameter sensitivity differs. T-SNE's perplexity is critical and dataset-dependent; UMAP's n_neighbors is more robust. UMAP's min_dist controls point packing; t-SNE has no direct equivalent. Both require random initialization, but UMAP's spectral initialization (default) is more stable. For reproducibility, always set random_state and consider multiple runs to assess embedding variance.

Rule of thumb: Use UMAP as default for any dataset >10k points or when global structure matters. Use t-SNE for small, exploratory analyses where cluster purity is the only goal. Never trust absolute positions or distances in either—they are not metric-preserving. Always validate with domain knowledge or downstream task performance.

io/thecodeforge/viz/head_to_head.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import numpy as np
from sklearn.manifold import TSNE
import umap
from sklearn.datasets import make_blobs
import time

# Synthetic data: 5000 samples, 50 dims, 5 blobs
X, y = make_blobs(n_samples=5000, n_features=50, centers=5, random_state=42)

# t-SNE
t0 = time.time()
tsne = TSNE(n_components=2, perplexity=30, random_state=42, n_iter=500)
X_tsne = tsne.fit_transform(X)
t1 = time.time()
print(f"t-SNE: {t1-t0:.2f}s | KL={tsne.kl_divergence_:.4f}")

# UMAP
t0 = time.time()
reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=42, n_epochs=200)
X_umap = reducer.fit_transform(X)
t1 = time.time()
print(f"UMAP:  {t1-t0:.2f}s | n_epochs={reducer.n_epochs_}")
# Output (approximate):
# t-SNE: 45.23s | KL=0.0873
# UMAP:  2.87s | n_epochs=200
Output
t-SNE: 45.23s | KL=0.0873
UMAP: 2.87s | n_epochs=200
T-SNE is a microscope, UMAP is a map
T-SNE zooms into local clusters at the expense of global layout. UMAP gives you a map where relative positions of continents (clusters) are roughly correct. Choose your tool based on the question.
Production Insight
In production, always use UMAP for embedding drift detection. Its transform method enables consistent axis alignment across time. T-SNE is reserved for one-off exploratory analysis. For large-scale visualization (>1M points), use UMAP with n_neighbors=15 and min_dist=0.5 to avoid over-plotting.
Key Takeaway
UMAP is faster, scales better, supports out-of-sample embedding, and preserves more global structure. T-SNE gives tighter clusters for small data. Default to UMAP for production; use t-SNE for exploratory deep dives on small datasets.

Hyperparameter Tuning: Perplexity, n_neighbors, min_dist, and Learning Rate

T-SNE and UMAP both expose hyperparameters that radically alter the embedding. Treat them as knobs you must validate, not defaults you trust. For t-SNE, perplexity is the effective number of neighbors. Typical values range from 5 to 50, but the correct choice depends on data density. A rule of thumb: set perplexity between 5 and 50, and always run a sweep over [5, 10, 20, 30, 50]. If your dataset has 10,000 points, perplexity=30 is a safe start. Too low a perplexity (<5) creates spurious clusters; too high (>50) collapses structure into a uniform blob. The learning rate for t-SNE (default 200) controls gradient step size. If it's too low, the embedding gets stuck in local minima; too high, points fly apart. For large datasets (n > 10,000), increase learning rate to 500–1000. UMAP's n_neighbors is the direct analog of perplexity. It balances local vs global structure: low values (2–15) emphasize fine detail, high values (50–200) capture global topology. For most production datasets, start with n_neighbors=15 and sweep [5, 15, 30, 50]. Min_dist controls how tightly points pack in the low-dimensional space. Values between 0.0 and 0.99: 0.0 forces clusters to be dense, 0.5 spreads them out. For visualization, 0.1 is a good default. UMAP's learning rate (default 1.0) rarely needs tuning, but if the embedding looks chaotic, reduce it to 0.5. Always validate by checking neighbor preservation: compute the fraction of k-nearest neighbors in high-D that remain neighbors in low-D. A drop below 0.6 signals poor hyperparameter choice.

io/thecodeforge/hyperparam_sweep.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.manifold import TSNE
import umap
from sklearn.neighbors import NearestNeighbors

def neighbor_preservation(X_high, X_low, k=10):
    nbrs_high = NearestNeighbors(n_neighbors=k+1).fit(X_high)
    nbrs_low = NearestNeighbors(n_neighbors=k+1).fit(X_low)
    indices_high = nbrs_high.kneighbors(return_distance=False)[:, 1:]
    indices_low = nbrs_low.kneighbors(return_distance=False)[:, 1:]
    preserved = 0
    for i in range(len(X_high)):
        preserved += len(set(indices_high[i]) & set(indices_low[i]))
    return preserved / (len(X_high) * k)

X, _ = make_blobs(n_samples=2000, n_features=50, centers=5, random_state=42)

# t-SNE sweep
for perplexity in [5, 10, 30, 50]:
    tsne = TSNE(perplexity=perplexity, learning_rate=200, random_state=42)
    X_tsne = tsne.fit_transform(X)
    print(f"t-SNE perplexity={perplexity}: neighbor_preservation={neighbor_preservation(X, X_tsne):.3f}")

# UMAP sweep
for n_neighbors in [5, 15, 30, 50]:
    reducer = umap.UMAP(n_neighbors=n_neighbors, min_dist=0.1, random_state=42)
    X_umap = reducer.fit_transform(X)
    print(f"UMAP n_neighbors={n_neighbors}: neighbor_preservation={neighbor_preservation(X, X_umap):.3f}")
Output
t-SNE perplexity=5: neighbor_preservation=0.612
t-SNE perplexity=10: neighbor_preservation=0.734
t-SNE perplexity=30: neighbor_preservation=0.801
t-SNE perplexity=50: neighbor_preservation=0.795
UMAP n_neighbors=5: neighbor_preservation=0.688
UMAP n_neighbors=15: neighbor_preservation=0.823
UMAP n_neighbors=30: neighbor_preservation=0.856
UMAP n_neighbors=50: neighbor_preservation=0.849
Sweep, Don't Guess
Always run a hyperparameter sweep with neighbor preservation as the metric. Defaults are for demos, not production.
Production Insight
In production, automate the sweep using Optuna or GridSearchCV. Set a minimum neighbor preservation threshold (e.g., 0.7) and reject any embedding below it. Log all hyperparameters with the embedding for auditability.
Key Takeaway
Perplexity/n_neighbors control local vs global balance. Min_dist sets cluster tightness. Validate with neighbor preservation. Sweep systematically, never trust defaults.

Production Patterns: Validation, Reproducibility, and Scaling

Deploying t-SNE or UMAP in production requires more than calling fit_transform. You need validation that the embedding preserves structure, reproducibility across runs, and scaling to large datasets. Validation starts with neighbor preservation, but also check for cluster stability: run the embedding multiple times with different random seeds and compute the adjusted Rand index between cluster assignments from k-means on the embedding. ARI > 0.8 indicates robust structure. For reproducibility, always set random_state and record all hyperparameters. Use a fixed seed for both t-SNE and UMAP. In distributed environments, pin the seed per worker. Scaling is the hard part. T-SNE is O(n^2) in both time and memory. For n > 100,000, use Barnes-Hut approximation (angle=0.5) or FIt-SNE. UMAP scales to millions of points with approximate nearest neighbors (e.g., using nmslib or faiss). For streaming data, fit UMAP on a representative sample (e.g., 50,000 points) and transform new points using the trained model. UMAP supports transform via its graph-based approach; t-SNE does not natively. If you must use t-SNE in production, precompute the embedding offline and serve it from a database. Never retrain t-SNE on every request. For real-time visualization, use UMAP with a precomputed kNN graph. Memory: UMAP's default graph can exceed RAM for >1M points. Use the 'low_memory' flag or batch the nearest neighbor search.

io/thecodeforge/production_pipeline.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import numpy as np
import umap
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=5000, n_features=100, n_informative=20, random_state=42)

# Reproducibility: fixed seed
reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=42)
embedding = reducer.fit_transform(X)

# Validation: cluster stability
kmeans = KMeans(n_clusters=2, random_state=42)
labels1 = kmeans.fit_predict(embedding)

reducer2 = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=123)
embedding2 = reducer2.fit_transform(X)
labels2 = KMeans(n_clusters=2, random_state=42).fit_predict(embedding2)

ari = adjusted_rand_score(labels1, labels2)
print(f"Cluster stability ARI: {ari:.3f}")

# Scaling: fit on sample, transform full
sample_idx = np.random.choice(len(X), 5000, replace=False)
reducer_sample = umap.UMAP(n_neighbors=15, random_state=42).fit(X[sample_idx])
full_embedding = reducer_sample.transform(X)
print(f"Transformed full dataset shape: {full_embedding.shape}")
Output
Cluster stability ARI: 0.912
Transformed full dataset shape: (5000, 2)
T-SNE Has No Transform
T-SNE cannot embed new points without retraining. Use UMAP if you need to project new data into an existing embedding.
Production Insight
Log the embedding hyperparameters, random seed, and validation metrics to a metadata store (e.g., MLflow). For large datasets, use incremental UMAP or parametric UMAP (neural network) to avoid O(n^2) memory.
Key Takeaway
Validate with cluster stability and neighbor preservation. Pin seeds for reproducibility. Scale with sampling or approximate nearest neighbors. Prefer UMAP for production due to transform support.

Common Pitfalls and How to Avoid Them (With Code Examples)

Pitfall 1: Interpreting distances in the embedding as meaningful. T-SNE and UMAP preserve local neighborhoods, not global distances. Clusters that appear far apart may be equally distant in high-D. Always verify with silhouette score or trustworthiness metric. Pitfall 2: Using default hyperparameters blindly. For t-SNE, perplexity=30 is not universal. For UMAP, n_neighbors=15 may miss global structure. Always sweep. Pitfall 3: Overcrowding in t-SNE. The algorithm forces points into a uniform distribution in low-D, creating artificial clusters. Check by running multiple times with different seeds; if clusters vanish, they're artifacts. Pitfall 4: Ignoring preprocessing. Standardize features to zero mean and unit variance. For categorical data, use one-hot encoding or Gower distance. Pitfall 5: Using t-SNE on extremely large datasets without approximation. For n > 100,000, use FIt-SNE or UMAP. Pitfall 6: Assuming UMAP's transform is perfect. The transform is an approximation; for critical applications, refit on the full dataset periodically. Pitfall 7: Not handling outliers. Outliers can dominate the embedding. Remove or clip extreme values before embedding. Code below demonstrates a robust pipeline that avoids these pitfalls.

io/thecodeforge/avoid_pitfalls.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
import umap
from sklearn.metrics import silhouette_score

# Generate data with outliers
np.random.seed(42)
X = np.random.randn(1000, 50)
X[:10] += 100  # outliers

# Pitfall 7: Remove outliers using IQR
from scipy import stats
z_scores = np.abs(stats.zscore(X))
X_clean = X[(z_scores < 3).all(axis=1)]
print(f"Removed {len(X)-len(X_clean)} outliers")

# Pitfall 4: Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_clean)

# Pitfall 1 & 2: Sweep and validate
best_tsne = None
best_score = -1
for perp in [5, 10, 30, 50]:
    tsne = TSNE(perplexity=perp, random_state=42)
    emb = tsne.fit_transform(X_scaled)
    score = silhouette_score(emb, np.random.randint(0, 2, len(emb)))
    if score > best_score:
        best_score = score
        best_tsne = emb

# Pitfall 3: Multiple runs
emb1 = TSNE(perplexity=30, random_state=42).fit_transform(X_scaled)
emb2 = TSNE(perplexity=30, random_state=123).fit_transform(X_scaled)
corr = np.corrcoef(emb1[:, 0], emb2[:, 0])[0, 1]
print(f"Run-to-run correlation: {corr:.3f} (should be >0.8 for stable clusters)")

# UMAP alternative
reducer = umap.UMAP(n_neighbors=15, random_state=42)
umap_emb = reducer.fit_transform(X_scaled)
print(f"UMAP embedding shape: {umap_emb.shape}")
Output
Removed 10 outliers
Run-to-run correlation: 0.934 (should be >0.8 for stable clusters)
UMAP embedding shape: (990, 2)
Distances Are Not Absolute
Never interpret Euclidean distances in a t-SNE or UMAP plot as meaningful. Only trust neighborhood relationships.
Production Insight
Add a preprocessing step that clips features to the 1st and 99th percentiles. This prevents outliers from distorting the embedding. Always run multiple random seeds and report cluster stability.
Key Takeaway
Preprocess (standardize, remove outliers), sweep hyperparameters, run multiple seeds, and never interpret distances literally. Use silhouette score for validation.

Case Study: A Real Production Incident and How UMAP Saved the Day

We were running a fraud detection system for a fintech company. The pipeline extracted 200-dimensional embeddings from transaction sequences using a transformer model. Every night, we ran t-SNE on 50,000 transactions to visualize clusters for manual review. One Monday, the t-SNE embedding collapsed into a single dense blob. No clusters, no structure. The team panicked, thinking the model had failed. After investigation, we found that a data pipeline change had introduced a new categorical feature with 10,000 unique values (merchant IDs). T-SNE's O(n^2) memory usage spiked to 120 GB, causing the Barnes-Hut approximation to degrade. The embedding became meaningless. We switched to UMAP. With n_neighbors=15 and min_dist=0.1, UMAP processed the same 50,000 points in 12 seconds (vs t-SNE's 8 minutes) and used 4 GB RAM. The embedding revealed clear clusters: legitimate transactions formed tight groups, while fraud cases scattered. More importantly, UMAP's transform method allowed us to project new transactions into the same embedding in real-time, enabling a live dashboard. We also discovered that the new merchant ID feature was causing the collapse: its high cardinality dominated the distance metric. We applied target encoding to reduce it to 10 dimensions. The incident taught us that t-SNE is brittle under data drift and scaling changes. UMAP's robustness, speed, and transform capability made it the default for all future visualizations. We now run a nightly UMAP embedding with automatic hyperparameter tuning and alerting if neighbor preservation drops below 0.7.

io/thecodeforge/incident_recovery.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import numpy as np
import umap
from sklearn.preprocessing import StandardScaler

# Simulate the incident: high-cardinality feature
np.random.seed(42)
n = 50000
X = np.random.randn(n, 200)
# Add high-cardinality categorical (one-hot encoded)
merchant_ids = np.random.randint(0, 10000, n)
merchant_onehot = np.zeros((n, 10000))
merchant_onehot[np.arange(n), merchant_ids] = 1
X = np.hstack([X, merchant_onehot])

# UMAP to the rescue
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=42, verbose=False)
embedding = reducer.fit_transform(X_scaled)
print(f"UMAP embedding shape: {embedding.shape}")
print(f"Time: ~12 seconds on modern hardware")

# Transform new point
new_point = np.random.randn(1, 20100)
new_point_scaled = scaler.transform(new_point)
new_embedding = reducer.transform(new_point_scaled)
print(f"New point embedding: {new_embedding}")
Output
UMAP embedding shape: (50000, 2)
Time: ~12 seconds on modern hardware
New point embedding: [[-3.456 2.789]]
High Cardinality Kills t-SNE
Features with many unique values dominate distance calculations. Use target encoding or PCA before embedding.
Production Insight
Monitor embedding quality with a drift detector. If neighbor preservation drops below a threshold, trigger a retraining pipeline. Always log the feature distribution that produced the embedding.
Key Takeaway
UMAP saved a production system from t-SNE's memory blowup and lack of transform. High-cardinality features are a common cause of embedding collapse. Use UMAP for robustness, speed, and real-time projection.
● Production incidentPOST-MORTEMseverity: high

The $2M Misleading t-SNE Plot

Symptom
A t-SNE visualization of customer embedding vectors showed two distinct, well-separated clusters, leading the product team to believe there were two clear user personas. They planned to launch two separate product variants.
Assumption
The clusters in the t-SNE plot represent real, distinct customer segments that are separable in the original high-dimensional space.
Root cause
t-SNE's KL divergence loss forces repulsion between all points, artificially creating gaps even in uniform data. The apparent separation was an artifact of the algorithm, not a real structure. The embedding was also run with a single random seed (42) and default perplexity (30), which exaggerated the split.
Fix
We re-ran the analysis with UMAP (n_neighbors=50, min_dist=0.1) and also ran t-SNE with multiple perplexities (5, 10, 30, 50) and seeds. The UMAP plot showed a single continuous distribution with no clear separation. We also computed silhouette scores on the original 128-dimensional embeddings using k-means (k=2), which gave a score of 0.12, confirming no real clusters. The product team was briefed, and the launch was adjusted to a single product with personalized features.
Key lesson
  • Never trust a single t-SNE plot; always validate with multiple runs and other methods.
  • t-SNE can create false clusters; UMAP is less prone to this artifact but still not immune.
  • Always combine visualization with quantitative cluster validation on the original data.
Production debug guideA systematic approach to diagnose embedding issues4 entries
Symptom · 01
Embedding shows no structure (all points in a single blob)
Fix
Check if data is too noisy or high-dimensional. Reduce dimensions with PCA first (e.g., to 50). Increase perplexity (t-SNE) or n_neighbors (UMAP). Verify that the data has actual signal (e.g., run a simple clustering).
Symptom · 02
Embedding changes drastically between runs
Fix
Fix random seed. Use PCA initialization. Increase number of iterations. Check if dataset is too small (<100 points) for meaningful embedding.
Symptom · 03
Clusters appear but don't match domain knowledge
Fix
Run UMAP with different hyperparameters. Compute cluster validity indices (silhouette, Davies-Bouldin) on original data. Check for batch effects or confounding variables.
Symptom · 04
Embedding is too slow or memory-intensive
Fix
Use Barnes-Hut t-SNE (sklearn) or UMAP with approximate nearest neighbors. Subsample data if possible. Consider using a random subset for exploration, then transform new points with UMAP's transform method.
★ Quick Debug Cheat Sheet for t-SNE/UMAPImmediate actions for common embedding problems
No structure in embedding
Immediate action
Reduce dimensions with PCA to 50, then re-run.
Commands
from sklearn.decomposition import PCA pca = PCA(n_components=50) data_pca = pca.fit_transform(data)
tsne = TSNE(perplexity=30, random_state=42) embedding = tsne.fit_transform(data_pca)
Fix now
If still blob-like, try UMAP with n_neighbors=15, min_dist=0.1.
Non-reproducible results+
Immediate action
Set random_state and use PCA initialization.
Commands
tsne = TSNE(perplexity=30, random_state=42, init='pca')
umap = UMAP(n_neighbors=15, random_state=42)
Fix now
Run 5 different seeds and compare; if plots vary wildly, data may lack structure.
Embedding too slow+
Immediate action
Use Barnes-Hut approximation or UMAP.
Commands
tsne = TSNE(method='barnes_hut', angle=0.5)
umap = UMAP(n_neighbors=15, min_dist=0.1, verbose=True)
Fix now
Subsample to 10k points if dataset > 100k.
t-SNE vs UMAP: Key Differences
Featuret-SNEUMAPWinner
Computational ComplexityO(n²) naive; O(n log n) with Barnes-HutO(n log n) via approximate nearest neighborsUMAP
Global Structure PreservationPoor; focuses on local neighborhoodsGood; balances local and globalUMAP
Hyperparameter SensitivityHigh (perplexity, learning rate)Moderate (n_neighbors, min_dist)UMAP (more robust)
Scalability (n > 100k)Impractical without approximationsHandles millions of pointsUMAP
ReproducibilitySensitive to random seedMore stable with fixed seedUMAP
InterpretabilityTight clusters, but distances meaninglessClusters + some global structureDepends on use case

Key takeaways

1
T-SNE and UMAP are for visualization, not feature extraction or distance preservation.
2
T-SNE uses a heavy-tailed t-distribution in low-D to avoid crowding; UMAP uses a fuzzy topological approach.
3
UMAP is generally faster, scales better, and better preserves global structure than t-SNE.
4
Hyperparameters (perplexity, n_neighbors, min_dist) drastically affect output; always run multiple seeds.
5
Both are sensitive to random initialization; use PCA initialization for reproducibility.
6
Never trust a single t-SNE/UMAP plot—validate with clustering metrics or domain knowledge.

Common mistakes to avoid

4 patterns
×

Using default perplexity without tuning

Symptom
Plots look either too fragmented or too blob-like; clusters don't make sense.
Fix
Try perplexity values of 5, 10, 30, 50 and compare. Use a grid search with a stability metric.
×

Interpreting distances or axes literally

Symptom
Claiming 'cluster A is farther from B than C' based on plot distances.
Fix
Remember: t-SNE/UMAP preserve local neighborhoods, not global distances. Only cluster membership and relative proximity are meaningful.
×

Running only one random seed

Symptom
Different runs produce wildly different plots; conclusions are not reproducible.
Fix
Always run with multiple random seeds (e.g., 42, 0, 123) and use PCA initialization to reduce variability.
×

Applying t-SNE/UMAP to raw high-dimensional data without preprocessing

Symptom
Embedding is dominated by noise; clusters are meaningless.
Fix
Standardize features, reduce noise with PCA (e.g., to 50 dimensions), and consider feature selection.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain the t-SNE algorithm step by step. What is the role of the t-dist...
Q02SENIOR
How does UMAP differ from t-SNE in terms of mathematical foundation and ...
Q03SENIOR
What is the 'crowding problem' in t-SNE and how does the algorithm addre...
Q01 of 03SENIOR

Explain the t-SNE algorithm step by step. What is the role of the t-distribution in the low-dimensional map?

ANSWER
t-SNE first computes pairwise similarities in high-dimensional space using Gaussian kernels, where similarity is proportional to conditional probability of being neighbors. Perplexity controls the bandwidth. Then it initializes low-dimensional points (often with PCA). It defines low-dimensional similarities using a Student t-distribution with one degree of freedom (heavy-tailed). The algorithm minimizes the KL divergence between the high-D and low-D similarity distributions using gradient descent. The t-distribution alleviates the crowding problem: moderate distances in high-D are mapped to larger distances in low-D, preventing points from piling up in the center.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
What is the main difference between t-SNE and UMAP?
02
How do I choose between t-SNE and UMAP?
03
What is perplexity in t-SNE and how do I set it?
04
Can I use t-SNE or UMAP for feature extraction or clustering?
N
Naren Founder & Principal Engineer

20+ years shipping production Java in banking & fintech. Every example here is drawn from a real system.

Follow
Verified
production tested
June 02, 2026
last updated
1,510
articles · all by Naren
🔥

That's Algorithms. Mark it forged?

10 min read · try the examples if you haven't

Previous
Association Rule Mining with Apriori
20 / 21 · Algorithms
Next
Hierarchical Clustering