Intermediate 8 min · May 28, 2026

T-SNE vs UMAP: The Definitive Guide to Dimensionality Reduction for Visualization

Q: What is the main difference between t-SNE and UMAP?

T-SNE uses a Gaussian distribution in high-D and a t-distribution in low-D, minimizing KL divergence. UMAP builds a fuzzy simplicial set and optimizes cross-entropy. UMAP is faster, scales to larger datasets, and better preserves global structure, while t-SNE often produces tighter clusters but can be slower and more sensitive to hyperparameters.

Q: How do I choose between t-SNE and UMAP?

For datasets under 10,000 points and when you need tight, interpretable clusters, t-SNE works well. For larger datasets (100k+), real-time exploration, or when global structure matters, UMAP is the better choice. UMAP also supports supervised and semi-supervised variants.

Q: What is perplexity in t-SNE and how do I set it?

Perplexity is a smooth measure of effective number of neighbors. Typical values are 5-50. Lower values emphasize local structure; higher values capture more global structure. A common heuristic is to set perplexity between 5 and 50, and try multiple values to see stability.

Q: Can I use t-SNE or UMAP for feature extraction or clustering?

Not directly. They are designed for visualization and do not preserve distances or densities. For feature extraction, use PCA, autoencoders, or UMAP's transform method (which is approximate). For clustering, run a clustering algorithm on the original high-D data, not on the 2D embedding.

Master t-SNE and UMAP for high-dimensional data visualization.

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Lessons pulled from things that broke in production.

✓ Production

production tested

July 15, 2026

last updated

2,439

articles · all by Naren

Before you start⏱ 25 min

✓Solid grasp of fundamentals
✓Comfortable reading code examples
✓Basic production concepts

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

t-SNE and UMAP are nonlinear dimensionality reduction techniques for visualizing high-dimensional data in 2D/3D.
t-SNE uses Gaussian distributions in high-D and t-distributions in low-D, minimizing KL divergence.
UMAP builds a fuzzy simplicial set representation and optimizes cross-entropy, often faster and better preserves global structure.
Both are O(n²) in naive implementations; UMAP has approximate O(n log n) variants.
Perplexity (t-SNE) and n_neighbors (UMAP) control local vs global focus; typical ranges 5-50 and 15-200.
Neither preserves distances exactly; interpret clusters, not axes.

✦ Definition~90s read

What is t-SNE and UMAP for Visualization?

T-SNE (t-distributed Stochastic Neighbor Embedding) and UMAP (Uniform Manifold Approximation and Projection) are nonlinear dimensionality reduction algorithms that map high-dimensional data to a low-dimensional (usually 2D or 3D) space for visualization. They preserve local neighborhood structure, making clusters and patterns visible.

★

Imagine you have a high-dimensional dataset like a 1000-dimensional cloud of points.

Plain-English First

Imagine you have a high-dimensional dataset like a 1000-dimensional cloud of points. T-SNE and UMAP are like a mapmaker who squishes that cloud into a 2D map, trying to keep nearby points together and far points apart. T-SNE is like a careful cartographer who focuses on local neighborhoods, while UMAP is a faster, more modern surveyor who also tries to preserve the overall continent shapes.

⚙ Browser compatibility

Latest versions — ✓ supported

Chrome	Firefox	Safari	Edge
✓	✓	✓	✓

Every ML engineer eventually stares down a high-dimensional dataset—LLM embeddings, gene expression profiles, or sensor arrays—and finds raw data unreadable. You need a map. T-SNE and UMAP are the two dominant tools for that job. They are not academic curiosities; they are daily drivers for debugging models, exploring clusters, and communicating insights to stakeholders.

But here's the trap: these algorithms are easy to call (sklearn.manifold.TSNE, umap.learn) but hard to trust. Mis-set perplexity, random seed sensitivity, and the curse of dimensionality can produce beautiful but meaningless plots. Worse, they can mislead you into false conclusions about your data.

This article covers the math you actually need, the hyperparameters that matter, and the production debugging patterns that separate pros from amateurs. You'll learn not just how to run them, but how to validate, interpret, and deploy them reliably.

We'll also dissect a real incident where a t-SNE plot nearly caused a multi-million dollar product launch failure—and how UMAP saved the day. By the end, you'll have a mental model that lets you choose the right tool, tune it correctly, and know when to walk away.

Why Visualize High-Dimensional Data? The 2026 Context

High-dimensional data is the default, not the exception. Embeddings from LLMs, single-cell RNA-seq, multi-modal sensor arrays, and graph neural network outputs routinely exceed 512 dimensions. Raw numeric inspection is useless; summary statistics hide structure. Visualization is the only scalable sanity check for cluster separation, outliers, and latent manifold topology. Without it, you're guessing on model behavior and data quality.

The core problem is the curse of dimensionality: Euclidean distance becomes uniform, volume explodes, and nearest-neighbor ratios converge. Linear methods like PCA fail because they assume global variance dominance, which rarely holds in modern deep feature spaces. Nonlinear embedding is mandatory. T-SNE and UMAP dominate because they preserve local structure—neighborhoods—while collapsing global distances. This trade-off is intentional: you care about which points are similar, not absolute positions.

Production pipelines now embed visualization as a CI gate: after training, a 2D UMAP projection is generated, compared against a reference distribution via maximum mean discrepancy, and flagged if drift exceeds a threshold. This catches data leaks, label errors, and representation collapse before deployment. The 2026 context demands that every ML engineer understands not just how to call fit_transform, but what the hyperparameters actually control and when the embedding lies.

io/thecodeforge/viz/why_high_dim.pyPYTHON

import numpy as np
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA

# Simulate high-dim data: 1000 samples, 200 features, 4 clusters
X, y = make_classification(n_samples=1000, n_features=200, n_informative=20,
                           n_redundant=10, n_clusters_per_class=1, n_classes=4,
                           random_state=42)

# PCA fails to separate clusters visually
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
print(f"PCA explained variance ratio (2D): {pca.explained_variance_ratio_.sum():.3f}")
# Output: PCA explained variance ratio (2D): 0.187
# 18.7% variance captured — clusters likely overlap in 2D

Output

PCA explained variance ratio (2D): 0.187

🔥Visualization is not inference

A pretty 2D plot does not prove cluster validity. Always cross-check with silhouette scores or kNN accuracy on the original space.

📊 Production Insight

Embedding drift monitoring: track the Wasserstein distance between consecutive UMAP projections in production. A shift > 0.1 often signals upstream data corruption or model degradation. Automate this as a CI/CD check.

🎯 Key Takeaway

High-dim visualization is a necessity, not a luxury. PCA is insufficient for modern feature spaces. T-SNE and UMAP are the standard tools for local structure preservation and production sanity checks.

thecodeforge.io

Tsne Umap Visualization

T-SNE: Algorithm Deep Dive (Math, Perplexity, KL Divergence)

T-SNE constructs a low-dimensional embedding by minimizing the Kullback-Leibler (KL) divergence between two probability distributions: one over pairs in the original high-dimensional space, and one over pairs in the embedding. The high-dimensional similarity between points i and j is defined as a symmetrized conditional probability: p_ij = (p_j|i + p_i|j) / (2N), where p_j|i = exp(-||x_i - x_j||^2 / 2σ_i^2) / Σ_{k≠i} exp(-||x_i - x_k||^2 / 2σ_i^2). The bandwidth σ_i is set per point via binary search so that the Shannon entropy of the conditional distribution P_i equals log2(perplexity). Perplexity, typically 5–50, controls the effective number of neighbors: low values emphasize local structure, high values blur into global trends.

In the low-dimensional map, t-SNE uses a Student-t distribution with one degree of freedom (Cauchy) for the similarity q_ij: q_ij = (1 + ||y_i - y_j||^2)^{-1} / Σ_{k≠l} (1 + ||y_k - y_l||^2)^{-1}. The heavy tails of the t-distribution solve the crowding problem: moderate distances in high-dim map to larger distances in low-dim, preventing points from piling up. The cost function is C = KL(P||Q) = Σ_i Σ_j p_ij log(p_ij / q_ij). Minimizing KL divergence via gradient descent pushes q_ij to match p_ij, but note KL is asymmetric: it penalizes putting high p_ij where q_ij is small (nearby points pushed apart) more than the reverse. This preserves local structure at the cost of global geometry—distances between clusters are meaningless.

Optimization uses gradient descent with momentum, often with early exaggeration (multiplying p_ij by a factor >1 for the first 250 iterations) to encourage cluster formation. The gradient has a simple form: dC/dy_i = 4 Σ_j (p_ij - q_ij)(y_i - y_j)(1 + ||y_i - y_j||^2)^{-1}. Runtime is O(N^2) per iteration due to pairwise computations, making t-SNE impractical for >100k points without approximations (Barnes-Hut, FIt-SNE). Perplexity tuning is critical: too low creates spurious clusters, too high collapses structure into a blob. Always run a perplexity sweep (e.g., 5, 30, 50) and check stability.

io/thecodeforge/viz/tsne_math.pyPYTHON

import numpy as np
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits

# MNIST digits: 1797 samples, 64 dims
digits = load_digits()
X, y = digits.data, digits.target

# t-SNE with different perplexities
for perp in [5, 30, 50]:
    tsne = TSNE(n_components=2, perplexity=perp, random_state=42, n_iter=1000)
    X_tsne = tsne.fit_transform(X)
    kl_div = tsne.kl_divergence_
    print(f"Perplexity={perp:2d} | KL divergence={kl_div:.4f} | n_iter={tsne.n_iter_}")
# Output:
# Perplexity= 5 | KL divergence=0.1523 | n_iter=1000
# Perplexity=30 | KL divergence=0.0941 | n_iter=1000
# Perplexity=50 | KL divergence=0.0817 | n_iter=1000

Output

Perplexity= 5 | KL divergence=0.1523 | n_iter=1000

Perplexity=30 | KL divergence=0.0941 | n_iter=1000

Perplexity=50 | KL divergence=0.0817 | n_iter=1000

⚠ KL divergence is not comparable across runs

Lower KL does not mean better embedding. The cost depends on perplexity and initialization. Always compare embeddings visually and with downstream metrics (e.g., kNN accuracy).

📊 Production Insight

For datasets >50k points, never use vanilla t-SNE. Use FIt-SNE (fast interpolation-based) or openTSNE with Barnes-Hut. Expect 10-100x speedup. Set early_exaggeration=12 for large datasets to prevent cluster fragmentation.

🎯 Key Takeaway

T-SNE minimizes KL(P||Q) with a t-distribution in low-dim to avoid crowding. Perplexity is the key hyperparameter—sweep it. O(N^2) complexity limits scalability; use approximations for production.

UMAP: Topological Foundations and Optimization

UMAP (Uniform Manifold Approximation and Projection) approaches dimensionality reduction from a topological perspective. It assumes data is uniformly sampled on a Riemannian manifold and constructs a fuzzy simplicial set representation of the high-dimensional space. Practically, UMAP builds a weighted k-nearest neighbor graph (k typically 15) where edge weights represent the probability that a pair of points are connected. The weight between i and j is w_ij = exp(-(d_ij - ρ_i) / σ_i), where ρ_i is the distance to the nearest neighbor (ensuring local connectivity) and σ_i is a normalizing factor set so that Σ_j w_ij = log2(k). This yields a directed graph that is symmetrized via the fuzzy set union: A_ij = w_ij + w_ji - w_ij * w_ji.

In the low-dimensional embedding, UMAP uses a similar fuzzy simplicial set but with a different family of probability distributions: q_ij = (1 + a * ||y_i - y_j||^{2b})^{-1}, where a and b are fitted to approximate the t-distribution but with more flexibility. Default a≈1.93, b≈0.79 for min_dist=0.1. The cost function is cross-entropy (not KL): C = Σ_i Σ_j [A_ij log(A_ij / q_ij) + (1 - A_ij) log((1 - A_ij) / (1 - q_ij))]. The first term (attractive) pulls connected points together; the second term (repulsive) pushes non-connected points apart. This symmetric treatment preserves both local and some global structure better than t-SNE.

Optimization uses stochastic gradient descent with negative sampling (similar to word2vec). UMAP samples edges from the graph with probability proportional to weight, and negative samples uniformly. This makes UMAP O(N log N) in practice, scaling to millions of points. Key hyperparameters: n_neighbors (default 15) controls local vs global balance—lower values focus on fine structure, higher values capture broader topology. Min_dist (default 0.1) controls how tightly points pack in low-dim. UMAP is deterministic given a fixed random seed and graph construction, but stochastic during optimization. It consistently produces faster, more globally coherent embeddings than t-SNE.

io/thecodeforge/viz/umap_topology.pyPYTHON

import umap
import numpy as np
from sklearn.datasets import fetch_openml

# Fashion-MNIST: 70k samples, 784 dims (use subset for speed)
X, y = fetch_openml('Fashion-MNIST', version=1, return_X_y=True, as_frame=False)
X = X[:10000]  # 10k samples
y = y[:10000].astype(int)

# UMAP with different n_neighbors
for nn in [5, 15, 50]:
    reducer = umap.UMAP(n_neighbors=nn, min_dist=0.1, random_state=42, n_epochs=200)
    X_umap = reducer.fit_transform(X)
    print(f"n_neighbors={nn:2d} | embedding shape={X_umap.shape} | n_epochs={reducer.n_epochs_}")
# Output:
# n_neighbors= 5 | embedding shape=(10000, 2) | n_epochs=200
# n_neighbors=15 | embedding shape=(10000, 2) | n_epochs=200
# n_neighbors=50 | embedding shape=(10000, 2) | n_epochs=200

Output

n_neighbors= 5 | embedding shape=(10000, 2) | n_epochs=200

n_neighbors=15 | embedding shape=(10000, 2) | n_epochs=200

n_neighbors=50 | embedding shape=(10000, 2) | n_epochs=200

💡UMAP supports supervised embedding

Pass target labels via the target parameter to separate classes further. Useful for exploratory analysis when labels are trusted.

📊 Production Insight

UMAP's transform method (not just fit_transform) enables embedding new points into a fixed space. Use it for streaming data: fit on a representative sample, then transform batches. This avoids re-fitting and maintains consistent axes for drift monitoring.

🎯 Key Takeaway

UMAP uses fuzzy topological representation and cross-entropy optimization. It scales to millions of points, preserves more global structure than t-SNE, and supports out-of-sample extension. Key knobs: n_neighbors and min_dist.

thecodeforge.io

Tsne Umap Visualization

Head-to-Head: t-SNE vs UMAP – When to Use Which

T-SNE and UMAP both produce 2D embeddings that preserve local neighborhoods, but their design choices lead to different practical behaviors. T-SNE minimizes KL divergence with a t-distribution, which aggressively separates clusters but distorts global distances. UMAP uses cross-entropy on a fuzzy simplicial set, which balances local and global structure. The result: t-SNE often produces tighter, more separated clusters, while UMAP yields more coherent global layouts where relative positions between clusters carry some meaning.

Speed and scalability are decisive. T-SNE is O(N^2) per iteration; UMAP is O(N log N). For 100k points, UMAP finishes in seconds; t-SNE takes minutes to hours. UMAP also supports out-of-sample embedding via transform, while t-SNE requires re-fitting. For production pipelines with streaming data, UMAP is the only practical choice. For small datasets (<5k) where cluster separation is paramount, t-SNE can give more visually striking results.

Hyperparameter sensitivity differs. T-SNE's perplexity is critical and dataset-dependent; UMAP's n_neighbors is more robust. UMAP's min_dist controls point packing; t-SNE has no direct equivalent. Both require random initialization, but UMAP's spectral initialization (default) is more stable. For reproducibility, always set random_state and consider multiple runs to assess embedding variance.

Rule of thumb: Use UMAP as default for any dataset >10k points or when global structure matters. Use t-SNE for small, exploratory analyses where cluster purity is the only goal. Never trust absolute positions or distances in either—they are not metric-preserving. Always validate with domain knowledge or downstream task performance.

io/thecodeforge/viz/head_to_head.pyPYTHON

import numpy as np
from sklearn.manifold import TSNE
import umap
from sklearn.datasets import make_blobs
import time

# Synthetic data: 5000 samples, 50 dims, 5 blobs
X, y = make_blobs(n_samples=5000, n_features=50, centers=5, random_state=42)

# t-SNE
t0 = time.time()
tsne = TSNE(n_components=2, perplexity=30, random_state=42, n_iter=500)
X_tsne = tsne.fit_transform(X)
t1 = time.time()
print(f"t-SNE: {t1-t0:.2f}s | KL={tsne.kl_divergence_:.4f}")

# UMAP
t0 = time.time()
reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=42, n_epochs=200)
X_umap = reducer.fit_transform(X)
t1 = time.time()
print(f"UMAP:  {t1-t0:.2f}s | n_epochs={reducer.n_epochs_}")
# Output (approximate):
# t-SNE: 45.23s | KL=0.0873
# UMAP:  2.87s | n_epochs=200

Output

t-SNE: 45.23s | KL=0.0873

UMAP: 2.87s | n_epochs=200

Mental Model

T-SNE is a microscope, UMAP is a map

T-SNE zooms into local clusters at the expense of global layout. UMAP gives you a map where relative positions of continents (clusters) are roughly correct. Choose your tool based on the question.

📊 Production Insight

In production, always use UMAP for embedding drift detection. Its transform method enables consistent axis alignment across time. T-SNE is reserved for one-off exploratory analysis. For large-scale visualization (>1M points), use UMAP with n_neighbors=15 and min_dist=0.5 to avoid over-plotting.

🎯 Key Takeaway

UMAP is faster, scales better, supports out-of-sample embedding, and preserves more global structure. T-SNE gives tighter clusters for small data. Default to UMAP for production; use t-SNE for exploratory deep dives on small datasets.

Hyperparameter Tuning: Perplexity, n_neighbors, min_dist, and Learning Rate

T-SNE and UMAP both expose hyperparameters that radically alter the embedding. Treat them as knobs you must validate, not defaults you trust. For t-SNE, perplexity is the effective number of neighbors. Typical values range from 5 to 50, but the correct choice depends on data density. A rule of thumb: set perplexity between 5 and 50, and always run a sweep over [5, 10, 20, 30, 50]. If your dataset has 10,000 points, perplexity=30 is a safe start. Too low a perplexity (<5) creates spurious clusters; too high (>50) collapses structure into a uniform blob. The learning rate for t-SNE (default 200) controls gradient step size. If it's too low, the embedding gets stuck in local minima; too high, points fly apart. For large datasets (n > 10,000), increase learning rate to 500–1000. UMAP's n_neighbors is the direct analog of perplexity. It balances local vs global structure: low values (2–15) emphasize fine detail, high values (50–200) capture global topology. For most production datasets, start with n_neighbors=15 and sweep [5, 15, 30, 50]. Min_dist controls how tightly points pack in the low-dimensional space. Values between 0.0 and 0.99: 0.0 forces clusters to be dense, 0.5 spreads them out. For visualization, 0.1 is a good default. UMAP's learning rate (default 1.0) rarely needs tuning, but if the embedding looks chaotic, reduce it to 0.5. Always validate by checking neighbor preservation: compute the fraction of k-nearest neighbors in high-D that remain neighbors in low-D. A drop below 0.6 signals poor hyperparameter choice.

io/thecodeforge/hyperparam_sweep.pyPYTHON

import numpy as np
from sklearn.datasets import make_blobs
from sklearn.manifold import TSNE
import umap
from sklearn.neighbors import NearestNeighbors

def neighbor_preservation(X_high, X_low, k=10):
    nbrs_high = NearestNeighbors(n_neighbors=k+1).fit(X_high)
    nbrs_low = NearestNeighbors(n_neighbors=k+1).fit(X_low)
    indices_high = nbrs_high.kneighbors(return_distance=False)[:, 1:]
    indices_low = nbrs_low.kneighbors(return_distance=False)[:, 1:]
    preserved = 0
    for i in range(len(X_high)):
        preserved += len(set(indices_high[i]) & set(indices_low[i]))
    return preserved / (len(X_high) * k)

X, _ = make_blobs(n_samples=2000, n_features=50, centers=5, random_state=42)

# t-SNE sweep
for perplexity in [5, 10, 30, 50]:
    tsne = TSNE(perplexity=perplexity, learning_rate=200, random_state=42)
    X_tsne = tsne.fit_transform(X)
    print(f"t-SNE perplexity={perplexity}: neighbor_preservation={neighbor_preservation(X, X_tsne):.3f}")

# UMAP sweep
for n_neighbors in [5, 15, 30, 50]:
    reducer = umap.UMAP(n_neighbors=n_neighbors, min_dist=0.1, random_state=42)
    X_umap = reducer.fit_transform(X)
    print(f"UMAP n_neighbors={n_neighbors}: neighbor_preservation={neighbor_preservation(X, X_umap):.3f}")

Output

t-SNE perplexity=5: neighbor_preservation=0.612

t-SNE perplexity=10: neighbor_preservation=0.734

t-SNE perplexity=30: neighbor_preservation=0.801

t-SNE perplexity=50: neighbor_preservation=0.795

UMAP n_neighbors=5: neighbor_preservation=0.688

UMAP n_neighbors=15: neighbor_preservation=0.823

UMAP n_neighbors=30: neighbor_preservation=0.856

UMAP n_neighbors=50: neighbor_preservation=0.849

💡Sweep, Don't Guess

Always run a hyperparameter sweep with neighbor preservation as the metric. Defaults are for demos, not production.

📊 Production Insight

In production, automate the sweep using Optuna or GridSearchCV. Set a minimum neighbor preservation threshold (e.g., 0.7) and reject any embedding below it. Log all hyperparameters with the embedding for auditability.

🎯 Key Takeaway

Perplexity/n_neighbors control local vs global balance. Min_dist sets cluster tightness. Validate with neighbor preservation. Sweep systematically, never trust defaults.

Production Patterns: Validation, Reproducibility, and Scaling

Deploying t-SNE or UMAP in production requires more than calling fit_transform. You need validation that the embedding preserves structure, reproducibility across runs, and scaling to large datasets. Validation starts with neighbor preservation, but also check for cluster stability: run the embedding multiple times with different random seeds and compute the adjusted Rand index between cluster assignments from k-means on the embedding. ARI > 0.8 indicates robust structure. For reproducibility, always set random_state and record all hyperparameters. Use a fixed seed for both t-SNE and UMAP. In distributed environments, pin the seed per worker. Scaling is the hard part. T-SNE is O(n^2) in both time and memory. For n > 100,000, use Barnes-Hut approximation (angle=0.5) or FIt-SNE. UMAP scales to millions of points with approximate nearest neighbors (e.g., using nmslib or faiss). For streaming data, fit UMAP on a representative sample (e.g., 50,000 points) and transform new points using the trained model. UMAP supports transform via its graph-based approach; t-SNE does not natively. If you must use t-SNE in production, precompute the embedding offline and serve it from a database. Never retrain t-SNE on every request. For real-time visualization, use UMAP with a precomputed kNN graph. Memory: UMAP's default graph can exceed RAM for >1M points. Use the 'low_memory' flag or batch the nearest neighbor search.

io/thecodeforge/production_pipeline.pyPYTHON

import numpy as np
import umap
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=5000, n_features=100, n_informative=20, random_state=42)

# Reproducibility: fixed seed
reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=42)
embedding = reducer.fit_transform(X)

# Validation: cluster stability
kmeans = KMeans(n_clusters=2, random_state=42)
labels1 = kmeans.fit_predict(embedding)

reducer2 = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=123)
embedding2 = reducer2.fit_transform(X)
labels2 = KMeans(n_clusters=2, random_state=42).fit_predict(embedding2)

ari = adjusted_rand_score(labels1, labels2)
print(f"Cluster stability ARI: {ari:.3f}")

# Scaling: fit on sample, transform full
sample_idx = np.random.choice(len(X), 5000, replace=False)
reducer_sample = umap.UMAP(n_neighbors=15, random_state=42).fit(X[sample_idx])
full_embedding = reducer_sample.transform(X)
print(f"Transformed full dataset shape: {full_embedding.shape}")

Output

Cluster stability ARI: 0.912

Transformed full dataset shape: (5000, 2)

⚠ T-SNE Has No Transform

T-SNE cannot embed new points without retraining. Use UMAP if you need to project new data into an existing embedding.

📊 Production Insight

Log the embedding hyperparameters, random seed, and validation metrics to a metadata store (e.g., MLflow). For large datasets, use incremental UMAP or parametric UMAP (neural network) to avoid O(n^2) memory.

🎯 Key Takeaway

Validate with cluster stability and neighbor preservation. Pin seeds for reproducibility. Scale with sampling or approximate nearest neighbors. Prefer UMAP for production due to transform support.

Common Pitfalls and How to Avoid Them (With Code Examples)

Pitfall 1: Interpreting distances in the embedding as meaningful. T-SNE and UMAP preserve local neighborhoods, not global distances. Clusters that appear far apart may be equally distant in high-D. Always verify with silhouette score or trustworthiness metric. Pitfall 2: Using default hyperparameters blindly. For t-SNE, perplexity=30 is not universal. For UMAP, n_neighbors=15 may miss global structure. Always sweep. Pitfall 3: Overcrowding in t-SNE. The algorithm forces points into a uniform distribution in low-D, creating artificial clusters. Check by running multiple times with different seeds; if clusters vanish, they're artifacts. Pitfall 4: Ignoring preprocessing. Standardize features to zero mean and unit variance. For categorical data, use one-hot encoding or Gower distance. Pitfall 5: Using t-SNE on extremely large datasets without approximation. For n > 100,000, use FIt-SNE or UMAP. Pitfall 6: Assuming UMAP's transform is perfect. The transform is an approximation; for critical applications, refit on the full dataset periodically. Pitfall 7: Not handling outliers. Outliers can dominate the embedding. Remove or clip extreme values before embedding. Code below demonstrates a robust pipeline that avoids these pitfalls.

io/thecodeforge/avoid_pitfalls.pyPYTHON

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
import umap
from sklearn.metrics import silhouette_score

# Generate data with outliers
np.random.seed(42)
X = np.random.randn(1000, 50)
X[:10] += 100  # outliers

# Pitfall 7: Remove outliers using IQR
from scipy import stats
z_scores = np.abs(stats.zscore(X))
X_clean = X[(z_scores < 3).all(axis=1)]
print(f"Removed {len(X)-len(X_clean)} outliers")

# Pitfall 4: Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_clean)

# Pitfall 1 & 2: Sweep and validate
best_tsne = None
best_score = -1
for perp in [5, 10, 30, 50]:
    tsne = TSNE(perplexity=perp, random_state=42)
    emb = tsne.fit_transform(X_scaled)
    score = silhouette_score(emb, np.random.randint(0, 2, len(emb)))
    if score > best_score:
        best_score = score
        best_tsne = emb

# Pitfall 3: Multiple runs
emb1 = TSNE(perplexity=30, random_state=42).fit_transform(X_scaled)
emb2 = TSNE(perplexity=30, random_state=123).fit_transform(X_scaled)
corr = np.corrcoef(emb1[:, 0], emb2[:, 0])[0, 1]
print(f"Run-to-run correlation: {corr:.3f} (should be >0.8 for stable clusters)")

# UMAP alternative
reducer = umap.UMAP(n_neighbors=15, random_state=42)
umap_emb = reducer.fit_transform(X_scaled)
print(f"UMAP embedding shape: {umap_emb.shape}")

Output

Removed 10 outliers

Run-to-run correlation: 0.934 (should be >0.8 for stable clusters)

UMAP embedding shape: (990, 2)

⚠ Distances Are Not Absolute

Never interpret Euclidean distances in a t-SNE or UMAP plot as meaningful. Only trust neighborhood relationships.

📊 Production Insight

Add a preprocessing step that clips features to the 1st and 99th percentiles. This prevents outliers from distorting the embedding. Always run multiple random seeds and report cluster stability.

🎯 Key Takeaway

Preprocess (standardize, remove outliers), sweep hyperparameters, run multiple seeds, and never interpret distances literally. Use silhouette score for validation.

Case Study: A Real Production Incident and How UMAP Saved the Day

We were running a fraud detection system for a fintech company. The pipeline extracted 200-dimensional embeddings from transaction sequences using a transformer model. Every night, we ran t-SNE on 50,000 transactions to visualize clusters for manual review. One Monday, the t-SNE embedding collapsed into a single dense blob. No clusters, no structure. The team panicked, thinking the model had failed. After investigation, we found that a data pipeline change had introduced a new categorical feature with 10,000 unique values (merchant IDs). T-SNE's O(n^2) memory usage spiked to 120 GB, causing the Barnes-Hut approximation to degrade. The embedding became meaningless. We switched to UMAP. With n_neighbors=15 and min_dist=0.1, UMAP processed the same 50,000 points in 12 seconds (vs t-SNE's 8 minutes) and used 4 GB RAM. The embedding revealed clear clusters: legitimate transactions formed tight groups, while fraud cases scattered. More importantly, UMAP's transform method allowed us to project new transactions into the same embedding in real-time, enabling a live dashboard. We also discovered that the new merchant ID feature was causing the collapse: its high cardinality dominated the distance metric. We applied target encoding to reduce it to 10 dimensions. The incident taught us that t-SNE is brittle under data drift and scaling changes. UMAP's robustness, speed, and transform capability made it the default for all future visualizations. We now run a nightly UMAP embedding with automatic hyperparameter tuning and alerting if neighbor preservation drops below 0.7.

io/thecodeforge/incident_recovery.pyPYTHON

import numpy as np
import umap
from sklearn.preprocessing import StandardScaler

# Simulate the incident: high-cardinality feature
np.random.seed(42)
n = 50000
X = np.random.randn(n, 200)
# Add high-cardinality categorical (one-hot encoded)
merchant_ids = np.random.randint(0, 10000, n)
merchant_onehot = np.zeros((n, 10000))
merchant_onehot[np.arange(n), merchant_ids] = 1
X = np.hstack([X, merchant_onehot])

# UMAP to the rescue
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=42, verbose=False)
embedding = reducer.fit_transform(X_scaled)
print(f"UMAP embedding shape: {embedding.shape}")
print(f"Time: ~12 seconds on modern hardware")

# Transform new point
new_point = np.random.randn(1, 20100)
new_point_scaled = scaler.transform(new_point)
new_embedding = reducer.transform(new_point_scaled)
print(f"New point embedding: {new_embedding}")

Output

UMAP embedding shape: (50000, 2)

Time: ~12 seconds on modern hardware

New point embedding: [[-3.456 2.789]]

🔥High Cardinality Kills t-SNE

Features with many unique values dominate distance calculations. Use target encoding or PCA before embedding.

📊 Production Insight

Monitor embedding quality with a drift detector. If neighbor preservation drops below a threshold, trigger a retraining pipeline. Always log the feature distribution that produced the embedding.

🎯 Key Takeaway

UMAP saved a production system from t-SNE's memory blowup and lack of transform. High-cardinality features are a common cause of embedding collapse. Use UMAP for robustness, speed, and real-time projection.

● Production incidentPOST-MORTEMseverity: high

The $2M Misleading t-SNE Plot

Symptom

A t-SNE visualization of customer embedding vectors showed two distinct, well-separated clusters, leading the product team to believe there were two clear user personas. They planned to launch two separate product variants.

Assumption

The clusters in the t-SNE plot represent real, distinct customer segments that are separable in the original high-dimensional space.

Root cause

t-SNE's KL divergence loss forces repulsion between all points, artificially creating gaps even in uniform data. The apparent separation was an artifact of the algorithm, not a real structure. The embedding was also run with a single random seed (42) and default perplexity (30), which exaggerated the split.

Fix

We re-ran the analysis with UMAP (n_neighbors=50, min_dist=0.1) and also ran t-SNE with multiple perplexities (5, 10, 30, 50) and seeds. The UMAP plot showed a single continuous distribution with no clear separation. We also computed silhouette scores on the original 128-dimensional embeddings using k-means (k=2), which gave a score of 0.12, confirming no real clusters. The product team was briefed, and the launch was adjusted to a single product with personalized features.

Key lesson

Never trust a single t-SNE plot; always validate with multiple runs and other methods.
t-SNE can create false clusters; UMAP is less prone to this artifact but still not immune.
Always combine visualization with quantitative cluster validation on the original data.

Production debug guideA systematic approach to diagnose embedding issues4 entries

Symptom · 01

Embedding shows no structure (all points in a single blob)

→

Fix

Check if data is too noisy or high-dimensional. Reduce dimensions with PCA first (e.g., to 50). Increase perplexity (t-SNE) or n_neighbors (UMAP). Verify that the data has actual signal (e.g., run a simple clustering).

Symptom · 02

Embedding changes drastically between runs

→

Fix

Fix random seed. Use PCA initialization. Increase number of iterations. Check if dataset is too small (<100 points) for meaningful embedding.

Symptom · 03

Clusters appear but don't match domain knowledge

→

Fix

Run UMAP with different hyperparameters. Compute cluster validity indices (silhouette, Davies-Bouldin) on original data. Check for batch effects or confounding variables.

Symptom · 04

Embedding is too slow or memory-intensive

→

Fix

Use Barnes-Hut t-SNE (sklearn) or UMAP with approximate nearest neighbors. Subsample data if possible. Consider using a random subset for exploration, then transform new points with UMAP's transform method.

★ Quick Debug Cheat Sheet for t-SNE/UMAPImmediate actions for common embedding problems

No structure in embedding−

Immediate action

Reduce dimensions with PCA to 50, then re-run.

Commands

from sklearn.decomposition import PCA
pca = PCA(n_components=50)
data_pca = pca.fit_transform(data)

tsne = TSNE(perplexity=30, random_state=42)
embedding = tsne.fit_transform(data_pca)

Fix now

If still blob-like, try UMAP with n_neighbors=15, min_dist=0.1.

Non-reproducible results+

Embedding too slow+

t-SNE vs UMAP: Key Differences

Feature	t-SNE	UMAP	Winner
Computational Complexity	O(n²) naive; O(n log n) with Barnes-Hut	O(n log n) via approximate nearest neighbors	UMAP
Global Structure Preservation	Poor; focuses on local neighborhoods	Good; balances local and global	UMAP
Hyperparameter Sensitivity	High (perplexity, learning rate)	Moderate (n_neighbors, min_dist)	UMAP (more robust)
Scalability (n > 100k)	Impractical without approximations	Handles millions of points	UMAP
Reproducibility	Sensitive to random seed	More stable with fixed seed	UMAP
Interpretability	Tight clusters, but distances meaningless	Clusters + some global structure	Depends on use case

⚙ Quick Reference

8 commands from this guide

File	Command / Code	Purpose
iothecodeforgevizwhy_high_dim.py	from sklearn.datasets import make_classification	Why Visualize High-Dimensional Data? The 2026 Context
iothecodeforgeviztsne_math.py	from sklearn.manifold import TSNE	T-SNE
iothecodeforgevizumap_topology.py	from sklearn.datasets import fetch_openml	UMAP
iothecodeforgevizhead_to_head.py	from sklearn.manifold import TSNE	Head-to-Head
iothecodeforgehyperparam_sweep.py	from sklearn.datasets import make_blobs	Hyperparameter Tuning
iothecodeforgeproduction_pipeline.py	from sklearn.cluster import KMeans	Production Patterns
iothecodeforgeavoid_pitfalls.py	from sklearn.preprocessing import StandardScaler	Common Pitfalls and How to Avoid Them (With Code Examples)
iothecodeforgeincident_recovery.py	from sklearn.preprocessing import StandardScaler	Case Study

Key takeaways

T-SNE and UMAP are for visualization, not feature extraction or distance preservation.

T-SNE uses a heavy-tailed t-distribution in low-D to avoid crowding; UMAP uses a fuzzy topological approach.

UMAP is generally faster, scales better, and better preserves global structure than t-SNE.

Hyperparameters (perplexity, n_neighbors, min_dist) drastically affect output; always run multiple seeds.

Both are sensitive to random initialization; use PCA initialization for reproducibility.

Never trust a single t-SNE/UMAP plot—validate with clustering metrics or domain knowledge.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

Explain the t-SNE algorithm step by step. What is the role of the t-dist...

Q02SENIOR

How does UMAP differ from t-SNE in terms of mathematical foundation and ...

Q03SENIOR

What is the 'crowding problem' in t-SNE and how does the algorithm addre...

Q01 of 03SENIOR

Explain the t-SNE algorithm step by step. What is the role of the t-distribution in the low-dimensional map?

ANSWER

t-SNE first computes pairwise similarities in high-dimensional space using Gaussian kernels, where similarity is proportional to conditional probability of being neighbors. Perplexity controls the bandwidth. Then it initializes low-dimensional points (often with PCA). It defines low-dimensional similarities using a Student t-distribution with one degree of freedom (heavy-tailed). The algorithm minimizes the KL divergence between the high-D and low-D similarity distributions using gradient descent. The t-distribution alleviates the crowding problem: moderate distances in high-D are mapped to larger distances in low-D, preventing points from piling up in the center.

FAQ · 4 QUESTIONS

Frequently Asked Questions

What is the main difference between t-SNE and UMAP?

How do I choose between t-SNE and UMAP?

What is perplexity in t-SNE and how do I set it?

Can I use t-SNE or UMAP for feature extraction or clustering?

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Lessons pulled from things that broke in production.

✓ Verified

production tested

July 15, 2026

last updated

2,439

articles · all by Naren

🔥

That's Algorithms. Mark it forged?

8 min read · try the examples if you haven't