Dimensionality reduction maps high-dimensional data (10,000+ features) into 2-50 dimensions while preserving variance (PCA) or neighbourhood structure (t-SNE/UMAP).
PCA: linear, deterministic, O(n·d²) time — interpretable components, best for pre-processing before linear models.
t-SNE: non-linear, stochastic, O(n²) time — great for visualisation but sensitive to perplexity; can create false clusters.
UMAP: non-linear, O(n log n) with approximate neighbours — scales to millions of points, preserves more global structure than t-SNE.
LDA: supervised, finds axes that maximise class separation — use only when labels are available and classes are roughly Gaussian.
Production insight: PCA can be inverted to detect outliers; t-SNE transforms are not reusable on new data — embed once, discard the model.
Plain-English First
Imagine you're describing every person at a party using 500 different facts — their shoe size, hair length, what they had for breakfast, etc. Most of those facts are redundant or useless for figuring out who's similar to whom. Dimensionality reduction is like a smart assistant that says: 'forget 490 of those facts — these 10 actually capture who people are.' You lose a little detail, but you gain the ability to actually SEE patterns, run models faster, and stop drowning in noise. That's it. That's dimensionality reduction.
High-dimensional data is everywhere in production ML — a user's click history might span 10,000 features, a raw image 50,176 pixels, a genomics dataset half a million SNP markers. Models trained directly on this data suffer from the curse of dimensionality: distances become meaningless, overfitting skyrockets, training slows to a crawl, and visualisation becomes impossible. Dimensionality reduction isn't a pre-processing nicety — it's often the difference between a model that generalises and one that memorises noise.
The core problem these techniques solve is geometric. In high dimensions, every point is roughly equidistant from every other point. That breaks nearest-neighbour search, makes clustering unstable, and bloats the covariance matrix your model has to estimate. By projecting data into a lower-dimensional space that preserves what actually matters — variance, local structure, class separability — you give your downstream algorithms a fighting chance.
By the end of this article you'll understand the internal mechanics of PCA, t-SNE, UMAP, and LDA well enough to choose the right one for a production problem, tune hyperparameters with confidence, avoid the subtle bugs that silently corrupt results, and answer the interview questions that trip up even experienced ML engineers.
What is Dimensionality Reduction Techniques?
High-dimensional data is the norm in production ML, not the exception. Each new feature adds a dimension, and as dimensions grow, the volume of the feature space explodes exponentially. This is the curse of dimensionality: distances between any two points become nearly identical, making nearest-neighbour algorithms useless; the sample density plummets, so you need exponentially more data to cover the space; and models overfit because they can memorise noise in the extra dimensions instead of learning signal.
Dimensionality reduction tackles this by projecting the data into a lower-dimensional subspace that preserves the structure you care about. The four dominant techniques — PCA, t-SNE, UMAP, LDA — each make different trade-offs between linearity, scalability, interpretability, and the type of structure they preserve.
PCA keeps global variance; t-SNE keeps local neighbourhoods; UMAP keeps a fuzzy topology that balances local and global; LDA keeps class separability. Pick the wrong one and you'll either destroy the signal you're trying to capture or get misled by false patterns.
package io.thecodeforge.dimensionality;
// Quick demonstration of curse of dimensionalitypublicclassReductionDemo {
publicstaticvoidmain(String[] args) {
int n = 1000; // points
int dim = 1000; // dimensionsdouble[][] data = newdouble[n][dim];
// fill with random data
java.util.Random rng = new java.util.Random(42);
for (int i = 0; i < n; i++)
for (int j = 0; j < dim; j++)
data[i][j] = rng.nextGaussian();
// compute average distance between random pointsdouble sumDist = 0.0;
int pairs = 0;
for (int i = 0; i < n; i++) {
for (int j = i+1; j < n; j++) {
double dist = 0.0;
for (int k = 0; k < dim; k++)
dist += (data[i][k] - data[j][k]) * (data[i][k] - data[j][k]);
sumDist += Math.sqrt(dist);
pairs++;
if (pairs >= 10000) break;
}
if (pairs >= 10000) break;
}
double avgDist = sumDist / pairs;
double minDist = Double.MAX_VALUE, maxDist = Double.MIN_VALUE;
// approximate min/max from same subsetSystem.out.println("Average distance in " + dim + "D: " + avgDist);
System.out.println("In high dimensions, all distances converge — that's the curse.");
}
}
Output
Average distance in 1000D: ~44.7
In high dimensions, all distances converge — that's the curse.
Why Reduction Matters in Production
A model trained on 10,000 features without reduction will likely overfit unless you have millions of samples. Dimensionality reduction is not optional — it's the first line of defence against the curse.
Production Insight
The curse of dimensionality is not theoretical. In production, high-dimensional data causes KNN to degrade to random guessing, tree-based models to split on noise, and clustering algorithms to produce arbitrary groups.
Always check the effective dimensionality of your data before choosing a reduction method. PCA can give you a quick diagnostic via explained variance ratio.
If you don't reduce, you'll need exponentially more training data to maintain model quality — that's the hidden cost.
Key Takeaway
Dimensionality reduction fights the curse of dimensionality.
Without it, distances break, models overfit, clusters vanish.
Choose the method based on which structure you want to preserve.
Always diagnose with PCA's explained variance first.
PCA — Linear Reduction with Interpretable Components
Principal Component Analysis finds the orthogonal axes that maximise the variance of the projected data. It's a linear transformation — the output dimensions are linear combinations of the original features. This makes PCA the most interpretable method: you can inspect the loadings (eigenvectors) to understand which original features drive each component.
The algorithm centres the data, computes the covariance matrix, then performs eigenvalue decomposition. The top-k eigenvectors become the principal components. In practice, you should always standardise data before PCA — otherwise high-variance features dominate.
PCA is also the fastest method: O(n·d² + d³) for n samples and d features. It's a deterministic closed-form solution — no hyperparameter tuning, no stochasticity. That's why it's the go-to for feature extraction in linear models and for anomaly detection (reconstruction error).
io/thecodeforge/dimensionality/PCA.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
package io.thecodeforge.dimensionality;
import java.util.ArrayList;
import java.util.List;
publicclassPCA {
privatedouble[] mean;
privatedouble[][] components;
publicvoidfit(double[][] data, int nComponents) {
int n = data.length;
int d = data[0].length;
// centre the data
mean = newdouble[d];
for (int j = 0; j < d; j++) {
double sum = 0.0;
for (int i = 0; i < n; i++) sum += data[i][j];
mean[j] = sum / n;
}
double[][] centered = newdouble[n][d];
for (int i = 0; i < n; i++)
for (int j = 0; j < d; j++)
centered[i][j] = data[i][j] - mean[j];
// covariance matrixdouble[][] cov = newdouble[d][d];
for (int i = 0; i < n; i++)
for (int j = 0; j < d; j++)
for (int k = 0; k < d; k++)
cov[j][k] += centered[i][j] * centered[i][k];
for (int j = 0; j < d; j++)
for (int k = 0; k < d; k++)
cov[j][k] /= (n - 1);
// simplified power iteration for top eigenvalues (production: use ARPACK)List<double[]> comps = newArrayList<>();
double[][] remaining = cov.clone();
for (int c = 0; c < nComponents; c++) {
double[] vec = newdouble[d];
for (int j = 0; j < d; j++) vec[j] = 1.0 / Math.sqrt(d);
for (int iter = 0; iter < 100; iter++) {
double[] newVec = newdouble[d];
for (int j = 0; j < d; j++)
for (int k = 0; k < d; k++)
newVec[j] += remaining[j][k] * vec[k];
double norm = 0.0;
for (double v : newVec) norm += v * v;
norm = Math.sqrt(norm);
for (int j = 0; j < d; j++) vec[j] = newVec[j] / norm;
}
comps.add(vec.clone());
// deflationfor (int i = 0; i < d; i++)
for (int j = 0; j < d; j++)
remaining[i][j] -= vec[i] * vec[j];
}
components = comps.toArray(newdouble[0][]);
}
publicdouble[][] transform(double[][] data) {
// projects data onto componentsint n = data.length;
int k = components.length;
double[][] result = newdouble[n][k];
for (int i = 0; i < n; i++)
for (int c = 0; c < k; c++)
for (int j = 0; j < data[0].length; j++)
result[i][c] += (data[i][j] - mean[j]) * components[c][j];
return result;
}
}
PCA as a Change of Basis
The first principal component points where variance is highest.
Each component is orthogonal to the previous — no redundancy.
The eigenvalues tell you how much variance each component explains.
You drop the smallest eigenvalue components — that's the reduction.
Production Insight
Never apply PCA directly to raw image pixels without flattening — you lose spatial structure. Use convolutional autoencoders instead if you need non-linear reduction for images.
PCA is invertible: reconstruction error = ||X - X_approx||. Use this as an anomaly score. Points with high reconstruction error are outliers — but only for linear manifolds.
Batch PCA with mini-batch mean update is not safe for streaming data. Use IncrementalPCA from scikit-learn or implement a covariance update formula.
Key Takeaway
PCA is the fastest, most interpretable reduction method.
Always standardise before PCA.
Use for pre-processing linear models and anomaly detection.
It cannot capture non-linear structure — that's why we have t-SNE and UMAP.
Should You Use PCA?
IfYou need interpretable feature importance (which original features drive the reduction)
→
UseUse PCA — inspect component loadings in component_matrix_[feature_index].
IfData is sparse (e.g., bag-of-words) and n_features > n_samples
→
UseUse TruncatedSVD (also called LSA). Standard PCA on dense covariance matrix would be O(d²) in memory — impossibly large.
IfYou plan to run linear models (logistic regression, SVM) after reduction
→
UsePCA is ideal. The reduced dimensions are uncorrelated, which helps linear models avoid multicollinearity.
IfData has non-linear structure (swiss roll, manifolds)
→
UseDo not use PCA — it will preserve only global variance, destroying local neighbourhoods. Use t-SNE or UMAP instead.
t-SNE — Capturing Local Neighbourhood Structure
t-Distributed Stochastic Neighbour Embedding (t-SNE) is a non-linear technique that focuses on preserving the pairwise distances between nearby points. It converts high-dimensional Euclidean distances into conditional probabilities of similarity in the original space, then matches those with a similar distribution in the low-dimensional space using a heavy-tailed t-distribution.
The key hyperparameter is perplexity — roughly the number of neighbours considered. Common mistake: using default perplexity=30 for all datasets. If your dataset is small (<500 points), reduce perplexity. If it's large (>10,000), increase it, but be aware of memory limits.
t-SNE is stochastic and each run gives a different embedding. It is not a function you can apply to new data — you must re-run on the full dataset. That makes it unsuitable for production ML pipelines that embed new points. Its only safe production use is exploratory visualisation and interpretation of learned features.
io/thecodeforge/dimensionality/TSNE.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
// io.thecodeforge.dimensionality.TSNE// Simplified t-SNE implementation for demonstration// Production: use scikit-learn or umap-learn; this shows the core looppackage io.thecodeforge.dimensionality;
import java.util.Random;
publicclassTSNE {
privatedouble[][] embedding;
privatedouble klDivergence;
publicvoidfit(double[][] data, int perplexity, int nIter) {
int n = data.length;
int d = data[0].length;
double[][] P = computePairwiseAffinities(data, perplexity);
// initialise low-dimensional points randomlyRandom rng = newRandom(42);
double[][] Y = newdouble[n][2];
for (int i = 0; i < n; i++) {
Y[i][0] = rng.nextGaussian() * 1e-4;
Y[i][1] = rng.nextGaussian() * 1e-4;
}
double learningRate = 200.0;
double momentum = 0.5;
double[][] velocity = newdouble[n][2];
for (int iter = 0; iter < nIter; iter++) {
// compute Q matrix (low-dimensional similarity)double[][] Q = newdouble[n][n];
double sumQ = 0.0;
for (int i = 0; i < n; i++) {
for (int j = i+1; j < n; j++) {
double dist = 0.0;
for (int dim = 0; dim < 2; dim++)
dist += (Y[i][dim] - Y[j][dim]) * (Y[i][dim] - Y[j][dim]);
Q[i][j] = 1.0 / (1.0 + dist);
Q[j][i] = Q[i][j];
sumQ += 2.0 * Q[i][j];
}
}
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
if (i != j) Q[i][j] /= sumQ;
// compute gradientdouble[][] grad = newdouble[n][2];
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
if (i == j) continue;
double diff = P[i][j] - Q[i][j];
double dist = 0.0;
for (int dim = 0; dim < 2; dim++)
dist += (Y[i][dim] - Y[j][dim]) * (Y[i][dim] - Y[j][dim]);
double factor = diff * (1.0 / (1.0 + dist));
for (int dim = 0; dim < 2; dim++)
grad[i][dim] += factor * (Y[i][dim] - Y[j][dim]);
}
grad[i][0] *= 4.0;
grad[i][1] *= 4.0;
}
// update with momentumfor (int i = 0; i < n; i++) {
for (int dim = 0; dim < 2; dim++) {
velocity[i][dim] = momentum * velocity[i][dim] - learningRate * grad[i][dim];
Y[i][dim] += velocity[i][dim];
}
}
// KL divergence monitoringif (iter % 100 == 0) {
double kl = 0.0;
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
if (P[i][j] > 0 && Q[i][j] > 0)
kl += P[i][j] * Math.log(P[i][j] / Q[i][j]);
System.out.println("Iter " + iter + " KL divergence: " + kl);
this.klDivergence = kl;
}
// momentum change after 250 iterationsif (iter == 250) momentum = 0.8;
}
embedding = Y;
}
privatedouble[][] computePairwiseAffinities(double[][] data, int perplexity) {
// simplified: compute Gaussian affinities with fixed variance// production: binary search sigma for each point to match perplexityint n = data.length;
double[][] P = newdouble[n][n];
double sigma = 1.0; // fixed for brevityfor (int i = 0; i < n; i++) {
double sum = 0.0;
for (int j = 0; j < n; j++) {
if (i == j) continue;
double dist = 0.0;
for (int dim = 0; dim < data[0].length; dim++)
dist += (data[i][dim] - data[j][dim]) * (data[i][dim] - data[j][dim]);
P[i][j] = Math.exp(-dist / (2 * sigma * sigma));
sum += P[i][j];
}
for (int j = 0; j < n; j++)
if (i != j) P[i][j] /= sum;
}
// symmetrise and normalisefor (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
P[i][j] = (P[i][j] + P[j][i]) / (2 * n);
return P;
}
publicdouble[][] getEmbedding() { return embedding; }
publicdoublegetKlDivergence() { return klDivergence; }
}
Production Warning: t-SNE is not a transform
t-SNE does not learn a mapping function. You cannot fit on training data and transform test data. If you need to embed new points, use UMAP (parametric mode) or train an autoencoder. t-SNE is for static visualisation only.
Production Insight
t-SNE is O(n²) in both memory and compute. For n > 100,000, the pairwise affinity matrix alone can exceed RAM on most machines.
Different runs produce different embeddings. Always fix random_state for reproducibility in reports.
False clusters: t-SNE can create artificial clusters even from uniform random noise, especially with low perplexity. Verify cluster validity with external metrics.
Key Takeaway
t-SNE is the gold standard for visualisation but useless for production pipelines.
Never use t-SNE on >50k points — memory explodes.
Always fix random_state and check KL divergence for convergence.
If false clusters appear, increase perplexity or switch to UMAP.
t-SNE or UMAP?
IfDataset size < 10,000 points
→
UseBoth work, but t-SNE often gives more visually appealing clusters. Use t-SNE with perplexity ~30.
IfDataset size between 10,000 and 100,000 points
→
Uset-SNE is feasible but slow. Use UMAP with default n_neighbors=15 for faster runs and comparable quality.
IfDataset > 100,000 points
→
UseAvoid t-SNE. Use UMAP with approximate nearest neighbours (annoy or nmslib).
IfNeed to embed new unseen points after training
→
UseUMAP has a .transform() method via umap.UMAP() . t-SNE cannot do this — you must re-run on the entire dataset.
UMAP — Scalable Non-Linear Reduction
Uniform Manifold Approximation and Projection (UMAP) is a manifold learning technique that builds a fuzzy topological representation of the data in high dimensions, then optimises a similar representation in low dimensions. It uses nearest neighbours to create a weighted graph, then minimises cross-entropy between the high-dimensional and low-dimensional graphs.
UMAP's key advantages: it scales to millions of points (O(n log n) with approximate neighbours), preserves more global structure than t-SNE, and — crucially — provides a transform method for embedding new data points after training. This makes it the only non-linear method suitable for production pipelines.
Two main hyperparameters: n_neighbors (controls local vs global balance, default 15) and min_dist (controls how tightly points pack together in the embedding, default 0.1). Lower n_neighbors focuses on local structure; higher values capture global topology.
io/thecodeforge/dimensionality/UMAP.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
// io.thecodeforge.dimensionality.UMAP// Simplified UMAP core — production uses umap-learn (Python)// This demonstrates the fuzzy simplicial set constructionpackage io.thecodeforge.dimensionality;
import java.util.*;
publicclassUMAP {
privatedouble[][] embedding;
privatedouble minDist;
privateint nNeighbors;
publicvoidfit(double[][] data, int nNeighbors, double minDist, int nComponents) {
this.nNeighbors = nNeighbors;
this.minDist = minDist;
int n = data.length;
// 1. Compute fuzzy simplicial set (simplified: nearest neighbour graph with smooth knn distances)double[][] knnDistances = computeKNNDistances(data, nNeighbors);
double[][] fuzzyGraph = computeFuzzyGraph(knnDistances, nNeighbors);
// 2. Initialise low-dimensional embedding using spectral layoutdouble[][] Y = spectralInit(fuzzyGraph, nComponents);
// 3. Optimise with force-directed graph layoutoptimizeEmbedding(fuzzyGraph, Y, nIterations: 500);
this.embedding = Y;
}
privatedouble[][] computeKNNDistances(double[][] data, int k) {
// Brute force for demo; production uses approximate nearest neighbours (HNSW)int n = data.length;
double[][] distances = newdouble[n][k];
for (int i = 0; i < n; i++) {
double[] dists = newdouble[n];
for (int j = 0; j < n; j++) {
if (i == j) dists[j] = Double.MAX_VALUE;
else {
double sum = 0.0;
for (int dim = 0; dim < data[0].length; dim++)
sum += (data[i][dim] - data[j][dim]) * (data[i][dim] - data[j][dim]);
dists[j] = Math.sqrt(sum);
}
}
// partial sort to get top kArrays.sort(dists);
for (int j = 0; j < k; j++) distances[i][j] = dists[j];
}
return distances;
}
privatedouble[][] computeFuzzyGraph(double[][] knnDistances, int k) {
int n = knnDistances.length;
double[][] graph = newdouble[n][n];
double sigma = 1.0; // simplified; UMAP computes local connectivity radiusfor (int i = 0; i < n; i++) {
double rho = knnDistances[i][0]; // distance to nearest neighbourfor (int jIndex = 0; jIndex < k; jIndex++) {
double d = knnDistances[i][jIndex];
double weight = Math.exp(-(d - rho) / sigma);
int j = jIndex; // simplified: need actual j index, not just distance
graph[i][j] = weight;
}
}
// symmetrise: graph = graph + graph^T - graph * graph^Tdouble[][] sym = newdouble[n][n];
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
sym[i][j] = graph[i][j] + graph[j][i] - graph[i][j] * graph[j][i];
return sym;
}
privatedouble[][] spectralInit(double[][] graph, int nComponents) {
// stub; production uses eigendecomposition of graph Laplacianreturnnewdouble[graph.length][nComponents];
}
privatevoidoptimizeEmbedding(double[][] graph, double[][] Y, int nIterations) {
// stub; UMAP uses stochastic gradient descent with negative sampling
}
publicdouble[][] transform(double[][] newData) {
// production implementation: search nearest neighbours in original data,// initialise positions, then run a small optimisation loopreturnnewdouble[newData.length][embedding[0].length];
}
}
UMAP's Two-Step Process
Step 1: Construct a weighted nearest-neighbour graph with fuzzy set membership.
Step 2: Optimise a low-dimensional graph to minimise cross-entropy between the two graphs.
The graph captures the topological structure — local AND global.
The transform method embeds new points using the same graph (parametric extension).
Production Insight
UMAP with high n_neighbors (>200) becomes increasingly global and starts behaving like PCA on the graph Laplacian.
UMAP's transform is not perfect; it uses a nearest neighbour search in the original data to initialise the new point. If original data is high-dimensional and sparse, the search is noisy.
For reproducibility, always set random_state and consider using n_jobs=1 to avoid non-determinism from parallel approximate neighbours.
Key Takeaway
UMAP is the production-grade alternative to t-SNE.
It scales to millions of points and supports transform.
Use n_neighbors to control local vs global balance.
Always set n_jobs=1 and random_state for deterministic pipelines.
When UMAP Beats Other Methods
IfYou need to embed new data points (transform) after training
→
UseOnly UMAP and autoencoders support transform; PCA also does but is linear.
IfDataset > 100k points
→
UseUMAP is the only practical non-linear option (O(n log n)).
IfYou care about preserving global structure (e.g., distances between clusters)
→
UseUMAP with n_neighbors=30-100 preserves more global structure than t-SNE.
LDA — Supervised Dimensionality Reduction
Linear Discriminant Analysis is a supervised technique. Unlike PCA which finds directions of maximum variance, LDA finds the linear axes that maximise class separability. It does this by maximising the ratio of between-class variance to within-class variance.
LDA assumes: (1) the data is normally distributed per class, (2) classes have identical covariance matrices, and (3) the number of features is less than the number of samples (otherwise the within-class scatter matrix is singular). It's limited to C-1 components where C is the number of classes, so for binary classification you get only 1 dimension.
When these assumptions hold, LDA often outperforms PCA for classification tasks because it's directly optimising for class separation. It's commonly used in face recognition (Fisherfaces) and as a pre-processing step for algorithms that benefit from class-discriminative features.
io/thecodeforge/dimensionality/LDA.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
package io.thecodeforge.dimensionality;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
publicclassLDA {
privatedouble[][] components;
privatedouble[] mean;
publicvoidfit(double[][] data, int[] labels, int nComponents) {
int n = data.length;
int d = data[0].length;
// compute overall mean
mean = newdouble[d];
for (int i = 0; i < n; i++)
for (int j = 0; j < d; j++)
mean[j] += data[i][j];
for (int j = 0; j < d; j++) mean[j] /= n;
// group data by classMap<Integer, List<double[]>> groups = newHashMap<>();
for (int i = 0; i < n; i++) {
groups.computeIfAbsent(labels[i], k -> newArrayList<>()).add(data[i]);
}
int numClasses = groups.size();
if (nComponents > numClasses - 1) {
thrownewIllegalArgumentException("LDA max components = " + (numClasses - 1));
}
// within-class scatter Swdouble[][] Sw = newdouble[d][d];
// between-class scatter Sbdouble[][] Sb = newdouble[d][d];
for (Map.Entry<Integer, List<double[]>> entry : groups.entrySet()) {
List<double[]> classData = entry.getValue();
double[] classMean = newdouble[d];
for (double[] row : classData)
for (int j = 0; j < d; j++)
classMean[j] += row[j];
for (int j = 0; j < d; j++) classMean[j] /= classData.size();
// within-class scatterfor (double[] row : classData) {
double[] diff = newdouble[d];
for (int j = 0; j < d; j++) diff[j] = row[j] - classMean[j];
for (int j = 0; j < d; j++)
for (int k = 0; k < d; k++)
Sw[j][k] += diff[j] * diff[k];
}
// between-class scatterdouble[] diff = newdouble[d];
for (int j = 0; j < d; j++) diff[j] = classMean[j] - mean[j];
double scale = classData.size();
for (int j = 0; j < d; j++)
for (int k = 0; k < d; k++)
Sb[j][k] += scale * diff[j] * diff[k];
}
// Solve generalised eigenvalue problem: Sb * v = lambda * Sw * v// Simplified: assume Sw invertible, compute Sw^{-1} * Sbdouble[][] SwInv = invertMatrix(Sw);
double[][] mat = multiply(SwInv, Sb);
// Power iteration for top nComponents eigenvectorsList<double[]> comps = newArrayList<>();
double[][] remaining = mat;
for (int c = 0; c < nComponents; c++) {
double[] vec = newdouble[d];
for (int j = 0; j < d; j++) vec[j] = 1.0 / Math.sqrt(d);
for (int iter = 0; iter < 100; iter++) {
double[] newVec = newdouble[d];
for (int j = 0; j < d; j++)
for (int k = 0; k < d; k++)
newVec[j] += remaining[j][k] * vec[k];
double norm = 0.0;
for (double v : newVec) norm += v * v;
norm = Math.sqrt(norm);
for (int j = 0; j < d; j++) vec[j] = newVec[j] / norm;
}
comps.add(vec.clone());
// deflationfor (int i = 0; i < d; i++)
for (int j = 0; j < d; j++)
remaining[i][j] -= vec[i] * vec[j];
}
components = comps.toArray(newdouble[0][]);
}
publicdouble[][] transform(double[][] data) {
int n = data.length;
int c = components.length;
double[][] result = newdouble[n][c];
for (int i = 0; i < n; i++)
for (int j = 0; j < c; j++)
for (int k = 0; k < data[0].length; k++)
result[i][j] += (data[i][k] - mean[k]) * components[j][k];
return result;
}
privatedouble[][] multiply(double[][] a, double[][] b) { /* stub */ returnnewdouble[a.length][b[0].length]; }
privatedouble[][] invertMatrix(double[][] m) { /* stub: use ApacheCommonsMath */ returnnewdouble[m.length][m[0].length]; }
}
LDA vs PCA for Classification
If your goal is classification and you have labelled data, LDA often beats PCA because it directly maximises class separation. However, LDA breaks when classes are imbalanced or have non-Gaussian distributions. In those cases, PCA + a good classifier may be more robust.
Production Insight
LDA assumes equal covariance across classes. If your classes have different spreads (e.g., one class has much higher variance), LDA will find suboptimal axes. Use Quadratic Discriminant Analysis (QDA) instead.
When n_features > n_samples, the within-class scatter matrix is singular. Solve this with regularised LDA (RLDA) or PCA pre-screening.
LDA is vulnerable to outliers because it uses class means and covariance matrices. Clean your labels before fitting.
Key Takeaway
LDA is supervised — use it when you care about class separation.
Assumes Gaussian classes with equal covariance — verify before trusting.
Limited to C-1 components.
Regularise if features > samples.
When to Use LDA?
IfYou have labelled data and want to reduce dimensions for classification
→
UseUse LDA if classes are balanced and roughly Gaussian. Max components = C-1.
IfNo labels available
→
UseLDA cannot be used. Use PCA or UMAP instead.
IfMore features than samples
→
UseLDA within-class scatter is singular. Apply PCA first to reduce to < n_samples, then LDA.
Choosing the Right Technique — Decision Tree and Trade-offs
You've now seen four techniques. The right choice depends on your goal: interpretability, visualisation, scalability, or classification accuracy.
Here's a summary rule of thumb
Need interpretable, fast, linear? → PCA
Need to visualise small datasets (<10k) with pretty clusters? → t-SNE
Need to visualise large datasets or embed new points? → UMAP
Have labels, want to separate classes? → LDA
But beyond the initial choice, monitor the embedding quality: for t-SNE/UMAP, check the KL divergence or cross-entropy loss. For PCA/LDA, check the explained variance ratio or classification accuracy on a hold-out set.
A common pitfall: applying dimensionality reduction once and never re-validating. Data distributions shift. The embedding that worked last quarter may be useless today. Treat your reduction pipeline as a trainable component with regular monitoring.
package io.thecodeforge.dimensionality;
// Monitoring class for embedding quality in productionimport java.util.*;
publicclassReductionMonitor {
publicstaticdoubletrustworthiness(double[][] original, double[][] embedding, int k) {
// Measures how often k-nearest neighbours in original are still neighbours in embeddingint n = original.length;
int[] origNeighbors = getKNN(original, k);
int[] embNeighbors = getKNN(embedding, k);
int common = 0;
for (int i = 0; i < n; i++) {
if (origNeighbors[i] == embNeighbors[i]) common++;
}
return (double) common / n;
}
privatestaticint[] getKNN(double[][] data, int k) {
// Stub — production uses a KD-Tree or ball tree for O(n log n)returnnewint[data.length];
}
publicstaticdoublestress(double[][] highD, double[][] lowD) {
// Normalised stress: sum of squared differences in pairwise distancesdouble sumDiff = 0.0, sumOrig = 0.0;
int n = highD.length;
for (int i = 0; i < n; i++) {
for (int j = i+1; j < n; j++) {
double dHigh = euclidean(highD[i], highD[j]);
double dLow = euclidean(lowD[i], lowD[j]);
sumDiff += (dHigh - dLow) * (dHigh - dLow);
sumOrig += dHigh * dHigh;
}
}
returnMath.sqrt(sumDiff / sumOrig);
}
privatestaticdoubleeuclidean(double[] a, double[] b) {
double sum = 0.0;
for (int i = 0; i < a.length; i++) sum += (a[i] - b[i]) * (a[i] - b[i]);
returnMath.sqrt(sum);
}
}
Silent Model Degradation
Dimensionality reduction is a data-dependent transformation. When your input data distribution shifts (new users, changed behaviour), the embedding quality can silently collapse. Always monitor trustworthiness and stress metrics, and retrigger fitting on a scheduled basis or when data drift is detected.
Production Insight
Embeddings from t-SNE/UMAP are not stable across retraining. If you retrain with new data, the entire embedding shifts — you cannot compare positions from two different runs.
For production pipelines that require stable low-dimensional representations (e.g., feature storage for retrieval), use PCA or parametric UMAP with a fixed random seed.
The cost of fitting: PCA is O(n·d²), t-SNE is O(n²), UMAP is O(n log n). For real-time systems, only PCA and parametric UMAP can be applied at inference time.
Key Takeaway
Match the technique to your goal: PCA for speed/interpretability, UMAP for scale/transform, t-SNE for visualisation only, LDA for class separation.
Monitor embedding quality with trustworthiness and stress.
Never blindly use default hyperparameters — tune perplexity, n_neighbors, and min_dist.
Plan for data drift — schedule periodic refits.
Final Decision Tree
IfGoal: interpretability (which original features matter)
→
UsePCA — inspect component loadings.
IfGoal: visual exploration of clusters
→
UseUMAP for >50k points, t-SNE for <10k.
IfGoal: pre-processing for a supervised model with labels
→
UseLDA if assumptions hold; otherwise PCA + any classifier.
IfGoal: anomaly detection
→
UsePCA reconstruction error or autoencoder loss.
IfNeed to embed new points in production
→
UsePCA (linear) or UMAP (non-linear) — avoid t-SNE.
● Production incidentPOST-MORTEMseverity: high
t-SNE OOM crash on a 500k row customer dataset
Symptom
t-SNE training fails with OutOfMemoryError on datasets larger than ~50,000 points. The machine has 64 GB RAM but the process allocates >100 GB.
Assumption
That t-SNE scales to any dataset size because scikit-learn handles it. Wrong — default implementation computes a full distance matrix of size n².
Root cause
t-SNE computes a dense n×n matrix of pairwise distances (float64) → memory = n² × 8 bytes. For n=500,000 that's 2,000 GB. Even with optimisations, the gradient computation requires O(n²) memory.
Fix
Switch to UMAP which uses nearest neighbour approximation (annoy / HNSW) and O(n log n) memory. Alternatively, subsample to 10,000 points for t-SNE visualisation.
Key lesson
Always estimate memory before running t-SNE on large datasets: n² × 8 bytes.
Use UMAP when you need to visualise more than 50,000 points.
Never use t-SNE for dimensionality reduction as a pre-processing step for downstream models — it's non-parametric and cannot embed new points.
Production debug guideSymptom → Root cause → Action4 entries
Symptom · 01
PCA components explain <10% of total variance in first 2 dimensions
→
Fix
Check if data is standardised (mean-centre + unit variance). PCA on unscaled data overweights high-variance features. Use StandardScaler before PCA.
Symptom · 02
t-SNE plot shows distinct clusters that don't align with labels and disappear on rerun
→
Fix
Reduce perplexity (try 5-30) or increase learning rate. False clusters appear when perplexity is too low relative to data size. Rerun with fixed random_state to verify stability.
Symptom · 03
UMAP visualisation looks like a uniform square blob with no structure
→
Fix
Increase n_neighbors (default 15). Too low a value makes UMAP focus only on local structure, losing global separation. Also check that data is scaled and that n_components is not too high for the visualised dimension.
Symptom · 04
LDA scored <60% accuracy on held-out test set after reduction
→
Fix
Check whether classes are balanced and covariance matrices are roughly equal. LDA assumes Gaussian class-conditional distributions. Try QDA (Quadratic Discriminant Analysis) if covariance differs per class, or use regularised LDA.
★ Quick Debug Cheat Sheet — Dimensionality ReductionCommands and actions for the three most common production failures.
PCA memory error on large matrix−
Immediate action
Kill the job, check memory usage. Reduce batch size or use IncrementalPCA.
monitor KL divergence: tsne.kl_divergence_ after each call — should decrease monotonically
Fix now
If KL divergence plateaus above 0.5, reduce perplexity or increase learning rate. Still stuck? Use PCA initialisation (init='pca').
UMAP embedding flips sign between runs+
Immediate action
Set random_state. UMAP is stochastic by default; without a fixed seed each run gives a different rotation.
Commands
umap.UMAP(random_state=42).fit_transform(data)
Check if the loss changes: umap_model._fit() logs negative sampling loss — should be <0.1 after full fit.
Fix now
Always set random_state for reproducibility in production pipelines. For full deterministic mode, use n_jobs=1.
Dimensionality Reduction Techniques Comparison
Technique
Type
Complexity
Scalability
Interpretability
Supports transform new data
Best Use Case
PCA
Linear, unsupervised
O(n·d² + d³)
High (up to 1M points)
High — inspect loadings
Yes
Pre-processing, anomaly detection, interpretable features
t-SNE
Non-linear, unsupervised
O(n²)
Low (<50k points)
Low — no loadings
No
2D/3D visualisation for small datasets
UMAP
Non-linear, unsupervised
O(n log n)
High (up to 10M points)
Medium — via graph inspection
Yes (approximate)
Visualisation, feature engineering, production pipelines
LDA
Linear, supervised
O(n·d² + d³)
Moderate (depends on n_features vs classes)
High — see component weights per discriminant
Yes
Classification pre-processing when labels available
Key takeaways
1
The curse of dimensionality makes distances meaningless
always consider reduction for high-dimensional data.
2
PCA for interpretable features and linear pre-processing; UMAP for scalable non-linear embedding; t-SNE for visualisation only.
3
Always standardise before PCA/LDA; tune perplexity for t-SNE; set random_state for reproducibility in all stochastic methods.
4
Monitor embedding quality with trustworthiness and stress metrics
silent degradation happens when data drifts.
5
Never use t-SNE in a production pipeline
it cannot transform new data. Use UMAP or PCA for that.
6
LDA is powerful when classes are Gaussian and covariance is equal
otherwise regularise or switch to PCA.
Common mistakes to avoid
6 patterns
×
Using t-SNE for feature extraction before clustering
Symptom
Clusters appear in t-SNE plot but K-Means on original data produces completely different groups — because t-SNE can introduce false clusters.
Fix
Use UMAP or PCA for pre-processing before clustering. t-SNE should only be used for visualisation, not as a feature extraction step.
×
Not scaling data before PCA
Symptom
First few principal components are dominated by features with large magnitudes (e.g., salary in dollars vs age in years). Reduced dimensions explain little meaningful variance.
Fix
Always standardise data to zero mean and unit variance (StandardScaler) before applying PCA.
×
Setting n_neighbors too low in UMAP
Symptom
UMAP embedding shows many tiny disconnected clusters, losing global structure. The visualisation looks like a scatter of isolated points.
Fix
Increase n_neighbors to at least 15. For global structure preservation, try values between 30 and 100.
×
Using LDA when within-class covariance matrices are not equal
Symptom
LDA components do not separate classes well; classification accuracy after reduction is worse than using raw features.
Fix
Check if classes have similar covariance matrices. If not, use Quadratic Discriminant Analysis (QDA) or regularised LDA.
×
Memorising syntax before understanding the concept
Symptom
You can code a PCA call but cannot explain when to standardise or why perplexity matters in t-SNE.
Fix
Focus on the geometric intuition: what problem each technique solves. Code is just implementation; the decision logic is what matters in production.
×
Skipping practice and only reading theory
Symptom
You can list the algorithms but fail to debug a real pipeline because you've never seen the common failure modes.
Fix
Run each technique on a real dataset. Intentionally break it — skip scaling, set wrong perplexity — and watch what happens. That's how you internalise the rules.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01SENIOR
Explain the difference between PCA and t-SNE. When would you use each in...
Q02SENIOR
What is the curse of dimensionality and how does PCA help mitigate it?
Q03SENIOR
How does perplexity affect t-SNE? What happens if you set perplexity too...
Q04SENIOR
UMAP claims to preserve more global structure than t-SNE. Why?
Q01 of 04SENIOR
Explain the difference between PCA and t-SNE. When would you use each in a production ML pipeline?
ANSWER
PCA is a linear, deterministic method that finds orthogonal axes of maximum variance. It is fast, interpretable, and supports transform (you can embed new data after fitting). Use PCA for feature extraction before linear models, for anomaly detection via reconstruction error, and for visualising data that lies on a linear manifold. t-SNE is a non-linear, stochastic method that preserves local neighbourhood structure. It does not support transform (you cannot embed new points after fitting) and it scales poorly (O(n²)). Use t-SNE only for exploratory visualisation of static datasets (<50k points). In a production pipeline, you would use PCA or UMAP (which supports transform) as a pre-processing step, never t-SNE.
Q02 of 04SENIOR
What is the curse of dimensionality and how does PCA help mitigate it?
ANSWER
The curse of dimensionality refers to phenomena that occur when data has many features: distances between points become nearly uniform (making nearest neighbour search useless), the volume of space grows exponentially (requiring exponentially more training samples to cover it), and models overfit because the number of parameters scales with feature count. PCA helps by projecting the data onto a lower-dimensional subspace that captures most of the variance. This reduces the effective dimensionality, reduces overfitting, and makes distance-based algorithms (KNN, clustering) work again. However, PCA assumes linearity, so for high-dimensional data with non-linear structure, you may need autoencoders or UMAP.
Q03 of 04SENIOR
How does perplexity affect t-SNE? What happens if you set perplexity too low or too high?
ANSWER
Perplexity in t-SNE roughly controls the number of neighbours each point considers when computing pairwise affinities. Low perplexity (e.g., 2-5) means each point only considers its immediate neighbours — the embedding focuses on local structure, often creating many small, false clusters even from uniform random noise. High perplexity (e.g., 100+) forces each point to consider many neighbours, which can make the embedding look like PCA (linear, spread out) and lose local structure. The default (30) works for datasets with a few thousand points. Rule of thumb: set perplexity between 5 and 50, and if in doubt, try multiple values and compare. Also note: perplexity must be less than the number of points, otherwise the algorithm fails.
Q04 of 04SENIOR
UMAP claims to preserve more global structure than t-SNE. Why?
ANSWER
UMAP builds a fuzzy simplicial set representation that captures both local neighbourhoods and global topology by constructing a weighted graph with a variable-bandwidth kernel. The optimisation minimises cross-entropy between the high-dimensional and low-dimensional graphs, which naturally balances local and global fidelity. t-SNE, in contrast, uses a fixed perplexity that defines a local neighbourhood for each point and relies on a heavy-tailed t-distribution to handle crowding — but this often discards global structure because points that are far apart in high dimensions get pushed arbitrarily far apart in low dimensions. UMAP's graph-based approach explicitly models the connectivity of the manifold, preserving distances between clusters better.
01
Explain the difference between PCA and t-SNE. When would you use each in a production ML pipeline?
SENIOR
02
What is the curse of dimensionality and how does PCA help mitigate it?
SENIOR
03
How does perplexity affect t-SNE? What happens if you set perplexity too low or too high?
SENIOR
04
UMAP claims to preserve more global structure than t-SNE. Why?
SENIOR
FAQ · 7 QUESTIONS
Frequently Asked Questions
01
What is Dimensionality Reduction Techniques in simple terms?
Dimensionality Reduction Techniques is a fundamental concept in ML / AI. Think of it as a tool — once you understand its purpose, you'll reach for it constantly.
Was this helpful?
02
Can I use t-SNE for feature extraction before clustering?
No. t-SNE is a visualisation technique, not a feature extraction method. It creates coordinates that can introduce false clusters and are not stable across runs. Use PCA, UMAP, or autoencoders for feature extraction before clustering.
Was this helpful?
03
Why does my PCA explain variance drop significantly after reducing dimensions?
That's expected. The amount of explained variance depends on the data. If the first few components explain very little (<30% total), it means the data is not well approximated by a linear subspace — consider using a non-linear method like UMAP or autoencoders.
Was this helpful?
04
UMAP vs t-SNE — which is better for visualising a dataset of 200,000 points?
UMAP. t-SNE's O(n²) memory and compute make it impractical for >50,000 points. UMAP scales to millions of points, preserves more global structure, and produces comparable visual quality. Plus, UMAP can embed new points after training — t-SNE cannot.
Was this helpful?
05
What is the difference between PCA and LDA?
PCA is unsupervised — it maximises variance without using labels. LDA is supervised — it maximises class separation using label information. PCA gives you up to d components, LDA gives you at most C-1 (C = number of classes). PCA works on any data; LDA fails if classes are not Gaussian or have unequal covariance.
Was this helpful?
06
How do I choose the number of components for PCA?
Look at the cumulative explained variance ratio. Pick the smallest k such that the ratio exceeds a threshold (e.g., 0.9 or 0.95). Alternatively, use the elbow method: plot explained variance vs k and pick the point where the curve flattens. For visualisation you often pick 2 or 3.
Was this helpful?
07
What metrics can I use to evaluate embedding quality?
Trustworthiness (how well neighbours in original space remain neighbours in embedding) and normalised stress (discrepancy in pairwise distances). For classification, downstream accuracy. For t-SNE, monitor KL divergence — it should decrease monotonically. For UMAP, the cross-entropy loss should stabilise.