Senior 5 min · March 06, 2026

t-SNE Memory Crash — Why 500K Rows Needs 2TB RAM

t-SNE allocates n²×8 bytes for pairwise distances — 2TB for 500K points.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Dimensionality reduction maps high-dimensional data (10,000+ features) into 2-50 dimensions while preserving variance (PCA) or neighbourhood structure (t-SNE/UMAP).
  • PCA: linear, deterministic, O(n·d²) time — interpretable components, best for pre-processing before linear models.
  • t-SNE: non-linear, stochastic, O(n²) time — great for visualisation but sensitive to perplexity; can create false clusters.
  • UMAP: non-linear, O(n log n) with approximate neighbours — scales to millions of points, preserves more global structure than t-SNE.
  • LDA: supervised, finds axes that maximise class separation — use only when labels are available and classes are roughly Gaussian.
  • Production insight: PCA can be inverted to detect outliers; t-SNE transforms are not reusable on new data — embed once, discard the model.
Plain-English First

Imagine you're describing every person at a party using 500 different facts — their shoe size, hair length, what they had for breakfast, etc. Most of those facts are redundant or useless for figuring out who's similar to whom. Dimensionality reduction is like a smart assistant that says: 'forget 490 of those facts — these 10 actually capture who people are.' You lose a little detail, but you gain the ability to actually SEE patterns, run models faster, and stop drowning in noise. That's it. That's dimensionality reduction.

High-dimensional data is everywhere in production ML — a user's click history might span 10,000 features, a raw image 50,176 pixels, a genomics dataset half a million SNP markers. Models trained directly on this data suffer from the curse of dimensionality: distances become meaningless, overfitting skyrockets, training slows to a crawl, and visualisation becomes impossible. Dimensionality reduction isn't a pre-processing nicety — it's often the difference between a model that generalises and one that memorises noise.

The core problem these techniques solve is geometric. In high dimensions, every point is roughly equidistant from every other point. That breaks nearest-neighbour search, makes clustering unstable, and bloats the covariance matrix your model has to estimate. By projecting data into a lower-dimensional space that preserves what actually matters — variance, local structure, class separability — you give your downstream algorithms a fighting chance.

By the end of this article you'll understand the internal mechanics of PCA, t-SNE, UMAP, and LDA well enough to choose the right one for a production problem, tune hyperparameters with confidence, avoid the subtle bugs that silently corrupt results, and answer the interview questions that trip up even experienced ML engineers.

What is Dimensionality Reduction Techniques?

High-dimensional data is the norm in production ML, not the exception. Each new feature adds a dimension, and as dimensions grow, the volume of the feature space explodes exponentially. This is the curse of dimensionality: distances between any two points become nearly identical, making nearest-neighbour algorithms useless; the sample density plummets, so you need exponentially more data to cover the space; and models overfit because they can memorise noise in the extra dimensions instead of learning signal.

Dimensionality reduction tackles this by projecting the data into a lower-dimensional subspace that preserves the structure you care about. The four dominant techniques — PCA, t-SNE, UMAP, LDA — each make different trade-offs between linearity, scalability, interpretability, and the type of structure they preserve.

PCA keeps global variance; t-SNE keeps local neighbourhoods; UMAP keeps a fuzzy topology that balances local and global; LDA keeps class separability. Pick the wrong one and you'll either destroy the signal you're trying to capture or get misled by false patterns.

io/thecodeforge/dimensionality/ReductionDemo.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
package io.thecodeforge.dimensionality;

// Quick demonstration of curse of dimensionality
public class ReductionDemo {
    public static void main(String[] args) {
        int n = 1000;   // points
        int dim = 1000; // dimensions
        double[][] data = new double[n][dim];
        // fill with random data
        java.util.Random rng = new java.util.Random(42);
        for (int i = 0; i < n; i++)
            for (int j = 0; j < dim; j++)
                data[i][j] = rng.nextGaussian();

        // compute average distance between random points
        double sumDist = 0.0;
        int pairs = 0;
        for (int i = 0; i < n; i++) {
            for (int j = i+1; j < n; j++) {
                double dist = 0.0;
                for (int k = 0; k < dim; k++)
                    dist += (data[i][k] - data[j][k]) * (data[i][k] - data[j][k]);
                sumDist += Math.sqrt(dist);
                pairs++;
                if (pairs >= 10000) break;
            }
            if (pairs >= 10000) break;
        }
        double avgDist = sumDist / pairs;
        double minDist = Double.MAX_VALUE, maxDist = Double.MIN_VALUE;
        // approximate min/max from same subset
        System.out.println("Average distance in " + dim + "D: " + avgDist);
        System.out.println("In high dimensions, all distances converge — that's the curse.");
    }
}
Output
Average distance in 1000D: ~44.7
In high dimensions, all distances converge — that's the curse.
Why Reduction Matters in Production
A model trained on 10,000 features without reduction will likely overfit unless you have millions of samples. Dimensionality reduction is not optional — it's the first line of defence against the curse.
Production Insight
The curse of dimensionality is not theoretical. In production, high-dimensional data causes KNN to degrade to random guessing, tree-based models to split on noise, and clustering algorithms to produce arbitrary groups.
Always check the effective dimensionality of your data before choosing a reduction method. PCA can give you a quick diagnostic via explained variance ratio.
If you don't reduce, you'll need exponentially more training data to maintain model quality — that's the hidden cost.
Key Takeaway
Dimensionality reduction fights the curse of dimensionality.
Without it, distances break, models overfit, clusters vanish.
Choose the method based on which structure you want to preserve.
Always diagnose with PCA's explained variance first.

PCA — Linear Reduction with Interpretable Components

Principal Component Analysis finds the orthogonal axes that maximise the variance of the projected data. It's a linear transformation — the output dimensions are linear combinations of the original features. This makes PCA the most interpretable method: you can inspect the loadings (eigenvectors) to understand which original features drive each component.

The algorithm centres the data, computes the covariance matrix, then performs eigenvalue decomposition. The top-k eigenvectors become the principal components. In practice, you should always standardise data before PCA — otherwise high-variance features dominate.

PCA is also the fastest method: O(n·d² + d³) for n samples and d features. It's a deterministic closed-form solution — no hyperparameter tuning, no stochasticity. That's why it's the go-to for feature extraction in linear models and for anomaly detection (reconstruction error).

io/thecodeforge/dimensionality/PCA.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
package io.thecodeforge.dimensionality;

import java.util.ArrayList;
import java.util.List;

public class PCA {
    private double[] mean;
    private double[][] components;

    public void fit(double[][] data, int nComponents) {
        int n = data.length;
        int d = data[0].length;
        // centre the data
        mean = new double[d];
        for (int j = 0; j < d; j++) {
            double sum = 0.0;
            for (int i = 0; i < n; i++) sum += data[i][j];
            mean[j] = sum / n;
        }
        double[][] centered = new double[n][d];
        for (int i = 0; i < n; i++)
            for (int j = 0; j < d; j++)
                centered[i][j] = data[i][j] - mean[j];

        // covariance matrix
        double[][] cov = new double[d][d];
        for (int i = 0; i < n; i++)
            for (int j = 0; j < d; j++)
                for (int k = 0; k < d; k++)
                    cov[j][k] += centered[i][j] * centered[i][k];
        for (int j = 0; j < d; j++)
            for (int k = 0; k < d; k++)
                cov[j][k] /= (n - 1);

        // simplified power iteration for top eigenvalues (production: use ARPACK)
        List<double[]> comps = new ArrayList<>();
        double[][] remaining = cov.clone();
        for (int c = 0; c < nComponents; c++) {
            double[] vec = new double[d];
            for (int j = 0; j < d; j++) vec[j] = 1.0 / Math.sqrt(d);
            for (int iter = 0; iter < 100; iter++) {
                double[] newVec = new double[d];
                for (int j = 0; j < d; j++)
                    for (int k = 0; k < d; k++)
                        newVec[j] += remaining[j][k] * vec[k];
                double norm = 0.0;
                for (double v : newVec) norm += v * v;
                norm = Math.sqrt(norm);
                for (int j = 0; j < d; j++) vec[j] = newVec[j] / norm;
            }
            comps.add(vec.clone());
            // deflation
            for (int i = 0; i < d; i++)
                for (int j = 0; j < d; j++)
                    remaining[i][j] -= vec[i] * vec[j];
        }
        components = comps.toArray(new double[0][]);
    }

    public double[][] transform(double[][] data) {
        // projects data onto components
        int n = data.length;
        int k = components.length;
        double[][] result = new double[n][k];
        for (int i = 0; i < n; i++)
            for (int c = 0; c < k; c++)
                for (int j = 0; j < data[0].length; j++)
                    result[i][c] += (data[i][j] - mean[j]) * components[c][j];
        return result;
    }
}
PCA as a Change of Basis
  • The first principal component points where variance is highest.
  • Each component is orthogonal to the previous — no redundancy.
  • The eigenvalues tell you how much variance each component explains.
  • You drop the smallest eigenvalue components — that's the reduction.
Production Insight
Never apply PCA directly to raw image pixels without flattening — you lose spatial structure. Use convolutional autoencoders instead if you need non-linear reduction for images.
PCA is invertible: reconstruction error = ||X - X_approx||. Use this as an anomaly score. Points with high reconstruction error are outliers — but only for linear manifolds.
Batch PCA with mini-batch mean update is not safe for streaming data. Use IncrementalPCA from scikit-learn or implement a covariance update formula.
Key Takeaway
PCA is the fastest, most interpretable reduction method.
Always standardise before PCA.
Use for pre-processing linear models and anomaly detection.
It cannot capture non-linear structure — that's why we have t-SNE and UMAP.
Should You Use PCA?
IfYou need interpretable feature importance (which original features drive the reduction)
UseUse PCA — inspect component loadings in component_matrix_[feature_index].
IfData is sparse (e.g., bag-of-words) and n_features > n_samples
UseUse TruncatedSVD (also called LSA). Standard PCA on dense covariance matrix would be O(d²) in memory — impossibly large.
IfYou plan to run linear models (logistic regression, SVM) after reduction
UsePCA is ideal. The reduced dimensions are uncorrelated, which helps linear models avoid multicollinearity.
IfData has non-linear structure (swiss roll, manifolds)
UseDo not use PCA — it will preserve only global variance, destroying local neighbourhoods. Use t-SNE or UMAP instead.

t-SNE — Capturing Local Neighbourhood Structure

t-Distributed Stochastic Neighbour Embedding (t-SNE) is a non-linear technique that focuses on preserving the pairwise distances between nearby points. It converts high-dimensional Euclidean distances into conditional probabilities of similarity in the original space, then matches those with a similar distribution in the low-dimensional space using a heavy-tailed t-distribution.

The key hyperparameter is perplexity — roughly the number of neighbours considered. Common mistake: using default perplexity=30 for all datasets. If your dataset is small (<500 points), reduce perplexity. If it's large (>10,000), increase it, but be aware of memory limits.

t-SNE is stochastic and each run gives a different embedding. It is not a function you can apply to new data — you must re-run on the full dataset. That makes it unsuitable for production ML pipelines that embed new points. Its only safe production use is exploratory visualisation and interpretation of learned features.

io/thecodeforge/dimensionality/TSNE.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
// io.thecodeforge.dimensionality.TSNE
// Simplified t-SNE implementation for demonstration
// Production: use scikit-learn or umap-learn; this shows the core loop
package io.thecodeforge.dimensionality;

import java.util.Random;

public class TSNE {
    private double[][] embedding;
    private double klDivergence;

    public void fit(double[][] data, int perplexity, int nIter) {
        int n = data.length;
        int d = data[0].length;
        double[][] P = computePairwiseAffinities(data, perplexity);
        // initialise low-dimensional points randomly
        Random rng = new Random(42);
        double[][] Y = new double[n][2];
        for (int i = 0; i < n; i++) {
            Y[i][0] = rng.nextGaussian() * 1e-4;
            Y[i][1] = rng.nextGaussian() * 1e-4;
        }

        double learningRate = 200.0;
        double momentum = 0.5;
        double[][] velocity = new double[n][2];

        for (int iter = 0; iter < nIter; iter++) {
            // compute Q matrix (low-dimensional similarity)
            double[][] Q = new double[n][n];
            double sumQ = 0.0;
            for (int i = 0; i < n; i++) {
                for (int j = i+1; j < n; j++) {
                    double dist = 0.0;
                    for (int dim = 0; dim < 2; dim++)
                        dist += (Y[i][dim] - Y[j][dim]) * (Y[i][dim] - Y[j][dim]);
                    Q[i][j] = 1.0 / (1.0 + dist);
                    Q[j][i] = Q[i][j];
                    sumQ += 2.0 * Q[i][j];
                }
            }
            for (int i = 0; i < n; i++)
                for (int j = 0; j < n; j++)
                    if (i != j) Q[i][j] /= sumQ;

            // compute gradient
            double[][] grad = new double[n][2];
            for (int i = 0; i < n; i++) {
                for (int j = 0; j < n; j++) {
                    if (i == j) continue;
                    double diff = P[i][j] - Q[i][j];
                    double dist = 0.0;
                    for (int dim = 0; dim < 2; dim++)
                        dist += (Y[i][dim] - Y[j][dim]) * (Y[i][dim] - Y[j][dim]);
                    double factor = diff * (1.0 / (1.0 + dist));
                    for (int dim = 0; dim < 2; dim++)
                        grad[i][dim] += factor * (Y[i][dim] - Y[j][dim]);
                }
                grad[i][0] *= 4.0;
                grad[i][1] *= 4.0;
            }

            // update with momentum
            for (int i = 0; i < n; i++) {
                for (int dim = 0; dim < 2; dim++) {
                    velocity[i][dim] = momentum * velocity[i][dim] - learningRate * grad[i][dim];
                    Y[i][dim] += velocity[i][dim];
                }
            }

            // KL divergence monitoring
            if (iter % 100 == 0) {
                double kl = 0.0;
                for (int i = 0; i < n; i++)
                    for (int j = 0; j < n; j++)
                        if (P[i][j] > 0 && Q[i][j] > 0)
                            kl += P[i][j] * Math.log(P[i][j] / Q[i][j]);
                System.out.println("Iter " + iter + " KL divergence: " + kl);
                this.klDivergence = kl;
            }

            // momentum change after 250 iterations
            if (iter == 250) momentum = 0.8;
        }
        embedding = Y;
    }

    private double[][] computePairwiseAffinities(double[][] data, int perplexity) {
        // simplified: compute Gaussian affinities with fixed variance
        // production: binary search sigma for each point to match perplexity
        int n = data.length;
        double[][] P = new double[n][n];
        double sigma = 1.0; // fixed for brevity
        for (int i = 0; i < n; i++) {
            double sum = 0.0;
            for (int j = 0; j < n; j++) {
                if (i == j) continue;
                double dist = 0.0;
                for (int dim = 0; dim < data[0].length; dim++)
                    dist += (data[i][dim] - data[j][dim]) * (data[i][dim] - data[j][dim]);
                P[i][j] = Math.exp(-dist / (2 * sigma * sigma));
                sum += P[i][j];
            }
            for (int j = 0; j < n; j++)
                if (i != j) P[i][j] /= sum;
        }
        // symmetrise and normalise
        for (int i = 0; i < n; i++)
            for (int j = 0; j < n; j++)
                P[i][j] = (P[i][j] + P[j][i]) / (2 * n);
        return P;
    }

    public double[][] getEmbedding() { return embedding; }
    public double getKlDivergence() { return klDivergence; }
}
Production Warning: t-SNE is not a transform
t-SNE does not learn a mapping function. You cannot fit on training data and transform test data. If you need to embed new points, use UMAP (parametric mode) or train an autoencoder. t-SNE is for static visualisation only.
Production Insight
t-SNE is O(n²) in both memory and compute. For n > 100,000, the pairwise affinity matrix alone can exceed RAM on most machines.
Different runs produce different embeddings. Always fix random_state for reproducibility in reports.
False clusters: t-SNE can create artificial clusters even from uniform random noise, especially with low perplexity. Verify cluster validity with external metrics.
Key Takeaway
t-SNE is the gold standard for visualisation but useless for production pipelines.
Never use t-SNE on >50k points — memory explodes.
Always fix random_state and check KL divergence for convergence.
If false clusters appear, increase perplexity or switch to UMAP.
t-SNE or UMAP?
IfDataset size < 10,000 points
UseBoth work, but t-SNE often gives more visually appealing clusters. Use t-SNE with perplexity ~30.
IfDataset size between 10,000 and 100,000 points
Uset-SNE is feasible but slow. Use UMAP with default n_neighbors=15 for faster runs and comparable quality.
IfDataset > 100,000 points
UseAvoid t-SNE. Use UMAP with approximate nearest neighbours (annoy or nmslib).
IfNeed to embed new unseen points after training
UseUMAP has a .transform() method via umap.UMAP() . t-SNE cannot do this — you must re-run on the entire dataset.

UMAP — Scalable Non-Linear Reduction

Uniform Manifold Approximation and Projection (UMAP) is a manifold learning technique that builds a fuzzy topological representation of the data in high dimensions, then optimises a similar representation in low dimensions. It uses nearest neighbours to create a weighted graph, then minimises cross-entropy between the high-dimensional and low-dimensional graphs.

UMAP's key advantages: it scales to millions of points (O(n log n) with approximate neighbours), preserves more global structure than t-SNE, and — crucially — provides a transform method for embedding new data points after training. This makes it the only non-linear method suitable for production pipelines.

Two main hyperparameters: n_neighbors (controls local vs global balance, default 15) and min_dist (controls how tightly points pack together in the embedding, default 0.1). Lower n_neighbors focuses on local structure; higher values capture global topology.

io/thecodeforge/dimensionality/UMAP.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
// io.thecodeforge.dimensionality.UMAP
// Simplified UMAP core — production uses umap-learn (Python)
// This demonstrates the fuzzy simplicial set construction
package io.thecodeforge.dimensionality;

import java.util.*;

public class UMAP {
    private double[][] embedding;
    private double minDist;
    private int nNeighbors;

    public void fit(double[][] data, int nNeighbors, double minDist, int nComponents) {
        this.nNeighbors = nNeighbors;
        this.minDist = minDist;
        int n = data.length;
        // 1. Compute fuzzy simplicial set (simplified: nearest neighbour graph with smooth knn distances)
        double[][] knnDistances = computeKNNDistances(data, nNeighbors);
        double[][] fuzzyGraph = computeFuzzyGraph(knnDistances, nNeighbors);

        // 2. Initialise low-dimensional embedding using spectral layout
        double[][] Y = spectralInit(fuzzyGraph, nComponents);

        // 3. Optimise with force-directed graph layout
        optimizeEmbedding(fuzzyGraph, Y, nIterations: 500);
        this.embedding = Y;
    }

    private double[][] computeKNNDistances(double[][] data, int k) {
        // Brute force for demo; production uses approximate nearest neighbours (HNSW)
        int n = data.length;
        double[][] distances = new double[n][k];
        for (int i = 0; i < n; i++) {
            double[] dists = new double[n];
            for (int j = 0; j < n; j++) {
                if (i == j) dists[j] = Double.MAX_VALUE;
                else {
                    double sum = 0.0;
                    for (int dim = 0; dim < data[0].length; dim++)
                        sum += (data[i][dim] - data[j][dim]) * (data[i][dim] - data[j][dim]);
                    dists[j] = Math.sqrt(sum);
                }
            }
            // partial sort to get top k
            Arrays.sort(dists);
            for (int j = 0; j < k; j++) distances[i][j] = dists[j];
        }
        return distances;
    }

    private double[][] computeFuzzyGraph(double[][] knnDistances, int k) {
        int n = knnDistances.length;
        double[][] graph = new double[n][n];
        double sigma = 1.0; // simplified; UMAP computes local connectivity radius
        for (int i = 0; i < n; i++) {
            double rho = knnDistances[i][0]; // distance to nearest neighbour
            for (int jIndex = 0; jIndex < k; jIndex++) {
                double d = knnDistances[i][jIndex];
                double weight = Math.exp(-(d - rho) / sigma);
                int j = jIndex; // simplified: need actual j index, not just distance
                graph[i][j] = weight;
            }
        }
        // symmetrise: graph = graph + graph^T - graph * graph^T
        double[][] sym = new double[n][n];
        for (int i = 0; i < n; i++)
            for (int j = 0; j < n; j++)
                sym[i][j] = graph[i][j] + graph[j][i] - graph[i][j] * graph[j][i];
        return sym;
    }

    private double[][] spectralInit(double[][] graph, int nComponents) {
        // stub; production uses eigendecomposition of graph Laplacian
        return new double[graph.length][nComponents];
    }

    private void optimizeEmbedding(double[][] graph, double[][] Y, int nIterations) {
        // stub; UMAP uses stochastic gradient descent with negative sampling
    }

    public double[][] transform(double[][] newData) {
        // production implementation: search nearest neighbours in original data,
        // initialise positions, then run a small optimisation loop
        return new double[newData.length][embedding[0].length];
    }
}
UMAP's Two-Step Process
  • Step 1: Construct a weighted nearest-neighbour graph with fuzzy set membership.
  • Step 2: Optimise a low-dimensional graph to minimise cross-entropy between the two graphs.
  • The graph captures the topological structure — local AND global.
  • The transform method embeds new points using the same graph (parametric extension).
Production Insight
UMAP with high n_neighbors (>200) becomes increasingly global and starts behaving like PCA on the graph Laplacian.
UMAP's transform is not perfect; it uses a nearest neighbour search in the original data to initialise the new point. If original data is high-dimensional and sparse, the search is noisy.
For reproducibility, always set random_state and consider using n_jobs=1 to avoid non-determinism from parallel approximate neighbours.
Key Takeaway
UMAP is the production-grade alternative to t-SNE.
It scales to millions of points and supports transform.
Use n_neighbors to control local vs global balance.
Always set n_jobs=1 and random_state for deterministic pipelines.
When UMAP Beats Other Methods
IfYou need to embed new data points (transform) after training
UseOnly UMAP and autoencoders support transform; PCA also does but is linear.
IfDataset > 100k points
UseUMAP is the only practical non-linear option (O(n log n)).
IfYou care about preserving global structure (e.g., distances between clusters)
UseUMAP with n_neighbors=30-100 preserves more global structure than t-SNE.

LDA — Supervised Dimensionality Reduction

Linear Discriminant Analysis is a supervised technique. Unlike PCA which finds directions of maximum variance, LDA finds the linear axes that maximise class separability. It does this by maximising the ratio of between-class variance to within-class variance.

LDA assumes: (1) the data is normally distributed per class, (2) classes have identical covariance matrices, and (3) the number of features is less than the number of samples (otherwise the within-class scatter matrix is singular). It's limited to C-1 components where C is the number of classes, so for binary classification you get only 1 dimension.

When these assumptions hold, LDA often outperforms PCA for classification tasks because it's directly optimising for class separation. It's commonly used in face recognition (Fisherfaces) and as a pre-processing step for algorithms that benefit from class-discriminative features.

io/thecodeforge/dimensionality/LDA.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
package io.thecodeforge.dimensionality;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class LDA {
    private double[][] components;
    private double[] mean;

    public void fit(double[][] data, int[] labels, int nComponents) {
        int n = data.length;
        int d = data[0].length;
        // compute overall mean
        mean = new double[d];
        for (int i = 0; i < n; i++)
            for (int j = 0; j < d; j++)
                mean[j] += data[i][j];
        for (int j = 0; j < d; j++) mean[j] /= n;

        // group data by class
        Map<Integer, List<double[]>> groups = new HashMap<>();
        for (int i = 0; i < n; i++) {
            groups.computeIfAbsent(labels[i], k -> new ArrayList<>()).add(data[i]);
        }
        int numClasses = groups.size();
        if (nComponents > numClasses - 1) {
            throw new IllegalArgumentException("LDA max components = " + (numClasses - 1));
        }

        // within-class scatter Sw
        double[][] Sw = new double[d][d];
        // between-class scatter Sb
        double[][] Sb = new double[d][d];

        for (Map.Entry<Integer, List<double[]>> entry : groups.entrySet()) {
            List<double[]> classData = entry.getValue();
            double[] classMean = new double[d];
            for (double[] row : classData)
                for (int j = 0; j < d; j++)
                    classMean[j] += row[j];
            for (int j = 0; j < d; j++) classMean[j] /= classData.size();

            // within-class scatter
            for (double[] row : classData) {
                double[] diff = new double[d];
                for (int j = 0; j < d; j++) diff[j] = row[j] - classMean[j];
                for (int j = 0; j < d; j++)
                    for (int k = 0; k < d; k++)
                        Sw[j][k] += diff[j] * diff[k];
            }

            // between-class scatter
            double[] diff = new double[d];
            for (int j = 0; j < d; j++) diff[j] = classMean[j] - mean[j];
            double scale = classData.size();
            for (int j = 0; j < d; j++)
                for (int k = 0; k < d; k++)
                    Sb[j][k] += scale * diff[j] * diff[k];
        }

        // Solve generalised eigenvalue problem: Sb * v = lambda * Sw * v
        // Simplified: assume Sw invertible, compute Sw^{-1} * Sb
        double[][] SwInv = invertMatrix(Sw);
        double[][] mat = multiply(SwInv, Sb);

        // Power iteration for top nComponents eigenvectors
        List<double[]> comps = new ArrayList<>();
        double[][] remaining = mat;
        for (int c = 0; c < nComponents; c++) {
            double[] vec = new double[d];
            for (int j = 0; j < d; j++) vec[j] = 1.0 / Math.sqrt(d);
            for (int iter = 0; iter < 100; iter++) {
                double[] newVec = new double[d];
                for (int j = 0; j < d; j++)
                    for (int k = 0; k < d; k++)
                        newVec[j] += remaining[j][k] * vec[k];
                double norm = 0.0;
                for (double v : newVec) norm += v * v;
                norm = Math.sqrt(norm);
                for (int j = 0; j < d; j++) vec[j] = newVec[j] / norm;
            }
            comps.add(vec.clone());
            // deflation
            for (int i = 0; i < d; i++)
                for (int j = 0; j < d; j++)
                    remaining[i][j] -= vec[i] * vec[j];
        }
        components = comps.toArray(new double[0][]);
    }

    public double[][] transform(double[][] data) {
        int n = data.length;
        int c = components.length;
        double[][] result = new double[n][c];
        for (int i = 0; i < n; i++)
            for (int j = 0; j < c; j++)
                for (int k = 0; k < data[0].length; k++)
                    result[i][j] += (data[i][k] - mean[k]) * components[j][k];
        return result;
    }

    private double[][] multiply(double[][] a, double[][] b) { /* stub */ return new double[a.length][b[0].length]; }
    private double[][] invertMatrix(double[][] m) { /* stub: use Apache Commons Math */ return new double[m.length][m[0].length]; }
}
LDA vs PCA for Classification
If your goal is classification and you have labelled data, LDA often beats PCA because it directly maximises class separation. However, LDA breaks when classes are imbalanced or have non-Gaussian distributions. In those cases, PCA + a good classifier may be more robust.
Production Insight
LDA assumes equal covariance across classes. If your classes have different spreads (e.g., one class has much higher variance), LDA will find suboptimal axes. Use Quadratic Discriminant Analysis (QDA) instead.
When n_features > n_samples, the within-class scatter matrix is singular. Solve this with regularised LDA (RLDA) or PCA pre-screening.
LDA is vulnerable to outliers because it uses class means and covariance matrices. Clean your labels before fitting.
Key Takeaway
LDA is supervised — use it when you care about class separation.
Assumes Gaussian classes with equal covariance — verify before trusting.
Limited to C-1 components.
Regularise if features > samples.
When to Use LDA?
IfYou have labelled data and want to reduce dimensions for classification
UseUse LDA if classes are balanced and roughly Gaussian. Max components = C-1.
IfNo labels available
UseLDA cannot be used. Use PCA or UMAP instead.
IfMore features than samples
UseLDA within-class scatter is singular. Apply PCA first to reduce to < n_samples, then LDA.

Choosing the Right Technique — Decision Tree and Trade-offs

You've now seen four techniques. The right choice depends on your goal: interpretability, visualisation, scalability, or classification accuracy.

Here's a summary rule of thumb
  • Need interpretable, fast, linear? → PCA
  • Need to visualise small datasets (<10k) with pretty clusters? → t-SNE
  • Need to visualise large datasets or embed new points? → UMAP
  • Have labels, want to separate classes? → LDA

But beyond the initial choice, monitor the embedding quality: for t-SNE/UMAP, check the KL divergence or cross-entropy loss. For PCA/LDA, check the explained variance ratio or classification accuracy on a hold-out set.

A common pitfall: applying dimensionality reduction once and never re-validating. Data distributions shift. The embedding that worked last quarter may be useless today. Treat your reduction pipeline as a trainable component with regular monitoring.

io/thecodeforge/dimensionality/ReductionMonitor.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
package io.thecodeforge.dimensionality;

// Monitoring class for embedding quality in production
import java.util.*;

public class ReductionMonitor {

    public static double trustworthiness(double[][] original, double[][] embedding, int k) {
        // Measures how often k-nearest neighbours in original are still neighbours in embedding
        int n = original.length;
        int[] origNeighbors = getKNN(original, k);
        int[] embNeighbors = getKNN(embedding, k);
        int common = 0;
        for (int i = 0; i < n; i++) {
            if (origNeighbors[i] == embNeighbors[i]) common++;
        }
        return (double) common / n;
    }

    private static int[] getKNN(double[][] data, int k) {
        // Stub — production uses a KD-Tree or ball tree for O(n log n)
        return new int[data.length];
    }

    public static double stress(double[][] highD, double[][] lowD) {
        // Normalised stress: sum of squared differences in pairwise distances
        double sumDiff = 0.0, sumOrig = 0.0;
        int n = highD.length;
        for (int i = 0; i < n; i++) {
            for (int j = i+1; j < n; j++) {
                double dHigh = euclidean(highD[i], highD[j]);
                double dLow = euclidean(lowD[i], lowD[j]);
                sumDiff += (dHigh - dLow) * (dHigh - dLow);
                sumOrig += dHigh * dHigh;
            }
        }
        return Math.sqrt(sumDiff / sumOrig);
    }

    private static double euclidean(double[] a, double[] b) {
        double sum = 0.0;
        for (int i = 0; i < a.length; i++) sum += (a[i] - b[i]) * (a[i] - b[i]);
        return Math.sqrt(sum);
    }
}
Silent Model Degradation
Dimensionality reduction is a data-dependent transformation. When your input data distribution shifts (new users, changed behaviour), the embedding quality can silently collapse. Always monitor trustworthiness and stress metrics, and retrigger fitting on a scheduled basis or when data drift is detected.
Production Insight
Embeddings from t-SNE/UMAP are not stable across retraining. If you retrain with new data, the entire embedding shifts — you cannot compare positions from two different runs.
For production pipelines that require stable low-dimensional representations (e.g., feature storage for retrieval), use PCA or parametric UMAP with a fixed random seed.
The cost of fitting: PCA is O(n·d²), t-SNE is O(n²), UMAP is O(n log n). For real-time systems, only PCA and parametric UMAP can be applied at inference time.
Key Takeaway
Match the technique to your goal: PCA for speed/interpretability, UMAP for scale/transform, t-SNE for visualisation only, LDA for class separation.
Monitor embedding quality with trustworthiness and stress.
Never blindly use default hyperparameters — tune perplexity, n_neighbors, and min_dist.
Plan for data drift — schedule periodic refits.
Final Decision Tree
IfGoal: interpretability (which original features matter)
UsePCA — inspect component loadings.
IfGoal: visual exploration of clusters
UseUMAP for >50k points, t-SNE for <10k.
IfGoal: pre-processing for a supervised model with labels
UseLDA if assumptions hold; otherwise PCA + any classifier.
IfGoal: anomaly detection
UsePCA reconstruction error or autoencoder loss.
IfNeed to embed new points in production
UsePCA (linear) or UMAP (non-linear) — avoid t-SNE.
● Production incidentPOST-MORTEMseverity: high

t-SNE OOM crash on a 500k row customer dataset

Symptom
t-SNE training fails with OutOfMemoryError on datasets larger than ~50,000 points. The machine has 64 GB RAM but the process allocates >100 GB.
Assumption
That t-SNE scales to any dataset size because scikit-learn handles it. Wrong — default implementation computes a full distance matrix of size n².
Root cause
t-SNE computes a dense n×n matrix of pairwise distances (float64) → memory = n² × 8 bytes. For n=500,000 that's 2,000 GB. Even with optimisations, the gradient computation requires O(n²) memory.
Fix
Switch to UMAP which uses nearest neighbour approximation (annoy / HNSW) and O(n log n) memory. Alternatively, subsample to 10,000 points for t-SNE visualisation.
Key lesson
  • Always estimate memory before running t-SNE on large datasets: n² × 8 bytes.
  • Use UMAP when you need to visualise more than 50,000 points.
  • Never use t-SNE for dimensionality reduction as a pre-processing step for downstream models — it's non-parametric and cannot embed new points.
Production debug guideSymptom → Root cause → Action4 entries
Symptom · 01
PCA components explain <10% of total variance in first 2 dimensions
Fix
Check if data is standardised (mean-centre + unit variance). PCA on unscaled data overweights high-variance features. Use StandardScaler before PCA.
Symptom · 02
t-SNE plot shows distinct clusters that don't align with labels and disappear on rerun
Fix
Reduce perplexity (try 5-30) or increase learning rate. False clusters appear when perplexity is too low relative to data size. Rerun with fixed random_state to verify stability.
Symptom · 03
UMAP visualisation looks like a uniform square blob with no structure
Fix
Increase n_neighbors (default 15). Too low a value makes UMAP focus only on local structure, losing global separation. Also check that data is scaled and that n_components is not too high for the visualised dimension.
Symptom · 04
LDA scored <60% accuracy on held-out test set after reduction
Fix
Check whether classes are balanced and covariance matrices are roughly equal. LDA assumes Gaussian class-conditional distributions. Try QDA (Quadratic Discriminant Analysis) if covariance differs per class, or use regularised LDA.
★ Quick Debug Cheat Sheet — Dimensionality ReductionCommands and actions for the three most common production failures.
PCA memory error on large matrix
Immediate action
Kill the job, check memory usage. Reduce batch size or use IncrementalPCA.
Commands
python -c "import numpy as np; n=500000; d=1000; print('Memory (GB):', (n*d*8)/1e9)"
from sklearn.decomposition import IncrementalPCA; ipca = IncrementalPCA(n_components=50, batch_size=1000)
Fix now
Switch to IncrementalPCA or use TruncatedSVD on sparse matrix if data is sparse (e.g., TF-IDF).
t-SNE training never converges+
Immediate action
Check iteration count and gradient norm. Increase early_exaggeration to spread points faster.
Commands
tsne = TSNE(perplexity=30, n_iter=1000, early_exaggeration=12)
monitor KL divergence: tsne.kl_divergence_ after each call — should decrease monotonically
Fix now
If KL divergence plateaus above 0.5, reduce perplexity or increase learning rate. Still stuck? Use PCA initialisation (init='pca').
UMAP embedding flips sign between runs+
Immediate action
Set random_state. UMAP is stochastic by default; without a fixed seed each run gives a different rotation.
Commands
umap.UMAP(random_state=42).fit_transform(data)
Check if the loss changes: umap_model._fit() logs negative sampling loss — should be <0.1 after full fit.
Fix now
Always set random_state for reproducibility in production pipelines. For full deterministic mode, use n_jobs=1.
Dimensionality Reduction Techniques Comparison
TechniqueTypeComplexityScalabilityInterpretabilitySupports transform new dataBest Use Case
PCALinear, unsupervisedO(n·d² + d³)High (up to 1M points)High — inspect loadingsYesPre-processing, anomaly detection, interpretable features
t-SNENon-linear, unsupervisedO(n²)Low (<50k points)Low — no loadingsNo2D/3D visualisation for small datasets
UMAPNon-linear, unsupervisedO(n log n)High (up to 10M points)Medium — via graph inspectionYes (approximate)Visualisation, feature engineering, production pipelines
LDALinear, supervisedO(n·d² + d³)Moderate (depends on n_features vs classes)High — see component weights per discriminantYesClassification pre-processing when labels available

Key takeaways

1
The curse of dimensionality makes distances meaningless
always consider reduction for high-dimensional data.
2
PCA for interpretable features and linear pre-processing; UMAP for scalable non-linear embedding; t-SNE for visualisation only.
3
Always standardise before PCA/LDA; tune perplexity for t-SNE; set random_state for reproducibility in all stochastic methods.
4
Monitor embedding quality with trustworthiness and stress metrics
silent degradation happens when data drifts.
5
Never use t-SNE in a production pipeline
it cannot transform new data. Use UMAP or PCA for that.
6
LDA is powerful when classes are Gaussian and covariance is equal
otherwise regularise or switch to PCA.

Common mistakes to avoid

6 patterns
×

Using t-SNE for feature extraction before clustering

Symptom
Clusters appear in t-SNE plot but K-Means on original data produces completely different groups — because t-SNE can introduce false clusters.
Fix
Use UMAP or PCA for pre-processing before clustering. t-SNE should only be used for visualisation, not as a feature extraction step.
×

Not scaling data before PCA

Symptom
First few principal components are dominated by features with large magnitudes (e.g., salary in dollars vs age in years). Reduced dimensions explain little meaningful variance.
Fix
Always standardise data to zero mean and unit variance (StandardScaler) before applying PCA.
×

Setting n_neighbors too low in UMAP

Symptom
UMAP embedding shows many tiny disconnected clusters, losing global structure. The visualisation looks like a scatter of isolated points.
Fix
Increase n_neighbors to at least 15. For global structure preservation, try values between 30 and 100.
×

Using LDA when within-class covariance matrices are not equal

Symptom
LDA components do not separate classes well; classification accuracy after reduction is worse than using raw features.
Fix
Check if classes have similar covariance matrices. If not, use Quadratic Discriminant Analysis (QDA) or regularised LDA.
×

Memorising syntax before understanding the concept

Symptom
You can code a PCA call but cannot explain when to standardise or why perplexity matters in t-SNE.
Fix
Focus on the geometric intuition: what problem each technique solves. Code is just implementation; the decision logic is what matters in production.
×

Skipping practice and only reading theory

Symptom
You can list the algorithms but fail to debug a real pipeline because you've never seen the common failure modes.
Fix
Run each technique on a real dataset. Intentionally break it — skip scaling, set wrong perplexity — and watch what happens. That's how you internalise the rules.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain the difference between PCA and t-SNE. When would you use each in...
Q02SENIOR
What is the curse of dimensionality and how does PCA help mitigate it?
Q03SENIOR
How does perplexity affect t-SNE? What happens if you set perplexity too...
Q04SENIOR
UMAP claims to preserve more global structure than t-SNE. Why?
Q01 of 04SENIOR

Explain the difference between PCA and t-SNE. When would you use each in a production ML pipeline?

ANSWER
PCA is a linear, deterministic method that finds orthogonal axes of maximum variance. It is fast, interpretable, and supports transform (you can embed new data after fitting). Use PCA for feature extraction before linear models, for anomaly detection via reconstruction error, and for visualising data that lies on a linear manifold. t-SNE is a non-linear, stochastic method that preserves local neighbourhood structure. It does not support transform (you cannot embed new points after fitting) and it scales poorly (O(n²)). Use t-SNE only for exploratory visualisation of static datasets (<50k points). In a production pipeline, you would use PCA or UMAP (which supports transform) as a pre-processing step, never t-SNE.
FAQ · 7 QUESTIONS

Frequently Asked Questions

01
What is Dimensionality Reduction Techniques in simple terms?
02
Can I use t-SNE for feature extraction before clustering?
03
Why does my PCA explain variance drop significantly after reducing dimensions?
04
UMAP vs t-SNE — which is better for visualising a dataset of 200,000 points?
05
What is the difference between PCA and LDA?
06
How do I choose the number of components for PCA?
07
What metrics can I use to evaluate embedding quality?
🔥

That's Algorithms. Mark it forged?

5 min read · try the examples if you haven't

Previous
DBSCAN Clustering
12 / 14 · Algorithms
Next
Ensemble Methods in ML