Senior 11 min · March 06, 2026
Dimensionality Reduction Techniques

t-SNE Memory Crash — Why 500K Rows Needs 2TB RAM

t-SNE allocates n²×8 bytes for pairwise distances — 2TB for 500K points.

N
Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Written from production experience, not tutorials.

Follow
Production
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Dimensionality reduction maps high-dimensional data (10,000+ features) into 2-50 dimensions while preserving variance (PCA) or neighbourhood structure (t-SNE/UMAP).
  • PCA: linear, deterministic, O(n·d²) time — interpretable components, best for pre-processing before linear models.
  • t-SNE: non-linear, stochastic, O(n²) time — great for visualisation but sensitive to perplexity; can create false clusters.
  • UMAP: non-linear, O(n log n) with approximate neighbours — scales to millions of points, preserves more global structure than t-SNE.
  • LDA: supervised, finds axes that maximise class separation — use only when labels are available and classes are roughly Gaussian.
  • Production insight: PCA can be inverted to detect outliers; t-SNE transforms are not reusable on new data — embed once, discard the model.
✦ Definition~90s read
What is Dimensionality Reduction Techniques?

High-dimensional data is the norm in production ML, not the exception. Each new feature adds a dimension, and as dimensions grow, the volume of the feature space explodes exponentially. This is the curse of dimensionality: distances between any two points become nearly identical, making nearest-neighbour algorithms useless; the sample density plummets, so you need exponentially more data to cover the space; and models overfit because they can memorise noise in the extra dimensions instead of learning signal.

Imagine you're describing every person at a party using 500 different facts — their shoe size, hair length, what they had for breakfast, etc.

Dimensionality reduction tackles this by projecting the data into a lower-dimensional subspace that preserves the structure you care about. The four dominant techniques — PCA, t-SNE, UMAP, LDA — each make different trade-offs between linearity, scalability, interpretability, and the type of structure they preserve.

PCA keeps global variance; t-SNE keeps local neighbourhoods; UMAP keeps a fuzzy topology that balances local and global; LDA keeps class separability. Pick the wrong one and you'll either destroy the signal you're trying to capture or get misled by false patterns.

Plain-English First

Imagine you're describing every person at a party using 500 different facts — their shoe size, hair length, what they had for breakfast, etc. Most of those facts are redundant or useless for figuring out who's similar to whom. Dimensionality reduction is like a smart assistant that says: 'forget 490 of those facts — these 10 actually capture who people are.' You lose a little detail, but you gain the ability to actually SEE patterns, run models faster, and stop drowning in noise. That's it. That's dimensionality reduction.

High-dimensional data is everywhere in production ML — a user's click history might span 10,000 features, a raw image 50,176 pixels, a genomics dataset half a million SNP markers. Models trained directly on this data suffer from the curse of dimensionality: distances become meaningless, overfitting skyrockets, training slows to a crawl, and visualisation becomes impossible. Dimensionality reduction isn't a pre-processing nicety — it's often the difference between a model that generalises and one that memorises noise.

The core problem these techniques solve is geometric. In high dimensions, every point is roughly equidistant from every other point. That breaks nearest-neighbour search, makes clustering unstable, and bloats the covariance matrix your model has to estimate. By projecting data into a lower-dimensional space that preserves what actually matters — variance, local structure, class separability — you give your downstream algorithms a fighting chance.

By the end of this article you'll understand the internal mechanics of PCA, t-SNE, UMAP, and LDA well enough to choose the right one for a production problem, tune hyperparameters with confidence, avoid the subtle bugs that silently corrupt results, and answer the interview questions that trip up even experienced ML engineers.

What is Dimensionality Reduction Techniques?

High-dimensional data is the norm in production ML, not the exception. Each new feature adds a dimension, and as dimensions grow, the volume of the feature space explodes exponentially. This is the curse of dimensionality: distances between any two points become nearly identical, making nearest-neighbour algorithms useless; the sample density plummets, so you need exponentially more data to cover the space; and models overfit because they can memorise noise in the extra dimensions instead of learning signal.

Dimensionality reduction tackles this by projecting the data into a lower-dimensional subspace that preserves the structure you care about. The four dominant techniques — PCA, t-SNE, UMAP, LDA — each make different trade-offs between linearity, scalability, interpretability, and the type of structure they preserve.

PCA keeps global variance; t-SNE keeps local neighbourhoods; UMAP keeps a fuzzy topology that balances local and global; LDA keeps class separability. Pick the wrong one and you'll either destroy the signal you're trying to capture or get misled by false patterns.

io/thecodeforge/dimensionality/ReductionDemo.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
package io.thecodeforge.dimensionality;

// Quick demonstration of curse of dimensionality
public class ReductionDemo {
    public static void main(String[] args) {
        int n = 1000;   // points
        int dim = 1000; // dimensions
        double[][] data = new double[n][dim];
        // fill with random data
        java.util.Random rng = new java.util.Random(42);
        for (int i = 0; i < n; i++)
            for (int j = 0; j < dim; j++)
                data[i][j] = rng.nextGaussian();

        // compute average distance between random points
        double sumDist = 0.0;
        int pairs = 0;
        for (int i = 0; i < n; i++) {
            for (int j = i+1; j < n; j++) {
                double dist = 0.0;
                for (int k = 0; k < dim; k++)
                    dist += (data[i][k] - data[j][k]) * (data[i][k] - data[j][k]);
                sumDist += Math.sqrt(dist);
                pairs++;
                if (pairs >= 10000) break;
            }
            if (pairs >= 10000) break;
        }
        double avgDist = sumDist / pairs;
        double minDist = Double.MAX_VALUE, maxDist = Double.MIN_VALUE;
        // approximate min/max from same subset
        System.out.println("Average distance in " + dim + "D: " + avgDist);
        System.out.println("In high dimensions, all distances converge — that's the curse.");
    }
}
Output
Average distance in 1000D: ~44.7
In high dimensions, all distances converge — that's the curse.
Why Reduction Matters in Production
A model trained on 10,000 features without reduction will likely overfit unless you have millions of samples. Dimensionality reduction is not optional — it's the first line of defence against the curse.
Production Insight
The curse of dimensionality is not theoretical. In production, high-dimensional data causes KNN to degrade to random guessing, tree-based models to split on noise, and clustering algorithms to produce arbitrary groups.
Always check the effective dimensionality of your data before choosing a reduction method. PCA can give you a quick diagnostic via explained variance ratio.
If you don't reduce, you'll need exponentially more training data to maintain model quality — that's the hidden cost.
Key Takeaway
Dimensionality reduction fights the curse of dimensionality.
Without it, distances break, models overfit, clusters vanish.
Choose the method based on which structure you want to preserve.
Always diagnose with PCA's explained variance first.
t-SNE Memory Crash: 500K Rows Need 2TB RAM THECODEFORGE.IO t-SNE Memory Crash: 500K Rows Need 2TB RAM Comparison of dimensionality reduction techniques and their scalability PCA: Linear Reduction Interpretable components, fast, limited to linear structure t-SNE: Local Neighbors Captures clusters, O(n²) memory, 500K rows → 2TB RAM UMAP: Scalable Non-Linear Handles large data, preserves global structure LDA: Supervised Maximizes class separability, requires labels Feature Selection Remove noise, keep relevant features Feature Extraction Reduce dimensions via transformation ⚠ t-SNE memory blowup: O(n²) distance matrix Use UMAP or PCA for datasets > 100K rows THECODEFORGE.IO
thecodeforge.io
t-SNE Memory Crash: 500K Rows Need 2TB RAM
Dimensionality Reduction

PCA — Linear Reduction with Interpretable Components

Principal Component Analysis finds the orthogonal axes that maximise the variance of the projected data. It's a linear transformation — the output dimensions are linear combinations of the original features. This makes PCA the most interpretable method: you can inspect the loadings (eigenvectors) to understand which original features drive each component.

The algorithm centres the data, computes the covariance matrix, then performs eigenvalue decomposition. The top-k eigenvectors become the principal components. In practice, you should always standardise data before PCA — otherwise high-variance features dominate.

PCA is also the fastest method: O(n·d² + d³) for n samples and d features. It's a deterministic closed-form solution — no hyperparameter tuning, no stochasticity. That's why it's the go-to for feature extraction in linear models and for anomaly detection (reconstruction error).

io/thecodeforge/dimensionality/PCA.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
package io.thecodeforge.dimensionality;

import java.util.ArrayList;
import java.util.List;

public class PCA {
    private double[] mean;
    private double[][] components;

    public void fit(double[][] data, int nComponents) {
        int n = data.length;
        int d = data[0].length;
        // centre the data
        mean = new double[d];
        for (int j = 0; j < d; j++) {
            double sum = 0.0;
            for (int i = 0; i < n; i++) sum += data[i][j];
            mean[j] = sum / n;
        }
        double[][] centered = new double[n][d];
        for (int i = 0; i < n; i++)
            for (int j = 0; j < d; j++)
                centered[i][j] = data[i][j] - mean[j];

        // covariance matrix
        double[][] cov = new double[d][d];
        for (int i = 0; i < n; i++)
            for (int j = 0; j < d; j++)
                for (int k = 0; k < d; k++)
                    cov[j][k] += centered[i][j] * centered[i][k];
        for (int j = 0; j < d; j++)
            for (int k = 0; k < d; k++)
                cov[j][k] /= (n - 1);

        // simplified power iteration for top eigenvalues (production: use ARPACK)
        List<double[]> comps = new ArrayList<>();
        double[][] remaining = cov.clone();
        for (int c = 0; c < nComponents; c++) {
            double[] vec = new double[d];
            for (int j = 0; j < d; j++) vec[j] = 1.0 / Math.sqrt(d);
            for (int iter = 0; iter < 100; iter++) {
                double[] newVec = new double[d];
                for (int j = 0; j < d; j++)
                    for (int k = 0; k < d; k++)
                        newVec[j] += remaining[j][k] * vec[k];
                double norm = 0.0;
                for (double v : newVec) norm += v * v;
                norm = Math.sqrt(norm);
                for (int j = 0; j < d; j++) vec[j] = newVec[j] / norm;
            }
            comps.add(vec.clone());
            // deflation
            for (int i = 0; i < d; i++)
                for (int j = 0; j < d; j++)
                    remaining[i][j] -= vec[i] * vec[j];
        }
        components = comps.toArray(new double[0][]);
    }

    public double[][] transform(double[][] data) {
        // projects data onto components
        int n = data.length;
        int k = components.length;
        double[][] result = new double[n][k];
        for (int i = 0; i < n; i++)
            for (int c = 0; c < k; c++)
                for (int j = 0; j < data[0].length; j++)
                    result[i][c] += (data[i][j] - mean[j]) * components[c][j];
        return result;
    }
}
PCA as a Change of Basis
  • The first principal component points where variance is highest.
  • Each component is orthogonal to the previous — no redundancy.
  • The eigenvalues tell you how much variance each component explains.
  • You drop the smallest eigenvalue components — that's the reduction.
Production Insight
Never apply PCA directly to raw image pixels without flattening — you lose spatial structure. Use convolutional autoencoders instead if you need non-linear reduction for images.
PCA is invertible: reconstruction error = ||X - X_approx||. Use this as an anomaly score. Points with high reconstruction error are outliers — but only for linear manifolds.
Batch PCA with mini-batch mean update is not safe for streaming data. Use IncrementalPCA from scikit-learn or implement a covariance update formula.
Key Takeaway
PCA is the fastest, most interpretable reduction method.
Always standardise before PCA.
Use for pre-processing linear models and anomaly detection.
It cannot capture non-linear structure — that's why we have t-SNE and UMAP.
Should You Use PCA?
IfYou need interpretable feature importance (which original features drive the reduction)
UseUse PCA — inspect component loadings in component_matrix_[feature_index].
IfData is sparse (e.g., bag-of-words) and n_features > n_samples
UseUse TruncatedSVD (also called LSA). Standard PCA on dense covariance matrix would be O(d²) in memory — impossibly large.
IfYou plan to run linear models (logistic regression, SVM) after reduction
UsePCA is ideal. The reduced dimensions are uncorrelated, which helps linear models avoid multicollinearity.
IfData has non-linear structure (swiss roll, manifolds)
UseDo not use PCA — it will preserve only global variance, destroying local neighbourhoods. Use t-SNE or UMAP instead.

t-SNE — Capturing Local Neighbourhood Structure

t-Distributed Stochastic Neighbour Embedding (t-SNE) is a non-linear technique that focuses on preserving the pairwise distances between nearby points. It converts high-dimensional Euclidean distances into conditional probabilities of similarity in the original space, then matches those with a similar distribution in the low-dimensional space using a heavy-tailed t-distribution.

The key hyperparameter is perplexity — roughly the number of neighbours considered. Common mistake: using default perplexity=30 for all datasets. If your dataset is small (<500 points), reduce perplexity. If it's large (>10,000), increase it, but be aware of memory limits.

t-SNE is stochastic and each run gives a different embedding. It is not a function you can apply to new data — you must re-run on the full dataset. That makes it unsuitable for production ML pipelines that embed new points. Its only safe production use is exploratory visualisation and interpretation of learned features.

io/thecodeforge/dimensionality/TSNE.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
// io.thecodeforge.dimensionality.TSNE
// Simplified t-SNE implementation for demonstration
// Production: use scikit-learn or umap-learn; this shows the core loop
package io.thecodeforge.dimensionality;

import java.util.Random;

public class TSNE {
    private double[][] embedding;
    private double klDivergence;

    public void fit(double[][] data, int perplexity, int nIter) {
        int n = data.length;
        int d = data[0].length;
        double[][] P = computePairwiseAffinities(data, perplexity);
        // initialise low-dimensional points randomly
        Random rng = new Random(42);
        double[][] Y = new double[n][2];
        for (int i = 0; i < n; i++) {
            Y[i][0] = rng.nextGaussian() * 1e-4;
            Y[i][1] = rng.nextGaussian() * 1e-4;
        }

        double learningRate = 200.0;
        double momentum = 0.5;
        double[][] velocity = new double[n][2];

        for (int iter = 0; iter < nIter; iter++) {
            // compute Q matrix (low-dimensional similarity)
            double[][] Q = new double[n][n];
            double sumQ = 0.0;
            for (int i = 0; i < n; i++) {
                for (int j = i+1; j < n; j++) {
                    double dist = 0.0;
                    for (int dim = 0; dim < 2; dim++)
                        dist += (Y[i][dim] - Y[j][dim]) * (Y[i][dim] - Y[j][dim]);
                    Q[i][j] = 1.0 / (1.0 + dist);
                    Q[j][i] = Q[i][j];
                    sumQ += 2.0 * Q[i][j];
                }
            }
            for (int i = 0; i < n; i++)
                for (int j = 0; j < n; j++)
                    if (i != j) Q[i][j] /= sumQ;

            // compute gradient
            double[][] grad = new double[n][2];
            for (int i = 0; i < n; i++) {
                for (int j = 0; j < n; j++) {
                    if (i == j) continue;
                    double diff = P[i][j] - Q[i][j];
                    double dist = 0.0;
                    for (int dim = 0; dim < 2; dim++)
                        dist += (Y[i][dim] - Y[j][dim]) * (Y[i][dim] - Y[j][dim]);
                    double factor = diff * (1.0 / (1.0 + dist));
                    for (int dim = 0; dim < 2; dim++)
                        grad[i][dim] += factor * (Y[i][dim] - Y[j][dim]);
                }
                grad[i][0] *= 4.0;
                grad[i][1] *= 4.0;
            }

            // update with momentum
            for (int i = 0; i < n; i++) {
                for (int dim = 0; dim < 2; dim++) {
                    velocity[i][dim] = momentum * velocity[i][dim] - learningRate * grad[i][dim];
                    Y[i][dim] += velocity[i][dim];
                }
            }

            // KL divergence monitoring
            if (iter % 100 == 0) {
                double kl = 0.0;
                for (int i = 0; i < n; i++)
                    for (int j = 0; j < n; j++)
                        if (P[i][j] > 0 && Q[i][j] > 0)
                            kl += P[i][j] * Math.log(P[i][j] / Q[i][j]);
                System.out.println("Iter " + iter + " KL divergence: " + kl);
                this.klDivergence = kl;
            }

            // momentum change after 250 iterations
            if (iter == 250) momentum = 0.8;
        }
        embedding = Y;
    }

    private double[][] computePairwiseAffinities(double[][] data, int perplexity) {
        // simplified: compute Gaussian affinities with fixed variance
        // production: binary search sigma for each point to match perplexity
        int n = data.length;
        double[][] P = new double[n][n];
        double sigma = 1.0; // fixed for brevity
        for (int i = 0; i < n; i++) {
            double sum = 0.0;
            for (int j = 0; j < n; j++) {
                if (i == j) continue;
                double dist = 0.0;
                for (int dim = 0; dim < data[0].length; dim++)
                    dist += (data[i][dim] - data[j][dim]) * (data[i][dim] - data[j][dim]);
                P[i][j] = Math.exp(-dist / (2 * sigma * sigma));
                sum += P[i][j];
            }
            for (int j = 0; j < n; j++)
                if (i != j) P[i][j] /= sum;
        }
        // symmetrise and normalise
        for (int i = 0; i < n; i++)
            for (int j = 0; j < n; j++)
                P[i][j] = (P[i][j] + P[j][i]) / (2 * n);
        return P;
    }

    public double[][] getEmbedding() { return embedding; }
    public double getKlDivergence() { return klDivergence; }
}
Production Warning: t-SNE is not a transform
t-SNE does not learn a mapping function. You cannot fit on training data and transform test data. If you need to embed new points, use UMAP (parametric mode) or train an autoencoder. t-SNE is for static visualisation only.
Production Insight
t-SNE is O(n²) in both memory and compute. For n > 100,000, the pairwise affinity matrix alone can exceed RAM on most machines.
Different runs produce different embeddings. Always fix random_state for reproducibility in reports.
False clusters: t-SNE can create artificial clusters even from uniform random noise, especially with low perplexity. Verify cluster validity with external metrics.
Key Takeaway
t-SNE is the gold standard for visualisation but useless for production pipelines.
Never use t-SNE on >50k points — memory explodes.
Always fix random_state and check KL divergence for convergence.
If false clusters appear, increase perplexity or switch to UMAP.
t-SNE or UMAP?
IfDataset size < 10,000 points
UseBoth work, but t-SNE often gives more visually appealing clusters. Use t-SNE with perplexity ~30.
IfDataset size between 10,000 and 100,000 points
Uset-SNE is feasible but slow. Use UMAP with default n_neighbors=15 for faster runs and comparable quality.
IfDataset > 100,000 points
UseAvoid t-SNE. Use UMAP with approximate nearest neighbours (annoy or nmslib).
IfNeed to embed new unseen points after training
UseUMAP has a .transform() method via umap.UMAP() . t-SNE cannot do this — you must re-run on the entire dataset.

UMAP — Scalable Non-Linear Reduction

Uniform Manifold Approximation and Projection (UMAP) is a manifold learning technique that builds a fuzzy topological representation of the data in high dimensions, then optimises a similar representation in low dimensions. It uses nearest neighbours to create a weighted graph, then minimises cross-entropy between the high-dimensional and low-dimensional graphs.

UMAP's key advantages: it scales to millions of points (O(n log n) with approximate neighbours), preserves more global structure than t-SNE, and — crucially — provides a transform method for embedding new data points after training. This makes it the only non-linear method suitable for production pipelines.

Two main hyperparameters: n_neighbors (controls local vs global balance, default 15) and min_dist (controls how tightly points pack together in the embedding, default 0.1). Lower n_neighbors focuses on local structure; higher values capture global topology.

io/thecodeforge/dimensionality/UMAP.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
// io.thecodeforge.dimensionality.UMAP
// Simplified UMAP core — production uses umap-learn (Python)
// This demonstrates the fuzzy simplicial set construction
package io.thecodeforge.dimensionality;

import java.util.*;

public class UMAP {
    private double[][] embedding;
    private double minDist;
    private int nNeighbors;

    public void fit(double[][] data, int nNeighbors, double minDist, int nComponents) {
        this.nNeighbors = nNeighbors;
        this.minDist = minDist;
        int n = data.length;
        // 1. Compute fuzzy simplicial set (simplified: nearest neighbour graph with smooth knn distances)
        double[][] knnDistances = computeKNNDistances(data, nNeighbors);
        double[][] fuzzyGraph = computeFuzzyGraph(knnDistances, nNeighbors);

        // 2. Initialise low-dimensional embedding using spectral layout
        double[][] Y = spectralInit(fuzzyGraph, nComponents);

        // 3. Optimise with force-directed graph layout
        optimizeEmbedding(fuzzyGraph, Y, nIterations: 500);
        this.embedding = Y;
    }

    private double[][] computeKNNDistances(double[][] data, int k) {
        // Brute force for demo; production uses approximate nearest neighbours (HNSW)
        int n = data.length;
        double[][] distances = new double[n][k];
        for (int i = 0; i < n; i++) {
            double[] dists = new double[n];
            for (int j = 0; j < n; j++) {
                if (i == j) dists[j] = Double.MAX_VALUE;
                else {
                    double sum = 0.0;
                    for (int dim = 0; dim < data[0].length; dim++)
                        sum += (data[i][dim] - data[j][dim]) * (data[i][dim] - data[j][dim]);
                    dists[j] = Math.sqrt(sum);
                }
            }
            // partial sort to get top k
            Arrays.sort(dists);
            for (int j = 0; j < k; j++) distances[i][j] = dists[j];
        }
        return distances;
    }

    private double[][] computeFuzzyGraph(double[][] knnDistances, int k) {
        int n = knnDistances.length;
        double[][] graph = new double[n][n];
        double sigma = 1.0; // simplified; UMAP computes local connectivity radius
        for (int i = 0; i < n; i++) {
            double rho = knnDistances[i][0]; // distance to nearest neighbour
            for (int jIndex = 0; jIndex < k; jIndex++) {
                double d = knnDistances[i][jIndex];
                double weight = Math.exp(-(d - rho) / sigma);
                int j = jIndex; // simplified: need actual j index, not just distance
                graph[i][j] = weight;
            }
        }
        // symmetrise: graph = graph + graph^T - graph * graph^T
        double[][] sym = new double[n][n];
        for (int i = 0; i < n; i++)
            for (int j = 0; j < n; j++)
                sym[i][j] = graph[i][j] + graph[j][i] - graph[i][j] * graph[j][i];
        return sym;
    }

    private double[][] spectralInit(double[][] graph, int nComponents) {
        // stub; production uses eigendecomposition of graph Laplacian
        return new double[graph.length][nComponents];
    }

    private void optimizeEmbedding(double[][] graph, double[][] Y, int nIterations) {
        // stub; UMAP uses stochastic gradient descent with negative sampling
    }

    public double[][] transform(double[][] newData) {
        // production implementation: search nearest neighbours in original data,
        // initialise positions, then run a small optimisation loop
        return new double[newData.length][embedding[0].length];
    }
}
UMAP's Two-Step Process
  • Step 1: Construct a weighted nearest-neighbour graph with fuzzy set membership.
  • Step 2: Optimise a low-dimensional graph to minimise cross-entropy between the two graphs.
  • The graph captures the topological structure — local AND global.
  • The transform method embeds new points using the same graph (parametric extension).
Production Insight
UMAP with high n_neighbors (>200) becomes increasingly global and starts behaving like PCA on the graph Laplacian.
UMAP's transform is not perfect; it uses a nearest neighbour search in the original data to initialise the new point. If original data is high-dimensional and sparse, the search is noisy.
For reproducibility, always set random_state and consider using n_jobs=1 to avoid non-determinism from parallel approximate neighbours.
Key Takeaway
UMAP is the production-grade alternative to t-SNE.
It scales to millions of points and supports transform.
Use n_neighbors to control local vs global balance.
Always set n_jobs=1 and random_state for deterministic pipelines.
When UMAP Beats Other Methods
IfYou need to embed new data points (transform) after training
UseOnly UMAP and autoencoders support transform; PCA also does but is linear.
IfDataset > 100k points
UseUMAP is the only practical non-linear option (O(n log n)).
IfYou care about preserving global structure (e.g., distances between clusters)
UseUMAP with n_neighbors=30-100 preserves more global structure than t-SNE.

LDA — Supervised Dimensionality Reduction

Linear Discriminant Analysis is a supervised technique. Unlike PCA which finds directions of maximum variance, LDA finds the linear axes that maximise class separability. It does this by maximising the ratio of between-class variance to within-class variance.

LDA assumes: (1) the data is normally distributed per class, (2) classes have identical covariance matrices, and (3) the number of features is less than the number of samples (otherwise the within-class scatter matrix is singular). It's limited to C-1 components where C is the number of classes, so for binary classification you get only 1 dimension.

When these assumptions hold, LDA often outperforms PCA for classification tasks because it's directly optimising for class separation. It's commonly used in face recognition (Fisherfaces) and as a pre-processing step for algorithms that benefit from class-discriminative features.

io/thecodeforge/dimensionality/LDA.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
package io.thecodeforge.dimensionality;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class LDA {
    private double[][] components;
    private double[] mean;

    public void fit(double[][] data, int[] labels, int nComponents) {
        int n = data.length;
        int d = data[0].length;
        // compute overall mean
        mean = new double[d];
        for (int i = 0; i < n; i++)
            for (int j = 0; j < d; j++)
                mean[j] += data[i][j];
        for (int j = 0; j < d; j++) mean[j] /= n;

        // group data by class
        Map<Integer, List<double[]>> groups = new HashMap<>();
        for (int i = 0; i < n; i++) {
            groups.computeIfAbsent(labels[i], k -> new ArrayList<>()).add(data[i]);
        }
        int numClasses = groups.size();
        if (nComponents > numClasses - 1) {
            throw new IllegalArgumentException("LDA max components = " + (numClasses - 1));
        }

        // within-class scatter Sw
        double[][] Sw = new double[d][d];
        // between-class scatter Sb
        double[][] Sb = new double[d][d];

        for (Map.Entry<Integer, List<double[]>> entry : groups.entrySet()) {
            List<double[]> classData = entry.getValue();
            double[] classMean = new double[d];
            for (double[] row : classData)
                for (int j = 0; j < d; j++)
                    classMean[j] += row[j];
            for (int j = 0; j < d; j++) classMean[j] /= classData.size();

            // within-class scatter
            for (double[] row : classData) {
                double[] diff = new double[d];
                for (int j = 0; j < d; j++) diff[j] = row[j] - classMean[j];
                for (int j = 0; j < d; j++)
                    for (int k = 0; k < d; k++)
                        Sw[j][k] += diff[j] * diff[k];
            }

            // between-class scatter
            double[] diff = new double[d];
            for (int j = 0; j < d; j++) diff[j] = classMean[j] - mean[j];
            double scale = classData.size();
            for (int j = 0; j < d; j++)
                for (int k = 0; k < d; k++)
                    Sb[j][k] += scale * diff[j] * diff[k];
        }

        // Solve generalised eigenvalue problem: Sb * v = lambda * Sw * v
        // Simplified: assume Sw invertible, compute Sw^{-1} * Sb
        double[][] SwInv = invertMatrix(Sw);
        double[][] mat = multiply(SwInv, Sb);

        // Power iteration for top nComponents eigenvectors
        List<double[]> comps = new ArrayList<>();
        double[][] remaining = mat;
        for (int c = 0; c < nComponents; c++) {
            double[] vec = new double[d];
            for (int j = 0; j < d; j++) vec[j] = 1.0 / Math.sqrt(d);
            for (int iter = 0; iter < 100; iter++) {
                double[] newVec = new double[d];
                for (int j = 0; j < d; j++)
                    for (int k = 0; k < d; k++)
                        newVec[j] += remaining[j][k] * vec[k];
                double norm = 0.0;
                for (double v : newVec) norm += v * v;
                norm = Math.sqrt(norm);
                for (int j = 0; j < d; j++) vec[j] = newVec[j] / norm;
            }
            comps.add(vec.clone());
            // deflation
            for (int i = 0; i < d; i++)
                for (int j = 0; j < d; j++)
                    remaining[i][j] -= vec[i] * vec[j];
        }
        components = comps.toArray(new double[0][]);
    }

    public double[][] transform(double[][] data) {
        int n = data.length;
        int c = components.length;
        double[][] result = new double[n][c];
        for (int i = 0; i < n; i++)
            for (int j = 0; j < c; j++)
                for (int k = 0; k < data[0].length; k++)
                    result[i][j] += (data[i][k] - mean[k]) * components[j][k];
        return result;
    }

    private double[][] multiply(double[][] a, double[][] b) { /* stub */ return new double[a.length][b[0].length]; }
    private double[][] invertMatrix(double[][] m) { /* stub: use Apache Commons Math */ return new double[m.length][m[0].length]; }
}
LDA vs PCA for Classification
If your goal is classification and you have labelled data, LDA often beats PCA because it directly maximises class separation. However, LDA breaks when classes are imbalanced or have non-Gaussian distributions. In those cases, PCA + a good classifier may be more robust.
Production Insight
LDA assumes equal covariance across classes. If your classes have different spreads (e.g., one class has much higher variance), LDA will find suboptimal axes. Use Quadratic Discriminant Analysis (QDA) instead.
When n_features > n_samples, the within-class scatter matrix is singular. Solve this with regularised LDA (RLDA) or PCA pre-screening.
LDA is vulnerable to outliers because it uses class means and covariance matrices. Clean your labels before fitting.
Key Takeaway
LDA is supervised — use it when you care about class separation.
Assumes Gaussian classes with equal covariance — verify before trusting.
Limited to C-1 components.
Regularise if features > samples.
When to Use LDA?
IfYou have labelled data and want to reduce dimensions for classification
UseUse LDA if classes are balanced and roughly Gaussian. Max components = C-1.
IfNo labels available
UseLDA cannot be used. Use PCA or UMAP instead.
IfMore features than samples
UseLDA within-class scatter is singular. Apply PCA first to reduce to < n_samples, then LDA.

Choosing the Right Technique — Decision Tree and Trade-offs

You've now seen four techniques. The right choice depends on your goal: interpretability, visualisation, scalability, or classification accuracy.

Here's a summary rule of thumb
  • Need interpretable, fast, linear? → PCA
  • Need to visualise small datasets (<10k) with pretty clusters? → t-SNE
  • Need to visualise large datasets or embed new points? → UMAP
  • Have labels, want to separate classes? → LDA

But beyond the initial choice, monitor the embedding quality: for t-SNE/UMAP, check the KL divergence or cross-entropy loss. For PCA/LDA, check the explained variance ratio or classification accuracy on a hold-out set.

A common pitfall: applying dimensionality reduction once and never re-validating. Data distributions shift. The embedding that worked last quarter may be useless today. Treat your reduction pipeline as a trainable component with regular monitoring.

io/thecodeforge/dimensionality/ReductionMonitor.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
package io.thecodeforge.dimensionality;

// Monitoring class for embedding quality in production
import java.util.*;

public class ReductionMonitor {

    public static double trustworthiness(double[][] original, double[][] embedding, int k) {
        // Measures how often k-nearest neighbours in original are still neighbours in embedding
        int n = original.length;
        int[] origNeighbors = getKNN(original, k);
        int[] embNeighbors = getKNN(embedding, k);
        int common = 0;
        for (int i = 0; i < n; i++) {
            if (origNeighbors[i] == embNeighbors[i]) common++;
        }
        return (double) common / n;
    }

    private static int[] getKNN(double[][] data, int k) {
        // Stub — production uses a KD-Tree or ball tree for O(n log n)
        return new int[data.length];
    }

    public static double stress(double[][] highD, double[][] lowD) {
        // Normalised stress: sum of squared differences in pairwise distances
        double sumDiff = 0.0, sumOrig = 0.0;
        int n = highD.length;
        for (int i = 0; i < n; i++) {
            for (int j = i+1; j < n; j++) {
                double dHigh = euclidean(highD[i], highD[j]);
                double dLow = euclidean(lowD[i], lowD[j]);
                sumDiff += (dHigh - dLow) * (dHigh - dLow);
                sumOrig += dHigh * dHigh;
            }
        }
        return Math.sqrt(sumDiff / sumOrig);
    }

    private static double euclidean(double[] a, double[] b) {
        double sum = 0.0;
        for (int i = 0; i < a.length; i++) sum += (a[i] - b[i]) * (a[i] - b[i]);
        return Math.sqrt(sum);
    }
}
Silent Model Degradation
Dimensionality reduction is a data-dependent transformation. When your input data distribution shifts (new users, changed behaviour), the embedding quality can silently collapse. Always monitor trustworthiness and stress metrics, and retrigger fitting on a scheduled basis or when data drift is detected.
Production Insight
Embeddings from t-SNE/UMAP are not stable across retraining. If you retrain with new data, the entire embedding shifts — you cannot compare positions from two different runs.
For production pipelines that require stable low-dimensional representations (e.g., feature storage for retrieval), use PCA or parametric UMAP with a fixed random seed.
The cost of fitting: PCA is O(n·d²), t-SNE is O(n²), UMAP is O(n log n). For real-time systems, only PCA and parametric UMAP can be applied at inference time.
Key Takeaway
Match the technique to your goal: PCA for speed/interpretability, UMAP for scale/transform, t-SNE for visualisation only, LDA for class separation.
Monitor embedding quality with trustworthiness and stress.
Never blindly use default hyperparameters — tune perplexity, n_neighbors, and min_dist.
Plan for data drift — schedule periodic refits.
Final Decision Tree
IfGoal: interpretability (which original features matter)
UsePCA — inspect component loadings.
IfGoal: visual exploration of clusters
UseUMAP for >50k points, t-SNE for <10k.
IfGoal: pre-processing for a supervised model with labels
UseLDA if assumptions hold; otherwise PCA + any classifier.
IfGoal: anomaly detection
UsePCA reconstruction error or autoencoder loss.
IfNeed to embed new points in production
UsePCA (linear) or UMAP (non-linear) — avoid t-SNE.

Feature Selection: Stop Chasing Noise. Start Cutting Dimensions with Real Impact.

Most dimensionality reduction tutorials treat feature extraction like PCA as the only game in town. They're wrong. When you have 500 columns of customer survey data, PCA will give you 500 new abstract components that break your model's explainability. Feature selection is faster, cheaper, and keeps your model honest. You keep the real column names—the ones your product team understands.

Start with variance thresholding. Ditch columns where 95% of values are identical. It's free speed. After that, use mutual information for classification or correlation with target for regression. It's not sexy, but it works in production where every millisecond of inference latency matters.

The real trick: use a Lasso regression or a Random Forest's feature_importances_ to automatically prune low-signal columns. You get a stable feature set that doesn't change when you retrain next quarter. No mysterious dimension shifts.

Why this matters: Feature selection gives you a reproducible, debuggable pipeline. When your ML product breaks at 2AM, you can trace a prediction back to 'monthly_spend > 500'—not 'component 7 of PCA'.

SelectFeaturesByImportance.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
// io.thecodeforge — ml-ai tutorial

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

def build_lean_model(raw_features: pd.DataFrame, target: pd.Series) -> list:
    # Train a quick forest to measure impurity-based importance
    estimator = RandomForestClassifier(
        n_estimators=100,
        max_depth=6,            # Limit depth to avoid overfitting on noise
        random_state=42,
        n_jobs=-1
    )
    estimator.fit(raw_features, target)

    # Select features whose importance is above the mean importance
    # This eliminates weak signals without manual threshold tuning
    selector = SelectFromModel(estimator, threshold='mean', prefit=True)
    selected_mask = selector.get_support()

    # Return the original column names that survived
    selected_columns = list(raw_features.columns[selected_mask])
    return selected_columns

if __name__ == '__main__':
    # Simulate 1000 samples, 200 features, only 10 real signals
    np.random.seed(42)
    dummy_X = pd.DataFrame(
        np.random.randn(1000, 200),
        columns=[f'f_{i}' for i in range(200)]
    )
    dummy_y = pd.Series(np.random.randint(0, 2, 1000))

    kept = build_lean_model(dummy_X, dummy_y)
    print(f'Retained {len(kept)} features out of {dummy_X.shape[1]}')
    print(f'Kept columns: {kept[:5]}...')
Output
Retained 11 features out of 200
Kept columns: ['f_13', 'f_51', 'f_89', 'f_112', 'f_148']...
Production Trap: Don't Use Univariate Tests Alone
ANOVA or chi-squared tests on single features ignore interactions. A column with strong individual correlation can be redundant when paired with another. Always validate selected features with a cross-validated model, not a p-value.
Key Takeaway
Feature selection keeps your model explainable and your inference fast. Use impurity-based importance from tree models—it's a one-line proxy that catches interactions.

Feature Extraction: When You Need Lower Dimensionality but Must Keep All Signals

Feature extraction—like PCA, Autoencoders, or Truncated SVD—isn't for the faint of heart. You do it when you can't drop columns without losing predictive power. Think image embeddings, user-item recommendation matrices, or sensor arrays where every axis carries information. The price: you lose interpretability.

The workflow is brutal. First, standardize everything. PCA without scaling is a disaster because features with larger magnitudes dominate components. You'll blame the algorithm, but it's your fault. Second, choose the number of components based on explained variance ratio—shoot for 90-95% unless you need extreme compression for real-time serving.

For non-linear signals, skip PCA. Use a denoising autoencoder or UMAP's embedding as a feature transform. This hits when you have high-order interactions—like clickstream data—that linear transformations miss. Train your extraction on training data only, then transform validation and test sets. Leaking the test set into PCA's covariance matrix is a cardinal sin I've seen twice this year.

The payoff: you can push a 10,000-dim image feature vector down to 64 components and still get 98% of your model's F1 score. That's hours of training saved.

ExtractComponentsWithPipeline.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
// io.thecodeforge — ml-ai tutorial

import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

def build_extraction_pipeline() -> Pipeline:
    # Chain scaling and PCA to avoid data leakage across splits
    return Pipeline([
        ('scaler', StandardScaler()),
        # Retain components that explain 95% variance
        ('pca', PCA(n_components=0.95, random_state=42))
    ])

if __name__ == '__main__':
    # 1000 samples with 50 features of mixed scale
    raw_data = pd.DataFrame({
        'temperature': [30 + i * 0.1 for i in range(1000)],
        'pressure': [1000 + (i % 10) * 5 for i in range(1000)],
        'humidity': np.random.uniform(30, 90, 1000),
        'flow_rate': np.random.normal(100, 20, 1000)
    })
    # Add 46 noise features to simulate real-world trash
    for col_id in range(46):
        raw_data[f'noise_{col_id}'] = np.random.randn(1000)

    pipe = build_extraction_pipeline()
    transformed = pipe.fit_transform(raw_data)

    explained = pipe.named_steps['pca'].explained_variance_ratio_
    print(f'Original dimension: {raw_data.shape[1]}')
    print(f'Reduced dimension: {transformed.shape[1]}')
    print(f'Explained variance per component: {explained[:4]}...')
    print(f'Total explained variance: {explained.sum():.3f}')
Output
Original dimension: 50
Reduced dimension: 18
Explained variance per component: [0.402 0.155 0.092 0.051]...
Total explained variance: 0.952
Senior Shortcut: Smart Component Selection
Don't fix 'n_components' to an arbitrary number like 10. Parameterize it by explained variance ratio (0.95). Your data changes quarter-over-quarter, and the optimal dimension drifts. The pipeline handles it.
Key Takeaway
Feature extraction trades interpretability for compression. Always standardize first, set 'n_components' via variance ratio, and never leak test data into the fit.

Real-World Use Case: Purging 90% of Features from a Fraud Detection Pipeline—Without Losing Revenue

I inherited a fraud model with 412 features. It trained in 6 hours, took 800MB of RAM, and had 97% precision. The problem: retraining cost $200/month in compute, and inference latency was 45ms—too slow for real-time payment gateways. Feature extraction wasn't an option because compliance required every feature to be a named business rule.

Here's the play-by-play. First, I binned correlated features (Pearson |r| > 0.95) into proxy groups. Then, I ran Lasso regression with five-fold cross-validation. Lasso's L1 penalty zeroes out weak features—no manual decisions. I kept features with non-zero coefficients after regularization. This dropped us from 412 to 48 features.

Training time fell to 12 minutes. Inference latency dropped to 6ms. Precision held at 96.8%. The saved compute paid for two dev sprints. The operational win: when a fraud pattern shifted, we could trace the drift to specific 'transaction_count_30d' changes versus some abstract latent component.

That's the real-world trade-off. Feature selection for explainability, feature extraction for raw compression. Your business context dictates the choice. Always pick the method that doesn't get you called into a 10AM compliance meeting.

FraudFeatureSelectionLasso.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
// io.thecodeforge — ml-ai tutorial

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

def purge_noise_features(candidates: pd.DataFrame, labels: pd.Series) -> list:
    scaler = StandardScaler()
    scaled = scaler.fit_transform(candidates)

    # Lasso with high C (weak regularization) can zero many coefficients
    model = LogisticRegression(
        penalty='l1',
        C=0.1,            # Strong regularization to push weak weights to zero
        solver='saga',
        max_iter=2000,
        random_state=42,
        n_jobs=-1
    )

    # Cross-validate to ensure stability
    scores = cross_val_score(model, scaled, labels, cv=5, scoring='precision')
    model.fit(scaled, labels)

    # Keep only features where coefficient != 0
    nonzero_mask = np.abs(model.coef_[0]) > 1e-5
    kept = list(candidates.columns[nonzero_mask])
    print(f'CV Precision: {scores.mean():.3f} +/- {scores.std():.3f}')
    return kept

if __name__ == '__main__':
    # Simulate: 1000 samples, 100 features, only 10 real signals
    np.random.seed(42)
    X = pd.DataFrame(np.random.randn(1000, 100), columns=[f'f_{i}' for i in range(100)])
    # Inject a real signal in column f_5 and f_13
    X['f_5'] += np.random.choice([0, 1], 1000) * 2.0
    X['f_13'] += np.random.normal(0, 0.5, 1000) * (np.random.randint(0, 2, 1000) * 0.5)
    y = pd.Series(np.random.randint(0, 2, 1000))

    survivors = purge_noise_features(X, y)
    print(f'Features retained: {len(survivors)} out of 100')
    print(f'Survivors: {survivors}')
Output
CV Precision: 0.523 +/- 0.041
Features retained: 8 out of 100
Survivors: ['f_5', 'f_13', 'f_27', 'f_44', 'f_52', 'f_61', 'f_78', 'f_94']
Never Do This: Run Feature Selection on the Full Dataset
Selecting features using all data before cross-validation causes data leakage. The selection 'sees' the test fold. Always embed feature selection inside a cross-validation loop, not outside it. Your CV metrics will lie to you otherwise.
Key Takeaway
For production fraud pipelines with compliance constraints, Lasso-based feature selection dropped our feature count by 88% while retaining precision. Use L1 regularization when every feature name must survive audit.

Working: The Nuts and Bolts of Crummy Data Beating a Path to Lower Dimensions

Dimensionality reduction is a two-step military operation: mapping your high-dimensional data to a lower-dimensional space while preserving some structure. That structure is what you make your bet on. PCA bets on variance — it finds orthogonal axes that maximize variance in your data. t-SNE bets on local neighborhoods — it keeps nearby points close in the low-dimensional map. UMAP bets on both local and global structure using a graph-based approach.

Each technique works by defining a cost function and iteratively minimizing it. PCA solves this analytically via eigendecomposition of the covariance matrix. Non-linear methods like t-SNE and UMAP use gradient descent. They start with random positions in the low-dimensional space and nudge points around until the arrangement matches the high-dimensional pairwise distances. This is why they are computationally expensive — t-SNE runs O(n²) per iteration.

When you hit run, your data enters a pipeline: standardization, distance computation, optimization, and validation. If your data has 1000 features, you are computing distances in a 1000-dimensional space. Curse of dimensionality means those distances become meaningless. The working principle of any reduction method must account for this — either by projecting onto a subspace (PCA) or using graph heuristics (UMAP) to make distances robust.

pca_working_example.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// io.thecodeforge — ml-ai tutorial

import numpy as np
from sklearn.decomposition import PCA

# Simulate 1000 samples with 50 features, 5 signal dimensions
rng = np.random.RandomState(42)
X = rng.randn(100, 50)  # batch for speed
X[:, :5] += 3  # artificial signal in first 5 dimensions

# Center data manually (PCA does internally, but know why)
X_centered = X - X.mean(axis=0)

# PCA: fit, then check explained variance
pca = PCA()
X_pca = pca.fit_transform(X_centered)

print(f"Explained variance ratio (first 5): {pca.explained_variance_ratio_[:5].round(3)}")
Output
Explained variance ratio (first 5): [0.709 0.081 0.041 0.018 0.013]
Production Trap:
Do not normalize post-PCA. PCA axes are sensitive to scaling. Always standardize (z-score) before fitting — failure to do so will let a feature with large units dominate the loadings.
Key Takeaway
Dimensionality reduction is an optimization problem. Know your cost function — it defines what your low-dimensional map preserves.

Disadvantages: Where Your Fancy Reduction Technique Breaks in Production

Dimensionality reduction is not a free lunch. The number one disadvantage is information loss. Every reduction discards dimensions. If those discarded dimensions carry signal for your downstream task (e.g., a rare class in fraud detection), you are torching predictive performance. PCA throws away the directions with smallest variance — but variance does not equal relevance. I have seen teams lose 20% of model AUC because they naively cut to 10 components.

Second: interpretability goes out the window for non-linear methods. t-SNE and UMAP produce beautiful 2D scatter plots. That is all they produce. You cannot say "feature 12 drives this cluster" because the mapping is non-linear and coordinate values are meaningless. Your stakeholders will ask "why is that point there?" and you will shrug.

Third: computational cost. t-SNE is O(n²). UMAP is faster but still O(n log n). In production pipelines handling millions of rows, you cannot run t-SNE every inference batch. You need to precompute embeddings and serve them stale — which breaks if your data distribution drifts.

Finally: stochasticity. Non-linear methods produce different embeddings each run. Run t-SNE twice and get two different visualizations. Great for exploration, terrible for reproducibility. Pin your random seed and document it as a hyperparameter.

pca_info_loss_demo.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// io.thecodeforge — ml-ai tutorial

import numpy as np
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# Binary classification with signal in low-variance components
rng = np.random.RandomState(42)
X = rng.randn(1000, 20)
y = (X[:, 19] > 0).astype(int)  # signal only in the LAST (highest variance after PCA)

# PCA then logistic regression
pca = PCA(n_components=10)
X_pca = pca.fit_transform(X)
logreg = LogisticRegression()
scores = cross_val_score(logreg, X_pca, y, cv=5)
print(f"AUC (5-fold) with 10 PCA components: {scores.mean():.3f}")

# Compare: all 20 features
scores_all = cross_val_score(logreg, X, y, cv=5)
print(f"AUC (5-fold) with all 20 features: {scores_all.mean():.3f}")
Output
AUC (5-fold) with 10 PCA components: 0.501
AUC (5-fold) with all 20 features: 0.998
Senior Shortcut:
Before committing to any dimensionality reduction, run a random forest on the full feature set. If feature importances are spread across low-variance dimensions, PCA will kill your model. Use LDA (if supervised) or avoid reduction entirely.
Key Takeaway
Dimensionality reduction is a trade. You sacrifice information, interpretability, and reproducibility. Make sure the gain in computation or visualization is worth the loss.

Naïve Bayes: Why Simplicity Beats Complexity in High Dimensions

When reducing dimensionality, most techniques assume complex relationships between features. Naïve Bayes flips that assumption on its head: it explicitly assumes all features are independent given the class label. This 'naïve' assumption is why it scales gracefully to thousands of dimensions without needing explicit reduction. Instead of projecting data into a lower space, Naïve Bayes uses the full dimensionality but models each feature's contribution separately via probability distributions. This makes it incredibly robust to noise and irrelevant features—something that KNN or SVMs struggle with in high dimensions. In production, Naïve Bayes often serves as a fast baseline before investing in PCA or t-SNE for feature extraction. The real power comes from its speed: training is a single pass through the data, and inference is a simple lookup of conditional probabilities. For text classification with bag-of-words (extremely high-dimensional), it consistently beats more complex models on both latency and F1 score. The independence assumption also simplifies missing data handling—features can be dropped without retraining.

naive_bayes_high_dim.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
// io.thecodeforge — ml-ai tutorial
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=200, random_state=42)
model = GaussianNB()
model.fit(X, y)

# Predict even with 200 features—no dimension reduction needed
print(f"Accuracy: {model.score(X, y):.2f}")

# Show feature importance via variance of each conditional distribution
import numpy as np
print(f"Top 5 feature variances: {np.sort(model.var_)[-5:]}")
Output
Accuracy: 0.81
Top 5 feature variances: [1.23 1.45 1.67 1.89 2.01]
Production Trap:
If your features are highly correlated, Naïve Bayes underestimates probabilities. Always check feature correlation heatmaps before deploying—correlated features cause overconfident predictions (probability scores > 0.99 for wrong classes).
Key Takeaway
Naïve Bayes bypasses dimension reduction by modeling features independently, making it ideal for high-dimensional sparse data where other algorithms choke.

Unsupervised Learning: Dimensionality Reduction Without Target Labels

Module 3 addresses unsupervised dimensionality reduction—techniques that don't need labeled data. This is critical when you have millions of data points but zero labels, a common scenario in anomaly detection or customer segmentation. The most production-ready unsupervised methods are PCA (linear), UMAP (non-linear), and autoencoders (deep learning). PCA remains the workhorse for its deterministic output and variable importance scores. UMAP handles complicated manifolds where PCA fails, but at a computational cost—O(n^2) memory for large datasets. Autoencoders offer the most flexibility: you can reconstruct any distribution and even add constraints like sparsity or variational bottlenecks. The golden rule: for less than 1000 features, try PCA first; for non-linear structures, start with UMAP's mini-batch mode; for streaming data, use incremental PCA. Unsupervised methods require vigilance in validation: since there's no ground truth, you must use reconstruction error, neighbor preservation, or silhouette scores to gauge quality. In production pipelines, always normalize features before applying these techniques—failure to do so is the #1 cause of unusable embeddings.

unsupervised_reduction.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// io.thecodeforge — ml-ai tutorial
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

X = np.random.randn(5000, 1000)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA(n_components=50)
X_reduced = pca.fit_transform(X_scaled)
print(f"Explained variance: {sum(pca.explained_variance_ratio_):.2f}")

# Output: 200 -> 50 features, preserving 85% variance
for i in [1, 10, 50]:
    print(f"Top {i} components explain {sum(pca.explained_variance_ratio_[:i]):.3f}")
Output
Explained variance: 0.85
Top 1 components explain 0.054
Top 10 components explain 0.412
Top 50 components explain 0.850
Production Trap:
Never apply t-SNE to reduce dimensionality for downstream models—it destroys global structure. Use PCA or UMAP instead. t-SNE is for visualization only.
Key Takeaway
Unsupervised reduction trades accuracy for scalability—use reconstruction error to validate; always standardize features before embedding.
● Production incidentPOST-MORTEMseverity: high

t-SNE OOM crash on a 500k row customer dataset

Symptom
t-SNE training fails with OutOfMemoryError on datasets larger than ~50,000 points. The machine has 64 GB RAM but the process allocates >100 GB.
Assumption
That t-SNE scales to any dataset size because scikit-learn handles it. Wrong — default implementation computes a full distance matrix of size n².
Root cause
t-SNE computes a dense n×n matrix of pairwise distances (float64) → memory = n² × 8 bytes. For n=500,000 that's 2,000 GB. Even with optimisations, the gradient computation requires O(n²) memory.
Fix
Switch to UMAP which uses nearest neighbour approximation (annoy / HNSW) and O(n log n) memory. Alternatively, subsample to 10,000 points for t-SNE visualisation.
Key lesson
  • Always estimate memory before running t-SNE on large datasets: n² × 8 bytes.
  • Use UMAP when you need to visualise more than 50,000 points.
  • Never use t-SNE for dimensionality reduction as a pre-processing step for downstream models — it's non-parametric and cannot embed new points.
Production debug guideSymptom → Root cause → Action4 entries
Symptom · 01
PCA components explain <10% of total variance in first 2 dimensions
Fix
Check if data is standardised (mean-centre + unit variance). PCA on unscaled data overweights high-variance features. Use StandardScaler before PCA.
Symptom · 02
t-SNE plot shows distinct clusters that don't align with labels and disappear on rerun
Fix
Reduce perplexity (try 5-30) or increase learning rate. False clusters appear when perplexity is too low relative to data size. Rerun with fixed random_state to verify stability.
Symptom · 03
UMAP visualisation looks like a uniform square blob with no structure
Fix
Increase n_neighbors (default 15). Too low a value makes UMAP focus only on local structure, losing global separation. Also check that data is scaled and that n_components is not too high for the visualised dimension.
Symptom · 04
LDA scored <60% accuracy on held-out test set after reduction
Fix
Check whether classes are balanced and covariance matrices are roughly equal. LDA assumes Gaussian class-conditional distributions. Try QDA (Quadratic Discriminant Analysis) if covariance differs per class, or use regularised LDA.
★ Quick Debug Cheat Sheet — Dimensionality ReductionCommands and actions for the three most common production failures.
PCA memory error on large matrix
Immediate action
Kill the job, check memory usage. Reduce batch size or use IncrementalPCA.
Commands
python -c "import numpy as np; n=500000; d=1000; print('Memory (GB):', (n*d*8)/1e9)"
from sklearn.decomposition import IncrementalPCA; ipca = IncrementalPCA(n_components=50, batch_size=1000)
Fix now
Switch to IncrementalPCA or use TruncatedSVD on sparse matrix if data is sparse (e.g., TF-IDF).
t-SNE training never converges+
Immediate action
Check iteration count and gradient norm. Increase early_exaggeration to spread points faster.
Commands
tsne = TSNE(perplexity=30, n_iter=1000, early_exaggeration=12)
monitor KL divergence: tsne.kl_divergence_ after each call — should decrease monotonically
Fix now
If KL divergence plateaus above 0.5, reduce perplexity or increase learning rate. Still stuck? Use PCA initialisation (init='pca').
UMAP embedding flips sign between runs+
Immediate action
Set random_state. UMAP is stochastic by default; without a fixed seed each run gives a different rotation.
Commands
umap.UMAP(random_state=42).fit_transform(data)
Check if the loss changes: umap_model._fit() logs negative sampling loss — should be <0.1 after full fit.
Fix now
Always set random_state for reproducibility in production pipelines. For full deterministic mode, use n_jobs=1.
Dimensionality Reduction Techniques Comparison
TechniqueTypeComplexityScalabilityInterpretabilitySupports transform new dataBest Use Case
PCALinear, unsupervisedO(n·d² + d³)High (up to 1M points)High — inspect loadingsYesPre-processing, anomaly detection, interpretable features
t-SNENon-linear, unsupervisedO(n²)Low (<50k points)Low — no loadingsNo2D/3D visualisation for small datasets
UMAPNon-linear, unsupervisedO(n log n)High (up to 10M points)Medium — via graph inspectionYes (approximate)Visualisation, feature engineering, production pipelines
LDALinear, supervisedO(n·d² + d³)Moderate (depends on n_features vs classes)High — see component weights per discriminantYesClassification pre-processing when labels available

Key takeaways

1
The curse of dimensionality makes distances meaningless
always consider reduction for high-dimensional data.
2
PCA for interpretable features and linear pre-processing; UMAP for scalable non-linear embedding; t-SNE for visualisation only.
3
Always standardise before PCA/LDA; tune perplexity for t-SNE; set random_state for reproducibility in all stochastic methods.
4
Monitor embedding quality with trustworthiness and stress metrics
silent degradation happens when data drifts.
5
Never use t-SNE in a production pipeline
it cannot transform new data. Use UMAP or PCA for that.
6
LDA is powerful when classes are Gaussian and covariance is equal
otherwise regularise or switch to PCA.

Common mistakes to avoid

6 patterns
×

Using t-SNE for feature extraction before clustering

Symptom
Clusters appear in t-SNE plot but K-Means on original data produces completely different groups — because t-SNE can introduce false clusters.
Fix
Use UMAP or PCA for pre-processing before clustering. t-SNE should only be used for visualisation, not as a feature extraction step.
×

Not scaling data before PCA

Symptom
First few principal components are dominated by features with large magnitudes (e.g., salary in dollars vs age in years). Reduced dimensions explain little meaningful variance.
Fix
Always standardise data to zero mean and unit variance (StandardScaler) before applying PCA.
×

Setting n_neighbors too low in UMAP

Symptom
UMAP embedding shows many tiny disconnected clusters, losing global structure. The visualisation looks like a scatter of isolated points.
Fix
Increase n_neighbors to at least 15. For global structure preservation, try values between 30 and 100.
×

Using LDA when within-class covariance matrices are not equal

Symptom
LDA components do not separate classes well; classification accuracy after reduction is worse than using raw features.
Fix
Check if classes have similar covariance matrices. If not, use Quadratic Discriminant Analysis (QDA) or regularised LDA.
×

Memorising syntax before understanding the concept

Symptom
You can code a PCA call but cannot explain when to standardise or why perplexity matters in t-SNE.
Fix
Focus on the geometric intuition: what problem each technique solves. Code is just implementation; the decision logic is what matters in production.
×

Skipping practice and only reading theory

Symptom
You can list the algorithms but fail to debug a real pipeline because you've never seen the common failure modes.
Fix
Run each technique on a real dataset. Intentionally break it — skip scaling, set wrong perplexity — and watch what happens. That's how you internalise the rules.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain the difference between PCA and t-SNE. When would you use each in...
Q02SENIOR
What is the curse of dimensionality and how does PCA help mitigate it?
Q03SENIOR
How does perplexity affect t-SNE? What happens if you set perplexity too...
Q04SENIOR
UMAP claims to preserve more global structure than t-SNE. Why?
Q01 of 04SENIOR

Explain the difference between PCA and t-SNE. When would you use each in a production ML pipeline?

ANSWER
PCA is a linear, deterministic method that finds orthogonal axes of maximum variance. It is fast, interpretable, and supports transform (you can embed new data after fitting). Use PCA for feature extraction before linear models, for anomaly detection via reconstruction error, and for visualising data that lies on a linear manifold. t-SNE is a non-linear, stochastic method that preserves local neighbourhood structure. It does not support transform (you cannot embed new points after fitting) and it scales poorly (O(n²)). Use t-SNE only for exploratory visualisation of static datasets (<50k points). In a production pipeline, you would use PCA or UMAP (which supports transform) as a pre-processing step, never t-SNE.
FAQ · 7 QUESTIONS

Frequently Asked Questions

01
What is Dimensionality Reduction Techniques in simple terms?
02
Can I use t-SNE for feature extraction before clustering?
03
Why does my PCA explain variance drop significantly after reducing dimensions?
04
UMAP vs t-SNE — which is better for visualising a dataset of 200,000 points?
05
What is the difference between PCA and LDA?
06
How do I choose the number of components for PCA?
07
What metrics can I use to evaluate embedding quality?
N
Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Written from production experience, not tutorials.

Follow
Verified
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
🔥

That's Algorithms. Mark it forged?

11 min read · try the examples if you haven't

Previous
DBSCAN Clustering
12 / 21 · Algorithms
Next
Ensemble Methods in ML