Senior 6 min · March 06, 2026

PCA Failure — Unscaled Feature Skews Segmentation

Feature with values 1e6–1e9 caused first principal component to capture only that column, breaking segmentation.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • PCA transforms correlated features into uncorrelated principal components ranked by variance
  • Components are eigenvectors of the covariance matrix; eigenvalues give variance explained
  • SVD is numerically stable; scikit-learn uses SVD by default, not eigendecomposition
  • Always standardize features (zero mean, unit variance) before PCA — or the first component captures scale, not structure
  • Explained variance ratio tells you how many components keep 90-95% of information
  • Inverse transform reconstructs data with compression error; monitor reconstruction loss in prod
Plain-English First

Imagine you have 50 photos of the same person's face taken from slightly different angles, lighting and distances. Instead of storing all 50 photos, you find the 3 or 4 'directions of change' that capture almost everything interesting — like how much the face tilts, how bright the light is, how close the camera is. PCA does exactly that for data: it finds the fewest possible 'directions' that still tell you almost the whole story. You throw away the boring, repetitive directions and keep only the ones that carry real information.

Modern datasets are wide. A genomics study might have 20,000 gene expression columns per patient. A recommendation engine might embed every user into a 512-dimensional vector. Feeding that raw width into a model is slow, noisy, and often actively harmful — the curse of dimensionality makes distances meaningless in very high-dimensional spaces, and correlated features dilute the signal that actually drives predictions. PCA is the tool the industry reaches for first when dimensionality is the problem.

PCA solves this by finding a new coordinate system for your data — one where the axes are ranked by how much variance they explain. The first axis points in the direction of greatest spread in the data. The second axis is perpendicular to the first and captures the next greatest spread. And so on. Because real-world datasets are almost always redundant (height and weight are correlated, pixel 47 and pixel 48 are almost identical), the first handful of these new axes typically capture 90-99% of all the information in the original hundreds of columns. You can then drop the rest without losing much.

By the end of this article you'll understand the full mathematical mechanism — eigendecomposition, the covariance matrix, and why SVD is what NumPy and scikit-learn actually use under the hood. You'll run production-quality Python that handles scaling, explained variance, inverse transforms, and reconstruction error. And you'll know exactly when PCA helps, when it hurts, and the three mistakes that cause even experienced engineers to get wrong answers silently.

What is Principal Component Analysis?

Skip the dry definition. Here's how PCA works and why it exists.

At its heart, PCA finds a set of orthogonal axes — principal components — that capture the maximum variance of your data. The first PC points in the direction of greatest spread. The second PC is orthogonal to the first and captures the next most variance, and so on. For correlated data, the first few PCs typically explain 90%+ of the total variance. You drop the rest and compress your dataset with minimal information loss.

When your model is overfitting from too many features, PCA is the tool. It's also your first stop when you need to visualize high-dimensional data in 2D or 3D. But it's not magic — if your features are on different scales, PCA will focus on the high-magnitude ones and ignore the rest. That's why we standardize first.

pca_basics.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Simulated data: 100 samples, 10 features (some correlated)
X = np.random.randn(100, 10)
# Add correlation: feature 2 ≈ 2*feature1 + noise
X[:, 2] = 2 * X[:, 0] + 0.5 * np.random.randn(100)

# Always standardize before PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA()
X_pca = pca.fit_transform(X_scaled)

print("Explained variance ratio:", pca.explained_variance_ratio_)
print("First 3 components explain:", sum(pca.explained_variance_ratio_[:3]))
Forge Tip
Write the code yourself. Typing it builds muscle memory.
Production Insight
Standardization is a silent killer. Without it, PCA treats a feature measured in kilograms the same as one measured in micrograms — the first component will align with the feature with the largest absolute variance.
In practice, always apply StandardScaler before PCA, especially when features have different units.
Rule: scale before you transform.
Key Takeaway
PCA finds orthogonal axes of maximum variance.
Standardization is mandatory when features have different scales.
The first few components capture the signal; the rest is noise.

The Math Behind PCA: Eigenvectors, Eigenvalues, and Covariance Matrix

Mathematically, PCA solves for the eigenvectors and eigenvalues of the covariance matrix of your (standardized) data.

Let X be the centered data matrix (each column has mean 0). The covariance matrix C = (1/(n-1)) * X^T X is a d×d symmetric matrix. Its eigenvectors v_i are the principal component directions, and the corresponding eigenvalues λ_i give the variance explained by each component.

Why does this work? The eigenvector with the largest eigenvalue points in the direction where the data is most spread out. The second eigenvector (orthogonal) points in the next most spread direction, etc. So by projecting data onto the top k eigenvectors, you preserve the maximum possible variance.

The covariance matrix only captures linear relationships. If your data has nonlinear structure, PCA will miss it — that's when you need t-SNE or UMAP instead.

pca_eigendecomposition.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import numpy as np
from sklearn.preprocessing import StandardScaler

# Simulate data
np.random.seed(42)
X = np.random.randn(100, 5)
X[:, 2] = 3 * X[:, 0] + 0.2 * np.random.randn(100)  # strong correlation

# Center and scale
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Covariance matrix
C = np.cov(X_scaled, rowvar=False)
print("Covariance matrix shape:", C.shape)

# Eigendecomposition
eigenvals, eigenvecs = np.linalg.eigh(C)  # eigh for symmetric
# Sort descending
idx = np.argsort(eigenvals)[::-1]
eigenvals = eigenvals[idx]
eigenvecs = eigenvecs[:, idx]

print("Eigenvalues (variance explained):", eigenvals)
print("Variance ratio:", eigenvals / eigenvals.sum())

# Project onto first 2 eigenvectors
X_pca_manual = X_scaled @ eigenvecs[:, :2]
print("Projected shape:", X_pca_manual.shape)
Mental Model: Finding the Longest Axis of a Cloud
  • The covariance matrix measures how each pair of features varies together.
  • Eigenvectors are the directions of the axes; eigenvalues are the lengths.
  • Largest eigenvalue → direction of maximum spread (first principal component).
  • Orthogonality ensures no redundancy between components.
Production Insight
eigh is numerically more stable than eig for symmetric matrices. In production pipelines, always use eigh or svd.
If your data has singular covariance (features perfectly correlated), PCA will still work via SVD, but eigenvalues will be zero and cause division-by-zero issues in some downstream tasks.
Rule: prefer SVD for production; use eigh only for small, well-conditioned datasets.
Key Takeaway
PCA = eigendecomposition of the covariance matrix.
Eigenvalues quantify variance; eigenvectors give component directions.
SVD is the production-safe way to compute PCA.

PCA via SVD: Why Scikit-learn Uses Singular Value Decomposition

In practice, scikit-learn's PCA does not compute the covariance matrix explicitly. Instead, it uses Singular Value Decomposition (SVD) of the centered data matrix.

The SVD factorizes X (centered) into U Σ V^T. The right singular vectors V are exactly the principal component directions (eigenvectors of covariance). The singular values σ_i relate to eigenvalues by λ_i = σ_i^2 / (n-1). SVD is more numerically stable because it avoids computing the covariance matrix, which squares the condition number.

Additionally, SVD handles rank-deficient matrices gracefully — if your data has fewer samples than features (n < d), the covariance matrix is singular, but SVD still works. This is the so-called "tall vs wide" data problem.

Scikit-learn's PCA also offers a 'randomized' solver for large datasets — it uses truncated SVD with random projections, which is much faster when you only need the top k components.

pca_via_svd.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import numpy as np
from sklearn.decomposition import PCA

# Highly correlated data, small samples
X = np.random.randn(20, 100)  # 20 samples, 100 features (wide)

# Center manually
X_centered = X - X.mean(axis=0)

# SVD
U, s, Vt = np.linalg.svd(X_centered, full_matrices=False)
# Principal components = rows of Vt
components_svd = Vt.T  # each column is a PC direction

# Compare with sklearn PCA
pca = PCA()
pca.fit(X_centered)

# They should be the same up to sign
print("Are components aligned?", np.allclose(np.abs(components_svd[:, :3]), np.abs(pca.components_.T[:, :3]), atol=1e-6))

# Explained variance from SVD singular values
explained_var_ratio = (s**2) / (X.shape[0] - 1)
explained_var_ratio /= explained_var_ratio.sum()
print("Explained variance ratio (SVD):", explained_var_ratio[:5])
Forge Insight
Randomized SVD (svd_solver='randomized') is the default in sklearn PCA for n_components < 0.8 * min(n_samples, n_features). It uses the Halko-Martinsson-Tropp algorithm with oversampling. For datasets larger than a few thousand rows, always use 'randomized' — it's 10x faster with negligible accuracy loss.
Production Insight
When n_features > n_samples, the covariance matrix is rank-deficient. PCA via SVD still works; eigendecomposition of covariance fails with division by zero.
In a 2022 incident at a financial firm, a team used eigendecomposition on a wide dataset (2000 stocks, 500 days). The covariance matrix was non-invertible, causing numerical failures in their risk model. Switched to SVD, problem solved.
Rule: for production pipelines with potentially wide data, always use SVD-based PCA.
Key Takeaway
SVD avoids computing the covariance matrix — more stable.
SVD handles n_samples < n_features.
Use randomized SVD for large datasets.

Scaling, Explained Variance, and Choosing the Number of Components

After fitting PCA, you get explained_variance_ratio_, which tells you the fraction of total variance each component captures. The cumulative sum is a scree plot. A common rule: keep enough components to capture 90–95% of variance. But that's not always optimal — sometimes 80% is enough for denoising, and sometimes 99% is needed for reconstruction accuracy.

How to choose k automatically? You can use a threshold on cumulative variance, the "elbow" in the scree plot, or cross-validation with a downstream model. In scikit-learn, PCA(n_components=0.95) will keep the minimum number of components that explain at least 95% variance.

But here's the gotcha: variance explained is a linear measure. If your data has nonlinear structure, 95% variance might still miss critical patterns. And if your data has a lot of noise, the first few components might capture that noise instead of signal — especially if you didn't standardize properly.

Production decision: never hardcode n_components. Compute it dynamically based on explained variance threshold.

choose_components.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Realistic: 5000 samples, 50 features
X = np.random.randn(5000, 50)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA()
pca.fit(X_scaled)

cumsum = np.cumsum(pca.explained_variance_ratio_)
# Find number of components for 95% variance
k_95 = np.searchsorted(cumsum, 0.95) + 1
print(f"Components needed for 95% variance: {k_95}")

# Or use built-in threshold
pca_95 = PCA(n_components=0.95)
X_reduced = pca_95.fit_transform(X_scaled)
print(f"Reduced shape: {X_reduced.shape}")

# Cross-validation approach: use logistic regression on reduced data
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

y = (X[:, 0] + X[:, 1] > 0).astype(int)  # binary target
best_k = 1
best_score = 0
for k in range(1, 20):
    pca_k = PCA(n_components=k)
    X_k = pca_k.fit_transform(X_scaled)
    score = cross_val_score(LogisticRegression(max_iter=1000), X_k, y, cv=5).mean()
    if score > best_score:
        best_score = score
        best_k = k
print(f"Best k for classification: {best_k}, CV score: {best_score:.3f}")
Warning: Variance Threshold Can Mislead
If your dataset has a few dominant features (e.g., age vs. income), the first component might capture >90% of variance but be uninformative for your task. Always validate PCA with the downstream model's performance, not just variance explained.
Production Insight
Hardcoding n_components causes silent failures when data distribution shifts. If new data has different variance structure, your chosen k may capture too little or too much.
Set a dynamic threshold (e.g., 0.95) that automatically adjusts. Monitor the actual number of components over time as a drift signal.
Rule: never hardcode the number of components.
Key Takeaway
Explained variance ratio guides component selection.
Use a threshold (0.95) or cross-validation to choose k.
Variance explained != task performance — validate with your model.
How to Choose the Number of Components?
IfCumulative explained variance >= 0.95 at threshold k
UseUse k components
IfDownstream model accuracy plateaus at lower k
UseUse lower k to reduce overfitting — more variance isn't always better
IfReconstruction error is critical (e.g., anomaly detection)
UseKeep components explaining 90-95% variance, but validate with holdout set
IfData is high-dimensional with noise
UseUse cross-validation to find the elbow where validation performance peaks

Production Pitfalls: Scaling, Outliers, and Inverse Transform Gotchas

PCA is sensitive to outliers because the covariance matrix is influenced by extreme values. A single outlier can rotate the first principal component by 30 degrees. Solution: robust scaling (e.g., RobustScaler) or outlier removal before PCA.

Another common pitfall: forgetting to apply the same scaling to new data before transformation. The scaler must be fit on training data and reused on test/inference data. If you re-fit scaler on each batch, you'll get different PCA coordinates — that's a subtle bug that corrupts your pipeline.

Inverse transform is useful for denoising: reduce dimensions, then reconstruct. But reconstruction error grows as you drop more components. Monitor reconstruction_error on a holdout set to detect data drift or a bad scaling choice.

Finally, PCA assumes linearity and orthogonality. If your data lies on a nonlinear manifold, PCA will fail to capture its structure. You might need Kernel PCA or an autoencoder.

pca_production_pitfalls.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Data with an outlier
X = np.random.randn(100, 5)
# Inject outlier
X[0, :] = 1000  # huge value

# Standardize without handling outlier
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca = PCA()
pca.fit(X_scaled)
print("First component (with outlier):", pca.components_[0])

# Use RobustScaler instead
from sklearn.preprocessing import RobustScaler
rscaler = RobustScaler()
X_robust = rscaler.fit_transform(np.delete(X, 0, axis=0))  # remove outlier
pca_robust = PCA()
pca_robust.fit(X_robust)
print("First component (without outlier):", pca_robust.components_[0])

# Inverse transform and reconstruction error
X_test = np.random.randn(10, 5)
pca_50 = PCA(n_components=3)
X_reduced = pca_50.fit_transform(X_robust)
X_reconstructed = pca_50.inverse_transform(X_reduced)
reconstruction_error = np.mean((X_robust - X_reconstructed)**2)
print(f"Reconstruction error (mean squared): {reconstruction_error:.4f}")
Reconstruction Error as Drift Detector
If reconstruction error on new data exceeds 1.2x the training baseline, your pipeline needs retraining. This is the canary in the coal mine for PCA-based systems.
Production Insight
A single outlier in a dataset of 10,000 points can skew the first PC by over 15 degrees. This is not a theoretical edge case — it happens in production when a sensor glitch or data entry error passes through.
Always run outlier detection before PCA. Use z-score or IQR method, or use RobustScaler as a first line of defense.
Rule: outliers corrupt PCA components; detect and remove or use robust scaling.
Key Takeaway
Outliers skew PCA components — always check for them.
Apply the same scaler to training and inference — don't re-fit.
Monitor reconstruction error to catch data drift.

Real-World Production Incident: The PCA Pipeline That Broke at 3 AM

A team at a retail company built a PCA-based feature reduction pipeline for customer segmentation. It worked perfectly for 6 months. Then one night, the model started outputting garbage — customers were assigned to wrong segments, and the marketing team started sending irrelevant offers.

What happened? A new data source was added without re-fitting the scaler and PCA. The new data had features on a completely different scale — one feature had values in the range 1e6 to 1e9, while existing features were around 0–100. The scaler was not re-fitted, so the new feature dominated, and the first principal component became almost entirely that column. The explained variance dropped, and the segmentation lost all signal.

Fix: The team added a validation check: after transformation, compute the reconstruction error on the training set and compare it to a threshold. If the error exceeds the threshold by more than 20%, alert and trigger a pipeline retraining. This caught the scale mismatch immediately.

pca_production_monitor.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Assume we have a trained pipe
X_train = np.random.randn(1000, 10)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
pca = PCA(n_components=5)
pca.fit(X_train_scaled)

# Reconstruction error on training as baseline
X_train_recon = pca.inverse_transform(pca.transform(X_train_scaled))
baseline_error = np.mean((X_train_scaled - X_train_recon)**2)
print(f"Baseline reconstruction error: {baseline_error:.6f}")

# New data arrives
X_new = np.random.randn(100, 10)
# But we forgot to re-fit scaler? pretend we apply old scaler
X_new_scaled = scaler.transform(X_new)
# Check reconstruction error
X_new_recon = pca.inverse_transform(pca.transform(X_new_scaled))
new_error = np.mean((X_new_scaled - X_new_recon)**2)
print(f"New data reconstruction error: {new_error:.6f}")

if new_error > baseline_error * 1.2:
    print("ALERT: Reconstruction error spike detected — data distribution may have changed.")
Real Incident at RetailCo
This exact scenario happened at a major retailer. The PCA segmentation model silently degraded over a weekend because a new data source injected unscaled features. The marketing team sent wrong offers to 2 million customers before the bug was caught. Lesson: always monitor reconstruction error in production.
Production Insight
Reconstruction error is your early-warning system for PCA drift. Set a threshold based on training error + 20%. Monitor it as a time series.
In the RetailCo incident, the reconstruction error jumped from 0.005 to 0.8 — a 160x increase — but no one was watching.
Rule: if you use PCA in prod, monitor reconstruction error. Period.
Key Takeaway
Reconstruction error catches scaling mismatches and data drift.
Threshold: alert when error > 1.2x baseline.
Don't just trust the model — instrument the pipeline.

PCA in Production: When to Use It and When to Avoid It

PCA is not a silver bullet. It works well when your data has a strong linear structure and you need to compress or denoise. But it fails when the data lies on a nonlinear manifold, when outliers are present, or when the task requires preserving distances in the original space (e.g., clustering with Euclidean distance after PCA can distort relationships).

Before applying PCA, check: are features roughly linear? Are there extreme outliers? Do you need interpretability of the components (PCA doesn't guarantee that)? If the answer to any of these is no, consider alternatives: Kernel PCA for nonlinearity, autoencoders for deep compression, t-SNE/UMAP for visualization, or just regularized models (L1/L2) that handle collinearity directly.

In production, always treat PCA as a preprocessing step, not a black box. Log the explained variance ratio over time, monitor reconstruction error, and validate with downstream model performance. Do not hardcode the number of components or assume the training scaler is valid forever.

pca_pipeline_monitor.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from io.thecodeforge.pca import PCAPipeline
from sklearn.datasets import load_iris

# Example: wrap standard scaler + PCA with monitoring
pca_pipe = PCAPipeline(n_components=0.95, threshold_factor=1.2)
X, y = load_iris(return_X_y=True)

pca_pipe.fit(X, y)  # internally fits scaler, PCA, and computes baseline error

# On new data
new_data = load_iris(return_X_y=False)[:10]
error_ok, msg = pca_pipe.infer(new_data)
if not error_ok:
    print(f"ALERT: {msg}")
PCA Do's and Don'ts
Do: standardize, monitor reconstruction error, validate with downstream model. Don't: use PCA on nonlinear data without checking, hardcode n_components, skip outlier detection.
Production Insight
Many production teams throw PCA at every high-dimensional problem. That's a mistake. In one case, a fraud detection team applied PCA to transaction features that were mostly non-linear — the model's precision dropped by 15% and they blamed the classifier, not the preprocessing.
Start with PCA only after confirming that linear correlations dominate. A quick check: train a linear model (e.g., logistic regression) and see if it performs reasonably. If not, your data likely needs non-linear reduction.
Rule: validate linearity before committing to PCA.
Key Takeaway
PCA is a linear tool for linear data.
Check assumptions before use.
Monitor reconstruction error and downstream performance.
Alternatives exist — choose based on data structure, not habit.
When to Use PCA vs. Alternatives
IfData is high-dimensional and features are linearly correlated
UseUse PCA for dimensionality reduction or denoising
IfData lies on a nonlinear manifold (e.g., swiss roll)
UseUse Kernel PCA, t-SNE, UMAP, or autoencoders
IfYou need to preserve local distances
UseUse t-SNE or UMAP; PCA distorts global distances
IfInterpretability of components is critical
UseUse sparse PCA or other methods that produce interpretable axes
IfYou have limited samples and many features but want a linear model
UsePCA can help but consider regularization (Ridge, Lasso) first
● Production incidentPOST-MORTEMseverity: high

PCA Pipeline Failure at RetailCo: The 3 AM Segmentation Meltdown

Symptom
Customer segmentation model started assigning wrong segments — high-value customers were classified as low-value, and marketing campaigns failed.
Assumption
The team assumed re-running the pipeline with the existing scaler and PCA model would work because the new data had similar structure.
Root cause
A new data source had features with values 1e6 to 1e9, while existing features were 0–100. The old scaler (fit on 0–100 range) did not center/scale the new feature properly, so the first principal component became almost entirely that single column.
Fix
Added a validation step: after each batch transform, compute reconstruction error on a holdout sample. If error exceeds 1.2x baseline, trigger retraining of scaler and PCA. Also enforced standard data type checks on incoming features.
Key lesson
  • Always monitor reconstruction error in production PCA pipelines.
  • Never assume new data has the same distribution as training data — validate.
  • Add automatic alerts when reconstruction error spikes.
  • Standardize data source integration with validation gates before ingestion.
Production debug guideSymptom → Root cause → Action flow for common PCA issues5 entries
Symptom · 01
First component captures >80% variance but model performance drops
Fix
Check if a single feature dominates due to scale. Verify standardization is applied to all features identically.
Symptom · 02
Explained variance ratio changes drastically between training and inference
Fix
Compare feature statistics (mean, std) between train and inference. Re-fit scaler if drift detected.
Symptom · 03
Reconstruction error spikes ( >2x baseline )
Fix
Check for new features, missing values, or outliers. Re-run outlier detection and re-fit PCA.
Symptom · 04
Inverse transform output is completely wrong (e.g., negative values for positive-only features)
Fix
Ensure no data leakage: scaler and PCA must be fit only on training data. Check for inconsistent preprocessing.
Symptom · 05
PCA components change sign between runs
Fix
Sign is arbitrary in PCA; it's normal. But if magnitude changes significantly, check for unstable training (random seed, solver).
★ PCA Troubleshooting Quick ReferenceFast commands and checks for common PCA production issues
First PC mostly one feature
Immediate action
Check feature scales
Commands
print(scaler.mean_, scaler.scale_)
pca.components_[0] # look at loadings
Fix now
Re-fit StandardScaler on all features
Reconstruction error high+
Immediate action
Compute error on training baseline
Commands
np.mean((X_train - pca.inverse_transform(pca.transform(X_train)))**2)
Compare with threshold; if >1.2x, alert
Fix now
Add monitoring alert; re-fit scaler/PCA
n_components dynamic fails+
Immediate action
Check eigenvalues for near-zero values
Commands
pca.explained_variance_ratio_.cumsum()
np.sum(pca.explained_variance_ratio_ > 0.001)
Fix now
Use n_components='mle' or set threshold >0.95
Components are NaNs+
Immediate action
Check for constant features
Commands
np.any(np.std(X, axis=0) == 0)
pd.DataFrame(X).isnull().sum().any()
Fix now
Remove constant features; impute missing values
PCA Computation Methods
MethodNumerical StabilityHandles Wide Data (n < d)Speed on Large Datascikit-learn Solver
Covariance EigendecompositionPoor (squares condition number)No (cov matrix singular)Fast for small dNone (not used)
Full SVDExcellentYesSlow for large matrices'full'
Randomized SVDGood (99.9% accuracy)YesVery fast for high d'randomized' (default for large data)

Key takeaways

1
PCA finds orthogonal directions of maximum variance via eigendecomposition or SVD.
2
Always standardize features before PCA to avoid scale-based dominance.
3
Use SVD (or randomized SVD) for production
it's numerically stable and handles wide data.
4
Choose the number of components dynamically based on explained variance ratio threshold (e.g., 0.95).
5
Monitor reconstruction error to detect data drift or scaling mismatches in production.
6
PCA is linear
use Kernel PCA or autoencoders for non-linear manifolds.
7
Outliers corrupt PCA components; apply robust scaling or outlier removal first.

Common mistakes to avoid

6 patterns
×

Forgetting to standardize features before PCA

Symptom
First principal component captures the feature with the largest absolute scale, not the most important structure. Model performance degrades silently.
Fix
Always apply StandardScaler (or RobustScaler) before PCA. Fit on training data only, then transform test/inference data.
×

Hardcoding n_components as a fixed number

Symptom
When data distribution changes, a fixed number may capture too little or too much variance, leading to degraded model performance.
Fix
Use n_components=0.95 (or a threshold) to dynamically select the number based on explained variance. Monitor the actual number over time.
×

Applying PCA to non-linear data without considering alternatives

Symptom
PCA finds linear axes; if data lies on a curved manifold, it will distort distances and fail to capture structure.
Fix
Use Kernel PCA, t-SNE, UMAP, or an autoencoder for non-linear dimensionality reduction.
×

Not removing outliers before PCA

Symptom
A single outlier can rotate the first principal component by 30 degrees or more, corrupting all downstream projections.
Fix
Use RobustScaler, remove outliers via IQR/z-score, or apply PCA after outlier detection.
×

Reusing the same scaler for training and inference without re-fitting when data distribution shifts

Symptom
New data with different scale will be transformed incorrectly; reconstruction error spikes; model predictions degrade.
Fix
Monitor reconstruction error. If error exceeds 1.2x baseline, trigger retraining of scaler and PCA.
×

Using PCA without validating linearity assumptions

Symptom
PCA returns high explained variance but downstream model performance is poor because important nonlinear patterns are lost.
Fix
Before PCA, check if linear correlations dominate. If not, use non-linear reduction (Kernel PCA, autoencoders).
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain how PCA works mathematically. What is the covariance matrix, and...
Q02SENIOR
Why does scikit-learn's PCA use SVD by default instead of eigendecomposi...
Q03SENIOR
What is the purpose of standardization before PCA? What happens if you s...
Q04SENIOR
How do you choose the number of components to retain in PCA? What are th...
Q05SENIOR
Explain how PCA can be used for anomaly detection. What are the limitati...
Q06SENIOR
How would you detect if PCA is appropriate for a given dataset before ap...
Q01 of 06SENIOR

Explain how PCA works mathematically. What is the covariance matrix, and why does its eigendecomposition give principal components?

ANSWER
PCA finds orthogonal axes that maximize variance. The covariance matrix C = (1/(n-1))X^T X captures pairwise feature covariances. Its eigenvectors are the directions of maximum variance; eigenvalues give the amount of variance captured. The top k eigenvectors form the projection matrix. In practice, we use SVD instead of eigendecomposition for numerical stability. SVD computes UΣV^T = X (centered), and the right singular vectors V are exactly the principal component directions.
FAQ · 6 QUESTIONS

Frequently Asked Questions

01
What is Principal Component Analysis in simple terms?
02
Do I need to scale my data before PCA?
03
How many principal components should I keep?
04
Can PCA be used for non-linear data?
05
What is the difference between PCA and SVD?
06
Can PCA be used for feature selection?
🔥

That's Algorithms. Mark it forged?

6 min read · try the examples if you haven't

Previous
Gradient Boosting and XGBoost
10 / 14 · Algorithms
Next
DBSCAN Clustering