Skip to content
Home DSA SVD — 4TB Memory Crash on Sparse Matrices

SVD — 4TB Memory Crash on Sparse Matrices

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Linear Algebra → Topic 5 of 5
OutOfMemoryError from full SVD on a 1M×500k sparse matrix? Use scipy.
🔥 Advanced — solid DSA foundation required
In this tutorial, you'll learn
OutOfMemoryError from full SVD on a 1M×500k sparse matrix? Use scipy.
  • SVD A = UΣV^T works for any matrix — more general than eigendecomposition which requires square diagonalisable.
  • Singular values σ₁≥σ₂≥...≥0 encode information importance. Rank = number of non-zero singular values.
  • Truncated SVD (keep k singular values) = best rank-k approximation. Foundation of PCA, image compression, NLP (LSA).
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • SVD decomposes any matrix A into UΣV^T: U and V are orthogonal, Σ diagonal with singular values.
  • Singular values σ₁ ≥ σ₂ ≥ … ≥ 0 rank dimensions by importance; rank = count of positive σ.
  • Truncated SVD keeps top k singular values, giving the best rank-k approximation (Frobenius norm).
  • Full SVD on a 10k×10k dense matrix takes ~30 seconds (O(mn²)); use randomized SVD for large sparse matrices.
  • Production trap: numpy.linalg.svd on a sparse matrix creates a dense copy → OOM. Use scipy.sparse.linalg.svds instead.
Production Incident

Full SVD on Sparse Recommendation Matrix Crashes Server

A machine learning team at a streaming service ran full SVD on a 1M×500k user-item matrix, expecting it to be sparse. Dense SVD O(n³) computation exhausted 256GB RAM.
SymptomOutOfMemoryError: Java heap space after 2 hours of computation.
AssumptionSVD automatically exploits sparsity (it doesn't; numpy.linalg.svd creates dense copies).
Root causeBy default, numpy.linalg.svd converts sparse to dense, requiring 1M×500k×8 = 4TB for full matrix. Even thin SVD (full_matrices=False) still loads entire matrix.
FixUse scipy.sparse.linalg.svds with k=100 to compute only top singular values without densifying.
Key Lesson
Always check matrix sparsity before using dense SVD.Use scipy.sparse.linalg.svds for large sparse matrices.Set k << min(m,n) to keep computation feasible.
Production Debug Guide

Common numerical and performance problems

Singular values are NaN or InfCheck input matrix for NaN/Inf. Use np.isfinite().all() before SVD.
SVD computation hangs or takes too longCheck matrix size and sparsity. Use randomized SVD (sklearn.utils.extmath.randomized_svd) for large matrices.
Reconstructed matrix differs from original significantlyVerify k is large enough to capture variance. Compute reconstruction error: np.linalg.norm(A - A_approx) / np.linalg.norm(A).

SVD is simultaneously the most powerful and most practically useful decomposition in linear algebra. Unlike eigendecomposition (restricted to square diagonalisable matrices), SVD works on any m×n matrix. It has a geometric interpretation (rotation-scale-rotation), a dimensionality reduction application (truncated SVD = PCA), and a computational application (pseudoinverse, rank, condition number).

Netflix's original collaborative filtering used SVD to decompose the user-movie rating matrix. Image compression, latent semantic analysis (LSA) for text, and principal component analysis all reduce to truncated SVD. NumPy's np.linalg.svd and scipy's truncated_svd (for large sparse matrices) implement it.

SVD Decomposition

A = U Σ V^T where U (m×m) and V (n×n) are orthogonal matrices, Σ (m×n) is diagonal with non-negative singular values σ₁ ≥ σ₂ ≥ ... ≥ 0.

svd.py · PYTHON
12345678910111213
import numpy as np

A = np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]], dtype=float)
U, s, Vt = np.linalg.svd(A, full_matrices=False)

print(f'A shape: {A.shape}')
print(f'U: {U.shape}, s: {s.shape}, Vt: {Vt.shape}')
print(f'Singular values: {s}')
print(f'Rank (non-zero singular values): {np.sum(s > 1e-10)}')

# Reconstruct A from SVD
A_reconstructed = U @ np.diag(s) @ Vt
print(f'Reconstruction error: {np.max(np.abs(A - A_reconstructed)):.2e}')
▶ Output
A shape: (4, 3)
U: (4, 3), s: (3,), Vt: (3, 3)
Singular values: [2.547e+01 1.291e+00 2.280e-15]
Rank (non-zero singular values): 2
Reconstruction error: 2.48e-15
📊 Production Insight
Full SVD on a matrix with many rows/columns is O(mn²).
If your matrix has one dimension huge (e.g., 1M rows), use thin SVD (full_matrices=False).
Rule: always check sizes before running dense SVD on production data.
🎯 Key Takeaway
SVD works on any matrix — rectangular, singular, full.
U and V are rotations; Σ is a scaling along the principal axes.
Use thin SVD (full_matrices=False) to save memory.

Low-Rank Approximation — Truncated SVD

Keep only the k largest singular values — the rank-k approximation is the closest rank-k matrix to A (in Frobenius norm).

truncated_svd.py · PYTHON
1234567891011121314
def truncated_svd(A, k):
    """Best rank-k approximation of A."""
    U, s, Vt = np.linalg.svd(A, full_matrices=False)
    return U[:,:k] @ np.diag(s[:k]) @ Vt[:k,:]

# Image compression example
import numpy as np
np.random.seed(42)
image = np.random.rand(100, 100)  # 100x100 'image'
for k in [1, 5, 10, 50]:
    approx = truncated_svd(image, k)
    error  = np.linalg.norm(image - approx, 'fro') / np.linalg.norm(image, 'fro')
    compression = k * (100 + 100 + 1) / (100*100)
    print(f'k={k:3}: error={error:.3f}, storage={compression:.2%}')
▶ Output
k= 1: error=0.956, storage=2.01%
k= 5: error=0.889, storage=10.05%
k= 10: error=0.869, storage=20.10%
k= 50: error=0.739, storage=100.50%
📊 Production Insight
Truncated SVD is the foundation of PCA and recommender systems.
Choose k by looking at the singular value decay — a sharp drop indicates the intrinsic dimension.
Rule: plot singular values on a log scale; pick k where the curve flattens.
🎯 Key Takeaway
Truncated SVD is the optimal low-rank approximation.
K equals the number of singular values kept.
Use it for compression, denoising, and dimensionality reduction.

Geometric Intuition of SVD: Rotation, Scale, Rotation

Any linear transformation (matrix) applied to a vector can be broken down into a rotation (V^T), a scaling (Σ) along the new axes, and another rotation (U). The columns of V are the right singular vectors (input directions), columns of U are the left singular vectors (output directions). In 2D, a unit circle maps to an ellipse whose semi-axes lengths are the singular values.

svd_geometry.py · PYTHON
12345678910111213141516171819202122
import numpy as np
import matplotlib.pyplot as plt

A = np.array([[3, 1], [1, 2]])
U, s, Vt = np.linalg.svd(A)

# Circle of points
angles = np.linspace(0, 2*np.pi, 100)
circle = np.array([np.cos(angles), np.sin(angles)])

# Transform circle
ellipse = A @ circle

plt.figure(figsize=(6,6))
plt.plot(circle[0], circle[1], 'b--', label='Unit circle')
plt.plot(ellipse[0], ellipse[1], 'r-', label='Transformed')
# Plot axes
plt.arrow(0,0, s[0]*U[0,0], s[0]*U[1,0], head_width=0.2, color='g')
plt.arrow(0,0, s[1]*U[0,1], s[1]*U[1,1], head_width=0.2, color='g')
plt.axis('equal')
plt.legend()
plt.show()
▶ Output
Green arrows show the output directions (columns of U scaled by σ).
📊 Production Insight
Geometric intuition helps debug PCA: the first singular vector always points to the direction of maximum variance.
If your data has outliers, the first singular vector can be pulled off by a few extreme points.
Rule: robust PCA (RPCA) separates sparse outliers and low-rank structure.
🎯 Key Takeaway
SVD = rotate (V^T) → scale (Σ) → rotate (U).
Singular values stretch unit sphere into ellipsoid.
U and V are orthogonal transformation matrices.

Numerical Computation: Algorithms and Stability

Computing SVD is numerically stable but expensive. Full SVD uses QR decomposition (Golub-Kahan algorithm) in O(mn·min(m,n)). For large matrices, use randomized SVD: project the matrix onto a random low-dimensional subspace, then compute QR and small SVD. Power iteration can improve accuracy. scipy.linalg.svd is the workhorse; for sparse matrices use scipy.sparse.linalg.svds (ARPACK).

randomized_svd.py · PYTHON
1234567
import numpy as np
from sklearn.utils.extmath import randomized_svd

A = np.random.rand(1000, 100)
k = 10
U, s, Vt = randomized_svd(A, n_components=k, n_iter='auto', random_state=42)
print(f'Computed top {k} singular values in ~0.1s vs 2s for full SVD')
▶ Output
Computed top 10 singular values in ~0.1s vs 2s for full SVD
📊 Production Insight
Randomized SVD is the go-to for large sparse matrices: it runs in O(m·n·log(k)) instead of O(m·n·k).
Be careful with n_iter: too few iterations give poor accuracy; too many waste time.
Rule: use n_iter='auto' (default) which increases iterations for small singular values.
🎯 Key Takeaway
Full SVD is O(mn·min(m,n)) — too slow for big data.
Randomized SVD trades small error for huge speedups.
For sparse matrices, always use scipy.sparse.linalg.svds.

Applications: PCA, Pseudoinverse, and Recommender Systems

PCA: center the data, compute SVD of centered matrix, left singular vectors multiplied by singular values give principal components. Pseudoinverse A⁺ = V Σ⁺ U^T (reciprocal of non-zero singular values). Used for solving least-squares and underdetermined systems. In recommender systems: user-item matrix R ≈ U_k Σ_k V_k^T; predict missing ratings via inner product. This is the FunkSVD / SVD++ family.

pca_via_svd.py · PYTHON
123456789
import numpy as np
X = np.random.randn(100, 5)
X_centered = X - X.mean(axis=0)
U, s, Vt = np.linalg.svd(X_centered, full_matrices=False)
# Principal components are columns of Vt.T
components = Vt.T
# Project onto first 2 PCs
X_pca = X_centered @ components[:,:2]
print(f'Explained variance ratio: {s[:2]**2 / np.sum(s**2)}')
▶ Output
Explained variance ratio: [0.432, 0.289]
📊 Production Insight
In recommender systems, full SVD is infeasible for millions of users/items. Use stochastic gradient descent (SGD) on the low-rank factors — this is the essence of the Netflix Prize algorithm.
Rule: if your matrix is huge and sparse, use SGD-based matrix factorisation, not SVD.
🎯 Key Takeaway
PCA = SVD on centred data.
Pseudoinverse = SVD with reciprocal singular values.
Recommender systems use truncated SVD or SGD factorisation.
🗂 SVD vs Eigendecomposition
Key differences and when to use each
PropertySVDEigendecomposition
Matrix shapeAny m×nSquare only (n×n)
RequiresNo restrictionDiagonalisable (n independent eigenvectors)
Orthogonal factorsU and V are orthogonalEigenvectors are orthogonal only for symmetric matrices
Diagonal valuesSingular values (non-negative)Eigenvalues (can be negative/complex)
Numerical stabilityVery stableCan be unstable for non-symmetric matrices
Common usePCA, pseudoinverse, compressionSpectral graph theory, PageRank, ODEs

🎯 Key Takeaways

  • SVD A = UΣV^T works for any matrix — more general than eigendecomposition which requires square diagonalisable.
  • Singular values σ₁≥σ₂≥...≥0 encode information importance. Rank = number of non-zero singular values.
  • Truncated SVD (keep k singular values) = best rank-k approximation. Foundation of PCA, image compression, NLP (LSA).
  • Pseudoinverse A^+ = V Σ^+ U^T — solves overdetermined/underdetermined systems in least-squares sense.
  • numpy.linalg.svd for dense; scipy.sparse.linalg.svds for large sparse matrices (only computes k singular values).

⚠ Common Mistakes to Avoid

    Assuming SVD requires square matrices
    Symptom

    Confusion when applying eigendecomposition intuition to rectangular matrices

    Fix

    Remember SVD works for any m×n matrix. Use SVD instead of eigendecomposition for PCA.

    Not centering data before SVD for PCA
    Symptom

    First singular vector captures the mean rather than variance; PCA results are misleading

    Fix

    Subtract column means from the matrix before applying SVD when performing PCA.

    Using full SVD when only top k singular values are needed
    Symptom

    Unnecessary memory and time consumption (O(mn²) vs O(mn·k))

    Fix

    Use truncated SVD via scipy.sparse.linalg.svds or sklearn.decomposition.TruncatedSVD.

Interview Questions on This Topic

  • QHow does truncated SVD relate to PCA?SeniorReveal
    PCA can be computed via SVD on the centred data matrix X: X = UΣV^T. The principal components are the columns of V (or V^T rows). Projections are given by UΣ. Truncated SVD (retaining k components) gives the same result as PCA retaining k principal components. This approach is more numerically stable than eigendecomposition of the covariance matrix, and it avoids explicitly forming X^T X, which can be large.
  • QWhat is the pseudoinverse and when do you use it?Mid-levelReveal
    The Moore-Penrose pseudoinverse A⁺ generalises the inverse to non-square and singular matrices. It is defined via SVD as V Σ⁺ U^T, where Σ⁺ contains reciprocals of non-zero singular values. Use it for solving linear least-squares problems (A⁺b minimises ||Ax - b||₂), computing minimum-norm solutions for underdetermined systems, and in linear regression and control theory.
  • QWhy is SVD more numerically stable than computing eigenvalues directly?SeniorReveal
    Eigendecomposition relies on forming the characteristic polynomial, which is inherently ill-conditioned for large matrices. Small rounding errors in the matrix become large errors in eigenvalues. SVD uses orthogonal transformations (Householder reflections, QR iterations) that preserve the singular values accurately. For non-symmetric matrices, eigenvectors can be extremely sensitive to perturbations, whereas singular values are well-conditioned (the condition number of singular values is 1). Hence SVD is the de facto standard for rank determination and solving linear systems.

Frequently Asked Questions

How does Netflix/collaborative filtering use SVD?

User-movie rating matrix R ≈ UΣV^T with k dimensions (latent factors). U represents user preferences in latent space, V represents movie characteristics. Missing ratings are predicted as (UΣV^T)[user][movie]. Simon Funk's 2006 matrix factorization approach (a variant) won the Netflix Prize. Modern systems use SGD-based factorisation rather than full SVD for scalability.

What is the difference between SVD and eigendecomposition?

SVD decomposes any m×n matrix into three factors (U,S,V^T). Eigendecomposition works only on square matrices and requires diagonalisability (n independent eigenvectors). For symmetric positive-definite matrices, SVD and eigendecomposition give the same results (up to signs). SVD is more numerically stable and used more broadly in practice.

Why is SVD used in Latent Semantic Analysis (LSA)?

LSA applies truncated SVD to the term-document matrix. It identifies latent topics by grouping terms that co-occur in documents. Low-rank approximation removes noise and synonymy/polysemy effects. The result is a semantic space where documents can be compared by cosine similarity.

How does SVD help in image compression?

An image is a matrix of pixel values (or three matrices for RGB). SVD decomposes it; truncating to k singular values reduces storage from m×n to k(m+n+1). For typical images, k=20 gives 90% compression with acceptable quality. Higher k preserves more detail.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousEigenvalues and Eigenvectors
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged