Linear Algebra · Numerical Methods · Machine Learning
SVD
Singular Value Decomposition
The most powerful factorization in all of mathematics.
Works on every matrix. Reveals hidden geometry.
Underlies PCA, recommender systems, NLP, and compression.
A = UΣVᵀGeometric RotationLow-Rank ApproxEckart-YoungPseudoinversePCA BridgeLSARecommenderCondition Number
// every matrix · no square assumption · the fundamental factorization
Most matrix factorizations — eigendecomposition, Cholesky, LU — require special conditions: square matrices,
positive definiteness, non-singularity. SVD requires nothing. It works for every matrix, regardless of shape,
rank, or properties.
The Theorem
★
SVD Existence Theorem
For every matrix \(\mathbf{A} \in \mathbb{R}^{m \times n}\) of rank \(r\), there exist orthogonal matrices
\(\mathbf{U} \in \mathbb{R}^{m \times m}\) and \(\mathbf{V} \in \mathbb{R}^{n \times n}\), and a matrix
\(\boldsymbol{\Sigma} \in \mathbb{R}^{m \times n}\) with non-negative entries on the diagonal and zeros
elsewhere, such that:
\[\mathbf{A} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T\]
The diagonal entries \(\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > \sigma_{r+1} = \cdots = 0\) are
the singular values of \(\mathbf{A}\). This decomposition always exists and is essentially unique
(up to signs).
Why "Singular" Values?
A matrix \(\mathbf{A} - \sigma\mathbf{I}\) (for square \(\mathbf{A}\)) becomes singular (non-invertible) at
\(\sigma = \sigma_i\). For rectangular matrices, this generalizes to the values at which
\(\mathbf{A}^T\mathbf{A} - \sigma^2\mathbf{I}\) is singular — equivalently, \(\sigma^2 =
\lambda_i(\mathbf{A}^T\mathbf{A})\). The name comes from classical matrix theory where these were called
"singular points" of the transformation.
U columns = left singular vectors. V columns = right singular vectors. Σ diagonal =
singular values. The r nonzero singular values equal the rank of A.
The Outer Product Expansion — Column by Column
The most revealing form of SVD is the sum of rank-1 matrices:
Each term σᵢ uᵢ vᵢᵀ is a rank-1 matrix — an outer product of the i-th left and right
singular vectors, weighted by σᵢ. This decomposition orders terms from most to least important by
singular value magnitude. Truncating to k terms gives the best rank-k approximation.
◈
Why This Form Matters
The outer product expansion reveals that every matrix is a weighted sum of
rank-1 components, each capturing a "mode" of variation. The first term explains the most;
subsequent terms add finer detail. Dropping the last terms gives optimal compression. This is the
mathematical basis of every compression, denoising, and latent representation method in ML.
03
Geometric Interpretation
// rotation → scaling → rotation · unit sphere → ellipse · three operations
SVD is not just an algebraic factorization — it is a complete geometric description of what every linear
transformation does. Any matrix \(\mathbf{A}: \mathbb{R}^n \to \mathbb{R}^m\) maps the unit sphere in
\(\mathbb{R}^n\) to an ellipse in \(\mathbb{R}^m\). SVD finds that ellipse exactly.
Step 1: V^T — rotation in ℝⁿ (change to the right singular vector basis). Step 2: Σ —
axis-aligned scaling by σ₁,…,σᵣ (stretch/compress each axis independently). Step 3: U — rotation in ℝᵐ
(rotate scaled result to final orientation). Every linear map is just these three operations.
Fig 1. SVD as three geometric operations. The unit sphere is rotated (V^T), stretched along
coordinate axes by σᵢ (Σ), then rotated again (U). The resulting ellipsoid has semi-axes of length σ₁ ≥ σ₂ ≥
···.
∿
The Ellipse Theorem
The image of the unit sphere \(\{\mathbf{x}: \|\mathbf{x}\|=1\}\) under \(\mathbf{A}\)
is an ellipse with semi-axes of length \(\sigma_1, \ldots, \sigma_r\) aligned with the columns of
\(\mathbf{U}\). The maximum stretch \(\max_{\|\mathbf{x}\|=1}\|\mathbf{A}\mathbf{x}\| = \sigma_1\) (the
spectral norm). The minimum nonzero stretch is \(\sigma_r\). Zero singular values
correspond to directions that are completely collapsed.
04
Full, Thin & Compact SVD
// three variants · when to use which · memory trade-offs
All singular vectors included, even those corresponding to zero singular values. Necessary for solving
linear systems and computing orthonormal bases for all four fundamental subspaces.
Drops the "extra" singular vectors that correspond to zero singular values. Same information for the
factorization; much more memory efficient when \(m \gg n\) or \(n \gg m\).
Compact SVD
\(\mathbf{U}_r\in\mathbb{R}^{m\times r}\), \(\boldsymbol{\Sigma}_r\in\mathbb{R}^{r\times r}\),
\(\mathbf{V}_r\in\mathbb{R}^{n\times r}\) where \(r = \text{rank}(\mathbf{A})\).
Only the \(r\) nonzero singular values and their vectors. Exact representation; most storage efficient.
numpy.linalg.svd(full_matrices=False).
Approximate, not exact. The best rank-\(k\) approximation (Eckart-Young theorem, §08). Used for
compression, PCA, denoising. sklearn.utils.extmath.randomized_svd.
05
Singular Values — Deep Meaning
// spectral norm · Frobenius norm · information content · energy
The singular values are the "energy spectrum" of the matrix. σᵢ² is the energy of the
i-th mode. Σᵢσᵢ² = total energy. The fraction σᵢ²/‖A‖_F² is the proportion of total energy in mode i —
exactly the explained variance ratio in PCA.
Singular Values and Matrix Properties
Rank: \(\text{rank}(\mathbf{A})\) = number of nonzero singular values.
Invertibility: \(\mathbf{A}\) is invertible iff all \(n\) singular values are strictly
positive iff \(\sigma_n > 0\).
Similarity invariance: Singular values are invariant under orthogonal similarity
transformations — they are intrinsic properties of the linear map, not the coordinate representation.
SVD and eigendecomposition are deeply connected. The singular values and singular vectors of \(\mathbf{A}\) are
exactly the square roots of eigenvalues and the eigenvectors of the related square symmetric matrices
\(\mathbf{A}^T\mathbf{A}\) and \(\mathbf{A}\mathbf{A}^T\).
// Deriving U, Σ, V from eigendecomposition of A^T A and A A^T
Since \(\boldsymbol{\Sigma}^T\boldsymbol{\Sigma} = \text{diag}(\sigma_1^2,\ldots,\sigma_r^2,0,\ldots)\),
this is the eigendecomposition of \(\mathbf{A}^T\mathbf{A}\):
\[\mathbf{A}^T\mathbf{A}\mathbf{v}_i = \sigma_i^2\mathbf{v}_i \quad\Rightarrow\quad \text{V = eigenvectors
of }\mathbf{A}^T\mathbf{A},\; \sigma_i^2 = \text{eigenvalues}\]
A^T A is n×n symmetric PSD — all eigenvalues ≥ 0, guaranteed real, eigenvectors
orthogonal (Spectral Theorem).
A A^T is m×m symmetric PSD — same nonzero eigenvalues as A^T A (though different sizes).
This is why the nonzero singular values of A^T equal those of A.
4
The singular values: \(\sigma_i = \sqrt{\lambda_i(\mathbf{A}^T\mathbf{A})} =
\sqrt{\lambda_i(\mathbf{A}\mathbf{A}^T)} > 0\)
This is the practical way to compute SVD: eigendecompose A^T A (or A A^T for tall
matrices), take square roots of eigenvalues, compute corresponding left/right singular vectors. ∎
SVD vs EigendecompositionComparison
\[\underbrace{\mathbf{A} = \mathbf{P}\boldsymbol{\Lambda}\mathbf{P}^{-1}}_{\text{eigendecomposition: square
A only}} \qquad \underbrace{\mathbf{A} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T}_{\text{SVD: any A,
orthogonal U and V}}\]
Eigendecomposition requires square matrix, non-defective. P may not be orthogonal. SVD
works for any m×n matrix. U and V are always orthogonal. For symmetric A: eigendecomposition is SVD (U = V
= eigenvectors, Σ = |Λ|). SVD strictly generalizes eigendecomposition.
07
Rank & The Four Fundamental Subspaces
// column space · null space · row space · left null space · SVD reveals all four
Gilbert Strang's "fundamental theorem of linear algebra" describes four subspaces associated with every matrix.
SVD reveals all four simultaneously and in one shot.
Column space and left null space are orthogonal complements in ℝᵐ. Row space and null
space are orthogonal complements in ℝⁿ. All four are orthonormal (U and V are orthogonal). SVD gives
orthonormal bases for all four simultaneously.
⊕
Rank-Nullity Theorem via SVD
\(\dim(\text{Col}(\mathbf{A})) + \dim(\text{Null}(\mathbf{A}^T)) = m\) and
\(\dim(\text{Row}(\mathbf{A})) + \dim(\text{Null}(\mathbf{A})) = n\). Both are immediate from SVD: exactly
\(r\) nonzero singular values give dimension \(r\) for column/row spaces; the remaining \(m-r\) and \(n-r\)
singular vectors span the null spaces. The rank-nullity theorem is just counting singular values.
08
Low-Rank Approximation — Eckart-Young-Mirsky
// best rank-k approximation · Frobenius and spectral · optimality proof
★
Eckart-Young-Mirsky Theorem (1936)
For any matrix norm induced by a unitarily invariant norm (including Frobenius and spectral norms), the best
rank-\(k\) approximation of \(\mathbf{A}\) is the truncated SVD:
\[\hat{\mathbf{A}}_k = \sum_{i=1}^k\sigma_i\mathbf{u}_i\mathbf{v}_i^T =
\mathbf{U}_k\boldsymbol{\Sigma}_k\mathbf{V}_k^T\]
The approximation error in the Frobenius norm is \(\|\mathbf{A} - \hat{\mathbf{A}}_k\|_F^2 =
\sum_{i=k+1}^r\sigma_i^2\), and in the spectral norm is \(\|\mathbf{A}-\hat{\mathbf{A}}_k\|_2 =
\sigma_{k+1}\).
For any rank-\(k\) matrix \(\mathbf{B}\), write \(\mathbf{A} - \mathbf{B}\) and use the fact that
\(\|\mathbf{A}\|_F^2 = \sum_i\sigma_i^2\):
\[\|\mathbf{A}-\mathbf{B}\|_F^2 = \|\mathbf{U}^T(\mathbf{A}-\mathbf{B})\mathbf{V}\|_F^2
\quad\text{(unitary invariance)}\]
2
Let \(\mathbf{C} = \boldsymbol{\Sigma} - \mathbf{U}^T\mathbf{B}\mathbf{V}\). Since \(\mathbf{B}\) has rank
\(\leq k\), \(\mathbf{U}^T\mathbf{B}\mathbf{V}\) has rank \(\leq k\), and \(\boldsymbol{\Sigma}\) is
diagonal. The minimum of \(\|\mathbf{C}\|_F^2\) over rank-\(k\) matrices subtracted from
\(\boldsymbol{\Sigma}\) is achieved by zeroing out the \(k\) largest diagonal entries — i.e., keeping
\(\sigma_{k+1},\ldots,\sigma_r\).
This is because taking the best rank-k approximation of a diagonal matrix means keeping
the k largest diagonal entries. ∎
Energy Interpretation
Energy Retained by Rank-k SVDCompression
\[\text{Energy retained} = \frac{\sum_{i=1}^k\sigma_i^2}{\sum_{i=1}^r\sigma_i^2} =
\frac{\|\hat{\mathbf{A}}_k\|_F^2}{\|\mathbf{A}\|_F^2}\]
\[\text{Storage ratio} = \frac{k(m+n+1)}{mn} \quad\text{(U}_k\text{, }\boldsymbol{\Sigma}_k\text{,
V}_k\text{ vs full A)}\]
Compression is beneficial when k(m+n+1) < mn, i.e., k < mn/(m+n+1) ≈ min(m,n)/2
roughly. For an image with σ₁²/‖A‖_F² ≈ 0.90, keeping just k=1 retains 90% of the information.
For non-square or rank-deficient matrices, the regular inverse doesn't exist. The pseudoinverse provides the
generalized inverse that gives the least-squares minimum-norm solution to \(\mathbf{A}\mathbf{x} = \mathbf{b}\).
A⁺b is the minimum Euclidean norm least-squares solution to Ax = b. If A has full
column rank: A⁺ = (AᵀA)⁻¹Aᵀ (left inverse, least-squares solution). If A has full row rank: A⁺ = Aᵀ(AAᵀ)⁻¹
(right inverse, minimum norm solution). SVD gives A⁺ in all cases.
Regularized Pseudoinverse
For ill-conditioned systems, small singular values \(\sigma_i \approx 0\) cause \(1/\sigma_i\) to explode,
amplifying noise. Tikhonov regularization replaces the pseudoinverse with:
This is the SVD solution to min_{x} ‖Ax−b‖² + λ²‖x‖² — Ridge regression! As λ → 0:
reduces to pseudoinverse. As λ → ∞: shrinks solution toward zero. The filter factor σᵢ/(σᵢ²+λ²) smoothly
downweights small singular values rather than inverting them.
10
SVD ↔ PCA — The Complete Bridge
// covariance eigendecomposition · identical solutions · which to use when
PCA and SVD are different presentations of the same underlying mathematics. Understanding their exact
relationship prevents the confusion that arises from seeing two algorithms that produce identical results
through apparently different computations.
SVD of Data Matrix → PCAEquivalence
Proof
\[\tilde{\mathbf{X}} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T \quad\text{(SVD of centered data matrix }
m\times d\text{)}\]
\[\mathbf{S} = \frac{1}{m-1}\tilde{\mathbf{X}}^T\tilde{\mathbf{X}} =
\frac{1}{m-1}\mathbf{V}\boldsymbol{\Sigma}^2\mathbf{V}^T \quad\text{(eigendecomposition of covariance)}\]
\[\lambda_j = \frac{\sigma_j^2}{m-1}, \quad \mathbf{v}_j = \text{j-th right singular vector} = \text{j-th PC
loading}\]
\[\mathbf{Z} = \tilde{\mathbf{X}}\mathbf{V}_k = \mathbf{U}_k\boldsymbol{\Sigma}_k \quad\text{(PC scores =
left singular vectors × singular values)}\]
SVD of the data matrix X̃ directly gives PCA without explicitly forming S = X̃^T X̃ /
(m-1). This is numerically superior (avoids squaring the condition number). Always call SVD on the data
matrix, not eigendecomposition on the covariance matrix.
Aspect
PCA formulation
SVD formulation
Central object
Covariance matrix S = X̃^T X̃/(m-1)
Data matrix X̃
Decomposition
Eigendecomposition of S
SVD of X̃
Loadings (directions)
Eigenvectors V of S
Right singular vectors V of X̃
Variance per component
Eigenvalues λⱼ of S
σⱼ² / (m-1) from SVD
Scores (projections)
Z = X̃ V_k
Z = U_k Σ_k
Numerical cost
O(d³) for d×d matrix
O(md min(m,d))
Preferred when
d ≪ m, covariance is available
General; always more stable
11
Condition Number & Numerical Stability
// sensitivity of linear systems · ill-conditioning · regularization insight
κ(A) measures how sensitive Ax = b is to perturbations: ‖δx‖/‖x‖ ≤ κ(A) ‖δb‖/‖b‖. κ = 1
is perfectly conditioned (orthogonal). κ ≫ 1 is ill-conditioned — small changes in b cause large changes
in x. κ → ∞ means nearly singular. Double-precision arithmetic has ~16 digits; κ ≈ 10⁸ leaves only ~8
reliable digits.
SVD avoids squaring: \(\kappa(\mathbf{A}^T\mathbf{A}) = \kappa(\mathbf{A})^2\).
Eigendecomposing \(\mathbf{A}^T\mathbf{A}\) squares the condition number, losing twice as many digits. SVD on
\(\mathbf{A}\) directly is always more numerically stable.
Regularization changes conditioning: Ridge regression solves
\((\mathbf{A}^T\mathbf{A}+\lambda\mathbf{I})\mathbf{x} = \mathbf{A}^T\mathbf{b}\). The condition number
becomes \((\sigma_1^2+\lambda)/(\sigma_r^2+\lambda) \ll \sigma_1^2/\sigma_r^2\) for large \(\lambda\).
Singular matrices: If any \(\sigma_i = 0\), the system \(\mathbf{A}\mathbf{x} =
\mathbf{b}\) is either inconsistent (no solution) or under-determined (infinite solutions). SVD via
pseudoinverse gives the correct minimum-norm solution in both cases.
12
Computing SVD — Algorithms
// Golub-Reinsch · bidiagonalization · power iteration · randomized SVD
Golub-Reinsch Algorithm (Classical)
The standard full SVD algorithm operates in two phases:
Phase 1 — Bidiagonalization: Use Householder reflections to reduce \(\mathbf{A}\) to
bidiagonal form \(\mathbf{B}\) (nonzero only on diagonal and superdiagonal). Cost: \(O(mn^2)\) flops.
Phase 2 — Iterative diagonalization: Apply implicit QR iterations to the bidiagonal
\(\mathbf{B}\) until it converges to diagonal \(\boldsymbol{\Sigma}\). Cost: \(O(n)\) per iteration,
typically converges in few iterations.
Randomized SVD — The Modern Standard for Large Matrices
Randomized SVD (Halko-Martinsson-Tropp 2011)
# Given: A ∈ ℝ^{m×n}, target rank k, oversampling p=10 # Step 1: Form random test matrix and sketch the range of A Ω = randn(n, k+p) # random Gaussian — O(n(k+p)) Y = A @ Ω#
sketch: O(mn(k+p))
# Step 2: Power iteration for better accuracy (optional) for _ in range(n_power_iter): Y = A @ (A.T @ Y) # amplifies dominant singular vectors
# Step 3: Orthonormalize the sketch Q, _ = qr(Y, mode='reduced')
# Q ∈ ℝ^{m×(k+p)}
# Step 4: Project A into small space and SVD B = Q.T @ A
# B ∈
ℝ^{(k+p)×n} — tiny! Û, Σ, Vt = svd(B, full=False) U = Q @ Û# full left singular vectors returnU[:, :k], Σ[:k], Vt[:k]
# Total cost: O(mn(k+p)) vs O(mn min(m,n)) for full SVD
LSA (Deerwester et al. 1990) was the first major application of SVD to natural language processing. It
discovers latent semantic structure in text by decomposing the term-document matrix.
The Term-Document Matrix
Build matrix \(\mathbf{A} \in \mathbb{R}^{t \times d}\) where \(\mathbf{A}_{ij} =
\text{tf-idf}(\text{term}_i, \text{doc}_j)\). Then apply SVD and truncate to rank \(k\):
Columns of
U_k: term embeddings (t terms → k dimensions). Rows of V_k^T: document embeddings (d documents → k
dimensions). Cosine similarity in k-dimensional space ≈ semantic similarity.
Query embedding: q_new = A_query^T U_k Σ_k^{-1}. The SVD "folds" both terms and
documents into the same k-dimensional latent space where synonyms cluster together even if they never
co-occur in the same document.
λ
Why LSA Works — The Key Insight
The terms "car" and "automobile" may never appear in the same document, but both
co-occur with "drive", "road", "engine". SVD captures this indirect relationship — the truncated
reconstruction implicitly smooths over lexical variation and reveals the underlying semantic
structure. This is because the low-rank SVD effectively clusters terms that appear in similar
contexts. LSA directly inspired word2vec and modern distributional semantics.
Netflix, Spotify, Amazon — their core recommendation engines are matrix factorization algorithms, all rooted in
SVD. The user-item ratings matrix is sparse and incomplete; SVD (or its variants) fills in the missing entries.
P_u = k-dimensional latent user preference vector. Q_i = k-dimensional latent item
feature vector. The dot product p_u^T q_i predicts user u's rating of item i. Optimized via SGD or ALS
(Alternating Least Squares) over the observed (non-missing) entries only.
Connection to SVD
If all entries of \(\mathbf{R}\) were observed, the optimal rank-\(k\) factorization
\(\mathbf{P}\mathbf{Q}^T\) would be exactly the truncated SVD
\(\mathbf{U}_k\boldsymbol{\Sigma}_k\mathbf{V}_k^T\). With missing entries, we minimize only over the observed
entries — this is the matrix completion problem. The SVD gives the unique closed-form
solution when all entries are present; SGD/ALS generalizes this to the missing-data case.
★
SVD++ — Simon Funk's Netflix Solution
Simon Funk (2006) won the Netflix Prize leaderboard with regularized matrix
factorization optimized by SGD — essentially stochastic gradient descent on the SVD objective with missing
data. The key insight: the SVD structure \(\mathbf{p}_u^T\mathbf{q}_i\) captures latent factors (genre
affinity, production era, etc.) even though these factors are never explicitly labeled. SVD++ adds
implicit feedback (which items a user interacted with, regardless of rating) to the model.
Fine-tune large language models by decomposing weight updates into low-rank SVD form: \(\Delta\mathbf{W} =
\mathbf{A}\mathbf{B}\) with \(\mathbf{A}\in\mathbb{R}^{d\times r}\), \(\mathbf{B}\in\mathbb{R}^{r\times
k}\), \(r \ll \min(d,k)\). Only A and B are trained; the pre-trained W is frozen. Uses 0.1% of parameters.
Directly uses the insight that weight updates are low-rank (SVD motivation).
◈
Spectral Normalization
Normalize each weight matrix by its largest singular value \(\sigma_1\) to enforce Lipschitz continuity:
\(\mathbf{W}_{\text{sn}} = \mathbf{W}/\sigma_1(\mathbf{W})\). Used in Spectral Normalization GAN (SN-GAN) to
stabilize discriminator training. Estimated cheaply using the power iteration method for \(\sigma_1\)
without full SVD.
≈
Weight Pruning via SVD
Compress neural network weight matrices by replacing each \(\mathbf{W}^{[\ell]}\) with its rank-\(k\) SVD
approximation \(\mathbf{U}_k\boldsymbol{\Sigma}_k\mathbf{V}_k^T\). Reduces inference FLOPs by \(d_1 d_2 /
(d_1 + d_2)k\). Works well for transformer attention projection matrices and embedding tables.
∥
Nuclear Norm Regularization
The nuclear norm \(\|\mathbf{W}\|_* = \sum_i\sigma_i\) is the convex relaxation of rank. Minimizing it
penalizes large singular values, encouraging low-rank weight matrices. Used as a regularizer when the
ground-truth relationship is expected to be low-rank: matrix completion, multi-task learning, collaborative
filtering.
∇
Gradient Analysis
SVD of gradient matrices during training diagnoses optimization health: effective rank (number of
significant singular values) measures gradient diversity. Sudden collapse of singular values indicates
training instability. Gradient clipping by spectral norm (\(\sigma_1\)) rather than Frobenius norm is more
principled and less aggressive.
⊕
Attention Matrix Analysis
In Transformers, the attention matrix \(\text{softmax}(\mathbf{Q}\mathbf{K}^T/\sqrt{d})\) has an intrinsic
rank much lower than its dimension in well-trained models. SVD of attention matrices reveals interpretable
"attention heads" — some attend to positional patterns, others to syntactic roles. Used extensively in
mechanistic interpretability research.
16
Image Compression & Signal Applications
// rank-k images · denoising · eigenfaces · signal separation
Image Compression
Treat an \(m \times n\) grayscale image as a matrix \(\mathbf{A}\). Its SVD compresses it by keeping only the
top \(k\) singular triplets:
For most natural images, >95% of the energy is captured in the top 50 singular
values. The rank-1 approximation alone gives a blurry but recognizable thumbnail. Ranks 10-50 give
visually acceptable quality at 10-50× compression. Note: JPEG is not SVD-based (it uses DCT blocks) but
shares the "energy compaction" philosophy.
Eigenfaces
Stack \(m\) face images (each of dimension \(p \times q = n\) pixels) as rows of data matrix \(\mathbf{X}\).
The left singular vectors of centered \(\tilde{\mathbf{X}}\) are the eigenfaces — the
principal modes of variation in human faces. Any face can be encoded as linear combination of the top \(k\)
eigenfaces. Sirovich and Kirby (1987), Turk and Pentland (1991) — the foundation of computer vision face
recognition pre-CNN.
17
The Complete Mental Model
// everything unified · one theorem · infinite reach
Fig 2. Complete SVD mental map — from the fundamental factorization through mathematical
properties to all major applications.
The complete story in one thread:
Existence: Every matrix \(\mathbf{A}\in\mathbb{R}^{m\times n}\) has an SVD \(\mathbf{A} =
\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T\) with orthogonal \(\mathbf{U},\mathbf{V}\) and non-negative
diagonal \(\boldsymbol{\Sigma}\). No conditions required.
Geometry: The three matrices have clear geometric meaning — \(\mathbf{V}^T\) rotates in
domain space, \(\boldsymbol{\Sigma}\) scales along coordinate axes, \(\mathbf{U}\) rotates in range space.
Every linear map maps the unit sphere to an ellipse; SVD finds that ellipse exactly.
Rank and subspaces: The number of nonzero singular values equals the rank. The first \(r\)
columns of \(\mathbf{U}\) span the column space; the last \(m-r\) span the left null space. Vice versa for
\(\mathbf{V}\). All four fundamental subspaces fall out simultaneously.
Optimal approximation: The Eckart-Young theorem proves that the truncated SVD
\(\hat{\mathbf{A}}_k = \sum_{i=1}^k\sigma_i\mathbf{u}_i\mathbf{v}_i^T\) is the best rank-\(k\) approximation
under any unitarily invariant norm. No other rank-\(k\) matrix is closer to \(\mathbf{A}\).
Connection to eigenvalues: \(\sigma_i^2 = \lambda_i(\mathbf{A}^T\mathbf{A}) =
\lambda_i(\mathbf{A}\mathbf{A}^T)\). SVD generalizes eigendecomposition to rectangular matrices while
guaranteeing orthogonal decomposition matrices.
Pseudoinverse: \(\mathbf{A}^+ = \mathbf{V}\boldsymbol{\Sigma}^+\mathbf{U}^T\) gives the
minimum-norm least-squares solution to any linear system — well-defined for any matrix, with or without full
rank.
PCA: SVD of the centered data matrix gives PCA directly — right singular vectors are the
principal components, and \(\sigma_i^2/(m-1)\) are the explained variances. Always compute PCA via SVD.
Applications span all of ML: LSA discovers semantic structure in text. Matrix factorization
powers recommendation engines. LoRA uses low-rank SVD structure to fine-tune LLMs with 1000× fewer parameters.
Spectral normalization stabilizes GAN training. Image compression uses rank-\(k\) SVD. Every major ML system
uses SVD somewhere.
★
Why SVD Is the Most Important Matrix Factorization
Eigendecomposition requires a square matrix. Cholesky requires positive definiteness. LU requires no zero
pivots. SVD requires nothing. It works for every matrix, reveals all its intrinsic
properties, provides the best possible low-rank approximation, generalizes both the inverse and the
projection, and connects to PCA, least squares, optimization, and signal processing in one unified
factorization. If you can only know one matrix decomposition — this is the one.