SVD | Shahid Ul Islam

01

What Is SVD? The Existence Theorem

// every matrix · no square assumption · the fundamental factorization

Most matrix factorizations — eigendecomposition, Cholesky, LU — require special conditions: square matrices, positive definiteness, non-singularity. SVD requires nothing. It works for every matrix, regardless of shape, rank, or properties.

The Theorem

★

SVD Existence Theorem

For every matrix \(\mathbf{A} \in \mathbb{R}^{m \times n}\) of rank \(r\), there exist orthogonal matrices \(\mathbf{U} \in \mathbb{R}^{m \times m}\) and \(\mathbf{V} \in \mathbb{R}^{n \times n}\), and a matrix \(\boldsymbol{\Sigma} \in \mathbb{R}^{m \times n}\) with non-negative entries on the diagonal and zeros elsewhere, such that: \[\mathbf{A} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T\] The diagonal entries \(\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > \sigma_{r+1} = \cdots = 0\) are the singular values of \(\mathbf{A}\). This decomposition always exists and is essentially unique (up to signs).

Why "Singular" Values?

A matrix \(\mathbf{A} - \sigma\mathbf{I}\) (for square \(\mathbf{A}\)) becomes singular (non-invertible) at \(\sigma = \sigma_i\). For rectangular matrices, this generalizes to the values at which \(\mathbf{A}^T\mathbf{A} - \sigma^2\mathbf{I}\) is singular — equivalently, \(\sigma^2 = \lambda_i(\mathbf{A}^T\mathbf{A})\). The name comes from classical matrix theory where these were called "singular points" of the transformation.

02

The Factorization in Detail

// component matrices · orthogonality · outer product expansion

The Three Matrices

Full SVD DecompositionDefinition

\[\underbrace{\mathbf{A}}_{m \times n} = \underbrace{\mathbf{U}}_{m \times m}\underbrace{\Sigma}_{m \times n}\underbrace{\mathbf{V}^T}_{n \times n}\] \[\mathbf{U}^T\mathbf{U} = \mathbf{I}_m, \quad \mathbf{V}^T\mathbf{V} = \mathbf{I}_n \quad \text{(orthogonal/unitary)}\] \[\boldsymbol{\Sigma} = \begin{pmatrix}\sigma_1 & & \\ & \ddots & \\ & & \sigma_r & \\ & & & \mathbf{0}\end{pmatrix}, \quad \sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0\]

U columns = left singular vectors. V columns = right singular vectors. Σ diagonal = singular values. The r nonzero singular values equal the rank of A.

The Outer Product Expansion — Column by Column

The most revealing form of SVD is the sum of rank-1 matrices:

Outer Product / Dyadic ExpansionCore Form

\[\mathbf{A} = \sum_{i=1}^{r} \sigma_i \mathbf{u}_i \mathbf{v}_i^T\]

Each term σᵢ uᵢ vᵢᵀ is a rank-1 matrix — an outer product of the i-th left and right singular vectors, weighted by σᵢ. This decomposition orders terms from most to least important by singular value magnitude. Truncating to k terms gives the best rank-k approximation.

◈

Why This Form Matters

The outer product expansion reveals that every matrix is a weighted sum of rank-1 components, each capturing a "mode" of variation. The first term explains the most; subsequent terms add finer detail. Dropping the last terms gives optimal compression. This is the mathematical basis of every compression, denoising, and latent representation method in ML.

03

Geometric Interpretation

// rotation → scaling → rotation · unit sphere → ellipse · three operations

SVD is not just an algebraic factorization — it is a complete geometric description of what every linear transformation does. Any matrix \(\mathbf{A}: \mathbb{R}^n \to \mathbb{R}^m\) maps the unit sphere in \(\mathbb{R}^n\) to an ellipse in \(\mathbb{R}^m\). SVD finds that ellipse exactly.

The Three Operations

SVD as Three Geometric StepsGeometry

\[\mathbf{A}\mathbf{x} = \mathbf{U}\underbrace{(\boldsymbol{\Sigma}(\mathbf{V}^T\mathbf{x}))}_{\text{scaled rotated coords}}\] \[\mathbf{x} \xrightarrow{\mathbf{V}^T} \mathbf{V}^T\mathbf{x} \xrightarrow{\boldsymbol{\Sigma}} \boldsymbol{\Sigma}\mathbf{V}^T\mathbf{x} \xrightarrow{\mathbf{U}} \mathbf{A}\mathbf{x}\]

Step 1: V^T — rotation in ℝⁿ (change to the right singular vector basis). Step 2: Σ — axis-aligned scaling by σ₁,…,σᵣ (stretch/compress each axis independently). Step 3: U — rotation in ℝᵐ (rotate scaled result to final orientation). Every linear map is just these three operations.

Fig 1. SVD as three geometric operations. The unit sphere is rotated (V^T), stretched along coordinate axes by σᵢ (Σ), then rotated again (U). The resulting ellipsoid has semi-axes of length σ₁ ≥ σ₂ ≥ ···.

∿

The Ellipse Theorem

The image of the unit sphere \(\{\mathbf{x}: \|\mathbf{x}\|=1\}\) under \(\mathbf{A}\) is an ellipse with semi-axes of length \(\sigma_1, \ldots, \sigma_r\) aligned with the columns of \(\mathbf{U}\). The maximum stretch \(\max_{\|\mathbf{x}\|=1}\|\mathbf{A}\mathbf{x}\| = \sigma_1\) (the spectral norm). The minimum nonzero stretch is \(\sigma_r\). Zero singular values correspond to directions that are completely collapsed.

04

Full, Thin & Compact SVD

// three variants · when to use which · memory trade-offs

Full SVD

\(\mathbf{U}\in\mathbb{R}^{m\times m}\), \(\boldsymbol{\Sigma}\in\mathbb{R}^{m\times n}\), \(\mathbf{V}\in\mathbb{R}^{n\times n}\).

All singular vectors included, even those corresponding to zero singular values. Necessary for solving linear systems and computing orthonormal bases for all four fundamental subspaces.

Thin (Economy) SVD

\(\mathbf{U}\in\mathbb{R}^{m\times\min(m,n)}\), \(\boldsymbol{\Sigma}\in\mathbb{R}^{\min(m,n)\times\min(m,n)}\), \(\mathbf{V}\in\mathbb{R}^{n\times\min(m,n)}\).

Drops the "extra" singular vectors that correspond to zero singular values. Same information for the factorization; much more memory efficient when \(m \gg n\) or \(n \gg m\).

Compact SVD

\(\mathbf{U}_r\in\mathbb{R}^{m\times r}\), \(\boldsymbol{\Sigma}_r\in\mathbb{R}^{r\times r}\), \(\mathbf{V}_r\in\mathbb{R}^{n\times r}\) where \(r = \text{rank}(\mathbf{A})\).

Only the \(r\) nonzero singular values and their vectors. Exact representation; most storage efficient. numpy.linalg.svd(full_matrices=False).

Truncated SVD (Rank-k)

\(\mathbf{U}_k\in\mathbb{R}^{m\times k}\), \(\boldsymbol{\Sigma}_k\in\mathbb{R}^{k\times k}\), \(\mathbf{V}_k\in\mathbb{R}^{n\times k}\), \(k < r\).

Approximate, not exact. The best rank-\(k\) approximation (Eckart-Young theorem, §08). Used for compression, PCA, denoising. sklearn.utils.extmath.randomized_svd.

05

Singular Values — Deep Meaning

// spectral norm · Frobenius norm · information content · energy

Singular Values — Key PropertiesTheorems

\[\sigma_1 = \|\mathbf{A}\|_2 = \max_{\|\mathbf{x}\|=1}\|\mathbf{A}\mathbf{x}\| \quad\text{(spectral / operator norm)}\] \[\|\mathbf{A}\|_F^2 = \sum_{i=1}^{r}\sigma_i^2 = \text{tr}(\mathbf{A}^T\mathbf{A}) \quad\text{(Frobenius norm = sum of squares)}\] \[\|\mathbf{A}\|_* = \sum_{i=1}^{r}\sigma_i \quad\text{(nuclear / trace norm)}\] \[\det(\mathbf{A}) = \prod_{i=1}^n\sigma_i \quad\text{(for square } \mathbf{A}\text{)}\]

The singular values are the "energy spectrum" of the matrix. σᵢ² is the energy of the i-th mode. Σᵢσᵢ² = total energy. The fraction σᵢ²/‖A‖_F² is the proportion of total energy in mode i — exactly the explained variance ratio in PCA.

Singular Values and Matrix Properties

Rank: \(\text{rank}(\mathbf{A})\) = number of nonzero singular values.
Invertibility: \(\mathbf{A}\) is invertible iff all \(n\) singular values are strictly positive iff \(\sigma_n > 0\).
Orthogonal matrix: All singular values equal 1 (orthogonal maps preserve lengths).
Positive definite: For symmetric \(\mathbf{A} \succeq 0\), singular values = eigenvalues.
Similarity invariance: Singular values are invariant under orthogonal similarity transformations — they are intrinsic properties of the linear map, not the coordinate representation.

06

Spectral Theorem Connection

// AᵀA and AAᵀ · eigendecomposition · why SVD generalizes eigendecomposition

SVD and eigendecomposition are deeply connected. The singular values and singular vectors of \(\mathbf{A}\) are exactly the square roots of eigenvalues and the eigenvectors of the related square symmetric matrices \(\mathbf{A}^T\mathbf{A}\) and \(\mathbf{A}\mathbf{A}^T\).

// Deriving U, Σ, V from eigendecomposition of A^T A and A A^T

1

From \(\mathbf{A} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T\), compute \(\mathbf{A}^T\mathbf{A}\): \[\mathbf{A}^T\mathbf{A} = (\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T)^T(\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T) = \mathbf{V}\boldsymbol{\Sigma}^T\mathbf{U}^T\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T = \mathbf{V}\boldsymbol{\Sigma}^T\boldsymbol{\Sigma}\mathbf{V}^T\]

2

Since \(\boldsymbol{\Sigma}^T\boldsymbol{\Sigma} = \text{diag}(\sigma_1^2,\ldots,\sigma_r^2,0,\ldots)\), this is the eigendecomposition of \(\mathbf{A}^T\mathbf{A}\): \[\mathbf{A}^T\mathbf{A}\mathbf{v}_i = \sigma_i^2\mathbf{v}_i \quad\Rightarrow\quad \text{V = eigenvectors of }\mathbf{A}^T\mathbf{A},\; \sigma_i^2 = \text{eigenvalues}\]

A^T A is n×n symmetric PSD — all eigenvalues ≥ 0, guaranteed real, eigenvectors orthogonal (Spectral Theorem).

3

Similarly \(\mathbf{A}\mathbf{A}^T = \mathbf{U}\boldsymbol{\Sigma}\boldsymbol{\Sigma}^T\mathbf{U}^T\): \[\mathbf{A}\mathbf{A}^T\mathbf{u}_i = \sigma_i^2\mathbf{u}_i \quad\Rightarrow\quad \text{U = eigenvectors of }\mathbf{A}\mathbf{A}^T\]

A A^T is m×m symmetric PSD — same nonzero eigenvalues as A^T A (though different sizes). This is why the nonzero singular values of A^T equal those of A.

4

The singular values: \(\sigma_i = \sqrt{\lambda_i(\mathbf{A}^T\mathbf{A})} = \sqrt{\lambda_i(\mathbf{A}\mathbf{A}^T)} > 0\)

This is the practical way to compute SVD: eigendecompose A^T A (or A A^T for tall matrices), take square roots of eigenvalues, compute corresponding left/right singular vectors. ∎

SVD vs EigendecompositionComparison

\[\underbrace{\mathbf{A} = \mathbf{P}\boldsymbol{\Lambda}\mathbf{P}^{-1}}_{\text{eigendecomposition: square A only}} \qquad \underbrace{\mathbf{A} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T}_{\text{SVD: any A, orthogonal U and V}}\]

Eigendecomposition requires square matrix, non-defective. P may not be orthogonal. SVD works for any m×n matrix. U and V are always orthogonal. For symmetric A: eigendecomposition is SVD (U = V = eigenvectors, Σ = |Λ|). SVD strictly generalizes eigendecomposition.

07

Rank & The Four Fundamental Subspaces

// column space · null space · row space · left null space · SVD reveals all four

Gilbert Strang's "fundamental theorem of linear algebra" describes four subspaces associated with every matrix. SVD reveals all four simultaneously and in one shot.

Four Fundamental Subspaces via SVDTheorem

\[\underbrace{\text{Col}(\mathbf{A})}_{\text{column space}} = \text{span}\{\mathbf{u}_1,\ldots,\mathbf{u}_r\} \subset \mathbb{R}^m \qquad \underbrace{\text{Null}(\mathbf{A}^T)}_{\text{left null space}} = \text{span}\{\mathbf{u}_{r+1},\ldots,\mathbf{u}_m\} \subset \mathbb{R}^m\] \[\underbrace{\text{Row}(\mathbf{A})}_{\text{row space}} = \text{span}\{\mathbf{v}_1,\ldots,\mathbf{v}_r\} \subset \mathbb{R}^n \qquad \underbrace{\text{Null}(\mathbf{A})}_{\text{null space}} = \text{span}\{\mathbf{v}_{r+1},\ldots,\mathbf{v}_n\} \subset \mathbb{R}^n\]

Column space and left null space are orthogonal complements in ℝᵐ. Row space and null space are orthogonal complements in ℝⁿ. All four are orthonormal (U and V are orthogonal). SVD gives orthonormal bases for all four simultaneously.

⊕

Rank-Nullity Theorem via SVD

\(\dim(\text{Col}(\mathbf{A})) + \dim(\text{Null}(\mathbf{A}^T)) = m\) and \(\dim(\text{Row}(\mathbf{A})) + \dim(\text{Null}(\mathbf{A})) = n\). Both are immediate from SVD: exactly \(r\) nonzero singular values give dimension \(r\) for column/row spaces; the remaining \(m-r\) and \(n-r\) singular vectors span the null spaces. The rank-nullity theorem is just counting singular values.

08

Low-Rank Approximation — Eckart-Young-Mirsky

// best rank-k approximation · Frobenius and spectral · optimality proof

★

Eckart-Young-Mirsky Theorem (1936)

For any matrix norm induced by a unitarily invariant norm (including Frobenius and spectral norms), the best rank-\(k\) approximation of \(\mathbf{A}\) is the truncated SVD: \[\hat{\mathbf{A}}_k = \sum_{i=1}^k\sigma_i\mathbf{u}_i\mathbf{v}_i^T = \mathbf{U}_k\boldsymbol{\Sigma}_k\mathbf{V}_k^T\] The approximation error in the Frobenius norm is \(\|\mathbf{A} - \hat{\mathbf{A}}_k\|_F^2 = \sum_{i=k+1}^r\sigma_i^2\), and in the spectral norm is \(\|\mathbf{A}-\hat{\mathbf{A}}_k\|_2 = \sigma_{k+1}\).

Proof Sketch

// Why truncated SVD minimizes Frobenius reconstruction error

1

For any rank-\(k\) matrix \(\mathbf{B}\), write \(\mathbf{A} - \mathbf{B}\) and use the fact that \(\|\mathbf{A}\|_F^2 = \sum_i\sigma_i^2\): \[\|\mathbf{A}-\mathbf{B}\|_F^2 = \|\mathbf{U}^T(\mathbf{A}-\mathbf{B})\mathbf{V}\|_F^2 \quad\text{(unitary invariance)}\]

2

Let \(\mathbf{C} = \boldsymbol{\Sigma} - \mathbf{U}^T\mathbf{B}\mathbf{V}\). Since \(\mathbf{B}\) has rank \(\leq k\), \(\mathbf{U}^T\mathbf{B}\mathbf{V}\) has rank \(\leq k\), and \(\boldsymbol{\Sigma}\) is diagonal. The minimum of \(\|\mathbf{C}\|_F^2\) over rank-\(k\) matrices subtracted from \(\boldsymbol{\Sigma}\) is achieved by zeroing out the \(k\) largest diagonal entries — i.e., keeping \(\sigma_{k+1},\ldots,\sigma_r\).

This is because taking the best rank-k approximation of a diagonal matrix means keeping the k largest diagonal entries. ∎

Energy Interpretation

Energy Retained by Rank-k SVDCompression

\[\text{Energy retained} = \frac{\sum_{i=1}^k\sigma_i^2}{\sum_{i=1}^r\sigma_i^2} = \frac{\|\hat{\mathbf{A}}_k\|_F^2}{\|\mathbf{A}\|_F^2}\] \[\text{Storage ratio} = \frac{k(m+n+1)}{mn} \quad\text{(U}_k\text{, }\boldsymbol{\Sigma}_k\text{, V}_k\text{ vs full A)}\]

Compression is beneficial when k(m+n+1) < mn, i.e., k < mn/(m+n+1) ≈ min(m,n)/2 roughly. For an image with σ₁²/‖A‖_F² ≈ 0.90, keeping just k=1 retains 90% of the information.

09

The Moore-Penrose Pseudoinverse

// least-squares solution · minimum norm · regularization · SVD formula

For non-square or rank-deficient matrices, the regular inverse doesn't exist. The pseudoinverse provides the generalized inverse that gives the least-squares minimum-norm solution to \(\mathbf{A}\mathbf{x} = \mathbf{b}\).

Moore-Penrose Pseudoinverse via SVDDefinition

\[\mathbf{A}^+ = \mathbf{V}\boldsymbol{\Sigma}^+\mathbf{U}^T \quad\text{where}\quad \boldsymbol{\Sigma}^+_{ii} = \begin{cases}1/\sigma_i & \sigma_i > 0 \\ 0 & \sigma_i = 0\end{cases}\] \[\hat{\mathbf{x}} = \mathbf{A}^+\mathbf{b} = \sum_{i:\sigma_i>0}\frac{\mathbf{u}_i^T\mathbf{b}}{\sigma_i}\mathbf{v}_i\]

A⁺b is the minimum Euclidean norm least-squares solution to Ax = b. If A has full column rank: A⁺ = (AᵀA)⁻¹Aᵀ (left inverse, least-squares solution). If A has full row rank: A⁺ = Aᵀ(AAᵀ)⁻¹ (right inverse, minimum norm solution). SVD gives A⁺ in all cases.

Regularized Pseudoinverse

For ill-conditioned systems, small singular values \(\sigma_i \approx 0\) cause \(1/\sigma_i\) to explode, amplifying noise. Tikhonov regularization replaces the pseudoinverse with:

Tikhonov Regularized InverseStable Inversion

\[\mathbf{A}^+_\lambda = \mathbf{V}\text{diag}\!\left(\frac{\sigma_i}{\sigma_i^2 + \lambda^2}\right)\mathbf{U}^T\]

This is the SVD solution to min_{x} ‖Ax−b‖² + λ²‖x‖² — Ridge regression! As λ → 0: reduces to pseudoinverse. As λ → ∞: shrinks solution toward zero. The filter factor σᵢ/(σᵢ²+λ²) smoothly downweights small singular values rather than inverting them.

10

SVD ↔ PCA — The Complete Bridge

// covariance eigendecomposition · identical solutions · which to use when

PCA and SVD are different presentations of the same underlying mathematics. Understanding their exact relationship prevents the confusion that arises from seeing two algorithms that produce identical results through apparently different computations.

SVD of Data Matrix → PCAEquivalence Proof

\[\tilde{\mathbf{X}} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T \quad\text{(SVD of centered data matrix } m\times d\text{)}\] \[\mathbf{S} = \frac{1}{m-1}\tilde{\mathbf{X}}^T\tilde{\mathbf{X}} = \frac{1}{m-1}\mathbf{V}\boldsymbol{\Sigma}^2\mathbf{V}^T \quad\text{(eigendecomposition of covariance)}\] \[\lambda_j = \frac{\sigma_j^2}{m-1}, \quad \mathbf{v}_j = \text{j-th right singular vector} = \text{j-th PC loading}\] \[\mathbf{Z} = \tilde{\mathbf{X}}\mathbf{V}_k = \mathbf{U}_k\boldsymbol{\Sigma}_k \quad\text{(PC scores = left singular vectors × singular values)}\]

SVD of the data matrix X̃ directly gives PCA without explicitly forming S = X̃^T X̃ / (m-1). This is numerically superior (avoids squaring the condition number). Always call SVD on the data matrix, not eigendecomposition on the covariance matrix.

Aspect	PCA formulation	SVD formulation
Central object	Covariance matrix S = X̃^T X̃/(m-1)	Data matrix X̃
Decomposition	Eigendecomposition of S	SVD of X̃
Loadings (directions)	Eigenvectors V of S	Right singular vectors V of X̃
Variance per component	Eigenvalues λⱼ of S	σⱼ² / (m-1) from SVD
Scores (projections)	Z = X̃ V_k	Z = U_k Σ_k
Numerical cost	O(d³) for d×d matrix	O(md min(m,d))
Preferred when	d ≪ m, covariance is available	General; always more stable

11

Condition Number & Numerical Stability

// sensitivity of linear systems · ill-conditioning · regularization insight

Condition Number via Singular ValuesStability

\[\kappa(\mathbf{A}) = \frac{\sigma_1}{\sigma_r} = \|\mathbf{A}\|_2 \cdot \|\mathbf{A}^{-1}\|_2 \quad\text{(for invertible square A)}\]

κ(A) measures how sensitive Ax = b is to perturbations: ‖δx‖/‖x‖ ≤ κ(A) ‖δb‖/‖b‖. κ = 1 is perfectly conditioned (orthogonal). κ ≫ 1 is ill-conditioned — small changes in b cause large changes in x. κ → ∞ means nearly singular. Double-precision arithmetic has ~16 digits; κ ≈ 10⁸ leaves only ~8 reliable digits.

SVD avoids squaring: \(\kappa(\mathbf{A}^T\mathbf{A}) = \kappa(\mathbf{A})^2\). Eigendecomposing \(\mathbf{A}^T\mathbf{A}\) squares the condition number, losing twice as many digits. SVD on \(\mathbf{A}\) directly is always more numerically stable.
Regularization changes conditioning: Ridge regression solves \((\mathbf{A}^T\mathbf{A}+\lambda\mathbf{I})\mathbf{x} = \mathbf{A}^T\mathbf{b}\). The condition number becomes \((\sigma_1^2+\lambda)/(\sigma_r^2+\lambda) \ll \sigma_1^2/\sigma_r^2\) for large \(\lambda\).
Singular matrices: If any \(\sigma_i = 0\), the system \(\mathbf{A}\mathbf{x} = \mathbf{b}\) is either inconsistent (no solution) or under-determined (infinite solutions). SVD via pseudoinverse gives the correct minimum-norm solution in both cases.

12

Computing SVD — Algorithms

// Golub-Reinsch · bidiagonalization · power iteration · randomized SVD

Golub-Reinsch Algorithm (Classical)

The standard full SVD algorithm operates in two phases:

Phase 1 — Bidiagonalization: Use Householder reflections to reduce \(\mathbf{A}\) to bidiagonal form \(\mathbf{B}\) (nonzero only on diagonal and superdiagonal). Cost: \(O(mn^2)\) flops.
Phase 2 — Iterative diagonalization: Apply implicit QR iterations to the bidiagonal \(\mathbf{B}\) until it converges to diagonal \(\boldsymbol{\Sigma}\). Cost: \(O(n)\) per iteration, typically converges in few iterations.

Randomized SVD — The Modern Standard for Large Matrices

Randomized SVD (Halko-Martinsson-Tropp 2011)

# Given: A ∈ ℝ^{m×n}, target rank k, oversampling p=10
# Step 1: Form random test matrix and sketch the range of A
Ω = randn(n, k+p)          # random Gaussian — O(n(k+p))
Y = A @ Ω               # sketch: O(mn(k+p))

# Step 2: Power iteration for better accuracy (optional)
for _ in range(n_power_iter):
  Y = A @ (A.T @ Y)    # amplifies dominant singular vectors

# Step 3: Orthonormalize the sketch
Q, _ = qr(Y, mode='reduced') # Q ∈ ℝ^{m×(k+p)}

# Step 4: Project A into small space and SVD
B = Q.T @ A              # B ∈ ℝ^{(k+p)×n} — tiny!
Û, Σ, Vt = svd(B, full=False)
U = Q @ Û              # full left singular vectors
return U[:, :k], Σ[:k], Vt[:k]

# Total cost: O(mn(k+p)) vs O(mn min(m,n)) for full SVD

Algorithm	Cost	Memory	Use When
Golub-Reinsch	\(O(mn^2)\) for \(m\geq n\)	\(O(mn)\)	Small/medium matrices; full SVD needed
Divide & Conquer	\(O(mn^2)\) but faster constant	\(O(mn)\)	Standard in LAPACK; default in NumPy
Lanczos / Arnoldi	\(O(mnk)\)	\(O((m+n)k)\)	Sparse matrices; top-k SVD
Randomized SVD	\(O(mn(k+p))\)	\(O((m+n)k)\)	Large dense matrices; top-k approximation
Incremental SVD	\(O(k^2(m+n))\) per update	\(O((m+n)k)\)	Streaming data; online updates

13

Latent Semantic Analysis (LSA)

// term-document matrix · tf-idf · latent topics · semantic similarity

LSA (Deerwester et al. 1990) was the first major application of SVD to natural language processing. It discovers latent semantic structure in text by decomposing the term-document matrix.

The Term-Document Matrix

Build matrix \(\mathbf{A} \in \mathbb{R}^{t \times d}\) where \(\mathbf{A}_{ij} = \text{tf-idf}(\text{term}_i, \text{doc}_j)\). Then apply SVD and truncate to rank \(k\):

LSA via Truncated SVDNLP

\[\underbrace{\mathbf{A}}_{t\times d} \approx \underbrace{\mathbf{U}_k}_{t\times k}\underbrace{\boldsymbol{\Sigma}_k}_{k\times k}\underbrace{\mathbf{V}_k^T}_{k\times d}\]

Columns of U_k: term embeddings (t terms → k dimensions). Rows of V_k^T: document embeddings (d documents → k dimensions). Cosine similarity in k-dimensional space ≈ semantic similarity.

Query embedding: q_new = A_query^T U_k Σ_k^{-1}. The SVD "folds" both terms and documents into the same k-dimensional latent space where synonyms cluster together even if they never co-occur in the same document.

λ

Why LSA Works — The Key Insight

The terms "car" and "automobile" may never appear in the same document, but both co-occur with "drive", "road", "engine". SVD captures this indirect relationship — the truncated reconstruction implicitly smooths over lexical variation and reveals the underlying semantic structure. This is because the low-rank SVD effectively clusters terms that appear in similar contexts. LSA directly inspired word2vec and modern distributional semantics.

14

Recommender Systems — Matrix Factorization

// user-item matrix · collaborative filtering · SVD++ · missing data

Netflix, Spotify, Amazon — their core recommendation engines are matrix factorization algorithms, all rooted in SVD. The user-item ratings matrix is sparse and incomplete; SVD (or its variants) fills in the missing entries.

The Collaborative Filtering Setup

Matrix Completion via FactorizationRecommendation

\[\underbrace{\mathbf{R}}_{m\text{ users}\times n\text{ items}} \approx \mathbf{P}\mathbf{Q}^T, \quad \mathbf{P}\in\mathbb{R}^{m\times k},\; \mathbf{Q}\in\mathbb{R}^{n\times k}\] \[\hat{R}_{ui} = \mathbf{p}_u^T\mathbf{q}_i = \sum_{f=1}^k p_{uf}\cdot q_{if}\] \[\min_{\mathbf{P},\mathbf{Q}}\sum_{(u,i)\in\text{observed}}\!(R_{ui}-\mathbf{p}_u^T\mathbf{q}_i)^2 + \lambda(\|\mathbf{P}\|_F^2 + \|\mathbf{Q}\|_F^2)\]

P_u = k-dimensional latent user preference vector. Q_i = k-dimensional latent item feature vector. The dot product p_u^T q_i predicts user u's rating of item i. Optimized via SGD or ALS (Alternating Least Squares) over the observed (non-missing) entries only.

Connection to SVD

If all entries of \(\mathbf{R}\) were observed, the optimal rank-\(k\) factorization \(\mathbf{P}\mathbf{Q}^T\) would be exactly the truncated SVD \(\mathbf{U}_k\boldsymbol{\Sigma}_k\mathbf{V}_k^T\). With missing entries, we minimize only over the observed entries — this is the matrix completion problem. The SVD gives the unique closed-form solution when all entries are present; SGD/ALS generalizes this to the missing-data case.

★

SVD++ — Simon Funk's Netflix Solution

Simon Funk (2006) won the Netflix Prize leaderboard with regularized matrix factorization optimized by SGD — essentially stochastic gradient descent on the SVD objective with missing data. The key insight: the SVD structure \(\mathbf{p}_u^T\mathbf{q}_i\) captures latent factors (genre affinity, production era, etc.) even though these factors are never explicitly labeled. SVD++ adds implicit feedback (which items a user interacted with, regardless of rating) to the model.

15

SVD in Deep Learning

// weight matrices · LoRA · spectral normalization · nuclear norm

⊗

LoRA — Low-Rank Adaptation

Fine-tune large language models by decomposing weight updates into low-rank SVD form: \(\Delta\mathbf{W} = \mathbf{A}\mathbf{B}\) with \(\mathbf{A}\in\mathbb{R}^{d\times r}\), \(\mathbf{B}\in\mathbb{R}^{r\times k}\), \(r \ll \min(d,k)\). Only A and B are trained; the pre-trained W is frozen. Uses 0.1% of parameters. Directly uses the insight that weight updates are low-rank (SVD motivation).

◈

Spectral Normalization

Normalize each weight matrix by its largest singular value \(\sigma_1\) to enforce Lipschitz continuity: \(\mathbf{W}_{\text{sn}} = \mathbf{W}/\sigma_1(\mathbf{W})\). Used in Spectral Normalization GAN (SN-GAN) to stabilize discriminator training. Estimated cheaply using the power iteration method for \(\sigma_1\) without full SVD.

≈

Weight Pruning via SVD

Compress neural network weight matrices by replacing each \(\mathbf{W}^{[\ell]}\) with its rank-\(k\) SVD approximation \(\mathbf{U}_k\boldsymbol{\Sigma}_k\mathbf{V}_k^T\). Reduces inference FLOPs by \(d_1 d_2 / (d_1 + d_2)k\). Works well for transformer attention projection matrices and embedding tables.

∥

Nuclear Norm Regularization

The nuclear norm \(\|\mathbf{W}\|_* = \sum_i\sigma_i\) is the convex relaxation of rank. Minimizing it penalizes large singular values, encouraging low-rank weight matrices. Used as a regularizer when the ground-truth relationship is expected to be low-rank: matrix completion, multi-task learning, collaborative filtering.

∇

Gradient Analysis

SVD of gradient matrices during training diagnoses optimization health: effective rank (number of significant singular values) measures gradient diversity. Sudden collapse of singular values indicates training instability. Gradient clipping by spectral norm (\(\sigma_1\)) rather than Frobenius norm is more principled and less aggressive.

⊕

Attention Matrix Analysis

In Transformers, the attention matrix \(\text{softmax}(\mathbf{Q}\mathbf{K}^T/\sqrt{d})\) has an intrinsic rank much lower than its dimension in well-trained models. SVD of attention matrices reveals interpretable "attention heads" — some attend to positional patterns, others to syntactic roles. Used extensively in mechanistic interpretability research.

16

Image Compression & Signal Applications

// rank-k images · denoising · eigenfaces · signal separation

Image Compression

Treat an \(m \times n\) grayscale image as a matrix \(\mathbf{A}\). Its SVD compresses it by keeping only the top \(k\) singular triplets:

Image Compression via Rank-k SVDApplication

\[\hat{\mathbf{A}}_k = \sum_{i=1}^k\sigma_i\mathbf{u}_i\mathbf{v}_i^T\] \[\text{Compression ratio} = \frac{k(m+n+1)}{mn}, \quad \text{PSNR} = 20\log_{10}\frac{\max(\mathbf{A})}{\|\mathbf{A}-\hat{\mathbf{A}}_k\|_F/\sqrt{mn}}\]

For most natural images, >95% of the energy is captured in the top 50 singular values. The rank-1 approximation alone gives a blurry but recognizable thumbnail. Ranks 10-50 give visually acceptable quality at 10-50× compression. Note: JPEG is not SVD-based (it uses DCT blocks) but shares the "energy compaction" philosophy.

Eigenfaces

Stack \(m\) face images (each of dimension \(p \times q = n\) pixels) as rows of data matrix \(\mathbf{X}\). The left singular vectors of centered \(\tilde{\mathbf{X}}\) are the eigenfaces — the principal modes of variation in human faces. Any face can be encoded as linear combination of the top \(k\) eigenfaces. Sirovich and Kirby (1987), Turk and Pentland (1991) — the foundation of computer vision face recognition pre-CNN.

17

The Complete Mental Model

// everything unified · one theorem · infinite reach

Fig 2. Complete SVD mental map — from the fundamental factorization through mathematical properties to all major applications.

The complete story in one thread:

Existence: Every matrix \(\mathbf{A}\in\mathbb{R}^{m\times n}\) has an SVD \(\mathbf{A} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T\) with orthogonal \(\mathbf{U},\mathbf{V}\) and non-negative diagonal \(\boldsymbol{\Sigma}\). No conditions required.
Geometry: The three matrices have clear geometric meaning — \(\mathbf{V}^T\) rotates in domain space, \(\boldsymbol{\Sigma}\) scales along coordinate axes, \(\mathbf{U}\) rotates in range space. Every linear map maps the unit sphere to an ellipse; SVD finds that ellipse exactly.
Rank and subspaces: The number of nonzero singular values equals the rank. The first \(r\) columns of \(\mathbf{U}\) span the column space; the last \(m-r\) span the left null space. Vice versa for \(\mathbf{V}\). All four fundamental subspaces fall out simultaneously.
Optimal approximation: The Eckart-Young theorem proves that the truncated SVD \(\hat{\mathbf{A}}_k = \sum_{i=1}^k\sigma_i\mathbf{u}_i\mathbf{v}_i^T\) is the best rank-\(k\) approximation under any unitarily invariant norm. No other rank-\(k\) matrix is closer to \(\mathbf{A}\).
Connection to eigenvalues: \(\sigma_i^2 = \lambda_i(\mathbf{A}^T\mathbf{A}) = \lambda_i(\mathbf{A}\mathbf{A}^T)\). SVD generalizes eigendecomposition to rectangular matrices while guaranteeing orthogonal decomposition matrices.
Pseudoinverse: \(\mathbf{A}^+ = \mathbf{V}\boldsymbol{\Sigma}^+\mathbf{U}^T\) gives the minimum-norm least-squares solution to any linear system — well-defined for any matrix, with or without full rank.
PCA: SVD of the centered data matrix gives PCA directly — right singular vectors are the principal components, and \(\sigma_i^2/(m-1)\) are the explained variances. Always compute PCA via SVD.
Applications span all of ML: LSA discovers semantic structure in text. Matrix factorization powers recommendation engines. LoRA uses low-rank SVD structure to fine-tune LLMs with 1000× fewer parameters. Spectral normalization stabilizes GAN training. Image compression uses rank-\(k\) SVD. Every major ML system uses SVD somewhere.

★

Why SVD Is the Most Important Matrix Factorization

Eigendecomposition requires a square matrix. Cholesky requires positive definiteness. LU requires no zero pivots. SVD requires nothing. It works for every matrix, reveals all its intrinsic properties, provides the best possible low-rank approximation, generalizes both the inverse and the projection, and connects to PCA, least squares, optimization, and signal processing in one unified factorization. If you can only know one matrix decomposition — this is the one.

SVD Singular Value Decomposition