PCA | Shahid Ul Islam

I

The Curse of Dimensionality

// Hughes phenomenon · sparse data · geometry of high dimensions

Before understanding what PCA solves, we must understand the problem it addresses. High-dimensional data is not just computationally expensive — it is geometrically pathological in ways that break nearly every ML assumption.

Volume Concentration

Consider a unit hypercube \([0,1]^d\). The fraction of its volume within distance \(\epsilon\) of the boundary is \(1 - (1-2\epsilon)^d\). For \(d = 100\) and \(\epsilon = 0.01\):

Volume Near Boundary

\[1 - (1-0.02)^{100} = 1 - 0.98^{100} \approx 1 - 0.133 = 0.867\]

87% of the hypercube's volume lies within 1% of the boundary. Nearly all data points are "near the edge" — the notion of a "typical interior point" vanishes.

Distance Concentration

For \(m\) points drawn uniformly from \([0,1]^d\), the ratio of maximum to minimum pairwise distance converges to 1 as \(d \to \infty\):

Distance Concentration

\[\lim_{d\to\infty} \frac{\max\text{dist} - \min\text{dist}}{\min\text{dist}} \to 0\]

All points become equidistant. k-NN becomes meaningless. Clustering loses discriminability. Any distance-based algorithm degrades catastrophically.

Data Sparsity

To cover a \(d\)-dimensional unit hypercube with a coverage fraction \(r\) per axis, you need \(m = r^{-d}\) samples. For \(r = 0.1\) and \(d = 100\): \(m = 10^{100}\) — more than atoms in the observable universe.

!

Why High Dimensions Are Mostly Empty

Real datasets with thousands of features have far fewer samples than dimensions. A 28×28 MNIST image lives in \(\mathbb{R}^{784}\) — but natural handwritten digits occupy a manifold of much lower intrinsic dimension. PCA finds this low-dimensional subspace. The intrinsic dimensionality is typically orders of magnitude smaller than the ambient dimension.

II

What PCA Does — Intuition First

// linear projection · maximum variance · orthogonal basis

PCA finds a new coordinate system for your data — one that is aligned with the directions of maximum variance. The first axis points in the direction where the data varies most. The second axis is orthogonal to the first and captures the next most variance. And so on.

Fig 1. Left: Data in original correlated basis — both axes needed to describe the cloud. Right: PCA basis — PC₁ (gold) aligns with maximum variance, PC₂ (lavender) captures residual variance. The description is now compact and uncorrelated.

→

The Core Promise of PCA

PCA finds a rotation of the coordinate system such that (1) variance is maximized along each new axis in order, (2) the new axes are mutually orthogonal (uncorrelated), and (3) we can keep only the first \(k\) axes and lose the minimum possible information. It is the optimal linear dimensionality reduction under mean squared error.

III

Data Centering & The Covariance Matrix

// mean subtraction · scatter matrix · statistical setup

PCA operates on centered data. The mean captures the location of the data cloud; the covariance matrix captures its shape. We must remove the mean first so that PCA describes shape, not position.

Step 1 — Center the Data

Data Centering

\[\boldsymbol{\mu} = \frac{1}{m}\sum_{i=1}^{m}\mathbf{x}^{(i)} \in \mathbb{R}^d\] \[\tilde{\mathbf{x}}^{(i)} = \mathbf{x}^{(i)} - \boldsymbol{\mu} \quad\Rightarrow\quad \frac{1}{m}\sum_{i=1}^{m}\tilde{\mathbf{x}}^{(i)} = \mathbf{0}\]

Let X̃ ∈ ℝ^{m×d} be the centered data matrix (each row is one centered sample). Everything from here uses X̃; we drop the tilde for brevity.

Step 2 — The Sample Covariance Matrix

Sample Covariance Matrix

\[\mathbf{S} = \frac{1}{m-1}\sum_{i=1}^{m}\tilde{\mathbf{x}}^{(i)}(\tilde{\mathbf{x}}^{(i)})^T = \frac{1}{m-1}\tilde{\mathbf{X}}^T\tilde{\mathbf{X}} \in \mathbb{R}^{d\times d}\]

The (j,k) entry S_{jk} = Cov(x_j, x_k) = correlation structure between features j and k. The diagonal S_{jj} = Var(x_j). The 1/(m-1) gives an unbiased estimate of the population covariance; for PCA, 1/m gives the same eigenvectors.

The covariance matrix is the central object of PCA. Its properties determine everything:

Symmetric: \(\mathbf{S} = \mathbf{S}^T\) — because \(\text{Cov}(x_j, x_k) = \text{Cov}(x_k, x_j)\).
Positive semi-definite (PSD): \(\mathbf{v}^T\mathbf{S}\mathbf{v} \geq 0\) for all \(\mathbf{v}\) — because \(\mathbf{v}^T\mathbf{S}\mathbf{v} = \frac{1}{m-1}\|\tilde{\mathbf{X}}\mathbf{v}\|^2 \geq 0\). This guarantees all eigenvalues \(\geq 0\).
Real eigenvalues: By the Spectral Theorem for symmetric matrices — guaranteed real, non-negative eigenvalues and orthogonal eigenvectors.
Rank: \(\text{rank}(\mathbf{S}) \leq \min(m-1, d)\) — if \(m < d\), the covariance matrix is singular and has at most \(m-1\) nonzero eigenvalues.

IV

Covariance Matrix — Deep Dive

// geometric meaning · ellipsoid of concentration · spectral decomposition

The Covariance Ellipsoid

The covariance matrix \(\mathbf{S}\) defines an ellipsoid: \(\{\mathbf{v} : \mathbf{v}^T\mathbf{S}^{-1}\mathbf{v} = 1\}\). This is the set of points at Mahalanobis distance 1 from the origin. Its axes are aligned with the eigenvectors of \(\mathbf{S}\), and the length of each axis is proportional to \(\sqrt{\lambda_i}\) (the square root of the corresponding eigenvalue).

Fig 2. The covariance ellipsoid. Left: isotropic (no preferred direction). Centre: elongated along x₁ — PC₁ aligns with x₁. Right: rotated — eigenvectors are oblique to original axes, and PCA finds them exactly.

Spectral Decomposition of S

By the Spectral Theorem for real symmetric matrices:

Eigendecomposition of Covariance Matrix

\[\mathbf{S} = \mathbf{V}\boldsymbol{\Lambda}\mathbf{V}^T = \sum_{j=1}^{d}\lambda_j\mathbf{v}_j\mathbf{v}_j^T\] \[\boldsymbol{\Lambda} = \text{diag}(\lambda_1,\ldots,\lambda_d), \quad \lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d \geq 0\] \[\mathbf{V} = [\mathbf{v}_1 | \mathbf{v}_2 | \cdots | \mathbf{v}_d], \quad \mathbf{V}^T\mathbf{V} = \mathbf{I}_d\]

V is orthogonal (V^T = V^{-1}). Each outer product λ_j v_j v_j^T is a rank-1 matrix representing the contribution of the j-th principal component to the total covariance structure.

V

The Variance Maximization Problem

// objective function · unit norm constraint · Rayleigh quotient

PCA's first principal component is the unit vector \(\mathbf{v}_1\) such that projecting the data onto \(\mathbf{v}_1\) gives maximum variance. Let's formalize this.

Variance of the Projection

The projection of sample \(\mathbf{x}^{(i)}\) onto unit vector \(\mathbf{v}\) is the scalar \(z^{(i)} = \mathbf{v}^T\mathbf{x}^{(i)}\). The variance of these projected values across all samples:

Projection Variance

\[\text{Var}\left(\mathbf{v}^T\mathbf{x}\right) = \frac{1}{m-1}\sum_{i=1}^{m}\left(\mathbf{v}^T\mathbf{x}^{(i)}\right)^2 = \frac{1}{m-1}\mathbf{v}^T\tilde{\mathbf{X}}^T\tilde{\mathbf{X}}\mathbf{v} = \mathbf{v}^T\mathbf{S}\mathbf{v}\]

Since data is centered: the mean of v^T x̃^{(i)} is 0. The variance is the quadratic form v^T S v. PCA maximizes this over all unit vectors v.

The Rayleigh Quotient

The Rayleigh quotient of a symmetric matrix \(\mathbf{S}\) with respect to vector \(\mathbf{v}\) is:

Rayleigh Quotient

\[R(\mathbf{v}) = \frac{\mathbf{v}^T\mathbf{S}\mathbf{v}}{\mathbf{v}^T\mathbf{v}}\]

The Rayleigh quotient is bounded: \(\lambda_{\min} \leq R(\mathbf{v}) \leq \lambda_{\max}\). The maximum is achieved when \(\mathbf{v} = \mathbf{v}_1\) (the eigenvector corresponding to the largest eigenvalue). The constrained problem (‖v‖=1) is equivalent to maximizing the unconstrained Rayleigh quotient.

Formal Optimization Problem

PCA First Principal Component

\[\mathbf{v}_1 = \underset{\mathbf{v} \in \mathbb{R}^d}{\arg\max}\; \mathbf{v}^T\mathbf{S}\mathbf{v} \quad\text{subject to}\quad \|\mathbf{v}\|_2 = 1\]

Subsequent components are found by successively deflating and re-solving:

k-th Principal Component (Deflation)

\[\mathbf{v}_k = \underset{\mathbf{v}}{\arg\max}\; \mathbf{v}^T\mathbf{S}\mathbf{v} \quad\text{subject to}\quad \|\mathbf{v}\|=1,\; \mathbf{v}\perp\mathbf{v}_j\;\forall j < k\]

VI

Lagrangian Derivation — Full Proof

// method of Lagrange multipliers · stationarity · complete derivation

We now apply the Lagrange multiplier method (from the Lagrange Multipliers masterclass) to solve the PCA optimization problem exactly. This derivation shows why eigenvectors are the answer — it falls out from first principles.

Setting Up the Lagrangian

Objective: maximize \(\mathbf{v}^T\mathbf{S}\mathbf{v}\). Constraint: \(\|\mathbf{v}\|^2 = 1\), i.e., \(g(\mathbf{v}) = \mathbf{v}^T\mathbf{v} - 1 = 0\).

Lagrangian

\[\mathcal{L}(\mathbf{v}, \lambda) = \mathbf{v}^T\mathbf{S}\mathbf{v} - \lambda(\mathbf{v}^T\mathbf{v} - 1)\]

Full Derivation

// Lagrangian → Eigenvalue Problem — Complete Proof

1

Stationarity condition: set \(\nabla_{\mathbf{v}}\mathcal{L} = \mathbf{0}\). Using matrix calculus identities: \(\nabla_{\mathbf{v}}(\mathbf{v}^T\mathbf{S}\mathbf{v}) = 2\mathbf{S}\mathbf{v}\) (since \(\mathbf{S}\) is symmetric) and \(\nabla_{\mathbf{v}}(\mathbf{v}^T\mathbf{v}) = 2\mathbf{v}\): \[\nabla_{\mathbf{v}}\mathcal{L} = 2\mathbf{S}\mathbf{v} - 2\lambda\mathbf{v} = \mathbf{0}\]

2

Simplify: \[\mathbf{S}\mathbf{v} = \lambda\mathbf{v}\]

This is the eigenvalue equation! The stationary points of the Lagrangian are exactly the eigenvectors of S.

3

Evaluate the objective at a stationary point. Multiply both sides of \(\mathbf{S}\mathbf{v} = \lambda\mathbf{v}\) by \(\mathbf{v}^T\) from the left: \[\mathbf{v}^T\mathbf{S}\mathbf{v} = \lambda\mathbf{v}^T\mathbf{v} = \lambda\cdot 1 = \lambda\]

The variance of the projection equals the Lagrange multiplier λ! To maximize variance, we want the largest λ.

4

Constraint stationarity: \(\partial\mathcal{L}/\partial\lambda = -(\mathbf{v}^T\mathbf{v}-1) = 0\) recovers \(\|\mathbf{v}\|=1\). ∎
Conclusion: \(\mathbf{v}_1 = \arg\max_{\|\mathbf{v}\|=1}\mathbf{v}^T\mathbf{S}\mathbf{v}\) is the eigenvector of \(\mathbf{S}\) corresponding to eigenvalue \(\lambda_1 = \max_j\lambda_j\). The maximum variance equals \(\lambda_1\).

★ Theorem — PCA as Eigenvalue Problem

The \(k\)-th principal component \(\mathbf{v}_k\) is the \(k\)-th eigenvector of the sample covariance matrix \(\mathbf{S}\) (ordered by decreasing eigenvalue). The variance explained by the \(k\)-th component equals the \(k\)-th eigenvalue \(\lambda_k\). This follows directly from the Lagrangian stationarity condition.

VII

The Eigenvalue Problem

// characteristic polynomial · spectral theorem · orthogonality proof

The Characteristic Equation

Eigenvalue Equation

\[\mathbf{S}\mathbf{v} = \lambda\mathbf{v} \quad\Leftrightarrow\quad (\mathbf{S} - \lambda\mathbf{I})\mathbf{v} = \mathbf{0}\] \[\text{Non-trivial solution} \Leftrightarrow \det(\mathbf{S} - \lambda\mathbf{I}) = 0 \quad\text{(characteristic polynomial)}\]

The characteristic polynomial is degree d in λ, giving d eigenvalues (counting multiplicity, possibly complex — but since S is real symmetric, all eigenvalues are real and ≥ 0).

Why Eigenvectors of S Are Orthogonal — Proof

// Proving v_i ⊥ v_j for distinct eigenvalues λ_i ≠ λ_j

1

Two eigenvectors: \(\mathbf{S}\mathbf{v}_i = \lambda_i\mathbf{v}_i\) and \(\mathbf{S}\mathbf{v}_j = \lambda_j\mathbf{v}_j\).

2

Multiply first equation by \(\mathbf{v}_j^T\): \(\mathbf{v}_j^T\mathbf{S}\mathbf{v}_i = \lambda_i\mathbf{v}_j^T\mathbf{v}_i\). Multiply second equation by \(\mathbf{v}_i^T\): \(\mathbf{v}_i^T\mathbf{S}\mathbf{v}_j = \lambda_j\mathbf{v}_i^T\mathbf{v}_j\).

3

Since \(\mathbf{S}\) is symmetric: \(\mathbf{v}_j^T\mathbf{S}\mathbf{v}_i = (\mathbf{S}\mathbf{v}_j)^T\mathbf{v}_i = \mathbf{v}_i^T\mathbf{S}\mathbf{v}_j\). So both sides are equal: \[\lambda_i(\mathbf{v}_j^T\mathbf{v}_i) = \lambda_j(\mathbf{v}_i^T\mathbf{v}_j)\]

4

Rearrange: \((\lambda_i - \lambda_j)(\mathbf{v}_i^T\mathbf{v}_j) = 0\). Since \(\lambda_i \neq \lambda_j\), we get \(\mathbf{v}_i^T\mathbf{v}_j = 0\). ∎

Eigenvectors corresponding to distinct eigenvalues of a symmetric matrix are automatically orthogonal. No "Gram-Schmidt" needed.

VIII

Principal Components — Complete Picture

// loadings · scores · projection · full decomposition

Loadings, Scores, and Projection

PCA Projection to k Dimensions

\[\mathbf{V}_k = [\mathbf{v}_1 | \mathbf{v}_2 | \cdots | \mathbf{v}_k] \in \mathbb{R}^{d\times k} \quad\text{(loadings matrix)}\] \[\mathbf{Z} = \tilde{\mathbf{X}}\mathbf{V}_k \in \mathbb{R}^{m\times k} \quad\text{(scores / projected data)}\] \[Z_{ij} = \tilde{\mathbf{x}}^{(i)}\cdot\mathbf{v}_j \quad\text{(score of sample }i\text{ on component }j\text{)}\]

V_k is the first k columns of the eigenvector matrix V. The scores Z are the coordinates in the new PCA subspace. Each column of Z has zero mean (since X̃ is centered) and variance λ_j.

Properties of the Scores

Zero mean: \(\mathbb{E}[\mathbf{Z}] = \mathbf{0}\) — inherited from centering.
Diagonal covariance: \(\text{Cov}(\mathbf{Z}) = \frac{1}{m-1}\mathbf{Z}^T\mathbf{Z} = \boldsymbol{\Lambda}_k = \text{diag}(\lambda_1,\ldots,\lambda_k)\) — the principal components are uncorrelated!
Variance: The \(j\)-th column of \(\mathbf{Z}\) has variance \(\lambda_j\) — the eigenvalue tells you exactly how much variance that component captures.

// Proving Cov(Z) = Λ_k

1

\(\text{Cov}(\mathbf{Z}) = \frac{1}{m-1}(\tilde{\mathbf{X}}\mathbf{V}_k)^T(\tilde{\mathbf{X}}\mathbf{V}_k) = \frac{1}{m-1}\mathbf{V}_k^T\tilde{\mathbf{X}}^T\tilde{\mathbf{X}}\mathbf{V}_k\)

2

\(= \mathbf{V}_k^T\mathbf{S}\mathbf{V}_k\)

3

Since \(\mathbf{S}\mathbf{v}_j = \lambda_j\mathbf{v}_j\) and \(\mathbf{v}_i^T\mathbf{v}_j = \delta_{ij}\): \[(\mathbf{V}_k^T\mathbf{S}\mathbf{V}_k)_{ij} = \mathbf{v}_i^T\mathbf{S}\mathbf{v}_j = \lambda_j\mathbf{v}_i^T\mathbf{v}_j = \lambda_j\delta_{ij}\] \[\Rightarrow \mathbf{V}_k^T\mathbf{S}\mathbf{V}_k = \boldsymbol{\Lambda}_k \quad\blacksquare\]

The projected data has a diagonal covariance — the components are decorrelated. This is the second key property of PCA (the first being variance maximization).

IX

Explained Variance & Choosing k

// variance ratio · scree plot · cumulative explained variance · elbow method

Explained Variance Ratio

Explained Variance

\[\text{EVR}(k) = \frac{\sum_{j=1}^{k}\lambda_j}{\sum_{j=1}^{d}\lambda_j} = \frac{\sum_{j=1}^{k}\lambda_j}{\text{tr}(\mathbf{S})}\]

tr(S) = total variance = sum of all variances. EVR(k) ∈ [0,1] — the fraction of total information retained by keeping k components. Common thresholds: 90%, 95%, 99%.

Methods for Choosing k

Method	Rule	Best For
Cumulative EVR threshold	Keep smallest \(k\) with EVR ≥ 0.95 (or 0.99)	General dimensionality reduction with known target
Scree plot / Elbow	Plot \(\lambda_j\) vs \(j\); keep components before the "elbow" where eigenvalues level off	Exploratory analysis; finding natural breaks
Kaiser criterion	Keep components with \(\lambda_j > 1\) (with standardized data)	Factor analysis; features on same scale
Reconstruction error	Keep \(k\) such that \(\\|\mathbf{X} - \hat{\mathbf{X}}_k\\|_F^2 / \\|\mathbf{X}\\|_F^2 \leq \epsilon\)	Compression with quality constraint
Cross-validation	Choose \(k\) that minimizes downstream task error	When PCA is a preprocessing step for classification
Optimal Hard Threshold	Gavish-Donoho formula (depends on noise level)	Denoising with known noise variance

X

Geometric Interpretation

// rotation · projection · new orthonormal basis · change of basis

PCA is a rigid rotation of the coordinate system — no stretching, no shearing — followed by a projection onto a subspace.

PCA as Change of Basis

Geometric Decomposition

\[\underbrace{\tilde{\mathbf{x}}}_{\text{original}} \xrightarrow{\mathbf{V}^T} \underbrace{\mathbf{z}}_{\text{all PCs}} \xrightarrow{\text{keep first }k} \underbrace{\mathbf{z}_k}_{\text{low-dim rep}}\] \[\mathbf{z} = \mathbf{V}^T\tilde{\mathbf{x}} \in \mathbb{R}^d \quad\text{(rotation — invertible, no info loss)}\] \[\mathbf{z}_k = \mathbf{V}_k^T\tilde{\mathbf{x}} \in \mathbb{R}^k \quad\text{(projection — information loss = }1-\text{EVR}(k)\text{)}\]

Rotation step (\(\mathbf{V}^T\)): This is an orthogonal transformation — it preserves lengths, angles, and distances. The data cloud shape is unchanged; only the coordinate labels change. No information is lost.
Projection step (keep k): Discard dimensions \(k+1, \ldots, d\). This projects the data onto the \(k\)-dimensional principal subspace. Information is lost — but the minimum possible amount (for a linear projection).
Optimality: PCA is the optimal linear compressor: no other linear projection onto a \(k\)-dimensional subspace retains more variance or has lower reconstruction error.

∠

The Data Stays in the Same Place

PCA does not move the data points — it moves the coordinate axes. The ellipsoidal cloud stays fixed; we rotate the rulers we use to measure it until they align with the cloud's natural axes. This is why PCA is purely a linear algebraic transformation: \(\mathbf{V}^T\mathbf{V} = \mathbf{I}\), det(\(\mathbf{V}\)) = ±1.

XI

Reconstruction & Reconstruction Error

// inverse transform · Frobenius norm · optimality theorem

Reconstruction from PCA Scores

Reconstruction

\[\hat{\tilde{\mathbf{x}}}^{(i)} = \mathbf{V}_k\mathbf{z}_k^{(i)} = \mathbf{V}_k\mathbf{V}_k^T\tilde{\mathbf{x}}^{(i)} \in \mathbb{R}^d\] \[\mathbf{P}_k = \mathbf{V}_k\mathbf{V}_k^T \in \mathbb{R}^{d\times d} \quad\text{(projection matrix onto principal subspace)}\]

P_k is an orthogonal projector: P_k² = P_k, P_k^T = P_k. The final reconstruction in original space: x̂^{(i)} = V_k z_k^{(i)} + μ (add the mean back).

Reconstruction Error

Total Reconstruction Error (Frobenius Norm)

\[\mathcal{E}_k = \|\tilde{\mathbf{X}} - \hat{\tilde{\mathbf{X}}}_k\|_F^2 = \|\tilde{\mathbf{X}}(\mathbf{I} - \mathbf{V}_k\mathbf{V}_k^T)\|_F^2\] \[= (m-1)\sum_{j=k+1}^{d}\lambda_j = (m-1)\left(\text{tr}(\mathbf{S}) - \sum_{j=1}^{k}\lambda_j\right)\]

The reconstruction error equals the sum of discarded eigenvalues (times m-1). Minimizing reconstruction error is exactly equivalent to maximizing retained variance — the two formulations of PCA are identical.

★ Eckart-Young-Mirsky Theorem

Among all rank-\(k\) approximations \(\hat{\mathbf{X}}\) of \(\tilde{\mathbf{X}}\), the one minimizing the Frobenius norm \(\|\tilde{\mathbf{X}} - \hat{\mathbf{X}}\|_F\) is given by the PCA truncation: \(\hat{\tilde{\mathbf{X}}}_k = \mathbf{Z}_k\mathbf{V}_k^T\). This is the foundation of optimal data compression via PCA.

XII

Connection to SVD

// singular value decomposition · thin SVD · equivalence proof · numerical stability

PCA is almost always computed via SVD rather than direct eigendecomposition of \(\mathbf{S}\). SVD is numerically more stable (it avoids squaring the condition number) and works even when \(d > m\).

SVD of the Data Matrix

Thin SVD of Centered Data Matrix

\[\tilde{\mathbf{X}} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T\] \[\mathbf{U} \in \mathbb{R}^{m\times r}, \quad \boldsymbol{\Sigma} = \text{diag}(\sigma_1,\ldots,\sigma_r), \quad \mathbf{V} \in \mathbb{R}^{d\times r}\] \[\mathbf{U}^T\mathbf{U} = \mathbf{I}_r, \quad \mathbf{V}^T\mathbf{V} = \mathbf{I}_r, \quad r = \text{rank}(\tilde{\mathbf{X}})\]

σ₁ ≥ σ₂ ≥ ··· ≥ σ_r > 0 are the singular values. U = left singular vectors (m×r). V = right singular vectors (d×r) = PCA loadings. r = min(m,d) in the full case.

Proof: SVD Right Singular Vectors = PCA Eigenvectors

// Deriving the equivalence SVD ↔ PCA

1

Compute \(\frac{1}{m-1}\tilde{\mathbf{X}}^T\tilde{\mathbf{X}}\) using the SVD: \[\frac{1}{m-1}\tilde{\mathbf{X}}^T\tilde{\mathbf{X}} = \frac{1}{m-1}(\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T)^T(\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T) = \frac{1}{m-1}\mathbf{V}\boldsymbol{\Sigma}^T\mathbf{U}^T\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T\]

2

Since \(\mathbf{U}^T\mathbf{U} = \mathbf{I}\): \[\mathbf{S} = \frac{1}{m-1}\mathbf{V}\boldsymbol{\Sigma}^2\mathbf{V}^T\]

3

This is the eigendecomposition of \(\mathbf{S}\) with: \[\lambda_j = \frac{\sigma_j^2}{m-1}, \quad \mathbf{v}_j = \text{right singular vector}_j\]

The PCA eigenvalues are the squared singular values divided by (m-1). The PCA loading vectors are exactly the right singular vectors V. ∎

SVD → PCA Correspondence

\[\lambda_j = \frac{\sigma_j^2}{m-1}, \quad \mathbf{v}_j = \text{right singular vector of }\tilde{\mathbf{X}}\] \[\mathbf{Z} = \tilde{\mathbf{X}}\mathbf{V}_k = \mathbf{U}_k\boldsymbol{\Sigma}_k \quad\text{(scores = left singular vectors × singular values)}\]

Why SVD is Preferred Numerically

Condition number: \(\kappa(\mathbf{S}) = \lambda_1/\lambda_d\). Since \(\lambda_j = \sigma_j^2/(m-1)\), we have \(\kappa(\mathbf{S}) = \kappa(\tilde{\mathbf{X}})^2\). Eigendecomposing \(\mathbf{S}\) squares the condition number — amplifying numerical errors. SVD works directly on \(\tilde{\mathbf{X}}\).
Memory when d > m: Computing \(\mathbf{S} \in \mathbb{R}^{d\times d}\) when \(d = 50000\) features and \(m = 500\) samples would require 20 GB. SVD via thin decomposition operates on the \(m \times m\) matrix \(\tilde{\mathbf{X}}\tilde{\mathbf{X}}^T\) instead.
Never explicitly form S: In practice, call np.linalg.svd(X_centered, full_matrices=False) — not np.linalg.eig(X.T @ X).

XIII

Whitening & Standardization

// ZCA · sphering · when to standardize · scale invariance

When to Standardize Before PCA

PCA is not scale-invariant. Features with larger variance dominate the covariance matrix regardless of their actual importance. If features are on different scales (e.g., income in dollars vs age in years), you must standardize first.

Standardization Before PCA

\[x_{ij}' = \frac{x_{ij} - \mu_j}{\sigma_j} \quad\Rightarrow\quad \mathbf{S} \text{ becomes the correlation matrix}\]

Standardized PCA decomposes the correlation matrix. Each feature contributes equally to the total variance (each has variance 1). Use when features are in different units. Skip when features share units (e.g., pixel intensities in images).

PCA Whitening (Sphering)

Whitening transforms the data so that the covariance matrix is the identity: \(\text{Cov}(\mathbf{z}_{\text{white}}) = \mathbf{I}\). Every direction has unit variance.

PCA Whitening

\[\mathbf{z}_{\text{white}} = \boldsymbol{\Lambda}_k^{-1/2}\mathbf{V}_k^T\tilde{\mathbf{x}} = \text{diag}(1/\sqrt{\lambda_1},\ldots,1/\sqrt{\lambda_k})\mathbf{V}_k^T\tilde{\mathbf{x}}\]

First rotate (V_k^T), then scale each dimension by 1/√λⱼ. This decorrelates AND equalizes variances. ZCA whitening additionally rotates back: z_ZCA = V Λ^{-1/2} V^T x̃ — whitened data stays in original feature space. Used as preprocessing for CNNs (makes optimization landscape more spherical).

XIV

Kernel PCA — Nonlinear Extension

// feature map · kernel trick · dual covariance · Gram matrix

Standard PCA can only find linear subspaces. Kernel PCA implicitly maps data to a high-dimensional feature space and performs PCA there — enabling nonlinear dimensionality reduction without ever explicitly computing the feature map.

The Kernel Trick for PCA

Replace \(\mathbf{x}\) with \(\phi(\mathbf{x})\) where \(\phi: \mathbb{R}^d \to \mathcal{H}\) (a potentially infinite-dimensional Hilbert space). The covariance in feature space is:

Covariance in Feature Space

\[\mathbf{C}_\phi = \frac{1}{m}\sum_{i=1}^m\phi(\mathbf{x}^{(i)})\phi(\mathbf{x}^{(i)})^T\]

The eigenvectors of C_φ lie in span{φ(x^(1)),...,φ(x^(m))}. So we can write v = Σ αᵢ φ(x^(i)). The eigenvalue equation reduces to solving an m×m Gram matrix eigenvalue problem.

Kernel PCA — Gram Matrix Eigenvalue Problem

\[\mathbf{K}_{ij} = k(\mathbf{x}^{(i)}, \mathbf{x}^{(j)}) = \phi(\mathbf{x}^{(i)})^T\phi(\mathbf{x}^{(j)})\] \[\tilde{\mathbf{K}} = \mathbf{K} - \mathbf{1}_m\mathbf{K} - \mathbf{K}\mathbf{1}_m + \mathbf{1}_m\mathbf{K}\mathbf{1}_m \quad\text{(centered Gram matrix)}\] \[\tilde{\mathbf{K}}\boldsymbol{\alpha}_j = \mu_j\boldsymbol{\alpha}_j \quad\Rightarrow\quad \text{scores: } z_{ij} = \frac{1}{\sqrt{\mu_j}}\sum_{l=1}^m\alpha_{jl} k(\mathbf{x}^{(l)}, \mathbf{x}^{(i)})\]

Common kernels: RBF k(x,x')=exp(−‖x−x'‖²/2σ²), polynomial k(x,x')=(x^Tx'+c)^p, sigmoid. Kernel PCA is O(m²) memory and O(m³) time — expensive for large datasets.

XV

Algorithms — Power Iteration & Randomized PCA

// power method · deflation · randomized SVD · computational complexity

Power Iteration — The Classical Algorithm

Power Iteration for Largest Eigenvector

            # Find the dominant eigenvector of S = X^T X / (m-1)

            initialize: v = random unit vector in ℝ^d

            repeat until convergence:

              z = S @ v
                      # matrix-vector product
              O(d²)

              v = z / ‖z‖
                   # renormalize

              λ₁ = v^T @ S @ v
               # Rayleigh quotient estimate

            # Convergence rate: (λ₂/λ₁)^t — fast when spectral gap is large

            # Deflation for subsequent eigenvectors:

            S_deflated = S − λ₁ * outer(v,
            v)

            # Repeat on S_deflated for v₂, v₃, ...

Randomized SVD — The Modern Standard

For large \(m, d\), computing the full SVD is \(O(\min(md^2, m^2d))\). Randomized PCA (Halko-Martinsson-Tropp 2011) computes a near-optimal rank-\(k\) approximation in \(O(mdk)\) — linear in \(k\).

Randomized PCA (sklearn default for large data)

            # Step 1: Random projection to sketch the range of X

            Ω = randn(d, k+p)           # random Gaussian matrix, p=10 oversampling

            Y = X_centered @ Ω
                    # O(mdk) — range
              approximation

            # (Optional) Power iteration for accuracy:

            for i in range(n_iter): Y =
            X_centered @ (X_centered.T @ Y)

            # Step 2: QR decomposition of Y

            Q, _ = qr(Y)
                         #
              orthonormal basis for range(X)

            # Step 3: Project and SVD in small space

            B = Q.T @ X_centered
                  # (k+p) × d — small!

            U_hat, Σ, Vt = svd(B, full=False)

            U = Q @ U_hat
                       # full left singular
              vectors

Complexity Comparison

Algorithm	Time Complexity	Memory	Best For
Full eigendecomposition of S	\(O(d^3)\)	\(O(d^2)\)	Small \(d\), numerical stability unimportant
Full SVD of \(\tilde{\mathbf{X}}\)	\(O(\min(md^2, m^2d))\)	\(O(md)\)	General, numerically stable
Power iteration (1 component)	\(O(mdt)\)	\(O(d)\)	Streaming data; very large d
Randomized SVD (rank k)	\(O(mdk)\)	\(O((m+d)k)\)	Large \(m,d\); extracting top-k components
Incremental PCA	\(O(d^2 \cdot \text{batches})\)	\(O(d^2)\)	Data too large to fit in memory

XVI

Applications in Machine Learning

// dimensionality reduction · denoising · visualization · face recognition · preprocessing

Preprocessing for ML

X_reduced = pca.fit_transform(X)

Reduce features before training a classifier or regressor. Removes multicollinearity (the covariance of PCA scores is diagonal). Speeds up training by \(d/k\). Regularizes against overfitting — fewer degrees of freedom. Most useful when \(d \gg m\).

Data Visualization

2D: Z = X @ V[:, :2]

Project to 2D or 3D for exploratory visualization. Reveals cluster structure, outliers, and data geometry that is invisible in high dimensions. First step in any EDA for high-dim tabular data or embeddings. Your breast cancer and Titanic datasets benefit directly.

Image Compression

X̂ = Z_k @ V_k.T + μ

Represent each image as k PCA coefficients instead of H×W pixels. Compression ratio = (H×W)/k. Quality controlled by k (or cumulative EVR threshold). Eigenfaces (PCA on face images) was state-of-the-art face recognition pre-CNN.

Denoising

X_denoised = pca.inverse_transform(Z_k)

Project to PCA subspace (removing noise-dominated components), then reconstruct. Works when signal variance >> noise variance — the first k components capture signal; the rest is noise. Used in fMRI, spectroscopy, financial data cleaning.

Anomaly Detection

score = ‖x - x̂‖² (reconstruction error)

Inliers have low reconstruction error (they lie near the principal subspace). Anomalies have high reconstruction error (they deviate from the normal pattern). Simple, interpretable, and computationally cheap compared to deep autoencoders.

Feature Engineering

latent = pca.transform(X)[:, :k]

PCA features are uncorrelated — avoiding multicollinearity issues in logistic regression and linear models. Whitened PCA features have equal variance — improving gradient descent convergence. Standard preprocessing for linear classifiers on text/image data.

XVII

PCA vs Other Dimensionality Reduction Methods

// t-SNE · UMAP · LDA · ICA · autoencoders

Method	Type	Preserves	Weaknesses	When to Use
PCA	Linear	Global variance (linear structure)	Misses nonlinear structure; sensitive to outliers	Default — fast, interpretable, reversible
t-SNE	Nonlinear	Local neighborhood structure	Non-deterministic; no out-of-sample; slow; can distort global structure	Visualization only (2D/3D); discovering clusters
UMAP	Nonlinear	Local + approximate global	Hyperparameter sensitive; still difficult to interpret	Visualization + faster alternative to t-SNE
LDA	Linear (supervised)	Class separability	Requires labels; at most C-1 components (C = classes); Gaussian assumption	Classification preprocessing; maximize between-class distance
ICA	Linear	Statistical independence	Requires non-Gaussian sources; order not defined	Signal separation (EEG, audio mixing)
Autoencoders	Nonlinear	Task-relevant features	Requires training; not interpretable; overparameterized	When PCA reconstruction error is unacceptably high
Random Projections	Linear (random)	Pairwise distances (JL lemma)	Suboptimal compression; not interpretable	Very large \(d\) where PCA is too slow

→

Practical Decision Rule

Start with PCA. Always. It's fast, reproducible, interpretable, and reversible. If PCA gives unsatisfactory cluster separation for visualization, try UMAP. If features are not Gaussian and independence matters, consider ICA. If class labels are available and you need maximum discrimination, use LDA. Only reach for autoencoders when linear methods provably fail and nonlinear structure is essential.

XVIII

The Complete Mental Model

// everything connected · one derivation thread · all the links

Fig 3. PCA complete derivation path — from the curse of dimensionality through covariance, Lagrangian optimization, eigenvalue problem, and out to all applications.

The complete story as a single coherent thread:

Motivation: High-dimensional data is geometrically pathological — distances concentrate, data is sparse, algorithms break. But real data often lives on a low-dimensional manifold embedded in high-dimensional space.
Setup: Center the data (\(\tilde{\mathbf{x}} = \mathbf{x} - \boldsymbol{\mu}\)). Compute the sample covariance matrix \(\mathbf{S} = \frac{1}{m-1}\tilde{\mathbf{X}}^T\tilde{\mathbf{X}}\). This is symmetric and PSD — all eigenvalues are real and non-negative by the Spectral Theorem.
Optimization: The first PC is \(\mathbf{v}_1 = \arg\max_{\|\mathbf{v}\|=1}\mathbf{v}^T\mathbf{S}\mathbf{v}\) — the direction of maximum variance. By the Rayleigh quotient, this is bounded by the largest eigenvalue of \(\mathbf{S}\).
Lagrangian derivation: Form \(\mathcal{L} = \mathbf{v}^T\mathbf{S}\mathbf{v} - \lambda(\mathbf{v}^T\mathbf{v}-1)\). Stationarity gives \(\mathbf{S}\mathbf{v} = \lambda\mathbf{v}\). The objective value at the optimum equals \(\lambda\) — so we want the largest eigenvalue. The \(k\)-th PC is the \(k\)-th eigenvector.
Eigenvectors are orthogonal — proven directly from the symmetry of \(\mathbf{S}\). PCA scores are therefore decorrelated: \(\text{Cov}(\mathbf{Z}) = \boldsymbol{\Lambda}_k\).
SVD connection: \(\lambda_j = \sigma_j^2/(m-1)\) and the right singular vectors of \(\tilde{\mathbf{X}}\) are exactly the PC loadings. Always use SVD numerically — it avoids squaring the condition number.
Reconstruction: The truncated reconstruction \(\hat{\mathbf{X}}_k = \mathbf{Z}_k\mathbf{V}_k^T + \boldsymbol{\mu}\) is the optimal rank-\(k\) approximation (Eckart-Young). The reconstruction error equals the sum of discarded eigenvalues.
Whitening additionally scales each PC to unit variance — making the optimization landscape spherical and improving gradient descent convergence for downstream models.
Kernel PCA extends to nonlinear subspaces via the kernel trick — solve an \(m \times m\) Gram matrix eigenvalue problem instead of the \(d \times d\) covariance problem.
Applications: Dimensionality reduction, visualization, denoising, anomaly detection, compression, feature engineering. PCA is the first tool to reach for — it's fast, interpretable, and reversible.

★ The Unifying Thread

PCA is the nexus where linear algebra (eigendecomposition, SVD, change of basis), statistics (variance, covariance, MLE, sample estimation), and optimization (Lagrange multipliers, Rayleigh quotient, constrained maximization) converge on a single clean answer: rotate the coordinate system so axes align with the data's natural variance structure. Every concept from the Linear Regression, Lagrange Multipliers, and Bias-Variance masterclasses appears here — PCA is where they all meet.

PCA Principal Component Analysis