Before understanding what PCA solves, we must understand the problem it addresses. High-dimensional data is not just computationally expensive — it is geometrically pathological in ways that break nearly every ML assumption.
Consider a unit hypercube \([0,1]^d\). The fraction of its volume within distance \(\epsilon\) of the boundary is \(1 - (1-2\epsilon)^d\). For \(d = 100\) and \(\epsilon = 0.01\):
For \(m\) points drawn uniformly from \([0,1]^d\), the ratio of maximum to minimum pairwise distance converges to 1 as \(d \to \infty\):
To cover a \(d\)-dimensional unit hypercube with a coverage fraction \(r\) per axis, you need \(m = r^{-d}\) samples. For \(r = 0.1\) and \(d = 100\): \(m = 10^{100}\) — more than atoms in the observable universe.
Real datasets with thousands of features have far fewer samples than dimensions. A 28×28 MNIST image lives in \(\mathbb{R}^{784}\) — but natural handwritten digits occupy a manifold of much lower intrinsic dimension. PCA finds this low-dimensional subspace. The intrinsic dimensionality is typically orders of magnitude smaller than the ambient dimension.
PCA finds a new coordinate system for your data — one that is aligned with the directions of maximum variance. The first axis points in the direction where the data varies most. The second axis is orthogonal to the first and captures the next most variance. And so on.
PCA finds a rotation of the coordinate system such that (1) variance is maximized along each new axis in order, (2) the new axes are mutually orthogonal (uncorrelated), and (3) we can keep only the first \(k\) axes and lose the minimum possible information. It is the optimal linear dimensionality reduction under mean squared error.
PCA operates on centered data. The mean captures the location of the data cloud; the covariance matrix captures its shape. We must remove the mean first so that PCA describes shape, not position.
The covariance matrix is the central object of PCA. Its properties determine everything:
The covariance matrix \(\mathbf{S}\) defines an ellipsoid: \(\{\mathbf{v} : \mathbf{v}^T\mathbf{S}^{-1}\mathbf{v} = 1\}\). This is the set of points at Mahalanobis distance 1 from the origin. Its axes are aligned with the eigenvectors of \(\mathbf{S}\), and the length of each axis is proportional to \(\sqrt{\lambda_i}\) (the square root of the corresponding eigenvalue).
By the Spectral Theorem for real symmetric matrices:
PCA's first principal component is the unit vector \(\mathbf{v}_1\) such that projecting the data onto \(\mathbf{v}_1\) gives maximum variance. Let's formalize this.
The projection of sample \(\mathbf{x}^{(i)}\) onto unit vector \(\mathbf{v}\) is the scalar \(z^{(i)} = \mathbf{v}^T\mathbf{x}^{(i)}\). The variance of these projected values across all samples:
The Rayleigh quotient of a symmetric matrix \(\mathbf{S}\) with respect to vector \(\mathbf{v}\) is:
The Rayleigh quotient is bounded: \(\lambda_{\min} \leq R(\mathbf{v}) \leq \lambda_{\max}\). The maximum is achieved when \(\mathbf{v} = \mathbf{v}_1\) (the eigenvector corresponding to the largest eigenvalue). The constrained problem (‖v‖=1) is equivalent to maximizing the unconstrained Rayleigh quotient.
Subsequent components are found by successively deflating and re-solving:
We now apply the Lagrange multiplier method (from the Lagrange Multipliers masterclass) to solve the PCA optimization problem exactly. This derivation shows why eigenvectors are the answer — it falls out from first principles.
Objective: maximize \(\mathbf{v}^T\mathbf{S}\mathbf{v}\). Constraint: \(\|\mathbf{v}\|^2 = 1\), i.e., \(g(\mathbf{v}) = \mathbf{v}^T\mathbf{v} - 1 = 0\).
The \(k\)-th principal component \(\mathbf{v}_k\) is the \(k\)-th eigenvector of the sample covariance matrix \(\mathbf{S}\) (ordered by decreasing eigenvalue). The variance explained by the \(k\)-th component equals the \(k\)-th eigenvalue \(\lambda_k\). This follows directly from the Lagrangian stationarity condition.
| Method | Rule | Best For |
|---|---|---|
| Cumulative EVR threshold | Keep smallest \(k\) with EVR ≥ 0.95 (or 0.99) | General dimensionality reduction with known target |
| Scree plot / Elbow | Plot \(\lambda_j\) vs \(j\); keep components before the "elbow" where eigenvalues level off | Exploratory analysis; finding natural breaks |
| Kaiser criterion | Keep components with \(\lambda_j > 1\) (with standardized data) | Factor analysis; features on same scale |
| Reconstruction error | Keep \(k\) such that \(\|\mathbf{X} - \hat{\mathbf{X}}_k\|_F^2 / \|\mathbf{X}\|_F^2 \leq \epsilon\) | Compression with quality constraint |
| Cross-validation | Choose \(k\) that minimizes downstream task error | When PCA is a preprocessing step for classification |
| Optimal Hard Threshold | Gavish-Donoho formula (depends on noise level) | Denoising with known noise variance |
PCA is a rigid rotation of the coordinate system — no stretching, no shearing — followed by a projection onto a subspace.
PCA does not move the data points — it moves the coordinate axes. The ellipsoidal cloud stays fixed; we rotate the rulers we use to measure it until they align with the cloud's natural axes. This is why PCA is purely a linear algebraic transformation: \(\mathbf{V}^T\mathbf{V} = \mathbf{I}\), det(\(\mathbf{V}\)) = ±1.
Among all rank-\(k\) approximations \(\hat{\mathbf{X}}\) of \(\tilde{\mathbf{X}}\), the one minimizing the Frobenius norm \(\|\tilde{\mathbf{X}} - \hat{\mathbf{X}}\|_F\) is given by the PCA truncation: \(\hat{\tilde{\mathbf{X}}}_k = \mathbf{Z}_k\mathbf{V}_k^T\). This is the foundation of optimal data compression via PCA.
PCA is almost always computed via SVD rather than direct eigendecomposition of \(\mathbf{S}\). SVD is numerically more stable (it avoids squaring the condition number) and works even when \(d > m\).
np.linalg.svd(X_centered, full_matrices=False) — not np.linalg.eig(X.T @ X).
PCA is not scale-invariant. Features with larger variance dominate the covariance matrix regardless of their actual importance. If features are on different scales (e.g., income in dollars vs age in years), you must standardize first.
Whitening transforms the data so that the covariance matrix is the identity: \(\text{Cov}(\mathbf{z}_{\text{white}}) = \mathbf{I}\). Every direction has unit variance.
Standard PCA can only find linear subspaces. Kernel PCA implicitly maps data to a high-dimensional feature space and performs PCA there — enabling nonlinear dimensionality reduction without ever explicitly computing the feature map.
Replace \(\mathbf{x}\) with \(\phi(\mathbf{x})\) where \(\phi: \mathbb{R}^d \to \mathcal{H}\) (a potentially infinite-dimensional Hilbert space). The covariance in feature space is:
For large \(m, d\), computing the full SVD is \(O(\min(md^2, m^2d))\). Randomized PCA (Halko-Martinsson-Tropp 2011) computes a near-optimal rank-\(k\) approximation in \(O(mdk)\) — linear in \(k\).
| Algorithm | Time Complexity | Memory | Best For |
|---|---|---|---|
| Full eigendecomposition of S | \(O(d^3)\) | \(O(d^2)\) | Small \(d\), numerical stability unimportant |
| Full SVD of \(\tilde{\mathbf{X}}\) | \(O(\min(md^2, m^2d))\) | \(O(md)\) | General, numerically stable |
| Power iteration (1 component) | \(O(mdt)\) | \(O(d)\) | Streaming data; very large d |
| Randomized SVD (rank k) | \(O(mdk)\) | \(O((m+d)k)\) | Large \(m,d\); extracting top-k components |
| Incremental PCA | \(O(d^2 \cdot \text{batches})\) | \(O(d^2)\) | Data too large to fit in memory |
Reduce features before training a classifier or regressor. Removes multicollinearity (the covariance of PCA scores is diagonal). Speeds up training by \(d/k\). Regularizes against overfitting — fewer degrees of freedom. Most useful when \(d \gg m\).
Project to 2D or 3D for exploratory visualization. Reveals cluster structure, outliers, and data geometry that is invisible in high dimensions. First step in any EDA for high-dim tabular data or embeddings. Your breast cancer and Titanic datasets benefit directly.
Represent each image as k PCA coefficients instead of H×W pixels. Compression ratio = (H×W)/k. Quality controlled by k (or cumulative EVR threshold). Eigenfaces (PCA on face images) was state-of-the-art face recognition pre-CNN.
Project to PCA subspace (removing noise-dominated components), then reconstruct. Works when signal variance >> noise variance — the first k components capture signal; the rest is noise. Used in fMRI, spectroscopy, financial data cleaning.
Inliers have low reconstruction error (they lie near the principal subspace). Anomalies have high reconstruction error (they deviate from the normal pattern). Simple, interpretable, and computationally cheap compared to deep autoencoders.
PCA features are uncorrelated — avoiding multicollinearity issues in logistic regression and linear models. Whitened PCA features have equal variance — improving gradient descent convergence. Standard preprocessing for linear classifiers on text/image data.
| Method | Type | Preserves | Weaknesses | When to Use |
|---|---|---|---|---|
| PCA | Linear | Global variance (linear structure) | Misses nonlinear structure; sensitive to outliers | Default — fast, interpretable, reversible |
| t-SNE | Nonlinear | Local neighborhood structure | Non-deterministic; no out-of-sample; slow; can distort global structure | Visualization only (2D/3D); discovering clusters |
| UMAP | Nonlinear | Local + approximate global | Hyperparameter sensitive; still difficult to interpret | Visualization + faster alternative to t-SNE |
| LDA | Linear (supervised) | Class separability | Requires labels; at most C-1 components (C = classes); Gaussian assumption | Classification preprocessing; maximize between-class distance |
| ICA | Linear | Statistical independence | Requires non-Gaussian sources; order not defined | Signal separation (EEG, audio mixing) |
| Autoencoders | Nonlinear | Task-relevant features | Requires training; not interpretable; overparameterized | When PCA reconstruction error is unacceptably high |
| Random Projections | Linear (random) | Pairwise distances (JL lemma) | Suboptimal compression; not interpretable | Very large \(d\) where PCA is too slow |
Start with PCA. Always. It's fast, reproducible, interpretable, and reversible. If PCA gives unsatisfactory cluster separation for visualization, try UMAP. If features are not Gaussian and independence matters, consider ICA. If class labels are available and you need maximum discrimination, use LDA. Only reach for autoencoders when linear methods provably fail and nonlinear structure is essential.
The complete story as a single coherent thread:
PCA is the nexus where linear algebra (eigendecomposition, SVD, change of basis), statistics (variance, covariance, MLE, sample estimation), and optimization (Lagrange multipliers, Rayleigh quotient, constrained maximization) converge on a single clean answer: rotate the coordinate system so axes align with the data's natural variance structure. Every concept from the Linear Regression, Lagrange Multipliers, and Bias-Variance masterclasses appears here — PCA is where they all meet.