// Machine Learning — Statistical Foundations

Linear
Regression

The bedrock of supervised learning — from first principles through the Normal Equation, MLE derivation, Gauss-Markov theorem, regularization, and the full geometric picture. Everything, once and for all.

Least Squares Normal Equation MLE / Gaussian Noise Gauss-Markov Gradient Descent Bias-Variance Ridge / Lasso Polynomial Regression R² and Metrics
// Table of Contents
  1. 01What is Regression? The Problem Setup
  2. 02The Linear Model and Hypothesis
  3. 03The Loss Function — MSE and Why
  4. 04The Normal Equation — Closed-Form Solution
  5. 05Geometric Interpretation — Projection
  6. 06Statistical Foundation — MLE Derivation
  7. 07Gradient Descent for Linear Regression
  8. 08Properties of OLS — Gauss-Markov Theorem
  9. 09Bias-Variance Tradeoff
  10. 10Feature Scaling and Normalization
  11. 11Polynomial Regression — Extending the Model
  12. 12Regularization — Ridge, Lasso, Elastic Net
  13. 13Assumptions and Diagnostics
  14. 14Evaluation Metrics
  15. 15Normal Equation vs Gradient Descent
  16. 16The Full Mental Model
§01

What is Regression? The Problem Setup

Regression is the task of learning a mapping from inputs to a continuous output. Given a training set of \(m\) examples \(\{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^{m}\) where \(\mathbf{x}^{(i)} \in \mathbb{R}^n\) and \(y^{(i)} \in \mathbb{R}\), we want to find a function \(h: \mathbb{R}^n \to \mathbb{R}\) that predicts \(y\) well for new, unseen \(\mathbf{x}\).

The word "regression" (Francis Galton, 1886) originally described a statistical phenomenon — the heights of children "regress toward the mean" compared to their parents. The mathematical machinery he developed became the foundation of modern supervised learning.

The Three Questions Regression Must Answer

  • What form should \(h\) take? — We choose a hypothesis class (linear functions, polynomials, etc.).
  • What does "predict well" mean? — We choose a loss function to measure prediction error.
  • How do we find the best \(h\)? — We choose an optimization algorithm to minimize the loss.

Linear regression gives specific, mathematically justified answers to all three. Crucially, the answers are not arbitrary choices — they fall out of a deep probabilistic assumption about how data is generated. We'll derive this in §06.

✓ Supervised Learning Framing

In supervised learning, we assume data is drawn i.i.d. from some unknown joint distribution \(P(\mathbf{x}, y)\). We want our hypothesis \(h_{\boldsymbol{\theta}}\) to minimize the expected loss — the population risk. We can only optimize empirical risk (over training data) and hope it generalizes. This tension is the bias-variance tradeoff (§09).

§02

The Linear Model and Hypothesis

Simple Linear Regression (1 Feature)

With a single input feature \(x\), the model is a line:

Simple Linear Hypothesis \[ h_{\theta_0, \theta_1}(x) = \theta_0 + \theta_1 x \]

Here \(\theta_0\) is the intercept (the value of \(y\) when \(x = 0\)) and \(\theta_1\) is the slope (the change in \(y\) per unit change in \(x\)). Together they define a line in the 2D plane — we adjust them to fit the data.

Multiple Linear Regression (n Features)

With \(n\) features \(x_1, x_2, \ldots, x_n\):

Multiple Linear Hypothesis \[ h_{\boldsymbol{\theta}}(\mathbf{x}) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_n x_n = \theta_0 + \sum_{j=1}^{n}\theta_j x_j \]

To simplify notation, we augment the feature vector with a 1: \(\mathbf{x} = [1, x_1, x_2, \ldots, x_n]^T \in \mathbb{R}^{n+1}\) and \(\boldsymbol{\theta} = [\theta_0, \theta_1, \ldots, \theta_n]^T \in \mathbb{R}^{n+1}\). Then:

Compact Vector Form \[ h_{\boldsymbol{\theta}}(\mathbf{x}) = \boldsymbol{\theta}^T \mathbf{x} \]

Matrix Form — Full Dataset

For all \(m\) training examples simultaneously, stack feature vectors row-by-row into the design matrix \(\mathbf{X} \in \mathbb{R}^{m \times (n+1)}\):

Design Matrix \[ \mathbf{X} = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} & \cdots & x_n^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} & \cdots & x_n^{(2)} \\ \vdots & \vdots & & & \vdots \\ 1 & x_1^{(m)} & x_2^{(m)} & \cdots & x_n^{(m)} \end{bmatrix}, \quad \mathbf{y} = \begin{bmatrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(m)} \end{bmatrix} \]
\[ \hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\theta} \in \mathbb{R}^m \]
💡 The Linear Assumption

The model assumes \(y\) is linear in the parameters \(\boldsymbol{\theta}\) — not necessarily linear in the raw features. We can include \(x^2, x^3, \log x\) etc. as columns of \(\mathbf{X}\), and the model is still "linear" in the sense that matters for optimization. This is exactly polynomial regression (§11).

Simple Linear Regression Model Diagram
Fig 1. Simple Linear Regression Model. The best-fitting line minimizes the distance between predicted values (on the line) and actual data points.
§03

The Loss Function — MSE and Why

We need a number that measures how wrong our predictions are. The most natural choice: measure the difference (residual) between the prediction \(\hat{y}^{(i)}\) and the truth \(y^{(i)}\).

Residuals

The residual for example \(i\) is:

\[ e^{(i)} = y^{(i)} - \hat{y}^{(i)} = y^{(i)} - \boldsymbol{\theta}^T\mathbf{x}^{(i)} \]

Why not just sum the residuals? Positive and negative residuals cancel. Why not sum absolute values? \(|e|\) is not differentiable at zero — inconvenient for calculus. The standard choice: sum of squared residuals.

The Mean Squared Error (MSE) Loss

MSE Loss \[ J(\boldsymbol{\theta}) = \frac{1}{2m}\sum_{i=1}^{m}\left(h_{\boldsymbol{\theta}}(\mathbf{x}^{(i)}) - y^{(i)}\right)^2 = \frac{1}{2m}\|\mathbf{X}\boldsymbol{\theta} - \mathbf{y}\|_2^2 \]

The \(\frac{1}{2}\) factor is a computational convenience — when we differentiate, the 2 from the power rule and the \(\frac{1}{2}\) cancel neatly, leaving cleaner gradients. It doesn't affect the location of the minimum.

Why Squared Error Specifically?

Three justifications, each from a different perspective:

  • Statistical: If noise is Gaussian (\(\epsilon \sim \mathcal{N}(0, \sigma^2)\)), minimizing MSE is equivalent to Maximum Likelihood Estimation. We prove this rigorously in §06 — MSE isn't arbitrary, it falls out of the assumption \(y = \boldsymbol{\theta}^T\mathbf{x} + \epsilon\).
  • Mathematical: The square makes \(J\) a smooth, strictly convex quadratic function of \(\boldsymbol{\theta}\). It has a unique global minimum that can be found analytically or reliably by gradient descent. No other common loss shares all three of these properties.
  • Geometric: Minimizing \(\sum e_i^2\) is equivalent to finding the orthogonal projection of \(\mathbf{y}\) onto the column space of \(\mathbf{X}\). This gives the solution its elegant geometric interpretation (§05).
⚠ When MSE Fails

MSE penalizes large errors quadratically — a residual of 10 contributes 100× as much as a residual of 1. This means outliers dominate the loss and can severely distort the fitted line. In that case, consider Mean Absolute Error (MAE) — more robust but non-differentiable at 0 — or Huber loss, which is quadratic for small residuals and linear for large ones.

Alternative Loss Functions

Loss Formula Properties Use When
MSE \(\frac{1}{m}\sum e_i^2\) Smooth, convex, differentiable Gaussian noise assumed; no outliers
MAE \(\frac{1}{m}\sum |e_i|\) Robust, but not differentiable at 0 Outliers present; want median regression
Huber MSE if \(|e|<\delta\), else MAE Best of both worlds Moderate outliers; need differentiability
Log-Cosh \(\frac{1}{m}\sum\log\cosh(e_i)\) Smooth approximation to MAE Outlier-resistant with smooth gradients
§04

The Normal Equation — Closed-Form Solution

Because MSE is a convex quadratic in \(\boldsymbol{\theta}\), we can find the exact minimum analytically by setting the gradient to zero. This gives the Normal Equation — one of the most important results in all of statistics.

Full Derivation

// Deriving the Normal Equation — setting ∇J = 0 and solving
1
Expand the loss: \(\displaystyle J(\boldsymbol{\theta}) = \frac{1}{2m}\|\mathbf{X}\boldsymbol{\theta} - \mathbf{y}\|^2 = \frac{1}{2m}(\mathbf{X}\boldsymbol{\theta}-\mathbf{y})^T(\mathbf{X}\boldsymbol{\theta}-\mathbf{y})\)
2
Expand the product: \(\displaystyle = \frac{1}{2m}\left(\boldsymbol{\theta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\theta} - \boldsymbol{\theta}^T\mathbf{X}^T\mathbf{y} - \mathbf{y}^T\mathbf{X}\boldsymbol{\theta} + \mathbf{y}^T\mathbf{y}\right)\)
Note: y^T Xθ = θ^T X^T y (both are scalars, so equal to their own transpose)
3
Simplify: \(\displaystyle J(\boldsymbol{\theta}) = \frac{1}{2m}\left(\boldsymbol{\theta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\theta} - 2\boldsymbol{\theta}^T\mathbf{X}^T\mathbf{y} + \mathbf{y}^T\mathbf{y}\right)\)
4
Take the gradient with respect to \(\boldsymbol{\theta}\) using matrix calculus identities: \[\frac{\partial}{\partial\boldsymbol{\theta}}\left(\boldsymbol{\theta}^T\mathbf{A}\boldsymbol{\theta}\right) = 2\mathbf{A}\boldsymbol{\theta} \quad (\text{for symmetric } \mathbf{A})\] \[\frac{\partial}{\partial\boldsymbol{\theta}}\left(\boldsymbol{\theta}^T\mathbf{b}\right) = \mathbf{b}\]
Here A = X^T X (which is symmetric since (X^T X)^T = X^T X) and b = X^T y
5
Therefore: \[\nabla_{\boldsymbol{\theta}} J = \frac{1}{2m}\left(2\mathbf{X}^T\mathbf{X}\boldsymbol{\theta} - 2\mathbf{X}^T\mathbf{y}\right) = \frac{1}{m}\left(\mathbf{X}^T\mathbf{X}\boldsymbol{\theta} - \mathbf{X}^T\mathbf{y}\right)\]
6
Set \(\nabla_{\boldsymbol{\theta}} J = \mathbf{0}\): \[\mathbf{X}^T\mathbf{X}\boldsymbol{\theta} = \mathbf{X}^T\mathbf{y}\]
These are the "normal equations" — one equation per parameter. The name comes from the geometric fact that the residual vector is normal (perpendicular) to the column space of X.
7
If \(\mathbf{X}^T\mathbf{X}\) is invertible, solve: \[\boxed{\hat{\boldsymbol{\theta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}}\]
The matrix (X^T X)^{-1} X^T is called the Moore-Penrose pseudoinverse of X, written X^†. So θ̂ = X^† y.
The Normal Equation \[ \hat{\boldsymbol{\theta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \]

When is \(\mathbf{X}^T\mathbf{X}\) Invertible?

\(\mathbf{X}^T\mathbf{X}\) is invertible if and only if \(\mathbf{X}\) has full column rank, i.e., the columns of \(\mathbf{X}\) are linearly independent. It fails when:

  • Multicollinearity: Two or more features are perfectly correlated (one is a linear combination of others). E.g., including both "temperature in °C" and "temperature in °F" — they're identical up to scaling.
  • Too many features: If \(n+1 > m\) (more parameters than samples), \(\mathbf{X}^T\mathbf{X}\) is singular — infinitely many solutions minimize the loss exactly to zero.
  • Duplicate features: Same feature column repeated twice.
💡 The Fix — Pseudoinverse and Regularization

When \(\mathbf{X}^T\mathbf{X}\) is singular or near-singular, two remedies exist. The Moore-Penrose pseudoinverse \(\mathbf{X}^{\dagger}\) gives the minimum-norm solution analytically. More practically, Ridge regularization adds \(\lambda\mathbf{I}\) to the matrix: \(\hat{\boldsymbol{\theta}}_{\text{ridge}} = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}\), which is always invertible for \(\lambda > 0\). This is not just a numerical trick — it has deep statistical meaning (§12).

§05

Geometric Interpretation — Projection

The Normal Equation has a stunning geometric interpretation that reveals why it has to take exactly this form. This is one of the most beautiful results in linear algebra.

The Column Space Perspective

Think of the \(m\) rows of \(\mathbf{X}\) as giving coordinates. The columns of \(\mathbf{X}\) are \(m\)-dimensional vectors. The set of all possible predictions \(\hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\theta}\) as \(\boldsymbol{\theta}\) varies is the column space of \(\mathbf{X}\): a subspace \(\mathcal{C}(\mathbf{X}) \subseteq \mathbb{R}^m\).

Unless \(\mathbf{y}\) happens to lie exactly in \(\mathcal{C}(\mathbf{X})\) (perfect fit), we cannot achieve zero residual. The best we can do: find the point in \(\mathcal{C}(\mathbf{X})\) closest to \(\mathbf{y}\) in Euclidean distance — minimizing \(\|\mathbf{X}\boldsymbol{\theta} - \mathbf{y}\|^2\). This is exactly the orthogonal projection of \(\mathbf{y}\) onto \(\mathcal{C}(\mathbf{X})\).

Col(X) — column space O y ŷ = Xθ̂ e = y − ŷ The residual e is perpendicular to the entire column space — this is why it's called the Normal Equation
Fig 1. OLS as orthogonal projection. ŷ = Xθ̂ is the closest point in Col(X) to y. The residual e = y − ŷ is perpendicular (normal) to Col(X).

Why the Residual Must Be Orthogonal to Col(X)

// Geometric proof: the optimality condition is orthogonality
1
For any other point \(\mathbf{X}\boldsymbol{\theta}\) in \(\mathcal{C}(\mathbf{X})\), write the squared distance to \(\mathbf{y}\): \[\|\mathbf{y} - \mathbf{X}\boldsymbol{\theta}\|^2 = \|\mathbf{y} - \hat{\mathbf{y}} + \hat{\mathbf{y}} - \mathbf{X}\boldsymbol{\theta}\|^2\]
2
\(= \|\mathbf{e} + (\hat{\mathbf{y}} - \mathbf{X}\boldsymbol{\theta})\|^2 = \|\mathbf{e}\|^2 + 2\mathbf{e}^T(\hat{\mathbf{y}} - \mathbf{X}\boldsymbol{\theta}) + \|\hat{\mathbf{y}} - \mathbf{X}\boldsymbol{\theta}\|^2\)
Using the Pythagorean identity for norms
3
If \(\mathbf{e} \perp \mathcal{C}(\mathbf{X})\), then the cross term vanishes: \(\mathbf{e}^T(\hat{\mathbf{y}} - \mathbf{X}\boldsymbol{\theta}) = 0\).
4
We get: \(\|\mathbf{y} - \mathbf{X}\boldsymbol{\theta}\|^2 = \|\mathbf{e}\|^2 + \|\hat{\mathbf{y}} - \mathbf{X}\boldsymbol{\theta}\|^2 \geq \|\mathbf{e}\|^2\)
The distance from y to any other point in Col(X) is at least ‖e‖² — so ŷ is the unique minimizer. ∎
✓ The Orthogonality Condition

Optimality requires the residual to be orthogonal to every column of \(\mathbf{X}\):
\(\mathbf{X}^T(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\theta}}) = \mathbf{0}\). Expanding this gives \(\mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\theta}} = \mathbf{X}^T\mathbf{y}\) — precisely the Normal Equations. The geometry and the algebra are the same thing.

§06

Statistical Foundation — MLE Derivation

Why does everyone use squared error? The most rigorous answer comes from probability theory. If we make a specific assumption about how the data is generated, squared error is the only natural choice.

The Probabilistic Data-Generating Model

Assume the true relationship between inputs and outputs is linear, plus independent Gaussian noise:

Generative Model \[ y^{(i)} = \boldsymbol{\theta}^T \mathbf{x}^{(i)} + \epsilon^{(i)}, \quad \epsilon^{(i)} \overset{\text{i.i.d.}}{\sim} \mathcal{N}(0, \sigma^2) \]

This means: the output \(y\) is the linear prediction plus a small random measurement/model error, and these errors are independent, zero-mean, and Gaussian. Under this model:

\[ P(y^{(i)} \mid \mathbf{x}^{(i)}; \boldsymbol{\theta}) = \frac{1}{\sqrt{2\pi}\sigma}\exp\!\left(-\frac{(y^{(i)} - \boldsymbol{\theta}^T\mathbf{x}^{(i)})^2}{2\sigma^2}\right) \]

Maximum Likelihood Estimation

MLE: find \(\boldsymbol{\theta}\) that maximizes the probability of observing the training data. The likelihood is:

// MLE → MSE: the full derivation
1
Likelihood (i.i.d. assumption — product over all samples): \[\mathcal{L}(\boldsymbol{\theta}) = \prod_{i=1}^{m} P(y^{(i)} \mid \mathbf{x}^{(i)}; \boldsymbol{\theta}) = \prod_{i=1}^{m} \frac{1}{\sqrt{2\pi}\sigma}\exp\!\left(-\frac{(y^{(i)}-\boldsymbol{\theta}^T\mathbf{x}^{(i)})^2}{2\sigma^2}\right)\]
2
Log-likelihood (log turns products into sums): \[\ell(\boldsymbol{\theta}) = \log\mathcal{L} = \sum_{i=1}^{m}\left[-\log(\sqrt{2\pi}\sigma) - \frac{(y^{(i)}-\boldsymbol{\theta}^T\mathbf{x}^{(i)})^2}{2\sigma^2}\right]\]
3
Maximizing \(\ell\) over \(\boldsymbol{\theta}\): the term \(-\log(\sqrt{2\pi}\sigma)\) is constant in \(\boldsymbol{\theta}\). So maximizing \(\ell\) is equivalent to minimizing: \[\sum_{i=1}^{m} \frac{(y^{(i)}-\boldsymbol{\theta}^T\mathbf{x}^{(i)})^2}{2\sigma^2}\]
4
Since \(\sigma^2 > 0\) is constant, this is equivalent to minimizing: \[\sum_{i=1}^{m}(y^{(i)}-\boldsymbol{\theta}^T\mathbf{x}^{(i)})^2 = \frac{1}{2m}\|\mathbf{X}\boldsymbol{\theta}-\mathbf{y}\|^2 = J(\boldsymbol{\theta}) \qquad \blacksquare\]
MLE under Gaussian noise ≡ minimizing MSE. The MSE loss is not a heuristic — it's the statistically principled choice.
★ The Profound Connection

This derivation reveals a deep truth: choosing a loss function = choosing a noise model. MSE assumes Gaussian noise. MAE assumes Laplace noise. Huber loss assumes a mixture. Every loss function implicitly encodes a statistical assumption about how residuals are distributed. When you use MSE, you're betting the noise is Gaussian.

§07

Gradient Descent for Linear Regression

While the Normal Equation gives an exact answer, gradient descent is essential when \(n\) is large (matrix inversion becomes too costly). From §02's gradient master class, we have:

Gradient of MSE \[ \nabla_{\boldsymbol{\theta}} J = \frac{1}{m}\mathbf{X}^T(\mathbf{X}\boldsymbol{\theta} - \mathbf{y}) = \frac{1}{m}\mathbf{X}^T(\hat{\mathbf{y}} - \mathbf{y}) \]

Component-Wise Derivation

Let's derive the gradient for a single weight \(\theta_j\) from scratch, to make sure every piece is understood:

// ∂J/∂θⱼ — partial derivative of MSE w.r.t. single weight
1
\(\displaystyle \frac{\partial J}{\partial \theta_j} = \frac{\partial}{\partial\theta_j}\frac{1}{2m}\sum_{i=1}^m\left(\sum_{k=0}^n \theta_k x_k^{(i)} - y^{(i)}\right)^2\)
2
Apply chain rule: \(\frac{\partial}{\partial\theta_j}(f)^2 = 2f \cdot \frac{\partial f}{\partial\theta_j}\) \[\displaystyle = \frac{1}{m}\sum_{i=1}^m\left(\hat{y}^{(i)} - y^{(i)}\right)\cdot \frac{\partial}{\partial\theta_j}\hat{y}^{(i)}\]
3
Since \(\hat{y}^{(i)} = \boldsymbol{\theta}^T\mathbf{x}^{(i)}\), we have \(\frac{\partial \hat{y}^{(i)}}{\partial\theta_j} = x_j^{(i)}\)
4
\[\boxed{\frac{\partial J}{\partial\theta_j} = \frac{1}{m}\sum_{i=1}^m\left(\hat{y}^{(i)} - y^{(i)}\right)x_j^{(i)}}\]
In vector form: ∇J = (1/m) X^T(ŷ − y). This is the dot product of the j-th feature column with the residual vector.

The Update Rule

GD Update — Vectorized \[ \boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \frac{\alpha}{m}\mathbf{X}^T(\hat{\mathbf{y}} - \mathbf{y}) \]
// Gradient Descent for Linear Regression Initialize: θ = zeros(n+1)   # or small random values

for epoch = 1 to num_epochs:
  # Forward pass
  y_hat = X @ θ                   # (m,) predictions
  residuals = y_hat y           # (m,) errors

  # Compute loss (optional, for monitoring)
  J = mean(residuals**2) / 2

  # Compute gradient
  grad = X.T @ residuals / m      # (n+1,)

  # Update parameters
  θ = θ α * grad

  ifgrad‖ < ε: break       # convergence check
§08

Properties of OLS — The Gauss-Markov Theorem

The OLS estimator \(\hat{\boldsymbol{\theta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\) has remarkable statistical properties. The Gauss-Markov theorem tells us it's the best we can possibly do within a class of estimators.

Unbiasedness

The OLS estimator is unbiased — in expectation it equals the true \(\boldsymbol{\theta}^*\).

// Proof that E[θ̂] = θ*
1
Substitute \(\mathbf{y} = \mathbf{X}\boldsymbol{\theta}^* + \boldsymbol{\epsilon}\) into the OLS formula: \[\hat{\boldsymbol{\theta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T(\mathbf{X}\boldsymbol{\theta}^* + \boldsymbol{\epsilon})\]
2
\(= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}\boldsymbol{\theta}^* + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\epsilon} = \boldsymbol{\theta}^* + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\epsilon}\)
3
\(\mathbb{E}[\hat{\boldsymbol{\theta}}] = \boldsymbol{\theta}^* + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\underbrace{\mathbb{E}[\boldsymbol{\epsilon}]}_{=\,\mathbf{0}} = \boldsymbol{\theta}^* \qquad \blacksquare\)
Uses the assumption E[ε] = 0 (zero-mean noise).

Variance of the OLS Estimator

Covariance Matrix of θ̂ \[ \text{Var}(\hat{\boldsymbol{\theta}}) = \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1} \]

The diagonal entries are the variances of individual estimates: \(\text{Var}(\hat{\theta}_j) = \sigma^2 [(\mathbf{X}^T\mathbf{X})^{-1}]_{jj}\). Large variance in \(\hat{\theta}_j\) means the estimate of the \(j\)-th coefficient is unreliable — it would change significantly with a different training set.

This formula reveals exactly when estimates are unreliable: when \((\mathbf{X}^T\mathbf{X})^{-1}\) has large entries — i.e., when the matrix is nearly singular due to multicollinearity.

The Gauss-Markov Theorem

★ Gauss-Markov Theorem (BLUE)

Under the OLS assumptions (linearity, exogeneity \(\mathbb{E}[\boldsymbol{\epsilon}|\mathbf{X}]=\mathbf{0}\), homoscedasticity \(\text{Var}(\boldsymbol{\epsilon})=\sigma^2\mathbf{I}\), no multicollinearity), the OLS estimator is the Best Linear Unbiased Estimator (BLUE): among all linear unbiased estimators, it has the minimum variance.

Formally: for any linear unbiased estimator \(\tilde{\boldsymbol{\theta}} = \mathbf{C}\mathbf{y}\) with \(\mathbb{E}[\tilde{\boldsymbol{\theta}}] = \boldsymbol{\theta}^*\), we have:

\[ \text{Var}(\tilde{\boldsymbol{\theta}}) - \text{Var}(\hat{\boldsymbol{\theta}}) = \text{Var}(\tilde{\boldsymbol{\theta}}) - \sigma^2(\mathbf{X}^T\mathbf{X})^{-1} \succeq 0 \]

(positive semidefinite — every diagonal entry of the difference is ≥ 0). OLS is the tightest possible unbiased linear estimate.

💡 "Best" vs "Optimal"

Gauss-Markov says OLS is best among linear unbiased estimators. But a biased estimator with much lower variance can have lower total error (lower MSE = bias² + variance). Ridge regression introduces bias deliberately to reduce variance — often beating OLS in test error. This is the bias-variance tradeoff (§09) in action, and explains why regularization can improve upon the "theoretically optimal" OLS.

§09

Bias-Variance Tradeoff

The expected test error of any estimator decomposes into three fundamental components. Understanding this decomposition is central to all of machine learning.

The Bias-Variance Decomposition

For a test point \(\mathbf{x}_0\) with true value \(y_0 = f(\mathbf{x}_0) + \epsilon\):

Bias-Variance Decomposition \[ \mathbb{E}\left[(y_0 - \hat{y}_0)^2\right] = \underbrace{\left(\mathbb{E}[\hat{y}_0] - f(\mathbf{x}_0)\right)^2}_{\text{Bias}^2} + \underbrace{\text{Var}(\hat{y}_0)}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Irreducible noise}} \]

Derivation

// Full proof of the decomposition
1
Let \(\bar{y} = \mathbb{E}[\hat{y}_0]\) (expected prediction) and \(f_0 = f(\mathbf{x}_0)\) (true signal). \[\mathbb{E}[(y_0-\hat{y}_0)^2] = \mathbb{E}[(f_0 + \epsilon - \hat{y}_0)^2]\]
2
Add and subtract \(\bar{y}\): \[= \mathbb{E}[(f_0 - \bar{y})^2 + 2(f_0-\bar{y})(\bar{y}-\hat{y}_0) + (\bar{y}-\hat{y}_0)^2 + 2\epsilon(\ldots) + \epsilon^2]\]
3
Taking expectations and using \(\mathbb{E}[\epsilon]=0\), \(\mathbb{E}[\epsilon \cdot \hat{y}_0]=0\) (independence): \[= (f_0-\bar{y})^2 + \mathbb{E}[(\hat{y}_0 - \bar{y})^2] + \sigma^2\]
4
\(= \text{Bias}^2[\hat{y}_0] + \text{Var}[\hat{y}_0] + \sigma^2 \qquad \blacksquare\)

What Each Term Means

Term Meaning Controlled By
Bias² How far off the average prediction is from the truth. High bias = underfitting — the model is too simple to capture the true pattern. Model complexity, hypothesis class. Linear model fitting a cubic truth has high bias.
Variance How much the prediction changes across different training sets. High variance = overfitting — the model learned training noise. Model complexity, regularization, training size. More parameters → higher variance.
Irreducible Inherent noise in the data — the best any model can achieve. Cannot be reduced by any algorithm. Data generation process. \(\sigma^2\) of the true noise.
complexity Error Bias² Variance Total σ² (irreducible) optimal ← underfitting overfitting →
Fig 2. Bias-variance tradeoff. As model complexity increases, bias decreases but variance increases. The optimal complexity minimizes total test error.
§10

Feature Scaling and Normalization

Feature scaling is not optional — it directly affects the geometry of the loss landscape and the convergence behavior of gradient descent.

Why Scaling Matters for Gradient Descent

If feature \(x_1 \in [0, 1]\) and feature \(x_2 \in [0, 10000]\), the loss surface is a highly elongated ellipsoid — the loss changes much faster in the \(\theta_2\) direction. Gradient descent in this landscape:

  • Takes tiny, inefficient zigzag steps (ravine problem from gradient descent masterclass)
  • Requires very different "ideal" learning rates per parameter
  • Converges extremely slowly

After scaling all features to similar ranges, the loss surface becomes nearly circular — gradient descent takes direct paths toward the minimum.

Method 1 — Min-Max Scaling (Normalization)

Min-Max Normalization \[ x_j' = \frac{x_j - \min(x_j)}{\max(x_j) - \min(x_j)} \implies x_j' \in [0, 1] \]

Use when: you know the bounded range and want values in [0, 1]. Sensitive to outliers (the min/max are pulled by extreme values).

Method 2 — Standardization (Z-Score Normalization)

Standardization \[ x_j' = \frac{x_j - \mu_j}{\sigma_j} \implies \mathbb{E}[x_j'] = 0, \; \text{Std}(x_j') = 1 \]

Use when: features may have outliers or unknown bounds. More robust than min-max. The standard choice for most ML workflows.

⚠ Critical Rule

Always compute \(\mu_j\) and \(\sigma_j\) on the training set only, then apply the same transformation to the validation and test sets. Computing statistics on the full dataset (including test) is data leakage — your model has indirectly seen test data during training.

Does Scaling Affect the Normal Equation?

No — the Normal Equation always produces the exact minimizer regardless of scaling. But in practice, numerically inverting a poorly-conditioned \(\mathbf{X}^T\mathbf{X}\) (due to very different feature scales) introduces floating-point errors. Scaling improves numerical stability. For gradient descent, scaling is essential. For the Normal Equation, it's good practice.

§11

Polynomial Regression — Extending the Model

Linear regression assumes \(y\) is linear in \(\mathbf{x}\). What if the true relationship is curved? Polynomial regression solves this while remaining within the linear regression framework.

Feature Expansion

For a single feature \(x\), add polynomial features up to degree \(d\):

Polynomial Feature Map \[ \phi(x) = [1, x, x^2, x^3, \ldots, x^d]^T \in \mathbb{R}^{d+1} \] \[ h(\mathbf{x}) = \boldsymbol{\theta}^T\phi(x) = \theta_0 + \theta_1 x + \theta_2 x^2 + \cdots + \theta_d x^d \]

This is still linear regression — linear in the parameters \(\boldsymbol{\theta}\). Every theorem, gradient formula, and the Normal Equation applies unchanged. The only thing that changed is the design matrix \(\mathbf{X}\) now contains polynomial columns:

Polynomial Design Matrix \[ \mathbf{X}_{\text{poly}} = \begin{bmatrix} 1 & x^{(1)} & (x^{(1)})^2 & \cdots & (x^{(1)})^d \\ 1 & x^{(2)} & (x^{(2)})^2 & \cdots & (x^{(2)})^d \\ \vdots & & & & \vdots \\ 1 & x^{(m)} & (x^{(m)})^2 & \cdots & (x^{(m)})^d \end{bmatrix} \]

Overfitting with Polynomial Regression

A degree-\(d\) polynomial has \(d+1\) parameters. With \(d = m-1\), the model can pass through every training point perfectly — but will wildly oscillate between them (Runge's phenomenon). This is extreme overfitting: zero training error, catastrophic test error.

⚠ The Degree Selection Problem

Low degree → underfitting (high bias). High degree → overfitting (high variance). The optimal degree is selected via cross-validation — training on a subset, evaluating on the held-out portion. Alternatively, regularization (§12) allows using high-degree polynomials while penalizing large coefficients, preventing wild oscillation.

Interaction Terms and General Feature Maps

With multiple features, the polynomial expansion includes cross-terms: \(x_1 x_2, x_1^2 x_2, x_1 x_2^2\), etc. For \(n\) features of degree \(d\), the number of features grows as \(\binom{n+d}{d} = O(n^d)\) — exponential in degree. This is the curse of dimensionality. Kernel methods (like SVR with RBF kernel) handle this implicitly via the kernel trick, computing the dot product in high-dimensional feature space without explicitly constructing it.

§12

Regularization — Ridge, Lasso, Elastic Net

Regularization adds a penalty to the loss that discourages large weights — deliberately introducing bias to reduce variance, improving generalization.

Ridge Regression (L2)

Ridge Loss \[ J_{\text{Ridge}}(\boldsymbol{\theta}) = \frac{1}{2m}\|\mathbf{X}\boldsymbol{\theta} - \mathbf{y}\|^2 + \frac{\lambda}{2}\|\boldsymbol{\theta}\|_2^2 = \frac{1}{2m}\|\mathbf{X}\boldsymbol{\theta}-\mathbf{y}\|^2 + \frac{\lambda}{2}\sum_{j=1}^{n}\theta_j^2 \]
✓ Ridge has a Closed-Form Solution

Setting \(\nabla J_{\text{Ridge}} = 0\): \(\mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\theta}} + \lambda\hat{\boldsymbol{\theta}} = \mathbf{X}^T\mathbf{y}\), giving \(\hat{\boldsymbol{\theta}}_{\text{ridge}} = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}\). The added \(\lambda\mathbf{I}\) ensures the matrix is always invertible — Ridge simultaneously solves multicollinearity and regularizes.

The gradient descent update with Ridge regularization:

Ridge GD Update \[ \boldsymbol{\theta} \leftarrow \boldsymbol{\theta}\!\left(1 - \frac{\alpha\lambda}{m}\right) - \frac{\alpha}{m}\mathbf{X}^T(\hat{\mathbf{y}}-\mathbf{y}) \]

The factor \((1-\frac{\alpha\lambda}{m}) < 1\) is "weight decay" — it shrinks \(\boldsymbol{\theta}\) each step before applying the gradient update. Ridge never drives weights exactly to zero.

Bayesian interpretation: Ridge is MAP estimation with a Gaussian prior \(\theta_j \sim \mathcal{N}(0, 1/\lambda)\). The regularization penalty is the negative log of this prior.

Lasso Regression (L1)

Lasso Loss \[ J_{\text{Lasso}}(\boldsymbol{\theta}) = \frac{1}{2m}\|\mathbf{X}\boldsymbol{\theta}-\mathbf{y}\|^2 + \lambda\sum_{j=1}^{n}|\theta_j| \]

Lasso's \(|\theta_j|\) is not differentiable at zero. The gradient uses the subgradient:

\[ \frac{\partial J_{\text{Lasso}}}{\partial\theta_j} = \frac{1}{m}[\mathbf{X}^T(\hat{\mathbf{y}}-\mathbf{y})]_j + \lambda \cdot \text{sign}(\theta_j) \]

Lasso's key property: it drives many weights exactly to zero, performing implicit feature selection. Geometrically, the L1 ball has corners at the coordinate axes — gradient descent paths tend to land exactly at corners, where some \(\theta_j = 0\).

Bayesian interpretation: MAP estimation with a Laplace prior \(\theta_j \sim \text{Laplace}(0, 1/\lambda)\).

Elastic Net (Ridge + Lasso)

Elastic Net Loss \[ J_{\text{EN}}(\boldsymbol{\theta}) = \frac{1}{2m}\|\mathbf{X}\boldsymbol{\theta}-\mathbf{y}\|^2 + \lambda_1\sum_j|\theta_j| + \frac{\lambda_2}{2}\sum_j\theta_j^2 \]

Combines L1 sparsity with L2 stability. When features are correlated, Lasso arbitrarily picks one; Elastic Net tends to group correlated features together.

RIDGE (L2) θ̂ small, non-zero weights LASSO (L1) θ̂ sparse: some θⱼ = 0 exactly θ₁ θ₂ θ₁ θ₂
Fig 3. L2 (circle) vs L1 (diamond) constraint regions. The contours of the MSE loss first touch the constraint region at the optimal θ̂. L1 corners lie on coordinate axes — Lasso naturally produces sparse solutions.
Method Penalty Effect on Weights Feature Selection Closed Form?
OLS None Unconstrained No Yes
Ridge \(\lambda\|\boldsymbol{\theta}\|_2^2\) Shrinks, never zero No Yes
Lasso \(\lambda\|\boldsymbol{\theta}\|_1\) Drives to exactly zero Yes No (subgradient)
Elastic Net \(\lambda_1\|\boldsymbol{\theta}\|_1 + \lambda_2\|\boldsymbol{\theta}\|_2^2\) Sparse + stable Yes (grouped) No
§13

Assumptions and Diagnostics

OLS is only "best" (Gauss-Markov) when its assumptions hold. Violating them doesn't mean the estimates are wrong — but their statistical properties change.

1. Linearity

The true relationship is \(y = \boldsymbol{\theta}^T\mathbf{x} + \epsilon\).

Violation: Residual vs fitted plot shows systematic curve. Fix: add polynomial features.

2. Zero-Mean Noise

\(\mathbb{E}[\epsilon | \mathbf{X}] = 0\). Noise doesn't systematically push in one direction.

Violation: Biased predictions. Caused by omitting a relevant variable (omitted variable bias).

3. Homoscedasticity

Constant variance: \(\text{Var}(\epsilon_i) = \sigma^2\) for all \(i\).

Violation (heteroscedasticity): OLS still unbiased but no longer BLUE. Use WLS or robust SE.

4. Independence

Errors \(\epsilon_i\) and \(\epsilon_j\) are uncorrelated for \(i \neq j\).

Violation (autocorrelation): Common in time series. Use GLS or include lagged features.

5. No Perfect Multicollinearity

\(\mathbf{X}^T\mathbf{X}\) is invertible — no feature is a linear combo of others.

Violation: Coefficients blow up, are unstable. Use Ridge regularization or VIF analysis.

6. Normality of Errors

\(\epsilon \sim \mathcal{N}(0,\sigma^2)\). Required for hypothesis tests and confidence intervals.

Violation: OLS still consistent. But t-tests and F-tests are invalid. Use robust inference.

💡 Residual Plots — Your Primary Diagnostic

Plot residuals \(e_i\) vs fitted values \(\hat{y}_i\). In a well-specified model: residuals should be randomly scattered around zero with no pattern. A U-shape → nonlinearity (add polynomial features). Funnel shape → heteroscedasticity. Systematic structure → omitted variable. QQ-plot of residuals tests normality.

§14

Evaluation Metrics

Metric Formula Range Interpretation
MSE \(\frac{1}{m}\sum(y_i - \hat{y}_i)^2\) \([0, \infty)\) Mean squared error. Same units as \(y^2\). Penalizes outliers heavily.
RMSE \(\sqrt{\text{MSE}}\) \([0, \infty)\) Same units as \(y\) — interpretable as "average error magnitude".
MAE \(\frac{1}{m}\sum|y_i - \hat{y}_i|\) \([0, \infty)\) More robust to outliers. Always \(\leq\) RMSE.
\(1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}}\) \((-\infty, 1]\) Fraction of variance explained. R² = 1: perfect. R² = 0: as good as predicting the mean. Negative: worse than the mean.
Adjusted R² \(1 - (1-R^2)\frac{m-1}{m-n-1}\) \((-\infty, 1]\) Penalizes adding useless features. R² always increases with more features; adjusted R² can decrease.
MAPE \(\frac{100}{m}\sum\frac{|y_i-\hat{y}_i|}{|y_i|}\) \([0, \infty)\%\) Relative error in %. Useful when scale varies across data.

R² — Deep Dive

R² Decomposition \[ \text{SS}_{\text{tot}} = \sum_i(y_i - \bar{y})^2, \quad \text{SS}_{\text{res}} = \sum_i(y_i - \hat{y}_i)^2, \quad R^2 = 1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}} \]

For OLS: \(\text{SS}_{\text{tot}} = \text{SS}_{\text{reg}} + \text{SS}_{\text{res}}\) (total variance = explained + unexplained). So \(R^2 = \frac{\text{SS}_{\text{reg}}}{\text{SS}_{\text{tot}}}\) — the fraction of total variance that the model explains.

⚠ R² Caveats

R² increases monotonically as you add features, even useless ones — never use it to compare models with different numbers of features. Use adjusted R² or information criteria (AIC/BIC) instead. Also, a high R² does not mean the model is well-specified — it can be high even when assumptions are violated.

§15

Normal Equation vs Gradient Descent

Normal Equation Gradient Descent
Formula \((\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\) \(\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \frac{\alpha}{m}\mathbf{X}^T(\hat{\mathbf{y}}-\mathbf{y})\)
Solution quality Exact global optimum Approximate — converges to optimum
Computational cost \(O(n^3)\) for matrix inversion \(O(kn)\) total, \(k\) = iterations
Works for large \(n\)? No: \(n > 10^4\) becomes slow/infeasible Yes: scales to billions of parameters
Learning rate? None needed Must be chosen carefully
Feature scaling? Not needed for correctness Essential for fast convergence
Non-linear models? No — only for linear regression Yes — works for any differentiable loss
Multicollinearity? Fails (singular matrix) Converges slowly, but still works
Regularization Ridge has closed form: \((\mathbf{X}^T\mathbf{X}+\lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}\) Add \(\frac{\lambda}{m}\boldsymbol{\theta}\) to gradient
💡 Practical Rule

Use the Normal Equation when: \(n \leq 10^4\) features and you have a GPU or fast linear algebra library. Use Gradient Descent when: large \(n\), online/streaming data, or for any model beyond linear regression (logistic, neural networks). For your house price prediction project — small \(n\) — the Normal Equation gives the exact answer instantly. For training your ViT chest X-ray classifier — billions of parameters — gradient descent (Adam) is the only option.

§16

The Full Mental Model — Everything Together

LINEAR REGRESSION — COMPLETE MAP THE MODEL ŷ = Xθ    h(x) = θᵀx MLE Justification ε~N(0,σ²) → use MSE Gaussian noise = MSE loss MSE LOSS J = (1/2m)‖Xθ − y‖² Convex quadratic in θ GEOMETRY ŷ = proj of y onto Col(X) e ⊥ Col(X) MINIMIZATION Normal Eq. Gradient Desc. NORMAL EQUATION θ̂ = (XᵀX)⁻¹ Xᵀy exact · O(n³) · small n GRADIENT DESCENT θ ← θ − (α/m)Xᵀ(ŷ−y) iterative · scales · all models PROPERTIES BLUE (Gauss-Markov) E[θ̂] = θ* (unbiased) Var = σ²(XᵀX)⁻¹ REGULARIZATION Ridge: (XᵀX + λI)⁻¹Xᵀy Lasso: sparse weights Elastic Net: L1 + L2 EXTENSIONS Polynomial features φ(x) Bias–Variance tradeoff Foundation of logistic reg.
Fig 4. Complete map of linear regression — from model through MLE justification, geometric interpretation, both solution methods, statistical properties, and extensions.

The unified story, told once more in plain language:

  1. Model: Assume \(y = \boldsymbol{\theta}^T\mathbf{x} + \epsilon\) — the world is approximately linear in the features. Represent all \(m\) examples at once with the design matrix \(\mathbf{X}\).
  2. Loss — MLE tells us: If \(\epsilon \sim \mathcal{N}(0,\sigma^2)\), Maximum Likelihood Estimation is exactly equivalent to minimizing MSE. The loss function is not a heuristic — it encodes a probabilistic assumption about the noise.
  3. Geometry: Minimizing MSE = finding the orthogonal projection of \(\mathbf{y}\) onto the column space of \(\mathbf{X}\). The residual is orthogonal to every predictor — this is the Normal Equation.
  4. Closed-form solution: \(\hat{\boldsymbol{\theta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\) — exact, in one shot. Costs \(O(n^3)\).
  5. Iterative solution: \(\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \frac{\alpha}{m}\mathbf{X}^T(\hat{\mathbf{y}}-\mathbf{y})\) — scales to any \(n\), required for all nonlinear models.
  6. Statistical properties: OLS is BLUE (Gauss-Markov). Unbiased with variance \(\sigma^2(\mathbf{X}^T\mathbf{X})^{-1}\). Multicollinearity inflates this variance.
  7. Regularization: Ridge (L2) shrinks weights, improves conditioning, has closed form. Lasso (L1) drives weights to zero — implicit feature selection. Both trade bias for variance — often improving test error over unregularized OLS.
  8. Extensions: Polynomial features make the model nonlinear in \(\mathbf{x}\) while keeping it linear in \(\boldsymbol{\theta}\) — the entire framework applies unchanged. Everything in logistic regression (sigmoid, cross-entropy) builds directly on this linear foundation.
★ The Unifying Thread Across Your ML Journey

Linear regression is not a standalone algorithm — it is the atom from which the rest of ML is built. Logistic regression replaces the identity output with a sigmoid; neural networks stack many linear layers with nonlinear activations; SVMs find the linear separator with maximum margin; PCA finds the linear subspace of maximum variance. Every model you ever trained — contains linear regression at its core. The design matrix, the gradient \(\mathbf{X}^T(\hat{\mathbf{y}}-\mathbf{y})\), the bias-variance tradeoff: these recur everywhere.