// Machine Learning — Generalization Theory

Under & Overfitting
Bias–Variance & Beyond

The deepest question in ML is not how to fit data — it's how to generalize. This is the complete mathematical and intuitive treatment: from first principles through every practical remedy, including the modern double-descent surprise.

Generalization Gap Bias² Variance Irreducible Noise Learning Curves Regularization Dropout Early Stopping Cross-Validation Ensembles Double Descent
// Table of Contents
  1. 01What Does "Generalize" Mean?
  2. 02Underfitting — High Bias
  3. 03Overfitting — High Variance
  4. 04Bias–Variance Decomposition — Full Proof
  5. 05The Tradeoff — Geometry and Intuition
  6. 06Diagnosing — Learning Curves
  7. 07Fixing Underfitting
  8. 08Regularization — L1, L2, Elastic Net
  9. 09Dropout — Randomized Regularization
  10. 10Early Stopping — Implicit Regularization
  11. 11Data Augmentation and More Data
  12. 12Cross-Validation — Unbiased Error Estimation
  13. 13Ensemble Methods — Bagging and Boosting
  14. 14Double Descent — The Modern Twist
  15. 15Decision Framework — What to Do When
  16. 16The Full Mental Model
§01

What Does "Generalize" Mean?

A model that memorizes its training data perfectly is useless — the world will always show it examples it hasn't seen before. The fundamental goal of supervised learning is not to fit training data. It is to learn the underlying pattern well enough to make accurate predictions on new data.

The Generalization Problem \[ \underbrace{R[h]}_{\text{true risk}} = \mathbb{E}_{(\mathbf{x}, y) \sim P}\left[\mathcal{L}(h(\mathbf{x}), y)\right] \qquad \underbrace{\hat{R}[h]}_{\text{empirical risk}} = \frac{1}{m}\sum_{i=1}^{m}\mathcal{L}(h(\mathbf{x}^{(i)}), y^{(i)}) \] \[ \text{Generalization Gap} = R[h] - \hat{R}[h] \]

We optimize \(\hat{R}\) (the training loss) but care about \(R\) (the true risk over the entire distribution). The gap between them is what we must control. A large gap means the model learned something specific to the training set — it will fail on new data.

The Three Sets — Why They All Exist

  • Training set: Used to fit model parameters. Loss here is \(\hat{R}\). Models always perform best here — they were literally optimized on it.
  • Validation set: Used to tune hyperparameters (learning rate, regularization strength, architecture choices). Never used for gradient updates. Gives an unbiased estimate of generalization during development.
  • Test set: Touched exactly once — after all development is complete. Provides the final, unbiased generalization estimate. Using the test set to make any model decision contaminates it.
⚠ Test Set Contamination

If you ever use the test set to make any decision — choosing between models, deciding to add more data, stopping early — it becomes part of your validation set. You have inadvertently fit hyperparameters to it. Report results on a truly held-out set, or use cross-validation with a separate final test set.

The Two Failure Modes

Underfitting
High Error

On both training AND test data. The model never learned the pattern. Bias is too high.

Good Fit
Low Error

Low on both. The model learned the true pattern without memorizing noise. The goal.

Overfitting
Big Gap

Low on training, high on test. The model memorized training noise. Variance is too high.

§02

Underfitting — High Bias

Underfitting occurs when a model is too simple to capture the underlying pattern in the data. The model makes systematic errors — it consistently predicts wrong in the same direction, regardless of which training samples were used.

What Bias Means Precisely

Bias is the expected deviation of the model's prediction from the true value, averaged over all possible training sets:

Bias Definition \[ \text{Bias}[\hat{f}(\mathbf{x})] = \mathbb{E}_{\mathcal{D}}\left[\hat{f}(\mathbf{x})\right] - f(\mathbf{x}) \]

The expectation is over all possible training sets \(\mathcal{D}\) of size \(m\) drawn from the same distribution. Even with infinite data, a high-bias model will still predict wrong — because the hypothesis class doesn't contain the true function.

🔴 Concrete Example

True relationship: \(y = 0.5x^3 - 2x^2 + x + \epsilon\). You fit a linear model \(\hat{y} = \theta_0 + \theta_1 x\). No matter how much data you give it, the best linear fit will consistently underestimate the peaks and overestimate the troughs — because a line cannot represent a cubic. This systematic error is bias. Adding more data won't fix it.

Symptoms of Underfitting

epoch loss train loss (high) val loss (similar, also high) Underfitting: both losses high and close together linear boundary can't capture true pattern
Fig 1. Left: Learning curve signature of underfitting — both train and val loss are high and close together. Right: The hypothesis class (line) is too simple for the true boundary.
  • High training loss and high validation loss — both are bad, and they're similar to each other. The model fails on the data it was trained on.
  • Performance doesn't improve much with more data — adding more samples of the same distribution won't help a model that structurally can't represent the pattern.
  • Large residuals with clear systematic structure — in regression, a residual plot shows obvious curves or trends; the model missed a real pattern.

Causes of Underfitting

Cause What's Wrong Fix
Hypothesis class too simple Linear model for nonlinear data; shallow tree; too few neurons Increase model capacity
Regularization too strong \(\lambda\) so large that weights are crushed to near-zero Decrease regularization
Too few features Missing variables that explain the variance in \(y\) Feature engineering; add relevant inputs
Training too short Stopped before convergence Train longer
Learning rate too large Gradient descent diverges/oscillates; never settles Reduce learning rate
§03

Overfitting — High Variance

Overfitting occurs when a model is so powerful that it learns not just the true pattern, but also the random noise specific to the training set. The model fits every quirk of the training data — including sampling accidents that won't appear in new data.

What Variance Means Precisely

Variance Definition \[ \text{Var}[\hat{f}(\mathbf{x})] = \mathbb{E}_{\mathcal{D}}\left[\left(\hat{f}(\mathbf{x}) - \mathbb{E}_{\mathcal{D}}[\hat{f}(\mathbf{x})]\right)^2\right] \]

Variance measures how much the model's prediction at \(\mathbf{x}\) changes when you retrain on a different sample of the same size from the same distribution. A high-variance model is wildly sensitive to which specific data points ended up in the training set.

🟣 Concrete Example

A degree-15 polynomial fit to 20 data points. If you redraw 20 samples from the same distribution and refit, the polynomial looks completely different — it bends wildly to pass through all 20 new points. The two polynomials agree at the training points but disagree everywhere else. That disagreement is variance. The model learned the noise, not the signal.

Symptoms of Overfitting

best point train loss (→ 0) val loss (rises) gap = overfit epoch high-deg. model fits every noise point
Fig 2. Left: Overfitting learning curve — train loss → 0, val loss diverges upward. Gap is the overfit region. Right: High-capacity model passes through all training points but oscillates wildly on unseen data.

Causes of Overfitting

Cause What's Wrong Fix
Model too complex Too many parameters relative to data size Reduce capacity; regularize
Too little data Training set too small — model memorizes it Get more data; augment
No regularization Weights grow unrestricted to fit noise L1/L2/Dropout
Training too long Kept training after the validation loss started rising Early stopping
Noisy / mislabeled data Model learns the wrong labels Clean data; label smoothing
Feature leakage Test-time information smuggled into training features Strict feature engineering pipeline
§04

Bias–Variance Decomposition — Full Proof

The expected test error at a point \(\mathbf{x}\) can be decomposed into three terms. This decomposition is the mathematical backbone of everything in this masterclass. We prove it from scratch.

Setup

Assume: \(y = f(\mathbf{x}) + \epsilon\) where \(f\) is the true unknown function and \(\epsilon \sim (0, \sigma^2)\) is irreducible noise. We learn \(\hat{f}(\mathbf{x})\) from a training set \(\mathcal{D}\). Define:

\[ \bar{f}(\mathbf{x}) = \mathbb{E}_{\mathcal{D}}[\hat{f}(\mathbf{x})] \quad \text{(average prediction over all possible training sets)} \]

The Decomposition Proof

// Full proof: E[(y − f̂(x))²] = Bias² + Variance + σ²
1
Expand expected squared error (expectation over both \(\mathcal{D}\) and noise \(\epsilon\)): \[\mathbb{E}\left[(y - \hat{f})^2\right] = \mathbb{E}\left[(f + \epsilon - \hat{f})^2\right]\]
2
Add and subtract \(\bar{f}\): \[= \mathbb{E}\left[(f - \bar{f}) + (\bar{f} - \hat{f}) + \epsilon\right]^2\] Expand the square (three terms, three cross-products): \[= \mathbb{E}[(f-\bar{f})^2] + \mathbb{E}[(\bar{f}-\hat{f})^2] + \mathbb{E}[\epsilon^2] + 2\cdot\text{cross terms}\]
3
Show cross terms vanish:
  • \(2(f-\bar{f})\mathbb{E}_\mathcal{D}[(\bar{f}-\hat{f})] = 2(f-\bar{f})\cdot 0 = 0\) since \(\mathbb{E}[\hat{f}] = \bar{f}\) by definition.
  • \(2\mathbb{E}[\epsilon(f-\bar{f})] = 0\) since \(\epsilon\) is independent of \(\hat{f}\) and \(f\), and \(\mathbb{E}[\epsilon]=0\).
  • \(2\mathbb{E}[\epsilon(\bar{f}-\hat{f})] = 0\) by independence of noise and training set.
4
What remains: \[= \underbrace{(f(\mathbf{x}) - \bar{f}(\mathbf{x}))^2}_{\text{Bias}^2[\hat{f}]} + \underbrace{\mathbb{E}_\mathcal{D}\left[(\hat{f}(\mathbf{x}) - \bar{f}(\mathbf{x}))^2\right]}_{\text{Var}[\hat{f}]} + \underbrace{\sigma^2}_{\text{Irreducible}}\]
Note: (f − f̄)² has no expectation over 𝒟 since both f(x) and f̄(x) are deterministic at a fixed x. ∎
Bias–Variance Decomposition \[ \underbrace{\mathbb{E}\!\left[(y - \hat{f}(\mathbf{x}))^2\right]}_{\text{Expected Test MSE}} = \underbrace{\left(\mathbb{E}_\mathcal{D}[\hat{f}(\mathbf{x})] - f(\mathbf{x})\right)^{\!2}}_{\text{Bias}^2} + \underbrace{\mathbb{E}_\mathcal{D}\!\left[\left(\hat{f}(\mathbf{x}) - \mathbb{E}_\mathcal{D}[\hat{f}]\right)^{\!2}\right]}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Irreducible Noise}} \]
★ What Each Term Is

Bias² — how far the average prediction is from the truth. A property of the hypothesis class, not the specific training set. Eliminated only by using a richer model family.  |  Variance — how much the prediction fluctuates across different training sets of the same size. Reduced by regularization, more data, ensembling.  |  σ² — the irreducible noise floor of the data-generating process. No model, no matter how perfect, can do better than this.

§05

The Tradeoff — Geometry and Intuition

Bias and variance move in opposite directions as model complexity increases. This creates the classic U-shaped test error curve.

complexity Error simple moderate complex σ² irreducible floor Bias² Var Total Train error optimal ← underfit overfit →
Fig 3. The bias–variance tradeoff. As complexity grows: Bias² (red) falls, Variance (violet) rises. Total test error (gold) is U-shaped, with a unique minimum. Train error (blue dashed) keeps falling, diverging from test error.

The Fundamental Tension

  • Simple models (low complexity): High bias (can't fit the pattern) but low variance (stable across training sets). Example: linear regression for nonlinear data.
  • Complex models (high complexity): Low bias (can represent complex patterns) but high variance (sensitive to training sample). Example: deep neural networks without regularization.
  • The sweet spot is wherever \(\text{Bias}^2 + \text{Variance}\) is minimized — the optimal complexity for your data size and noise level.
  • More data shifts the curve: With \(m \to \infty\) samples, variance \(\to 0\) for most models (by the law of large numbers) — bias is all that remains. More data always helps variance; never helps bias.

Effect of Training Size on Bias and Variance

Variance Scaling \[ \text{Var}[\hat{f}(\mathbf{x})] \sim \frac{\sigma^2 \cdot \text{complexity}}{m} \]

For many models (e.g. linear models with n features): Var ∝ σ²n/m. More data always reduces variance; more features increases it.

This formula captures a crucial practical insight: you can afford more model complexity if you have more data. A 1000-parameter model needs far fewer samples to generalize than a 1 billion-parameter model. Modern deep learning works by having model complexity so high it seems it should fail — but the massive datasets compensate.

§06

Diagnosing — Learning Curves

Before fixing a problem, you must diagnose it correctly. Learning curves — plots of training and validation loss vs. training set size (or epoch count) — are the primary diagnostic tool.

Training Size Learning Curves

Fix the model, vary the number of training samples \(m\). Plot train and validation error as functions of \(m\):

HIGH BIAS small gap, both high → add complexity m → GOOD FIT small gap, both low → well calibrated m → HIGH VARIANCE large gap → regularize / more data m →
Fig 4. Learning curves vs. training size. Red = train error, gold dashed = val error. Left: high bias — both plateau high. Centre: good fit — converge low. Right: high variance — persistent large gap.

Reading Learning Curves — Diagnostic Rules

Observation Diagnosis Remedy
Train error ≈ Val error, both high High bias / underfitting More capacity, less regularization, better features
Train error low, Val error much higher High variance / overfitting Regularize, more data, simpler model, dropout
Val error rising during training Overfitting mid-training Early stopping
Both errors high and diverging Learning rate issues Reduce LR; check gradient flow
Val error bouncing wildly Batch size too small / LR too high Increase batch size; use LR schedule
Gap closes as m increases Variance problem (more data helps) Collect more training data
Gap persists regardless of m Bias problem (more data won't help) Increase model complexity
§07

Fixing Underfitting

Increase Model Capacity

Add more layers or neurons (deep learning). Increase polynomial degree. Use a more expressive algorithm (e.g., tree → neural net). The hypothesis class must contain a good approximation of the true function.

Feature Engineering

Add interaction terms (\(x_1 x_2\)), polynomial features (\(x^2, x^3\)), domain-specific transformations (\(\log x\), \(\sqrt{x}\)), or embeddings. A better representation lets a simpler model capture complex patterns.

Reduce Regularization

If you over-regularized (λ too high, dropout rate too high), the model is prevented from learning. Decrease \(\lambda\). Check that regularization is appropriate for the data size.

Train Longer / Better Optimizer

If training hasn't converged, running more epochs or switching from SGD to Adam can reach a better minimum. Check that the learning rate is appropriate — too large prevents convergence.

⚠ Underfitting is Not Solved by More Data

This is the most important diagnostic distinction. If your model has high training error and high validation error with a small gap between them, adding more training data will not help — the model structurally cannot learn the pattern. You must change the model family or the features. More data only helps variance (overfitting), not bias (underfitting).

§08

Regularization — L1, L2, Elastic Net

Regularization is the primary tool for fighting overfitting in parametric models. It adds a penalty on model complexity directly to the loss — deliberately introducing bias to reduce variance.

L2 Regularization (Ridge / Weight Decay)

L2 Regularized Loss \[ J_{\text{L2}}(\boldsymbol{\theta}) = \underbrace{\frac{1}{m}\sum_{i=1}^{m}\mathcal{L}(\hat{y}^{(i)}, y^{(i)})}_{\text{empirical risk}} + \underbrace{\frac{\lambda}{2}\|\boldsymbol{\theta}\|_2^2}_{\text{L2 penalty}} \]

The gradient update becomes \(\boldsymbol{\theta} \leftarrow \boldsymbol{\theta}(1 - \alpha\lambda/m) - \alpha\nabla\hat{R}\) — the factor \((1-\alpha\lambda/m) < 1\) is weight decay: weights shrink each step. L2 penalizes large weights but never drives them exactly to zero. All features remain in the model, but large coefficients are suppressed.

Effect on bias–variance: Larger \(\lambda\) → smaller effective weights → simpler model → higher bias, lower variance. \(\lambda = 0\) recovers unregularized OLS; \(\lambda \to \infty\) drives all weights to zero.

L1 Regularization (Lasso)

L1 Regularized Loss \[ J_{\text{L1}}(\boldsymbol{\theta}) = \frac{1}{m}\sum_{i=1}^{m}\mathcal{L}(\hat{y}^{(i)}, y^{(i)}) + \lambda\|\boldsymbol{\theta}\|_1 = \frac{1}{m}\sum_i\mathcal{L} + \lambda\sum_j|\theta_j| \]

L1 drives many weights exactly to zero — performing implicit feature selection. The L1 ball has corners at coordinate axes; gradient descent paths tend to land there. Useful when you suspect many features are irrelevant — L1 automatically discards them.

Elastic Net (Best of Both)

Elastic Net \[ J_{\text{EN}} = \frac{1}{m}\sum_i\mathcal{L} + \lambda_1\|\boldsymbol{\theta}\|_1 + \frac{\lambda_2}{2}\|\boldsymbol{\theta}\|_2^2 \]

Choosing λ — The Regularization Path

There's no universal rule. Cross-validate over a grid of \(\lambda\) values. The optimal \(\lambda\) is the one that minimizes validation error. In practice:

  • Start with a coarse log-scale grid: \(\lambda \in \{10^{-4}, 10^{-3}, 10^{-2}, 10^{-1}, 1, 10\}\)
  • Refine around the best coarse value
  • In neural networks, L2 regularization is typically applied only to weight matrices, not biases
L1 (Lasso) L2 (Ridge) Elastic Net
Weights Sparse: many exact zeros Shrinks, never zero Sparse + stable
Feature selection Yes (implicit) No Yes (grouped)
Correlated features Picks one arbitrarily Shrinks all equally Groups them
Closed-form? No (subgradient) Yes (for linear models) No
Best for High-dim, sparse true model Most deep learning High-dim, correlated features
§09

Dropout — Randomized Regularization

Dropout (Srivastava et al., 2014) is the most important regularization technique for neural networks. During training, each neuron is randomly set to zero with probability \(p\) (the dropout rate) at each forward pass.

Dropout Forward Pass \[ \tilde{h}_j^{(l)} = h_j^{(l)} \cdot r_j^{(l)}, \quad r_j^{(l)} \sim \text{Bernoulli}(1-p) \] \[ \text{At test time: } h_j^{(l)} \leftarrow (1-p) \cdot h_j^{(l)} \quad \text{(scale down to match expected training activation)} \]

Why Dropout Works — Two Perspectives

Ensemble Interpretation

With \(N\) neurons and dropout rate \(p\), each training step samples a different subnetwork from the \(2^N\) possible networks. At test time, using the full network with scaled weights approximates averaging the predictions of all \(2^N\) subnetworks. Ensembles reduce variance (§13) — and dropout implicitly does the same at almost no extra cost.

Co-Adaptation Prevention

Without dropout, neurons can develop complex co-dependencies — "neuron A fires only because neuron B is always active." Dropout forces each neuron to be useful independently, since it can't rely on any specific neighbor being present. This encourages learning more robust, distributed representations.

Inverted Dropout (Standard Implementation)

// Inverted Dropout — PyTorch-style # Training forward pass
def forward_train(h, p_drop):
  mask = (torch.rand_like(h) > p_drop).float()
  return (h * mask) / (1 - p_drop)  # scale up during training

# Test forward pass — NO dropout, NO scaling needed
def forward_test(h):
  return h  # use full network as-is

Inverted dropout divides by \((1-p)\) at training time (not test time), keeping expected activation the same. Standard dropout scales at test time — both are equivalent but inverted is preferred in practice since test-time inference doesn't need to know \(p\).

Practical Dropout Guidelines

  • Typical rate: 0.2–0.5 for fully connected layers. 0.1–0.2 for convolutional layers (less effective there — spatial structure helps).
  • Where to apply: After activation functions in hidden layers. Usually not applied to input layer or output layer.
  • Interaction with BatchNorm: Using both can be counterproductive — BatchNorm normalizes across the batch and effectively does its own regularization. Many modern architectures use BatchNorm without Dropout.
  • Training vs inference: Always disable dropout at inference. PyTorch: model.eval(). Forgetting this is a common bug — your test performance will be noisy and lower than it should be.
§10

Early Stopping — Implicit Regularization

Early stopping monitors validation loss during training and halts when it stops improving. It exploits the observation that validation error initially decreases with training, reaches a minimum, then starts rising as the model begins overfitting.

epoch loss train loss val loss best val loss ← patience window → stop here
Fig 5. Early stopping. Best checkpoint saved at minimum validation loss (green). Training continues for patience epochs after the best point, then stops. Model weights rolled back to green marker.

The Algorithm

// Early Stopping with Patience best_val_loss = ∞
patience_count = 0
best_weights = None

for epoch in range(max_epochs):
  train_one_epoch(model, optimizer)
  val_loss = evaluate(model, val_loader)

  if val_loss < best_val_loss min_delta:
    best_val_loss = val_loss
    best_weights = copy(model.weights)  # checkpoint
    patience_count = 0
  else:
    patience_count += 1
    if patience_count >= patience:
      break  # stop training

model.weights = best_weights  # restore best checkpoint

Why Early Stopping is Implicit L2 Regularization

For gradient descent with step size \(\alpha\) and \(T\) steps, the effective regularization is approximately \(\lambda \approx \frac{1}{\alpha T}\). Stopping early limits the total movement of parameters from their initial values — effectively constraining them to a ball around the initialization. This was proven formally by Bishop (1995) for quadratic losses and has been generalized to deep networks empirically.

§11

Data Augmentation and More Data

The most reliable remedy for overfitting is more data. More training samples reduce variance directly — the model has less ability to memorize specific examples when there are more of them. When real data is limited, data augmentation manufactures additional training examples from existing ones.

Why More Data Reduces Variance

From the bias–variance analysis: \(\text{Var}[\hat{f}] \propto 1/m\) for many model families. Each additional sample constrains the model further, reducing its freedom to fit noise. In the limit \(m \to \infty\): variance \(\to 0\) and performance is limited only by bias and irreducible noise.

Data Augmentation — Category by Domain

Computer Vision

Random crop, horizontal/vertical flip, rotation (±15°), color jitter (brightness, contrast, saturation), Gaussian noise, random erase, mixup, cutout, RandAugment. Must preserve label semantics — flipping a "9" to get a "6" would be wrong.

NLP / Text

Synonym replacement, back-translation (translate to French, back to English), random insertion/deletion/swap of words, word2vec neighborhood sampling, EDA (Easy Data Augmentation). More constrained — syntax and semantics must be preserved.

Audio / Speech

Time stretching, pitch shifting, adding background noise, SpecAugment (frequency/time masking on the spectrogram). SpecAugment is standard for speech recognition.

Tabular Data

SMOTE (synthetic minority oversampling for imbalanced classes), Gaussian noise on continuous features, bootstrapping. Less natural than image augmentation — use carefully.

Advanced Augmentation — Mixup and CutMix

Mixup \[ \tilde{\mathbf{x}} = \lambda\mathbf{x}_i + (1-\lambda)\mathbf{x}_j, \quad \tilde{y} = \lambda y_i + (1-\lambda)y_j, \quad \lambda \sim \text{Beta}(\alpha, \alpha) \]

Mixup creates new training examples by linearly interpolating between two random training samples — both inputs and labels. It encourages the model to have linear behavior between training points, strongly regularizing the decision boundary. Proven especially effective for large CNNs and ViTs — directly relevant to your chest X-ray classifier.

💡 Augmentation vs. Bias

Data augmentation reduces variance by effectively increasing \(m\). But poor augmentation introduces bias — if the augmented samples don't reflect the true test distribution. For example, medical images should not be augmented with horizontal flip if lesions are laterality-specific. Always validate that your augmentations are semantically valid for the task.

§12

Cross-Validation — Unbiased Error Estimation

Cross-validation gives a reliable estimate of generalization error while using all available data for both training and validation — essential when data is limited.

k-Fold Cross-Validation

k-Fold CV Error Estimate \[ \text{CV}(k) = \frac{1}{k}\sum_{i=1}^{k} \text{Error}_{\text{val}}^{(i)} \quad \text{where fold } i \text{ is held out and the rest are used for training} \]
// k-Fold Cross-Validation shuffle(data)  # randomize order
folds = split(data, k=5)  # split into k equal parts
errors = []

for i in range(k):
  val_fold = folds[i]
  train_folds = concat(folds[:i] + folds[i+1:])
  model = train(train_folds)
  errors.append(evaluate(model, val_fold))

cv_error = mean(errors)  # unbiased generalization estimate
cv_std = std(errors)   # uncertainty in the estimate

Choosing k

k Bias of Estimate Variance of Estimate Compute Cost Best For
k = 5 Slightly higher Lower 5× training Large datasets; most common default
k = 10 Lower Moderate 10× training Standard recommendation (ESL)
k = m (LOOCV) Lowest (nearly unbiased) Very high m× training Very small datasets (<100 samples)
Stratified k-fold Same as k-fold Lower for imbalanced Same as k-fold Classification with class imbalance

Using CV for Hyperparameter Tuning

⚠ The Right Way to Use Cross-Validation

CV estimates test error for a given hyperparameter setting. When you search over hyperparameters (grid search, random search, Bayesian optimization) and pick the best CV score, the selected model's CV error is optimistically biased — you've implicitly overfit the hyperparameters to the CV folds. Always maintain a completely separate test set to get the final unbiased estimate after all hyperparameter decisions are made. Use CV only for comparing options, not for final reporting.

§13

Ensemble Methods — Bagging and Boosting

Ensembles combine multiple models to produce a prediction better than any individual model. They work by exploiting the bias–variance decomposition in different ways.

Why Ensembles Work — The Math

For \(B\) independent models each with variance \(\sigma^2\), the variance of their mean is:

Variance of Ensemble Mean \[ \text{Var}\left[\frac{1}{B}\sum_{b=1}^{B}\hat{f}_b(\mathbf{x})\right] = \rho\sigma^2 + \frac{1-\rho}{B}\sigma^2 \]

Where \(\rho\) is the pairwise correlation between individual model predictions. As \(B \to \infty\): variance \(\to \rho\sigma^2\). The key: uncorrelated (\(\rho = 0\)) models give \(\frac{\sigma^2}{B} \to 0\). The ensemble reduces variance proportionally to \(1/B\) — as long as the models are diverse. High correlation kills the benefit.

Bagging — Bootstrap Aggregating

Train \(B\) copies of the same model on different bootstrap samples (random samples with replacement, same size \(m\)). Average (regression) or vote (classification) their predictions.

Bagging Prediction \[ \hat{f}_{\text{bag}}(\mathbf{x}) = \frac{1}{B}\sum_{b=1}^{B}\hat{f}_b(\mathbf{x}), \quad \mathcal{D}_b \sim \text{Bootstrap}(\mathcal{D}) \]
  • Random Forest = Bagging + at each split, only consider a random subset of features. This decorrelates the trees (reduces \(\rho\)) — giving much better variance reduction than simple bagging.
  • What it fixes: Reduces variance without increasing bias significantly. Works best on high-variance, low-bias base learners (deep trees).
  • Out-of-bag error: Each bootstrap sample leaves out ~37% of data. These out-of-bag samples give a free validation estimate — no need for a separate val set.

Boosting — Sequential Bias Reduction

Boosting trains models sequentially, each one focused on correcting the errors of the previous ones. Instead of reducing variance (like bagging), it reduces bias.

// Gradient Boosting (simplified) F_0(x) = mean(y)  # initial prediction

for b = 1 to B:
  # Compute residuals (negative gradient of loss)
  r_i = ∂L/∂F(x_i)  # pseudo-residuals

  # Fit a weak learner (small tree) to residuals
  h_b = fit_tree(r, max_depth=3)

  # Update ensemble with step size α (shrinkage)
  F_b(x) = F_{b-1}(x) + α·h_b(x)
Bagging Boosting
Goal Reduce variance Reduce bias
Training Parallel — independent models Sequential — each corrects prior
Base learner Low-bias, high-variance (deep trees) High-bias, low-variance (shallow trees)
Overfitting risk Low (averaging reduces it) High (can overfit with too many rounds)
Examples Random Forest, Extra Trees XGBoost, LightGBM, AdaBoost, CatBoost
§14

Double Descent — The Modern Twist

Classical theory predicts a single U-shaped test error curve with increasing model complexity. Modern deep learning broke this prediction — error can rise, then fall again, giving a "double descent" curve. Understanding this resolves the apparent paradox that overparameterized models generalize well.

model size → Test Error interpolation threshold irreducible noise classical regime modern overparameterized regime train error = 0 spike: too many params to find good interpolant
Fig 6. Double descent. Classical U-shape occurs below the interpolation threshold (model can't memorize training data). At the threshold, test error spikes. Beyond it, in the overparameterized regime, error decreases again — often below the classical minimum.

The Interpolation Threshold and What Happens There

The interpolation threshold is where the model has exactly enough parameters to fit the training data perfectly — where training error reaches zero. Near this threshold, a model has just enough parameters to interpolate but many of them lead to poor generalizers — hence the error spike.

Beyond this threshold, the model is overparameterized — there are infinitely many interpolating solutions. Gradient descent with small initialization implicitly finds the minimum-norm interpolating solution — the one with the smallest weights among all perfect interpolators. In high dimensions, this minimum-norm solution generalizes surprisingly well.

Why This Matters for Deep Learning

  • Modern neural networks operate in the far-right of the double descent curve — massively overparameterized, achieving near-zero training loss while still generalizing. Classical theory said this should be catastrophic; double descent explains why it isn't.
  • Implicit regularization: SGD with small initialization has an implicit bias toward flat minima and minimum-norm solutions — which happen to generalize well. This is an active research area.
  • Your ViT model: Vision Transformers with hundreds of millions of parameters trained on limited chest X-ray data are in this regime. Regularization (weight decay, dropout), data augmentation, and pretraining are all essential to move the generalization curve downward.
💡 The Updated Picture

Classical advice: "if your model overfits, use a simpler model or more regularization." Modern deep learning advice: "if your model overfits, consider making it much larger and adding implicit regularization through SGD dynamics, data augmentation, and architectural choices." Both are true — they operate in different regimes of the double descent curve. Your diagnosis (learning curves, bias–variance) tells you which regime you're in.

§15

Decision Framework — What to Do When

✓ The Diagnostic First

Always diagnose before treating. Train the model. Plot learning curves. Is train error too high? → bias problem. Is val error much higher than train? → variance problem. Is the gap constant as m increases? → bias. Does more data close the gap? → variance. Most wasted ML effort comes from applying variance fixes (regularization) to bias problems and vice versa.

Symptom Cause Treatments (in rough priority order)
Train error high + Val error ≈ Train error High Bias / Underfitting 1) Larger model · 2) More features / better representation · 3) Reduce λ · 4) Train longer · 5) Better optimizer
Train error low + Val error ≫ Train error High Variance / Overfitting 1) More data · 2) L2/L1 regularization · 3) Dropout · 4) Early stopping · 5) Smaller model · 6) Data augmentation · 7) Ensemble
Val error rises during training Overfitting in time 1) Early stopping · 2) Reduce LR at the same epoch · 3) Increase λ
High error, both losses diverge Training instability 1) Reduce learning rate · 2) Gradient clipping · 3) Better initialization · 4) BatchNorm
Gap doesn't close with 10× data High Bias (irreducible in current class) 1) Fundamentally different model architecture · 2) New features · 3) Transfer learning
Large variance in CV scores High model sensitivity 1) Ensemble · 2) More regularization · 3) Larger k in k-fold · 4) More data

The Bias–Variance Priority Rule

When both bias and variance are high, fix bias first. Here's why: applying variance-reduction techniques (regularization, simpler model) to a high-bias model makes it worse — it reduces capacity further. But increasing capacity (to fix bias) on a high-variance model is manageable — you can then apply regularization. Fix the model class first, then regularize.

§16

The Full Mental Model

GENERALIZATION — COMPLETE MENTAL MAP GENERALIZATION ERROR E[(y − f̂)²] = Bias² + Var + σ² BIAS² (Underfitting) systematic error · model too simple VARIANCE (Overfitting) sensitivity to training set · model too complex IRREDUCIBLE σ² noise floor · cannot be eliminated DIAGNOSE learning curves · train vs val loss FIX UNDERFITTING · More capacity · Feature engineering · Reduce regularization FIX OVERFITTING · L1/L2 regularization · Dropout · Early stop · More data · Ensemble ESTIMATE · k-fold cross-validation · Held-out test set · Train/val/test split THE MODERN VIEW — DOUBLE DESCENT Classical regime (underparameterized): bias² + variance tradeoff. Optimize model size. Interpolation threshold: error spike. Exactly enough params to fit training data. Modern regime (overparameterized): min-norm interpolation. Error decreases again. SGD implicit bias + augmentation + pretraining enable massive models to generalize. In both regimes: more diverse data + appropriate regularization always helps.
Fig 7. Complete mental map — from the bias–variance decomposition through diagnosis, remedies, and the modern double-descent picture.

The complete story in one thread:

  1. The Goal is Generalization: We minimize empirical risk \(\hat{R}\) but care about true risk \(R\). The generalization gap \(R - \hat{R}\) is what we control.
  2. Expected test error decomposes exactly: \(\text{Bias}^2 + \text{Variance} + \sigma^2\). This is not an approximation — it's a theorem. Irreducible noise \(\sigma^2\) is a ceiling no model can break through.
  3. Underfitting = high bias: The hypothesis class doesn't contain a good approximation of the true function. Even with infinite data, performance is bounded. Fix: richer model, better features. Symptoms: both train and val loss are high, gap is small.
  4. Overfitting = high variance: The model learned the training noise. Sensitive to which specific training points were sampled. Fix: regularization, more data, ensembling, simpler model. Symptoms: low train loss, high val loss, large gap.
  5. Diagnose first with learning curves — the shape tells you which problem you have. Then apply the correct remedy. Applying variance fixes to a bias problem makes things worse.
  6. Regularization introduces bias deliberately to reduce variance — trading error sources for net gain. L2 shrinks all weights. L1 drives many to exactly zero. Dropout ensembles \(2^N\) networks implicitly. Early stopping constrains the parameter path.
  7. More data always reduces variance, never reduces bias. Cross-validation gives an unbiased generalization estimate when held-out test data is scarce.
  8. Ensembles exploit the Bias–Variance identity: Averaging \(B\) uncorrelated models of variance \(\sigma^2\) gives variance \(\sigma^2/B\). Bagging reduces variance. Boosting reduces bias sequentially.
  9. Double descent: Classical theory predicts one U-shaped curve. Modern overparameterized models (like; Vision Transformers) operate beyond the interpolation threshold — in a regime where more parameters actually helps, because SGD's implicit bias finds minimum-norm solutions that generalize.
★ The One Fact That Ties Everything Together

Every technique in this document is either reducing bias, reducing variance, or estimating which one needs reducing. L2 regularization reduces variance at the cost of bias. Feature engineering reduces bias. Data augmentation reduces variance. Ensembles reduce variance. Boosting reduces bias. Cross-validation estimates total error so you can identify which term dominates. The bias–variance decomposition is not just a theoretical result — it is the organizing principle of all of machine learning.