What Does "Generalize" Mean?
A model that memorizes its training data perfectly is useless — the world will always show it examples it hasn't seen before. The fundamental goal of supervised learning is not to fit training data. It is to learn the underlying pattern well enough to make accurate predictions on new data.
We optimize \(\hat{R}\) (the training loss) but care about \(R\) (the true risk over the entire distribution). The gap between them is what we must control. A large gap means the model learned something specific to the training set — it will fail on new data.
The Three Sets — Why They All Exist
- Training set: Used to fit model parameters. Loss here is \(\hat{R}\). Models always perform best here — they were literally optimized on it.
- Validation set: Used to tune hyperparameters (learning rate, regularization strength, architecture choices). Never used for gradient updates. Gives an unbiased estimate of generalization during development.
- Test set: Touched exactly once — after all development is complete. Provides the final, unbiased generalization estimate. Using the test set to make any model decision contaminates it.
If you ever use the test set to make any decision — choosing between models, deciding to add more data, stopping early — it becomes part of your validation set. You have inadvertently fit hyperparameters to it. Report results on a truly held-out set, or use cross-validation with a separate final test set.
The Two Failure Modes
On both training AND test data. The model never learned the pattern. Bias is too high.
Low on both. The model learned the true pattern without memorizing noise. The goal.
Low on training, high on test. The model memorized training noise. Variance is too high.
Underfitting — High Bias
Underfitting occurs when a model is too simple to capture the underlying pattern in the data. The model makes systematic errors — it consistently predicts wrong in the same direction, regardless of which training samples were used.
What Bias Means Precisely
Bias is the expected deviation of the model's prediction from the true value, averaged over all possible training sets:
The expectation is over all possible training sets \(\mathcal{D}\) of size \(m\) drawn from the same distribution. Even with infinite data, a high-bias model will still predict wrong — because the hypothesis class doesn't contain the true function.
True relationship: \(y = 0.5x^3 - 2x^2 + x + \epsilon\). You fit a linear model \(\hat{y} = \theta_0 + \theta_1 x\). No matter how much data you give it, the best linear fit will consistently underestimate the peaks and overestimate the troughs — because a line cannot represent a cubic. This systematic error is bias. Adding more data won't fix it.
Symptoms of Underfitting
- High training loss and high validation loss — both are bad, and they're similar to each other. The model fails on the data it was trained on.
- Performance doesn't improve much with more data — adding more samples of the same distribution won't help a model that structurally can't represent the pattern.
- Large residuals with clear systematic structure — in regression, a residual plot shows obvious curves or trends; the model missed a real pattern.
Causes of Underfitting
| Cause | What's Wrong | Fix |
|---|---|---|
| Hypothesis class too simple | Linear model for nonlinear data; shallow tree; too few neurons | Increase model capacity |
| Regularization too strong | \(\lambda\) so large that weights are crushed to near-zero | Decrease regularization |
| Too few features | Missing variables that explain the variance in \(y\) | Feature engineering; add relevant inputs |
| Training too short | Stopped before convergence | Train longer |
| Learning rate too large | Gradient descent diverges/oscillates; never settles | Reduce learning rate |
Overfitting — High Variance
Overfitting occurs when a model is so powerful that it learns not just the true pattern, but also the random noise specific to the training set. The model fits every quirk of the training data — including sampling accidents that won't appear in new data.
What Variance Means Precisely
Variance measures how much the model's prediction at \(\mathbf{x}\) changes when you retrain on a different sample of the same size from the same distribution. A high-variance model is wildly sensitive to which specific data points ended up in the training set.
A degree-15 polynomial fit to 20 data points. If you redraw 20 samples from the same distribution and refit, the polynomial looks completely different — it bends wildly to pass through all 20 new points. The two polynomials agree at the training points but disagree everywhere else. That disagreement is variance. The model learned the noise, not the signal.
Symptoms of Overfitting
Causes of Overfitting
| Cause | What's Wrong | Fix |
|---|---|---|
| Model too complex | Too many parameters relative to data size | Reduce capacity; regularize |
| Too little data | Training set too small — model memorizes it | Get more data; augment |
| No regularization | Weights grow unrestricted to fit noise | L1/L2/Dropout |
| Training too long | Kept training after the validation loss started rising | Early stopping |
| Noisy / mislabeled data | Model learns the wrong labels | Clean data; label smoothing |
| Feature leakage | Test-time information smuggled into training features | Strict feature engineering pipeline |
Bias–Variance Decomposition — Full Proof
The expected test error at a point \(\mathbf{x}\) can be decomposed into three terms. This decomposition is the mathematical backbone of everything in this masterclass. We prove it from scratch.
Setup
Assume: \(y = f(\mathbf{x}) + \epsilon\) where \(f\) is the true unknown function and \(\epsilon \sim (0, \sigma^2)\) is irreducible noise. We learn \(\hat{f}(\mathbf{x})\) from a training set \(\mathcal{D}\). Define:
The Decomposition Proof
- \(2(f-\bar{f})\mathbb{E}_\mathcal{D}[(\bar{f}-\hat{f})] = 2(f-\bar{f})\cdot 0 = 0\) since \(\mathbb{E}[\hat{f}] = \bar{f}\) by definition.
- \(2\mathbb{E}[\epsilon(f-\bar{f})] = 0\) since \(\epsilon\) is independent of \(\hat{f}\) and \(f\), and \(\mathbb{E}[\epsilon]=0\).
- \(2\mathbb{E}[\epsilon(\bar{f}-\hat{f})] = 0\) by independence of noise and training set.
Bias² — how far the average prediction is from the truth. A property of the hypothesis class, not the specific training set. Eliminated only by using a richer model family. | Variance — how much the prediction fluctuates across different training sets of the same size. Reduced by regularization, more data, ensembling. | σ² — the irreducible noise floor of the data-generating process. No model, no matter how perfect, can do better than this.
The Tradeoff — Geometry and Intuition
Bias and variance move in opposite directions as model complexity increases. This creates the classic U-shaped test error curve.
The Fundamental Tension
- Simple models (low complexity): High bias (can't fit the pattern) but low variance (stable across training sets). Example: linear regression for nonlinear data.
- Complex models (high complexity): Low bias (can represent complex patterns) but high variance (sensitive to training sample). Example: deep neural networks without regularization.
- The sweet spot is wherever \(\text{Bias}^2 + \text{Variance}\) is minimized — the optimal complexity for your data size and noise level.
- More data shifts the curve: With \(m \to \infty\) samples, variance \(\to 0\) for most models (by the law of large numbers) — bias is all that remains. More data always helps variance; never helps bias.
Effect of Training Size on Bias and Variance
For many models (e.g. linear models with n features): Var ∝ σ²n/m. More data always reduces variance; more features increases it.
This formula captures a crucial practical insight: you can afford more model complexity if you have more data. A 1000-parameter model needs far fewer samples to generalize than a 1 billion-parameter model. Modern deep learning works by having model complexity so high it seems it should fail — but the massive datasets compensate.
Diagnosing — Learning Curves
Before fixing a problem, you must diagnose it correctly. Learning curves — plots of training and validation loss vs. training set size (or epoch count) — are the primary diagnostic tool.
Training Size Learning Curves
Fix the model, vary the number of training samples \(m\). Plot train and validation error as functions of \(m\):
Reading Learning Curves — Diagnostic Rules
| Observation | Diagnosis | Remedy |
|---|---|---|
| Train error ≈ Val error, both high | High bias / underfitting | More capacity, less regularization, better features |
| Train error low, Val error much higher | High variance / overfitting | Regularize, more data, simpler model, dropout |
| Val error rising during training | Overfitting mid-training | Early stopping |
| Both errors high and diverging | Learning rate issues | Reduce LR; check gradient flow |
| Val error bouncing wildly | Batch size too small / LR too high | Increase batch size; use LR schedule |
| Gap closes as m increases | Variance problem (more data helps) | Collect more training data |
| Gap persists regardless of m | Bias problem (more data won't help) | Increase model complexity |
Fixing Underfitting
Add more layers or neurons (deep learning). Increase polynomial degree. Use a more expressive algorithm (e.g., tree → neural net). The hypothesis class must contain a good approximation of the true function.
Add interaction terms (\(x_1 x_2\)), polynomial features (\(x^2, x^3\)), domain-specific transformations (\(\log x\), \(\sqrt{x}\)), or embeddings. A better representation lets a simpler model capture complex patterns.
If you over-regularized (λ too high, dropout rate too high), the model is prevented from learning. Decrease \(\lambda\). Check that regularization is appropriate for the data size.
If training hasn't converged, running more epochs or switching from SGD to Adam can reach a better minimum. Check that the learning rate is appropriate — too large prevents convergence.
This is the most important diagnostic distinction. If your model has high training error and high validation error with a small gap between them, adding more training data will not help — the model structurally cannot learn the pattern. You must change the model family or the features. More data only helps variance (overfitting), not bias (underfitting).
Regularization — L1, L2, Elastic Net
Regularization is the primary tool for fighting overfitting in parametric models. It adds a penalty on model complexity directly to the loss — deliberately introducing bias to reduce variance.
L2 Regularization (Ridge / Weight Decay)
The gradient update becomes \(\boldsymbol{\theta} \leftarrow \boldsymbol{\theta}(1 - \alpha\lambda/m) - \alpha\nabla\hat{R}\) — the factor \((1-\alpha\lambda/m) < 1\) is weight decay: weights shrink each step. L2 penalizes large weights but never drives them exactly to zero. All features remain in the model, but large coefficients are suppressed.
Effect on bias–variance: Larger \(\lambda\) → smaller effective weights → simpler model → higher bias, lower variance. \(\lambda = 0\) recovers unregularized OLS; \(\lambda \to \infty\) drives all weights to zero.
L1 Regularization (Lasso)
L1 drives many weights exactly to zero — performing implicit feature selection. The L1 ball has corners at coordinate axes; gradient descent paths tend to land there. Useful when you suspect many features are irrelevant — L1 automatically discards them.
Elastic Net (Best of Both)
Choosing λ — The Regularization Path
There's no universal rule. Cross-validate over a grid of \(\lambda\) values. The optimal \(\lambda\) is the one that minimizes validation error. In practice:
- Start with a coarse log-scale grid: \(\lambda \in \{10^{-4}, 10^{-3}, 10^{-2}, 10^{-1}, 1, 10\}\)
- Refine around the best coarse value
- In neural networks, L2 regularization is typically applied only to weight matrices, not biases
| L1 (Lasso) | L2 (Ridge) | Elastic Net | |
|---|---|---|---|
| Weights | Sparse: many exact zeros | Shrinks, never zero | Sparse + stable |
| Feature selection | Yes (implicit) | No | Yes (grouped) |
| Correlated features | Picks one arbitrarily | Shrinks all equally | Groups them |
| Closed-form? | No (subgradient) | Yes (for linear models) | No |
| Best for | High-dim, sparse true model | Most deep learning | High-dim, correlated features |
Dropout — Randomized Regularization
Dropout (Srivastava et al., 2014) is the most important regularization technique for neural networks. During training, each neuron is randomly set to zero with probability \(p\) (the dropout rate) at each forward pass.
Why Dropout Works — Two Perspectives
With \(N\) neurons and dropout rate \(p\), each training step samples a different subnetwork from the \(2^N\) possible networks. At test time, using the full network with scaled weights approximates averaging the predictions of all \(2^N\) subnetworks. Ensembles reduce variance (§13) — and dropout implicitly does the same at almost no extra cost.
Without dropout, neurons can develop complex co-dependencies — "neuron A fires only because neuron B is always active." Dropout forces each neuron to be useful independently, since it can't rely on any specific neighbor being present. This encourages learning more robust, distributed representations.
Inverted Dropout (Standard Implementation)
def forward_train(h, p_drop):
mask = (torch.rand_like(h) > p_drop).float()
return (h * mask) / (1 - p_drop) # scale up during training
# Test forward pass — NO dropout, NO scaling needed
def forward_test(h):
return h # use full network as-is
Inverted dropout divides by \((1-p)\) at training time (not test time), keeping expected activation the same. Standard dropout scales at test time — both are equivalent but inverted is preferred in practice since test-time inference doesn't need to know \(p\).
Practical Dropout Guidelines
- Typical rate: 0.2–0.5 for fully connected layers. 0.1–0.2 for convolutional layers (less effective there — spatial structure helps).
- Where to apply: After activation functions in hidden layers. Usually not applied to input layer or output layer.
- Interaction with BatchNorm: Using both can be counterproductive — BatchNorm normalizes across the batch and effectively does its own regularization. Many modern architectures use BatchNorm without Dropout.
- Training vs inference: Always disable dropout at inference. PyTorch:
model.eval(). Forgetting this is a common bug — your test performance will be noisy and lower than it should be.
Early Stopping — Implicit Regularization
Early stopping monitors validation loss during training and halts when it stops improving. It exploits the observation that validation error initially decreases with training, reaches a minimum, then starts rising as the model begins overfitting.
The Algorithm
patience_count = 0
best_weights = None
for epoch in range(max_epochs):
train_one_epoch(model, optimizer)
val_loss = evaluate(model, val_loader)
if val_loss < best_val_loss − min_delta:
best_val_loss = val_loss
best_weights = copy(model.weights) # checkpoint
patience_count = 0
else:
patience_count += 1
if patience_count >= patience:
break # stop training
model.weights = best_weights # restore best checkpoint
Why Early Stopping is Implicit L2 Regularization
For gradient descent with step size \(\alpha\) and \(T\) steps, the effective regularization is approximately \(\lambda \approx \frac{1}{\alpha T}\). Stopping early limits the total movement of parameters from their initial values — effectively constraining them to a ball around the initialization. This was proven formally by Bishop (1995) for quadratic losses and has been generalized to deep networks empirically.
Data Augmentation and More Data
The most reliable remedy for overfitting is more data. More training samples reduce variance directly — the model has less ability to memorize specific examples when there are more of them. When real data is limited, data augmentation manufactures additional training examples from existing ones.
Why More Data Reduces Variance
From the bias–variance analysis: \(\text{Var}[\hat{f}] \propto 1/m\) for many model families. Each additional sample constrains the model further, reducing its freedom to fit noise. In the limit \(m \to \infty\): variance \(\to 0\) and performance is limited only by bias and irreducible noise.
Data Augmentation — Category by Domain
Random crop, horizontal/vertical flip, rotation (±15°), color jitter (brightness, contrast, saturation), Gaussian noise, random erase, mixup, cutout, RandAugment. Must preserve label semantics — flipping a "9" to get a "6" would be wrong.
Synonym replacement, back-translation (translate to French, back to English), random insertion/deletion/swap of words, word2vec neighborhood sampling, EDA (Easy Data Augmentation). More constrained — syntax and semantics must be preserved.
Time stretching, pitch shifting, adding background noise, SpecAugment (frequency/time masking on the spectrogram). SpecAugment is standard for speech recognition.
SMOTE (synthetic minority oversampling for imbalanced classes), Gaussian noise on continuous features, bootstrapping. Less natural than image augmentation — use carefully.
Advanced Augmentation — Mixup and CutMix
Mixup creates new training examples by linearly interpolating between two random training samples — both inputs and labels. It encourages the model to have linear behavior between training points, strongly regularizing the decision boundary. Proven especially effective for large CNNs and ViTs — directly relevant to your chest X-ray classifier.
Data augmentation reduces variance by effectively increasing \(m\). But poor augmentation introduces bias — if the augmented samples don't reflect the true test distribution. For example, medical images should not be augmented with horizontal flip if lesions are laterality-specific. Always validate that your augmentations are semantically valid for the task.
Cross-Validation — Unbiased Error Estimation
Cross-validation gives a reliable estimate of generalization error while using all available data for both training and validation — essential when data is limited.
k-Fold Cross-Validation
folds = split(data, k=5) # split into k equal parts
errors = []
for i in range(k):
val_fold = folds[i]
train_folds = concat(folds[:i] + folds[i+1:])
model = train(train_folds)
errors.append(evaluate(model, val_fold))
cv_error = mean(errors) # unbiased generalization estimate
cv_std = std(errors) # uncertainty in the estimate
Choosing k
| k | Bias of Estimate | Variance of Estimate | Compute Cost | Best For |
|---|---|---|---|---|
| k = 5 | Slightly higher | Lower | 5× training | Large datasets; most common default |
| k = 10 | Lower | Moderate | 10× training | Standard recommendation (ESL) |
| k = m (LOOCV) | Lowest (nearly unbiased) | Very high | m× training | Very small datasets (<100 samples) |
| Stratified k-fold | Same as k-fold | Lower for imbalanced | Same as k-fold | Classification with class imbalance |
Using CV for Hyperparameter Tuning
CV estimates test error for a given hyperparameter setting. When you search over hyperparameters (grid search, random search, Bayesian optimization) and pick the best CV score, the selected model's CV error is optimistically biased — you've implicitly overfit the hyperparameters to the CV folds. Always maintain a completely separate test set to get the final unbiased estimate after all hyperparameter decisions are made. Use CV only for comparing options, not for final reporting.
Ensemble Methods — Bagging and Boosting
Ensembles combine multiple models to produce a prediction better than any individual model. They work by exploiting the bias–variance decomposition in different ways.
Why Ensembles Work — The Math
For \(B\) independent models each with variance \(\sigma^2\), the variance of their mean is:
Where \(\rho\) is the pairwise correlation between individual model predictions. As \(B \to \infty\): variance \(\to \rho\sigma^2\). The key: uncorrelated (\(\rho = 0\)) models give \(\frac{\sigma^2}{B} \to 0\). The ensemble reduces variance proportionally to \(1/B\) — as long as the models are diverse. High correlation kills the benefit.
Bagging — Bootstrap Aggregating
Train \(B\) copies of the same model on different bootstrap samples (random samples with replacement, same size \(m\)). Average (regression) or vote (classification) their predictions.
- Random Forest = Bagging + at each split, only consider a random subset of features. This decorrelates the trees (reduces \(\rho\)) — giving much better variance reduction than simple bagging.
- What it fixes: Reduces variance without increasing bias significantly. Works best on high-variance, low-bias base learners (deep trees).
- Out-of-bag error: Each bootstrap sample leaves out ~37% of data. These out-of-bag samples give a free validation estimate — no need for a separate val set.
Boosting — Sequential Bias Reduction
Boosting trains models sequentially, each one focused on correcting the errors of the previous ones. Instead of reducing variance (like bagging), it reduces bias.
for b = 1 to B:
# Compute residuals (negative gradient of loss)
r_i = −∂L/∂F(x_i) # pseudo-residuals
# Fit a weak learner (small tree) to residuals
h_b = fit_tree(r, max_depth=3)
# Update ensemble with step size α (shrinkage)
F_b(x) = F_{b-1}(x) + α·h_b(x)
| Bagging | Boosting | |
|---|---|---|
| Goal | Reduce variance | Reduce bias |
| Training | Parallel — independent models | Sequential — each corrects prior |
| Base learner | Low-bias, high-variance (deep trees) | High-bias, low-variance (shallow trees) |
| Overfitting risk | Low (averaging reduces it) | High (can overfit with too many rounds) |
| Examples | Random Forest, Extra Trees | XGBoost, LightGBM, AdaBoost, CatBoost |
Double Descent — The Modern Twist
Classical theory predicts a single U-shaped test error curve with increasing model complexity. Modern deep learning broke this prediction — error can rise, then fall again, giving a "double descent" curve. Understanding this resolves the apparent paradox that overparameterized models generalize well.
The Interpolation Threshold and What Happens There
The interpolation threshold is where the model has exactly enough parameters to fit the training data perfectly — where training error reaches zero. Near this threshold, a model has just enough parameters to interpolate but many of them lead to poor generalizers — hence the error spike.
Beyond this threshold, the model is overparameterized — there are infinitely many interpolating solutions. Gradient descent with small initialization implicitly finds the minimum-norm interpolating solution — the one with the smallest weights among all perfect interpolators. In high dimensions, this minimum-norm solution generalizes surprisingly well.
Why This Matters for Deep Learning
- Modern neural networks operate in the far-right of the double descent curve — massively overparameterized, achieving near-zero training loss while still generalizing. Classical theory said this should be catastrophic; double descent explains why it isn't.
- Implicit regularization: SGD with small initialization has an implicit bias toward flat minima and minimum-norm solutions — which happen to generalize well. This is an active research area.
- Your ViT model: Vision Transformers with hundreds of millions of parameters trained on limited chest X-ray data are in this regime. Regularization (weight decay, dropout), data augmentation, and pretraining are all essential to move the generalization curve downward.
Classical advice: "if your model overfits, use a simpler model or more regularization." Modern deep learning advice: "if your model overfits, consider making it much larger and adding implicit regularization through SGD dynamics, data augmentation, and architectural choices." Both are true — they operate in different regimes of the double descent curve. Your diagnosis (learning curves, bias–variance) tells you which regime you're in.
Decision Framework — What to Do When
Always diagnose before treating. Train the model. Plot learning curves. Is train error too high? → bias problem. Is val error much higher than train? → variance problem. Is the gap constant as m increases? → bias. Does more data close the gap? → variance. Most wasted ML effort comes from applying variance fixes (regularization) to bias problems and vice versa.
| Symptom | Cause | Treatments (in rough priority order) |
|---|---|---|
| Train error high + Val error ≈ Train error | High Bias / Underfitting | 1) Larger model · 2) More features / better representation · 3) Reduce λ · 4) Train longer · 5) Better optimizer |
| Train error low + Val error ≫ Train error | High Variance / Overfitting | 1) More data · 2) L2/L1 regularization · 3) Dropout · 4) Early stopping · 5) Smaller model · 6) Data augmentation · 7) Ensemble |
| Val error rises during training | Overfitting in time | 1) Early stopping · 2) Reduce LR at the same epoch · 3) Increase λ |
| High error, both losses diverge | Training instability | 1) Reduce learning rate · 2) Gradient clipping · 3) Better initialization · 4) BatchNorm |
| Gap doesn't close with 10× data | High Bias (irreducible in current class) | 1) Fundamentally different model architecture · 2) New features · 3) Transfer learning |
| Large variance in CV scores | High model sensitivity | 1) Ensemble · 2) More regularization · 3) Larger k in k-fold · 4) More data |
The Bias–Variance Priority Rule
When both bias and variance are high, fix bias first. Here's why: applying variance-reduction techniques (regularization, simpler model) to a high-bias model makes it worse — it reduces capacity further. But increasing capacity (to fix bias) on a high-variance model is manageable — you can then apply regularization. Fix the model class first, then regularize.
The Full Mental Model
The complete story in one thread:
- The Goal is Generalization: We minimize empirical risk \(\hat{R}\) but care about true risk \(R\). The generalization gap \(R - \hat{R}\) is what we control.
- Expected test error decomposes exactly: \(\text{Bias}^2 + \text{Variance} + \sigma^2\). This is not an approximation — it's a theorem. Irreducible noise \(\sigma^2\) is a ceiling no model can break through.
- Underfitting = high bias: The hypothesis class doesn't contain a good approximation of the true function. Even with infinite data, performance is bounded. Fix: richer model, better features. Symptoms: both train and val loss are high, gap is small.
- Overfitting = high variance: The model learned the training noise. Sensitive to which specific training points were sampled. Fix: regularization, more data, ensembling, simpler model. Symptoms: low train loss, high val loss, large gap.
- Diagnose first with learning curves — the shape tells you which problem you have. Then apply the correct remedy. Applying variance fixes to a bias problem makes things worse.
- Regularization introduces bias deliberately to reduce variance — trading error sources for net gain. L2 shrinks all weights. L1 drives many to exactly zero. Dropout ensembles \(2^N\) networks implicitly. Early stopping constrains the parameter path.
- More data always reduces variance, never reduces bias. Cross-validation gives an unbiased generalization estimate when held-out test data is scarce.
- Ensembles exploit the Bias–Variance identity: Averaging \(B\) uncorrelated models of variance \(\sigma^2\) gives variance \(\sigma^2/B\). Bagging reduces variance. Boosting reduces bias sequentially.
- Double descent: Classical theory predicts one U-shaped curve. Modern overparameterized models (like; Vision Transformers) operate beyond the interpolation threshold — in a regime where more parameters actually helps, because SGD's implicit bias finds minimum-norm solutions that generalize.
Every technique in this document is either reducing bias, reducing variance, or estimating which one needs reducing. L2 regularization reduces variance at the cost of bias. Feature engineering reduces bias. Data augmentation reduces variance. Ensembles reduce variance. Boosting reduces bias. Cross-validation estimates total error so you can identify which term dominates. The bias–variance decomposition is not just a theoretical result — it is the organizing principle of all of machine learning.