What is Optimization? The Landscape
Every machine learning model has a loss function \( J(\boldsymbol{\theta}) \) — a scalar that measures how wrong the model is. Training a model is entirely a problem of minimizing this function over the parameter space \( \boldsymbol{\theta} \in \mathbb{R}^n \).
Think of \( J(\boldsymbol{\theta}) \) as defining a surface in \( (n+1) \)-dimensional space — a "loss landscape." Each point in \( \mathbb{R}^n \) is a configuration of weights; the height at each point is the loss. Training = finding the lowest point on this landscape.
Why Can't We Just Solve It Directly?
For some models (linear regression), there's a closed-form solution: set the gradient to zero and solve algebraically. But for almost everything else:
- Neural networks: \(n\) can be billions of parameters. The normal equation \(\boldsymbol{\theta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\) requires inverting an \(n \times n\) matrix — \(O(n^3)\) cost, completely infeasible.
- Non-convex losses: Setting \(\nabla J = 0\) gives a system of nonlinear equations with no closed form — think sigmoid, ReLU composed many times.
- Online/streaming data: We may not have all data at once, so analytical solutions are structurally impossible.
Instead of jumping to the answer, we walk toward it — taking small steps in the direction that decreases the loss fastest. That direction is the negative gradient. This is gradient descent.
The Gradient — Mathematical Foundation
From Derivative to Gradient
For a scalar function of one variable \(f(x)\), the derivative \(f'(x)\) tells us the slope — the rate of change. When the function takes a vector input \( \boldsymbol{\theta} \in \mathbb{R}^n \), we generalize to partial derivatives.
The gradient \(\nabla J(\boldsymbol{\theta})\) is the vector of all partial derivatives:
Each component \(\frac{\partial J}{\partial \theta_j}\) tells you: "how much does the loss change if I nudge \(\theta_j\) slightly, holding all other parameters fixed?"
What the Gradient Tells You — Geometrically
- Direction: \(\nabla J(\boldsymbol{\theta})\) points in the direction of steepest increase of \(J\) at \(\boldsymbol{\theta}\). Therefore, \(-\nabla J\) points toward steepest decrease.
- Magnitude: \(\|\nabla J(\boldsymbol{\theta})\|\) tells you how steep the slope is. A large gradient means the surface is steep; near a minimum it approaches zero.
- Orthogonality: The gradient is always perpendicular (orthogonal) to the level curves (contours) of the function — just as a mountain's steepest path is always perpendicular to elevation contours.
- Local information: The gradient is a local property — it only describes the surface at a single point. This is why gradient descent must iterate.
Gradient of Common Loss Functions
Both gradients have the form \(\mathbf{X}^T(\hat{\mathbf{y}} - \mathbf{y})\) — the features matrix transposed, times the residual. This is the "error signal" propagated back through the linear layer. The difference is only in how \(\hat{\mathbf{y}}\) is computed.
Directional Derivative & Why Negative Gradient is Optimal
This section gives the mathematical proof that the gradient direction is optimal — not just an intuition.
The Directional Derivative
The rate of change of \(J\) in an arbitrary direction \(\mathbf{u}\) (a unit vector, \(\|\mathbf{u}\| = 1\)) is the directional derivative:
where φ is the angle between ∇J and u, and ‖u‖ = 1
This is the dot product formula — the component of the gradient in the direction \(\mathbf{u}\).
Proof: The Negative Gradient is Steepest Descent
The negative gradient \(-\nabla J(\boldsymbol{\theta})\) is the unique direction of locally steepest descent. Moving in any other direction decreases the loss more slowly (or increases it). This is not heuristic — it follows directly from the Cauchy-Schwarz inequality applied to the dot product.
Gradient Descent — The Algorithm
The Core Update Rule
Starting from an initial parameter vector \(\boldsymbol{\theta}^{(0)}\), we repeatedly move in the direction of steepest descent by a small step controlled by the learning rate \(\alpha\):
For each individual parameter \(\theta_j\):
All parameters must be updated simultaneously using the gradients computed at the current parameter values. You compute all gradients first, then update all parameters. If you update \(\theta_1\) first and use the new \(\theta_1\) to compute the gradient for \(\theta_2\), you're doing something different — a form of coordinate descent, not true gradient descent.
Derivation from Taylor Expansion
Why does this update make sense? The first-order Taylor expansion of \(J\) at \(\boldsymbol{\theta}\) is:
The Taylor expansion shows that each step decreases \(J\) by approximately \(\alpha\|\nabla J\|^2\). This is always non-negative — so gradient descent is a descent method, meaning the loss never increases (for sufficiently small \(\alpha\) and smooth \(J\)). Near a minimum \(\|\nabla J\| \to 0\) and updates become tiny.
for t = 1, 2, 3, ... until convergence:
# Compute gradient over entire dataset
g = ∇J(θ; X_train, y_train)
# Update parameters
θ = θ − α · g
# Convergence check
if ‖g‖ < ε: break # gradient near zero → at minimum
The Learning Rate — Deep Dive
The learning rate \(\alpha\) is the single most important hyperparameter in gradient descent. It controls how far we step in the negative gradient direction each iteration.
Overshoots the minimum. Oscillates or diverges. Loss increases instead of decreasing. Can jump across valleys entirely.
Converges smoothly and efficiently. Each step meaningfully reduces loss. Reaches neighborhood of minimum in reasonable time.
Takes infinitely small steps. Converges too slowly — may take millions of iterations. Wastes compute, may appear stuck.
Mathematical Condition for Convergence
For a convex function with Lipschitz-continuous gradient (the gradient doesn't change too fast — characterized by a constant \(L\)), gradient descent is guaranteed to converge if:
where L = max eigenvalue of the Hessian ∇²J (the curvature)
Intuitively: \(\frac{1}{L}\) is the step size that exactly "fits" the curvature of the loss landscape. The Hessian's largest eigenvalue tells you how curved the function is — high curvature demands a smaller step.
The Quadratic Bowl — Understanding LR Geometrically
For a simple quadratic \(J(\theta) = \frac{1}{2}a\theta^2\) (a bowl with curvature \(a\)):
This is a geometric sequence multiplied by \((1-\alpha a)\) each step:
- If \(|1-\alpha a| < 1\) (i.e., \(\alpha < 2/a\)): sequence converges to 0 → minimum ✓
- If \(|1-\alpha a| = 1\) (i.e., \(\alpha = 2/a\)): stuck oscillating between \(+\theta_0\) and \(-\theta_0\)
- If \(|1-\alpha a| > 1\) (i.e., \(\alpha > 2/a\)): sequence diverges → ✗
Start with \(\alpha = 0.001\) or \(\alpha = 0.01\). Monitor training loss: if it decreases smoothly → fine. If it oscillates or spikes → halve \(\alpha\). If it barely moves → try 10× larger. Modern practice: use Adam (§11) which adapts \(\alpha\) per-parameter automatically.
Batch, Stochastic & Mini-Batch GD
The three variants differ only in how many samples are used to compute the gradient per update. This creates a fundamental speed-vs-accuracy tradeoff.
Batch Gradient Descent (BGD)
Use all \(m\) training samples to compute one gradient, then do one update.
- Pros: Exact gradient — smooth convergence, guaranteed descent on convex functions, stable loss curve.
- Cons: Expensive per update when \(m\) is large. Cannot be used for online learning. Slow for big datasets.
Stochastic Gradient Descent (SGD)
Use one randomly selected sample to compute an approximate gradient, then update immediately.
- Pros: \(m\) updates per epoch instead of 1. Fast to start learning. Noise can help escape shallow local minima. Works for online/streaming data.
- Cons: High variance — loss curve is noisy. May never fully converge — oscillates around minimum. Needs careful learning rate tuning.
Mini-Batch Gradient Descent (the Standard)
Use a small batch of \(B\) samples (typically 32, 64, 128, 256) to compute each gradient estimate.
where 𝓑 is a randomly sampled mini-batch of size B
- Why it works: Gradient estimate is a sample mean — by the law of large numbers, it's a good approximation of the true gradient even for small \(B\).
- GPU efficiency: Matrix operations on a batch of size 64 are barely slower than on a single sample — parallelism is exploited optimally.
- Noise benefits: Some stochasticity helps escape local minima; far less noisy than pure SGD.
| Variant | Batch Size | Updates/Epoch | Gradient Accuracy | Noise Level | Memory |
|---|---|---|---|---|---|
| Batch GD | \(m\) (all) | 1 | Exact | Zero | High |
| SGD | 1 | \(m\) | Very noisy | Very high | Minimal |
| Mini-Batch GD | 32–512 | \(m/B\) | Good approximation | Moderate | Low–Med |
One epoch = one complete pass through the entire training dataset. With mini-batch GD and batch size \(B\), one epoch consists of \(\lceil m/B \rceil\) gradient updates. Training typically runs for tens to hundreds of epochs.
The Loss Landscape — Challenges
Real loss functions (especially for deep networks) are high-dimensional non-convex surfaces full of pathological structures that naive gradient descent struggles with.
Problem 1 — Local Minima
A local minimum is a point where \(\nabla J = 0\) and all directions lead upward — but it's not the global minimum. Gradient descent can get trapped here.
Empirical research suggests that in high-dimensional networks, true local minima are rare. Most critical points are saddle points. Furthermore, many local minima have similar loss values to the global minimum — escaping them is less important than it seems. Batch noise in SGD also helps escape shallow minima.
Problem 2 — Saddle Points
A saddle point is where \(\nabla J = 0\) but it's neither a local min nor max — the Hessian has both positive and negative eigenvalues. The function curves up in some directions, down in others.
Gradient descent slows to a crawl near saddle points because the gradient is near zero in all directions — even though you're not at a minimum. In very high dimensions, saddle points are far more common than local minima (a critical point is a local min only if all eigenvalues are positive — exponentially unlikely in high dimensions).
Saddle points are the primary obstacle in deep learning optimization. Momentum (§08) and noise from mini-batching help escape them.
Problem 3 — Ravines (Ill-Conditioned Curvature)
A ravine is a region where the loss surface is much more curved in one direction than another — like a long valley. The condition number \(\kappa = \lambda_{\max}/\lambda_{\min}\) of the Hessian measures this.
In a ravine, gradient descent oscillates wildly across the narrow dimension while making slow progress along the valley floor. The gradient direction points across the valley (high curvature direction) rather than along it (toward the minimum).
Problem 4 — Plateaus and Vanishing Gradients
A plateau is a region where \(J\) is nearly flat — gradient is tiny everywhere. The optimizer makes negligible progress. This is also the mechanism of the vanishing gradient problem in deep networks: when gradients are multiplied through many layers via the chain rule, they can become exponentially small, stalling learning in early layers.
Momentum — Escaping Ravines
Momentum is inspired by physics: a ball rolling downhill accumulates velocity in the direction it's been traveling. Rather than making each step purely from the current gradient, momentum accumulates a velocity vector from past gradients.
Classical Momentum (Heavy Ball Method)
β ∈ [0,1) — momentum coefficient (typically 0.9). v⁽⁰⁾ = 0
The velocity \(\mathbf{v}^{(t)}\) is an exponentially weighted average of past gradients. Expanding recursively:
In a ravine: gradients in the across-valley direction alternate signs each step → they cancel in the average. Gradients along the valley floor are consistently in the same direction → they accumulate. The ball rolls along the valley rather than zigzagging across it.
In steady state, if the gradient is constant \(\mathbf{g}\), the velocity converges to \(\mathbf{v}^* = \frac{-\alpha}{1-\beta}\mathbf{g}\). With \(\beta = 0.9\), the effective learning rate is \(\frac{\alpha}{1-0.9} = 10\alpha\). Momentum amplifies the step size in consistent gradient directions by \(\frac{1}{1-\beta}\).
Nesterov Accelerated Gradient (NAG)
A smarter variant: instead of computing the gradient at the current position, compute it at the lookahead position (where momentum would take you):
NAG has a provably faster convergence rate: for convex functions, momentum achieves \(O(1/T)\) convergence while NAG achieves the optimal \(O(1/T^2)\) rate. Intuition: by computing the gradient at the lookahead point, NAG "corrects" for overshooting before it happens.
AdaGrad — Adaptive Learning Rates
A fundamental problem with a single global \(\alpha\): some parameters may need large updates (rarely seen features), others need small updates (frequently updated parameters). AdaGrad adapts the learning rate per parameter based on its history of gradients.
ε ≈ 1e-8 prevents division by zero. G_j⁽⁰⁾ = 0
In matrix/vector form:
⊙ = element-wise product. Division and sqrt are element-wise.
Why It Works — and Where It Fails
- Large gradient history → large \(G_j\) → small effective step. Prevents overshooting frequently updated parameters.
- Small gradient history → small \(G_j\) → large effective step. Gives more learning to sparse/rarely updated parameters.
- NLP use: Great for sparse features (word embeddings) — rare words get larger updates.
\(G_j\) is a sum — it only ever increases. Over time, the effective learning rate \(\frac{\alpha}{\sqrt{G_j}}\) shrinks monotonically toward zero. In deep networks with long training, AdaGrad can grind to a halt — the learning rate becomes vanishingly small before convergence. This is what RMSProp fixes.
RMSProp — Fixing AdaGrad
RMSProp (Root Mean Square Propagation, Hinton 2012) replaces the cumulative sum with an exponentially weighted moving average of squared gradients. The "memory" decays — old gradients are forgotten, preventing the monotonic learning rate decay.
ρ ∈ (0,1) — decay rate (typically 0.9 or 0.99). s⁽⁰⁾ = 0
\(\mathbf{s}^{(t)}\) is the exponentially weighted moving average of \(\mathbf{g}^2\) — effectively a recent estimate of the RMS (root mean square) of the gradient. Unlike AdaGrad's ever-growing sum, \(\mathbf{s}\) stabilizes to a recent average.
| Property | AdaGrad | RMSProp |
|---|---|---|
| Gradient history | Cumulative sum (full history) | Exponential moving average (recent) |
| Effective LR over time | Monotonically decreasing → 0 | Stabilizes (adapts to recent gradients) |
| Works for deep networks? | No — LR vanishes | Yes |
| Sparse gradients | Excellent | Good |
Adam — The Gold Standard
Adam (Adaptive Moment Estimation, Kingma & Ba 2015) combines the ideas of momentum (first moment) and RMSProp (second moment) into a single algorithm. It is the most widely used optimizer in deep learning.
The Algorithm — Full Specification
Paper recommendations: \(\alpha = 0.001\), \(\beta_1 = 0.9\), \(\beta_2 = 0.999\), \(\epsilon = 10^{-8}\). These defaults work surprisingly well across a wide variety of problems without tuning.
The Bias Correction — Why It's Necessary
Both \(\mathbf{m}\) and \(\mathbf{v}\) are initialized to zero. In the first few iterations, they're heavily biased toward zero — the exponential average hasn't "warmed up" yet. The bias correction factors \((1-\beta_1^t)\) and \((1-\beta_2^t)\) counteract this.
Note: as \(t \to \infty\), \(\beta_1^t \to 0\) and \(\beta_2^t \to 0\), so the correction factors approach 1 — bias correction only matters in early training.
Understanding Adam's Effective Step Size
Ignoring bias correction, the effective update for parameter \(j\) is:
If gradient is consistent (always same sign, similar magnitude): \(\hat{m}_j \approx g_j\) and \(\hat{v}_j \approx g_j^2\), so \(\frac{\hat{m}_j}{\sqrt{\hat{v}_j}} \approx \frac{g_j}{|g_j|} = \text{sign}(g_j)\). The effective step is roughly \(\pm\alpha\) — Adam takes normalized steps with magnitude roughly equal to \(\alpha\).
If gradient is noisy (alternates sign): \(\hat{m}_j \approx 0\) (signals cancel) while \(\hat{v}_j\) remains nonzero → very small update. Adam naturally adapts to gradient uncertainty.
Adam Variants Worth Knowing
Adam with decoupled weight decay. Standard Adam applies L2 regularization to the gradient before adaptive scaling — this inadvertently reduces the effective regularization. AdamW applies weight decay directly to the parameters, restoring proper regularization. Default in Transformers.
Adam with Nesterov momentum (NAG) replacing classical momentum. Uses the lookahead gradient for the first moment. Slightly better convergence on some tasks due to NAG's superior theoretical properties.
Fixes a convergence issue in Adam for non-convex settings. Maintains the maximum of past second moments instead of the moving average — \(\hat{v}_j = \max(\hat{v}_j^{(t-1)}, \hat{v}_j^{(t)})\). Ensures the learning rate is monotonically non-increasing per parameter.
A 2023 optimizer (Google Brain) discovered by program search. Uses only the sign of the momentum update, not its magnitude. More memory efficient (no second moment) and claims better generalization on vision/language tasks.
Learning Rate Schedules
A fixed learning rate is rarely optimal: a large \(\alpha\) is good early (fast progress), but can cause oscillation near the minimum. A small \(\alpha\) is precise but slow early. Learning rate schedules vary \(\alpha\) over time.
Step Decay
γ ∈ (0,1), k = step interval (e.g., drop by 0.5 every 10 epochs)
Exponential Decay
Cosine Annealing (Very Common)
T = total steps. Starts at α_max, smoothly decays to α_min.
With warm restarts (SGDR), the LR is periodically reset to \(\alpha_{\max}\) — allowing the optimizer to explore different loss landscape regions and find better minima. Standard in modern training.
Linear Warmup (Critical for Transformers)
In early training, Adam's second moment estimate is unreliable (few samples). A large initial LR causes catastrophically large updates. Warmup gives the second moment estimate time to stabilize before the full LR is used. Ubiquitous in BERT, GPT, ViT training.
Cyclical Learning Rates (CLR)
Instead of monotonically decreasing, cycle \(\alpha\) between bounds \([\alpha_{\min}, \alpha_{\max}]\). Large LR phases help escape local minima and saddle points; small LR phases allow fine convergence. Empirically competitive with complex schedules for much less tuning.
Second-Order Methods & Newton's Method
Gradient descent uses only first-order information (the gradient). Second-order methods additionally use the Hessian \(\mathbf{H} = \nabla^2 J\) — the matrix of second derivatives — which encodes curvature information.
Newton's Method
Fit a quadratic approximation to \(J\) at the current point (second-order Taylor expansion) and jump directly to its minimum:
Newton's method has quadratic convergence near the minimum — the number of correct digits roughly doubles each iteration. Gradient descent has only linear convergence. But Newton requires computing and inverting an \(n \times n\) Hessian — \(O(n^2)\) storage and \(O(n^3)\) inversion cost. For neural networks with millions of parameters, this is completely infeasible.
| Method | Information Used | Convergence | Cost per Step | Practical? |
|---|---|---|---|---|
| Gradient Descent | 1st order (∇J) | Linear \(O(1/T)\) | \(O(n)\) | Yes |
| Newton's Method | 2nd order (∇²J) | Quadratic | \(O(n^3)\) | No (large n) |
| Quasi-Newton (L-BFGS) | Approx 2nd order | Superlinear | \(O(n)\) mem | Yes (small n) |
| Adam | 1st order + diagonal 2nd | Adaptive | \(O(n)\) | Yes |
Why Adam Approximates Newton
Adam's second moment \(\mathbf{v}\) approximates the diagonal of the Hessian — the curvature along each parameter axis independently (ignoring cross terms). Dividing by \(\sqrt{\mathbf{v}}\) approximates the Hessian scaling without ever forming the full matrix. This is why Adam often converges much faster than plain SGD despite being purely first-order.
Convergence Analysis
Convex Case — Guaranteed Convergence
For a convex, \(L\)-smooth function with minimum \(J^*\), with step size \(\alpha \leq \frac{1}{L}\):
This is \(O(1/T)\) convergence — to get \(\epsilon\)-accuracy requires \(O(1/\epsilon)\) iterations.
Strongly Convex Case — Exponential Convergence
If \(J\) is additionally \(\mu\)-strongly convex (the Hessian's minimum eigenvalue \(\geq \mu > 0\)), gradient descent converges linearly (geometrically fast):
The factor \(\kappa = L/\mu\) is the condition number. When \(\kappa\) is large (ill-conditioned — ravine), the contraction factor \((1-1/\kappa)\) is close to 1 → slow convergence. This is the formal reason ravines are problematic.
Convergence Criteria in Practice
- Gradient norm: \(\|\nabla J(\boldsymbol{\theta}^{(t)})\|_2 < \epsilon\) — the gradient is near zero. Standard and mathematically principled.
- Parameter change: \(\|\boldsymbol{\theta}^{(t+1)} - \boldsymbol{\theta}^{(t)}\|_2 < \epsilon\) — parameters stopped moving significantly.
- Loss change: \(|J^{(t+1)} - J^{(t)}| < \epsilon\) — loss has plateaued.
- Validation loss: Monitor held-out loss — stop when it starts increasing (early stopping). Most common in practice.
GD for Linear & Logistic Regression — Concrete Examples
Linear Regression — Full GD Update
Model: \(\hat{\mathbf{y}} = \mathbf{X}\mathbf{w} + b\). Loss: MSE = \(\frac{1}{2m}\|\mathbf{X}\mathbf{w} - \mathbf{y}\|^2\).
Logistic Regression — Full GD Update
Model: \(\hat{\mathbf{y}} = \sigma(\mathbf{X}\mathbf{w} + b)\). Loss: Cross-entropy. As derived in the logistic regression masterclass:
The gradient update rule for logistic and linear regression has the exact same vectorized form — \(\frac{1}{m}\mathbf{X}^T(\hat{\mathbf{y}} - \mathbf{y})\). The only difference is in how \(\hat{\mathbf{y}}\) is computed (identity vs sigmoid). This is why gradient descent code for both models looks nearly identical — and why the pattern generalizes to neural networks.
Why GD Instead of the Normal Equation?
| Normal Equation | Gradient Descent | |
|---|---|---|
| Solution quality | Exact global optimum | Approximate (converges to optimum) |
| Cost (n = #features) | \(O(n^3)\) matrix inversion | \(O(kn)\) per iteration |
| Works when n is large? | No — \(n > 10^4\) becomes infeasible | Yes — scales to billions of params |
| Works for non-convex loss? | No | Yes (no guarantee of global min) |
| Hyperparameter tuning | None | Learning rate must be chosen |
The Full Mental Model — Everything Together
The complete unified narrative:
- The negative gradient \(-\nabla J\) is the unique direction of steepest descent — proven by the Cauchy-Schwarz inequality applied to the directional derivative.
- Gradient descent iterates \(\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \alpha\nabla J\) — valid because the Taylor expansion guarantees a decrease of \(\alpha\|\nabla J\|^2\) per step.
- The learning rate \(\alpha\) controls step size — bounded above by \(1/L\) for convergence in the convex case.
- Mini-batch GD trades gradient accuracy for speed — GPU parallelism makes this the default.
- Real loss surfaces have saddle points, ravines, and plateaus — naive GD struggles with all three.
- Momentum accumulates past gradients, suppressing oscillation across ravines and accelerating along consistent directions.
- AdaGrad adapts per-parameter learning rates using accumulated squared gradients. RMSProp fixes AdaGrad's vanishing LR with an exponential moving average.
- Adam = Momentum + RMSProp + bias correction. The effective step is roughly \(\alpha \cdot \text{sign}(\text{mean gradient})\) — normalized, adaptive, momentum-smoothed. Default for deep learning.
- LR schedules (especially warmup + cosine decay) significantly improve final performance — the optimizer and schedule interact critically.
- Adam approximates Newton's method diagonally — second moments estimate the curvature per parameter, scaling steps by inverse curvature without forming the full Hessian.
Every weight update in every model you've ever trained — was computed by some variant of this algorithm. Gradient descent is not a detail of ML training: it is ML training. Backpropagation simply computes \(\nabla J\); gradient descent decides what to do with it.