// Machine Learning — Optimization Theory

Gradient
Descent

Every weight update in every neural network you'll ever train flows through this. Theory, geometry, math, all variants, all optimizers — from first principles.

Directional Derivative Steepest Descent Learning Rate Momentum AdaGrad RMSProp Adam Convergence Saddle Points LR Schedules
// Table of Contents
  1. 01What is Optimization? The Landscape
  2. 02The Gradient — Mathematical Foundation
  3. 03Directional Derivative & Steepest Descent
  4. 04Gradient Descent — The Algorithm
  5. 05The Learning Rate — Deep Dive
  6. 06Batch, Stochastic & Mini-Batch GD
  7. 07The Loss Landscape — Challenges
  8. 08Momentum — Escaping Ravines
  9. 09AdaGrad — Adaptive Learning Rates
  10. 10RMSProp — Fixing AdaGrad
  11. 11Adam — The Gold Standard
  12. 12Learning Rate Schedules
  13. 13Second-Order Methods & Newton's Method
  14. 14Convergence Analysis
  15. 15GD for Linear & Logistic Regression
  16. 16The Full Mental Model
§01

What is Optimization? The Landscape

Every machine learning model has a loss function \( J(\boldsymbol{\theta}) \) — a scalar that measures how wrong the model is. Training a model is entirely a problem of minimizing this function over the parameter space \( \boldsymbol{\theta} \in \mathbb{R}^n \).

Optimization Objective \[ \boldsymbol{\theta}^* = \arg\min_{\boldsymbol{\theta}} \; J(\boldsymbol{\theta}) \]

Think of \( J(\boldsymbol{\theta}) \) as defining a surface in \( (n+1) \)-dimensional space — a "loss landscape." Each point in \( \mathbb{R}^n \) is a configuration of weights; the height at each point is the loss. Training = finding the lowest point on this landscape.

Why Can't We Just Solve It Directly?

For some models (linear regression), there's a closed-form solution: set the gradient to zero and solve algebraically. But for almost everything else:

  • Neural networks: \(n\) can be billions of parameters. The normal equation \(\boldsymbol{\theta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\) requires inverting an \(n \times n\) matrix — \(O(n^3)\) cost, completely infeasible.
  • Non-convex losses: Setting \(\nabla J = 0\) gives a system of nonlinear equations with no closed form — think sigmoid, ReLU composed many times.
  • Online/streaming data: We may not have all data at once, so analytical solutions are structurally impossible.
✓ The Solution: Iterative Optimization

Instead of jumping to the answer, we walk toward it — taking small steps in the direction that decreases the loss fastest. That direction is the negative gradient. This is gradient descent.

§02

The Gradient — Mathematical Foundation

From Derivative to Gradient

For a scalar function of one variable \(f(x)\), the derivative \(f'(x)\) tells us the slope — the rate of change. When the function takes a vector input \( \boldsymbol{\theta} \in \mathbb{R}^n \), we generalize to partial derivatives.

The gradient \(\nabla J(\boldsymbol{\theta})\) is the vector of all partial derivatives:

Gradient Definition \[ \nabla J(\boldsymbol{\theta}) = \begin{bmatrix} \dfrac{\partial J}{\partial \theta_1} \\ \dfrac{\partial J}{\partial \theta_2} \\ \vdots \\ \dfrac{\partial J}{\partial \theta_n} \end{bmatrix} \in \mathbb{R}^n \]

Each component \(\frac{\partial J}{\partial \theta_j}\) tells you: "how much does the loss change if I nudge \(\theta_j\) slightly, holding all other parameters fixed?"

What the Gradient Tells You — Geometrically

  • Direction: \(\nabla J(\boldsymbol{\theta})\) points in the direction of steepest increase of \(J\) at \(\boldsymbol{\theta}\). Therefore, \(-\nabla J\) points toward steepest decrease.
  • Magnitude: \(\|\nabla J(\boldsymbol{\theta})\|\) tells you how steep the slope is. A large gradient means the surface is steep; near a minimum it approaches zero.
  • Orthogonality: The gradient is always perpendicular (orthogonal) to the level curves (contours) of the function — just as a mountain's steepest path is always perpendicular to elevation contours.
  • Local information: The gradient is a local property — it only describes the surface at a single point. This is why gradient descent must iterate.

Gradient of Common Loss Functions

MSE Loss (Linear Regression) \[ J(\mathbf{w}) = \frac{1}{2m}\|\mathbf{X}\mathbf{w} - \mathbf{y}\|^2 \quad\Rightarrow\quad \nabla_{\mathbf{w}} J = \frac{1}{m}\mathbf{X}^T(\mathbf{X}\mathbf{w} - \mathbf{y}) \]
Cross-Entropy Loss (Logistic Regression) \[ J(\mathbf{w}) = -\frac{1}{m}\sum_i\left[y_i\log\hat{y}_i + (1-y_i)\log(1-\hat{y}_i)\right] \quad\Rightarrow\quad \nabla_{\mathbf{w}} J = \frac{1}{m}\mathbf{X}^T(\hat{\mathbf{y}} - \mathbf{y}) \]
💡 Notice

Both gradients have the form \(\mathbf{X}^T(\hat{\mathbf{y}} - \mathbf{y})\) — the features matrix transposed, times the residual. This is the "error signal" propagated back through the linear layer. The difference is only in how \(\hat{\mathbf{y}}\) is computed.

§03

Directional Derivative & Why Negative Gradient is Optimal

This section gives the mathematical proof that the gradient direction is optimal — not just an intuition.

The Directional Derivative

The rate of change of \(J\) in an arbitrary direction \(\mathbf{u}\) (a unit vector, \(\|\mathbf{u}\| = 1\)) is the directional derivative:

Directional Derivative \[ D_{\mathbf{u}} J(\boldsymbol{\theta}) = \nabla J(\boldsymbol{\theta}) \cdot \mathbf{u} = \|\nabla J\| \|\mathbf{u}\| \cos\phi = \|\nabla J\| \cos\phi \]

where φ is the angle between ∇J and u, and ‖u‖ = 1

This is the dot product formula — the component of the gradient in the direction \(\mathbf{u}\).

Proof: The Negative Gradient is Steepest Descent

// Why −∇J is the unique direction of maximum decrease
1
We want to find the unit vector \(\mathbf{u}\) that minimizes \(D_{\mathbf{u}} J = \|\nabla J\|\cos\phi\).
2
\(\|\nabla J\|\) is fixed at the current point. So we minimize over the angle \(\phi\) between \(\mathbf{u}\) and \(\nabla J\).
3
\(\cos\phi\) is minimized when \(\phi = 180°\), i.e., when \(\mathbf{u}\) points exactly opposite to \(\nabla J\).
At φ = 180°: cos(φ) = −1, giving D_u J = −‖∇J‖ — the most negative possible rate of change.
4
Therefore: \(\mathbf{u}^* = -\dfrac{\nabla J(\boldsymbol{\theta})}{\|\nabla J(\boldsymbol{\theta})\|}\)
The unit vector in the direction of steepest descent is the normalized negative gradient. ∎
✓ The Fundamental Theorem of Gradient Descent

The negative gradient \(-\nabla J(\boldsymbol{\theta})\) is the unique direction of locally steepest descent. Moving in any other direction decreases the loss more slowly (or increases it). This is not heuristic — it follows directly from the Cauchy-Schwarz inequality applied to the dot product.

minimum θ ∇J −∇J contour line −∇J is always perpendicular to contour lines, pointing toward the minimum
Fig 1. Contour plot of loss function. ∇J points outward (steepest ascent), perpendicular to contours. −∇J points inward (steepest descent) toward the minimum.
§04

Gradient Descent — The Algorithm

The Core Update Rule

Starting from an initial parameter vector \(\boldsymbol{\theta}^{(0)}\), we repeatedly move in the direction of steepest descent by a small step controlled by the learning rate \(\alpha\):

Gradient Descent Update \[ \boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \alpha \nabla J(\boldsymbol{\theta}^{(t)}) \]

For each individual parameter \(\theta_j\):

Component-wise Update \[ \theta_j \leftarrow \theta_j - \alpha \frac{\partial J}{\partial \theta_j}, \quad \forall j = 1, \ldots, n \]
⚠ Simultaneous Update Rule

All parameters must be updated simultaneously using the gradients computed at the current parameter values. You compute all gradients first, then update all parameters. If you update \(\theta_1\) first and use the new \(\theta_1\) to compute the gradient for \(\theta_2\), you're doing something different — a form of coordinate descent, not true gradient descent.

Derivation from Taylor Expansion

Why does this update make sense? The first-order Taylor expansion of \(J\) at \(\boldsymbol{\theta}\) is:

// Motivating the update rule via Taylor series
1
Taylor expansion: \(\displaystyle J(\boldsymbol{\theta} + \boldsymbol{\delta}) \approx J(\boldsymbol{\theta}) + \nabla J(\boldsymbol{\theta})^T \boldsymbol{\delta}\)
For small perturbation δ, the change in loss is approximately ∇J · δ
2
We want \(\boldsymbol{\delta}\) that most decreases \(J\). Choose \(\boldsymbol{\delta} = -\alpha \nabla J(\boldsymbol{\theta})\).
3
Then: \(\displaystyle J(\boldsymbol{\theta} - \alpha\nabla J) \approx J(\boldsymbol{\theta}) - \alpha\|\nabla J\|^2\)
Since ∇Jᵀ(−α∇J) = −α‖∇J‖² < 0 — the loss is guaranteed to decrease (for small enough α).
4
\(\Rightarrow \boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \alpha\nabla J(\boldsymbol{\theta}^{(t)}) \qquad \blacksquare\)
The update is valid as long as α is small enough for the linear approximation to hold.
💡 The decrease guarantee

The Taylor expansion shows that each step decreases \(J\) by approximately \(\alpha\|\nabla J\|^2\). This is always non-negative — so gradient descent is a descent method, meaning the loss never increases (for sufficiently small \(\alpha\) and smooth \(J\)). Near a minimum \(\|\nabla J\| \to 0\) and updates become tiny.

// Batch Gradient Descent — General Form Initialize: θ ∈ ℝⁿ (e.g., zeros or small random), choose α
for t = 1, 2, 3, ... until convergence:
  # Compute gradient over entire dataset
  g = ∇J(θ; X_train, y_train)

  # Update parameters
  θ = θ α · g

  # Convergence check
  ifg‖ < ε: break   # gradient near zero → at minimum
§05

The Learning Rate — Deep Dive

The learning rate \(\alpha\) is the single most important hyperparameter in gradient descent. It controls how far we step in the negative gradient direction each iteration.

α too large

Overshoots the minimum. Oscillates or diverges. Loss increases instead of decreasing. Can jump across valleys entirely.

α just right

Converges smoothly and efficiently. Each step meaningfully reduces loss. Reaches neighborhood of minimum in reasonable time.

α too small

Takes infinitely small steps. Converges too slowly — may take millions of iterations. Wastes compute, may appear stuck.

Mathematical Condition for Convergence

For a convex function with Lipschitz-continuous gradient (the gradient doesn't change too fast — characterized by a constant \(L\)), gradient descent is guaranteed to converge if:

Convergence Condition (Lipschitz) \[ 0 < \alpha \leq \frac{1}{L} \]

where L = max eigenvalue of the Hessian ∇²J (the curvature)

Intuitively: \(\frac{1}{L}\) is the step size that exactly "fits" the curvature of the loss landscape. The Hessian's largest eigenvalue tells you how curved the function is — high curvature demands a smaller step.

The Quadratic Bowl — Understanding LR Geometrically

For a simple quadratic \(J(\theta) = \frac{1}{2}a\theta^2\) (a bowl with curvature \(a\)):

\[ \theta^{(t+1)} = \theta^{(t)} - \alpha \cdot a\theta^{(t)} = \theta^{(t)}(1 - \alpha a) \]

This is a geometric sequence multiplied by \((1-\alpha a)\) each step:

  • If \(|1-\alpha a| < 1\) (i.e., \(\alpha < 2/a\)): sequence converges to 0 → minimum ✓
  • If \(|1-\alpha a| = 1\) (i.e., \(\alpha = 2/a\)): stuck oscillating between \(+\theta_0\) and \(-\theta_0\)
  • If \(|1-\alpha a| > 1\) (i.e., \(\alpha > 2/a\)): sequence diverges → ✗
⚡ Practical Rule

Start with \(\alpha = 0.001\) or \(\alpha = 0.01\). Monitor training loss: if it decreases smoothly → fine. If it oscillates or spikes → halve \(\alpha\). If it barely moves → try 10× larger. Modern practice: use Adam (§11) which adapts \(\alpha\) per-parameter automatically.

§06

Batch, Stochastic & Mini-Batch GD

The three variants differ only in how many samples are used to compute the gradient per update. This creates a fundamental speed-vs-accuracy tradeoff.

Batch Gradient Descent (BGD)

Use all \(m\) training samples to compute one gradient, then do one update.

Batch GD Gradient \[ \nabla J(\boldsymbol{\theta}) = \frac{1}{m} \sum_{i=1}^{m} \nabla_{\boldsymbol{\theta}} \mathcal{L}(\hat{y}^{(i)}, y^{(i)}) \]
  • Pros: Exact gradient — smooth convergence, guaranteed descent on convex functions, stable loss curve.
  • Cons: Expensive per update when \(m\) is large. Cannot be used for online learning. Slow for big datasets.

Stochastic Gradient Descent (SGD)

Use one randomly selected sample to compute an approximate gradient, then update immediately.

SGD Gradient (one sample) \[ \nabla J(\boldsymbol{\theta}) \approx \nabla_{\boldsymbol{\theta}} \mathcal{L}(\hat{y}^{(i)}, y^{(i)}) \quad \text{for random } i \]
  • Pros: \(m\) updates per epoch instead of 1. Fast to start learning. Noise can help escape shallow local minima. Works for online/streaming data.
  • Cons: High variance — loss curve is noisy. May never fully converge — oscillates around minimum. Needs careful learning rate tuning.

Mini-Batch Gradient Descent (the Standard)

Use a small batch of \(B\) samples (typically 32, 64, 128, 256) to compute each gradient estimate.

Mini-Batch Gradient \[ \nabla J(\boldsymbol{\theta}) \approx \frac{1}{B} \sum_{i \in \mathcal{B}} \nabla_{\boldsymbol{\theta}} \mathcal{L}(\hat{y}^{(i)}, y^{(i)}) \]

where 𝓑 is a randomly sampled mini-batch of size B

  • Why it works: Gradient estimate is a sample mean — by the law of large numbers, it's a good approximation of the true gradient even for small \(B\).
  • GPU efficiency: Matrix operations on a batch of size 64 are barely slower than on a single sample — parallelism is exploited optimally.
  • Noise benefits: Some stochasticity helps escape local minima; far less noisy than pure SGD.
Variant Batch Size Updates/Epoch Gradient Accuracy Noise Level Memory
Batch GD \(m\) (all) 1 Exact Zero High
SGD 1 \(m\) Very noisy Very high Minimal
Mini-Batch GD 32–512 \(m/B\) Good approximation Moderate Low–Med
💡 One Epoch

One epoch = one complete pass through the entire training dataset. With mini-batch GD and batch size \(B\), one epoch consists of \(\lceil m/B \rceil\) gradient updates. Training typically runs for tens to hundreds of epochs.

§07

The Loss Landscape — Challenges

Real loss functions (especially for deep networks) are high-dimensional non-convex surfaces full of pathological structures that naive gradient descent struggles with.

Problem 1 — Local Minima

A local minimum is a point where \(\nabla J = 0\) and all directions lead upward — but it's not the global minimum. Gradient descent can get trapped here.

Local vs Global Minimum \[ \text{Local minimum: } \nabla J(\boldsymbol{\theta}^*) = 0, \; \nabla^2 J(\boldsymbol{\theta}^*) \succ 0 \; (\text{positive definite Hessian}) \] \[ \text{Global minimum: } J(\boldsymbol{\theta}^*) \leq J(\boldsymbol{\theta}) \; \forall \boldsymbol{\theta} \]
✓ The Good News for Deep Learning

Empirical research suggests that in high-dimensional networks, true local minima are rare. Most critical points are saddle points. Furthermore, many local minima have similar loss values to the global minimum — escaping them is less important than it seems. Batch noise in SGD also helps escape shallow minima.

Problem 2 — Saddle Points

A saddle point is where \(\nabla J = 0\) but it's neither a local min nor max — the Hessian has both positive and negative eigenvalues. The function curves up in some directions, down in others.

Saddle Point Condition \[ \nabla J(\boldsymbol{\theta}^*) = 0, \quad \nabla^2 J(\boldsymbol{\theta}^*) \text{ is indefinite} \quad (\text{has both } + \text{ and } - \text{ eigenvalues}) \]

Gradient descent slows to a crawl near saddle points because the gradient is near zero in all directions — even though you're not at a minimum. In very high dimensions, saddle points are far more common than local minima (a critical point is a local min only if all eigenvalues are positive — exponentially unlikely in high dimensions).

Saddle points are the primary obstacle in deep learning optimization. Momentum (§08) and noise from mini-batching help escape them.

Problem 3 — Ravines (Ill-Conditioned Curvature)

A ravine is a region where the loss surface is much more curved in one direction than another — like a long valley. The condition number \(\kappa = \lambda_{\max}/\lambda_{\min}\) of the Hessian measures this.

In a ravine, gradient descent oscillates wildly across the narrow dimension while making slow progress along the valley floor. The gradient direction points across the valley (high curvature direction) rather than along it (toward the minimum).

start vanilla GD: oscillates with momentum: smooth
Fig 2. Ravine (ill-conditioned landscape). Vanilla GD zigzags across the valley. Momentum smooths the path and travels along the valley floor.

Problem 4 — Plateaus and Vanishing Gradients

A plateau is a region where \(J\) is nearly flat — gradient is tiny everywhere. The optimizer makes negligible progress. This is also the mechanism of the vanishing gradient problem in deep networks: when gradients are multiplied through many layers via the chain rule, they can become exponentially small, stalling learning in early layers.

§08

Momentum — Escaping Ravines

Momentum is inspired by physics: a ball rolling downhill accumulates velocity in the direction it's been traveling. Rather than making each step purely from the current gradient, momentum accumulates a velocity vector from past gradients.

Classical Momentum (Heavy Ball Method)

Momentum Update \[ \mathbf{v}^{(t)} = \beta \mathbf{v}^{(t-1)} - \alpha \nabla J(\boldsymbol{\theta}^{(t)}) \] \[ \boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} + \mathbf{v}^{(t)} \]

β ∈ [0,1) — momentum coefficient (typically 0.9). v⁽⁰⁾ = 0

The velocity \(\mathbf{v}^{(t)}\) is an exponentially weighted average of past gradients. Expanding recursively:

// Expanding v to see what it really computes
1
\(\mathbf{v}^{(t)} = -\alpha\nabla J^{(t)} + \beta(-\alpha\nabla J^{(t-1)} + \beta(\ldots))\)
2
\(\mathbf{v}^{(t)} = -\alpha\sum_{k=0}^{t} \beta^{t-k} \nabla J^{(k)}\)
v is a weighted sum of all past gradients, with exponentially decaying weights β^(t-k).

In a ravine: gradients in the across-valley direction alternate signs each step → they cancel in the average. Gradients along the valley floor are consistently in the same direction → they accumulate. The ball rolls along the valley rather than zigzagging across it.

💡 Effective Step Size

In steady state, if the gradient is constant \(\mathbf{g}\), the velocity converges to \(\mathbf{v}^* = \frac{-\alpha}{1-\beta}\mathbf{g}\). With \(\beta = 0.9\), the effective learning rate is \(\frac{\alpha}{1-0.9} = 10\alpha\). Momentum amplifies the step size in consistent gradient directions by \(\frac{1}{1-\beta}\).

Nesterov Accelerated Gradient (NAG)

A smarter variant: instead of computing the gradient at the current position, compute it at the lookahead position (where momentum would take you):

Nesterov Momentum \[ \tilde{\boldsymbol{\theta}} = \boldsymbol{\theta}^{(t)} + \beta \mathbf{v}^{(t-1)} \quad \text{(lookahead position)} \] \[ \mathbf{v}^{(t)} = \beta\mathbf{v}^{(t-1)} - \alpha \nabla J(\tilde{\boldsymbol{\theta}}) \] \[ \boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} + \mathbf{v}^{(t)} \]

NAG has a provably faster convergence rate: for convex functions, momentum achieves \(O(1/T)\) convergence while NAG achieves the optimal \(O(1/T^2)\) rate. Intuition: by computing the gradient at the lookahead point, NAG "corrects" for overshooting before it happens.

§09

AdaGrad — Adaptive Learning Rates

A fundamental problem with a single global \(\alpha\): some parameters may need large updates (rarely seen features), others need small updates (frequently updated parameters). AdaGrad adapts the learning rate per parameter based on its history of gradients.

AdaGrad Update \[ G_j^{(t)} = G_j^{(t-1)} + \left(\frac{\partial J}{\partial \theta_j}\right)^2 \quad \text{(accumulated squared gradients)} \] \[ \theta_j^{(t+1)} = \theta_j^{(t)} - \frac{\alpha}{\sqrt{G_j^{(t)} + \epsilon}} \cdot \frac{\partial J}{\partial \theta_j} \]

ε ≈ 1e-8 prevents division by zero. G_j⁽⁰⁾ = 0

In matrix/vector form:

\[ \mathbf{G}^{(t)} = \mathbf{G}^{(t-1)} + \mathbf{g}^{(t)} \odot \mathbf{g}^{(t)} \] \[ \boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \frac{\alpha}{\sqrt{\mathbf{G}^{(t)}} + \epsilon} \odot \mathbf{g}^{(t)} \]

⊙ = element-wise product. Division and sqrt are element-wise.

Why It Works — and Where It Fails

  • Large gradient history → large \(G_j\) → small effective step. Prevents overshooting frequently updated parameters.
  • Small gradient history → small \(G_j\) → large effective step. Gives more learning to sparse/rarely updated parameters.
  • NLP use: Great for sparse features (word embeddings) — rare words get larger updates.
⚠ The Fatal Flaw of AdaGrad

\(G_j\) is a sum — it only ever increases. Over time, the effective learning rate \(\frac{\alpha}{\sqrt{G_j}}\) shrinks monotonically toward zero. In deep networks with long training, AdaGrad can grind to a halt — the learning rate becomes vanishingly small before convergence. This is what RMSProp fixes.

§10

RMSProp — Fixing AdaGrad

RMSProp (Root Mean Square Propagation, Hinton 2012) replaces the cumulative sum with an exponentially weighted moving average of squared gradients. The "memory" decays — old gradients are forgotten, preventing the monotonic learning rate decay.

RMSProp Update \[ \mathbf{s}^{(t)} = \rho \mathbf{s}^{(t-1)} + (1-\rho)\, \mathbf{g}^{(t)} \odot \mathbf{g}^{(t)} \] \[ \boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \frac{\alpha}{\sqrt{\mathbf{s}^{(t)}} + \epsilon} \odot \mathbf{g}^{(t)} \]

ρ ∈ (0,1) — decay rate (typically 0.9 or 0.99). s⁽⁰⁾ = 0

\(\mathbf{s}^{(t)}\) is the exponentially weighted moving average of \(\mathbf{g}^2\) — effectively a recent estimate of the RMS (root mean square) of the gradient. Unlike AdaGrad's ever-growing sum, \(\mathbf{s}\) stabilizes to a recent average.

Property AdaGrad RMSProp
Gradient history Cumulative sum (full history) Exponential moving average (recent)
Effective LR over time Monotonically decreasing → 0 Stabilizes (adapts to recent gradients)
Works for deep networks? No — LR vanishes Yes
Sparse gradients Excellent Good
§11

Adam — The Gold Standard

Adam (Adaptive Moment Estimation, Kingma & Ba 2015) combines the ideas of momentum (first moment) and RMSProp (second moment) into a single algorithm. It is the most widely used optimizer in deep learning.

The Algorithm — Full Specification

Adam — Complete Update Rules \[ \mathbf{m}^{(t)} = \beta_1 \mathbf{m}^{(t-1)} + (1-\beta_1)\mathbf{g}^{(t)} \quad \text{[1st moment — mean of gradients]} \] \[ \mathbf{v}^{(t)} = \beta_2 \mathbf{v}^{(t-1)} + (1-\beta_2)\mathbf{g}^{(t)} \odot \mathbf{g}^{(t)} \quad \text{[2nd moment — variance of gradients]} \] \[ \hat{\mathbf{m}}^{(t)} = \frac{\mathbf{m}^{(t)}}{1 - \beta_1^t} \qquad \hat{\mathbf{v}}^{(t)} = \frac{\mathbf{v}^{(t)}}{1 - \beta_2^t} \quad \text{[bias correction]} \] \[ \boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \frac{\alpha}{\sqrt{\hat{\mathbf{v}}^{(t)}} + \epsilon} \odot \hat{\mathbf{m}}^{(t)} \]
★ Default Hyperparameters

Paper recommendations: \(\alpha = 0.001\), \(\beta_1 = 0.9\), \(\beta_2 = 0.999\), \(\epsilon = 10^{-8}\). These defaults work surprisingly well across a wide variety of problems without tuning.

The Bias Correction — Why It's Necessary

Both \(\mathbf{m}\) and \(\mathbf{v}\) are initialized to zero. In the first few iterations, they're heavily biased toward zero — the exponential average hasn't "warmed up" yet. The bias correction factors \((1-\beta_1^t)\) and \((1-\beta_2^t)\) counteract this.

// Why m̂ = m/(1−β₁ᵗ) is unbiased
1
Expanding the recursion: \(\mathbf{m}^{(t)} = (1-\beta_1)\sum_{k=1}^{t}\beta_1^{t-k}\mathbf{g}^{(k)}\)
2
Taking expectation: \(\mathbb{E}[\mathbf{m}^{(t)}] = \mathbb{E}[\mathbf{g}](1-\beta_1^t) \cdot \text{(assuming stationary g)}\)
The (1−β₁ᵗ) factor shows the estimate is biased downward by exactly this factor.
3
Dividing by \((1-\beta_1^t)\): \(\mathbb{E}[\hat{\mathbf{m}}^{(t)}] = \mathbb{E}[\mathbf{g}]\) — now unbiased. \(\blacksquare\)

Note: as \(t \to \infty\), \(\beta_1^t \to 0\) and \(\beta_2^t \to 0\), so the correction factors approach 1 — bias correction only matters in early training.

Understanding Adam's Effective Step Size

Ignoring bias correction, the effective update for parameter \(j\) is:

\[ \Delta\theta_j = -\alpha \cdot \frac{\hat{m}_j}{\sqrt{\hat{v}_j} + \epsilon} \]

If gradient is consistent (always same sign, similar magnitude): \(\hat{m}_j \approx g_j\) and \(\hat{v}_j \approx g_j^2\), so \(\frac{\hat{m}_j}{\sqrt{\hat{v}_j}} \approx \frac{g_j}{|g_j|} = \text{sign}(g_j)\). The effective step is roughly \(\pm\alpha\) — Adam takes normalized steps with magnitude roughly equal to \(\alpha\).

If gradient is noisy (alternates sign): \(\hat{m}_j \approx 0\) (signals cancel) while \(\hat{v}_j\) remains nonzero → very small update. Adam naturally adapts to gradient uncertainty.

Adam Variants Worth Knowing

AdamW

Adam with decoupled weight decay. Standard Adam applies L2 regularization to the gradient before adaptive scaling — this inadvertently reduces the effective regularization. AdamW applies weight decay directly to the parameters, restoring proper regularization. Default in Transformers.

Nadam

Adam with Nesterov momentum (NAG) replacing classical momentum. Uses the lookahead gradient for the first moment. Slightly better convergence on some tasks due to NAG's superior theoretical properties.

AMSGrad

Fixes a convergence issue in Adam for non-convex settings. Maintains the maximum of past second moments instead of the moving average — \(\hat{v}_j = \max(\hat{v}_j^{(t-1)}, \hat{v}_j^{(t)})\). Ensures the learning rate is monotonically non-increasing per parameter.

Lion

A 2023 optimizer (Google Brain) discovered by program search. Uses only the sign of the momentum update, not its magnitude. More memory efficient (no second moment) and claims better generalization on vision/language tasks.

§12

Learning Rate Schedules

A fixed learning rate is rarely optimal: a large \(\alpha\) is good early (fast progress), but can cause oscillation near the minimum. A small \(\alpha\) is precise but slow early. Learning rate schedules vary \(\alpha\) over time.

Step Decay

Step Decay \[ \alpha^{(t)} = \alpha_0 \cdot \gamma^{\lfloor t / k \rfloor} \]

γ ∈ (0,1), k = step interval (e.g., drop by 0.5 every 10 epochs)

Exponential Decay

Exponential Decay \[ \alpha^{(t)} = \alpha_0 \cdot e^{-\lambda t} \]

Cosine Annealing (Very Common)

Cosine Annealing \[ \alpha^{(t)} = \alpha_{\min} + \frac{1}{2}(\alpha_{\max} - \alpha_{\min})\left(1 + \cos\frac{\pi t}{T}\right) \]

T = total steps. Starts at α_max, smoothly decays to α_min.

With warm restarts (SGDR), the LR is periodically reset to \(\alpha_{\max}\) — allowing the optimizer to explore different loss landscape regions and find better minima. Standard in modern training.

Linear Warmup (Critical for Transformers)

Warmup + Decay \[ \alpha^{(t)} = \begin{cases} \dfrac{t}{T_{\text{warm}}} \cdot \alpha_{\max} & t \leq T_{\text{warm}} \\ \text{decay schedule} & t > T_{\text{warm}} \end{cases} \]

In early training, Adam's second moment estimate is unreliable (few samples). A large initial LR causes catastrophically large updates. Warmup gives the second moment estimate time to stabilize before the full LR is used. Ubiquitous in BERT, GPT, ViT training.

Cyclical Learning Rates (CLR)

Instead of monotonically decreasing, cycle \(\alpha\) between bounds \([\alpha_{\min}, \alpha_{\max}]\). Large LR phases help escape local minima and saddle points; small LR phases allow fine convergence. Empirically competitive with complex schedules for much less tuning.

§13

Second-Order Methods & Newton's Method

Gradient descent uses only first-order information (the gradient). Second-order methods additionally use the Hessian \(\mathbf{H} = \nabla^2 J\) — the matrix of second derivatives — which encodes curvature information.

Newton's Method

Fit a quadratic approximation to \(J\) at the current point (second-order Taylor expansion) and jump directly to its minimum:

// Deriving the Newton step
1
Second-order Taylor: \(\displaystyle J(\boldsymbol{\theta} + \boldsymbol{\delta}) \approx J(\boldsymbol{\theta}) + \nabla J^T\boldsymbol{\delta} + \frac{1}{2}\boldsymbol{\delta}^T \mathbf{H} \boldsymbol{\delta}\)
2
Minimize over \(\boldsymbol{\delta}\): take gradient w.r.t. \(\boldsymbol{\delta}\) and set to zero: \(\nabla J + \mathbf{H}\boldsymbol{\delta} = 0\)
3
\(\boldsymbol{\delta}^* = -\mathbf{H}^{-1}\nabla J\)
4
\(\boxed{\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \mathbf{H}(\boldsymbol{\theta}^{(t)})^{-1} \nabla J(\boldsymbol{\theta}^{(t)})}\)
The Hessian inverse rescales the gradient by the curvature — large curvature → small step; small curvature → large step.
✓ Newton vs Gradient Descent

Newton's method has quadratic convergence near the minimum — the number of correct digits roughly doubles each iteration. Gradient descent has only linear convergence. But Newton requires computing and inverting an \(n \times n\) Hessian — \(O(n^2)\) storage and \(O(n^3)\) inversion cost. For neural networks with millions of parameters, this is completely infeasible.

Method Information Used Convergence Cost per Step Practical?
Gradient Descent 1st order (∇J) Linear \(O(1/T)\) \(O(n)\) Yes
Newton's Method 2nd order (∇²J) Quadratic \(O(n^3)\) No (large n)
Quasi-Newton (L-BFGS) Approx 2nd order Superlinear \(O(n)\) mem Yes (small n)
Adam 1st order + diagonal 2nd Adaptive \(O(n)\) Yes

Why Adam Approximates Newton

Adam's second moment \(\mathbf{v}\) approximates the diagonal of the Hessian — the curvature along each parameter axis independently (ignoring cross terms). Dividing by \(\sqrt{\mathbf{v}}\) approximates the Hessian scaling without ever forming the full matrix. This is why Adam often converges much faster than plain SGD despite being purely first-order.

§14

Convergence Analysis

Convex Case — Guaranteed Convergence

For a convex, \(L\)-smooth function with minimum \(J^*\), with step size \(\alpha \leq \frac{1}{L}\):

Convergence Rate — Convex Case \[ J(\boldsymbol{\theta}^{(T)}) - J^* \leq \frac{\|\boldsymbol{\theta}^{(0)} - \boldsymbol{\theta}^*\|^2}{2\alpha T} \]

This is \(O(1/T)\) convergence — to get \(\epsilon\)-accuracy requires \(O(1/\epsilon)\) iterations.

Strongly Convex Case — Exponential Convergence

If \(J\) is additionally \(\mu\)-strongly convex (the Hessian's minimum eigenvalue \(\geq \mu > 0\)), gradient descent converges linearly (geometrically fast):

Linear Convergence — Strongly Convex \[ \|\boldsymbol{\theta}^{(T)} - \boldsymbol{\theta}^*\|^2 \leq \left(1 - \frac{\mu}{L}\right)^T \|\boldsymbol{\theta}^{(0)} - \boldsymbol{\theta}^*\|^2 \]

The factor \(\kappa = L/\mu\) is the condition number. When \(\kappa\) is large (ill-conditioned — ravine), the contraction factor \((1-1/\kappa)\) is close to 1 → slow convergence. This is the formal reason ravines are problematic.

Convergence Criteria in Practice

  • Gradient norm: \(\|\nabla J(\boldsymbol{\theta}^{(t)})\|_2 < \epsilon\) — the gradient is near zero. Standard and mathematically principled.
  • Parameter change: \(\|\boldsymbol{\theta}^{(t+1)} - \boldsymbol{\theta}^{(t)}\|_2 < \epsilon\) — parameters stopped moving significantly.
  • Loss change: \(|J^{(t+1)} - J^{(t)}| < \epsilon\) — loss has plateaued.
  • Validation loss: Monitor held-out loss — stop when it starts increasing (early stopping). Most common in practice.
§15

GD for Linear & Logistic Regression — Concrete Examples

Linear Regression — Full GD Update

Model: \(\hat{\mathbf{y}} = \mathbf{X}\mathbf{w} + b\). Loss: MSE = \(\frac{1}{2m}\|\mathbf{X}\mathbf{w} - \mathbf{y}\|^2\).

Linear Regression GD \[ \nabla_{\mathbf{w}} J = \frac{1}{m}\mathbf{X}^T(\mathbf{X}\mathbf{w} - \mathbf{y}) = \frac{1}{m}\mathbf{X}^T(\hat{\mathbf{y}} - \mathbf{y}) \] \[ \nabla_{b} J = \frac{1}{m}\mathbf{1}^T(\hat{\mathbf{y}} - \mathbf{y}) = \frac{1}{m}\sum_i(\hat{y}^{(i)} - y^{(i)}) \] \[ \mathbf{w} \leftarrow \mathbf{w} - \frac{\alpha}{m}\mathbf{X}^T(\hat{\mathbf{y}} - \mathbf{y}), \qquad b \leftarrow b - \frac{\alpha}{m}\sum_i(\hat{y}^{(i)} - y^{(i)}) \]

Logistic Regression — Full GD Update

Model: \(\hat{\mathbf{y}} = \sigma(\mathbf{X}\mathbf{w} + b)\). Loss: Cross-entropy. As derived in the logistic regression masterclass:

Logistic Regression GD \[ \nabla_{\mathbf{w}} J = \frac{1}{m}\mathbf{X}^T(\hat{\mathbf{y}} - \mathbf{y}) \] \[ \nabla_{b} J = \frac{1}{m}\sum_i(\hat{y}^{(i)} - y^{(i)}) \] \[ \mathbf{w} \leftarrow \mathbf{w} - \frac{\alpha}{m}\mathbf{X}^T(\hat{\mathbf{y}} - \mathbf{y}), \qquad b \leftarrow b - \frac{\alpha}{m}\sum_i(\hat{y}^{(i)} - y^{(i)}) \]
✓ Identical Update Forms

The gradient update rule for logistic and linear regression has the exact same vectorized form — \(\frac{1}{m}\mathbf{X}^T(\hat{\mathbf{y}} - \mathbf{y})\). The only difference is in how \(\hat{\mathbf{y}}\) is computed (identity vs sigmoid). This is why gradient descent code for both models looks nearly identical — and why the pattern generalizes to neural networks.

Why GD Instead of the Normal Equation?

Normal Equation Gradient Descent
Solution quality Exact global optimum Approximate (converges to optimum)
Cost (n = #features) \(O(n^3)\) matrix inversion \(O(kn)\) per iteration
Works when n is large? No — \(n > 10^4\) becomes infeasible Yes — scales to billions of params
Works for non-convex loss? No Yes (no guarantee of global min)
Hyperparameter tuning None Learning rate must be chosen
§16

The Full Mental Model — Everything Together

GRADIENT DESCENT — COMPLETE TAXONOMY GRADIENT DESCENT θ ← θ − α∇J(θ) ↓ challenges Saddle Points & Local Minima Ravines ill-conditioned Learning Rate too large / small Gradient Noise mini-batch variance ↓ solutions Batch Strategy BGD · SGD · Mini-Batch controls noise ↔ speed Momentum v = βv − α∇J escapes ravines Adaptive LR AdaGrad · RMSProp per-param step size combined Adam m = β₁m + (1−β₁)g v = β₂v + (1−β₂)g² LR Schedule step · cosine · warmup α = f(t) CONVERGENCE GUARANTEES Convex: O(1/T) rate Strongly convex: O((1−μ/L)ᵀ) exponential Non-convex: no global guarantee — use momentum + noise
Fig 3. Complete taxonomy of gradient descent — challenges, solutions, and how Adam synthesizes momentum and adaptive learning rates.

The complete unified narrative:

  1. The negative gradient \(-\nabla J\) is the unique direction of steepest descent — proven by the Cauchy-Schwarz inequality applied to the directional derivative.
  2. Gradient descent iterates \(\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \alpha\nabla J\) — valid because the Taylor expansion guarantees a decrease of \(\alpha\|\nabla J\|^2\) per step.
  3. The learning rate \(\alpha\) controls step size — bounded above by \(1/L\) for convergence in the convex case.
  4. Mini-batch GD trades gradient accuracy for speed — GPU parallelism makes this the default.
  5. Real loss surfaces have saddle points, ravines, and plateaus — naive GD struggles with all three.
  6. Momentum accumulates past gradients, suppressing oscillation across ravines and accelerating along consistent directions.
  7. AdaGrad adapts per-parameter learning rates using accumulated squared gradients. RMSProp fixes AdaGrad's vanishing LR with an exponential moving average.
  8. Adam = Momentum + RMSProp + bias correction. The effective step is roughly \(\alpha \cdot \text{sign}(\text{mean gradient})\) — normalized, adaptive, momentum-smoothed. Default for deep learning.
  9. LR schedules (especially warmup + cosine decay) significantly improve final performance — the optimizer and schedule interact critically.
  10. Adam approximates Newton's method diagonally — second moments estimate the curvature per parameter, scaling steps by inverse curvature without forming the full Hessian.
★ The Single Most Important Fact

Every weight update in every model you've ever trained — was computed by some variant of this algorithm. Gradient descent is not a detail of ML training: it is ML training. Backpropagation simply computes \(\nabla J\); gradient descent decides what to do with it.