Deep Learning / Optimization / Mathematical Foundations
Foundation of all Deep Learning
the algorithm that trains every neural networkBackpropagation
From computational graphs and the chain rule to reverse-mode AD, Jacobians, VJPs, matrix calculus, PyTorch
autograd, and every modern trick. The most complete treatment of backpropagation ever written.
// the core question · naive vs efficient · what training actually needs
Training a neural network requires computing the gradient of the loss function with respect to every learnable
parameter. For a modern network with 175 billion parameters (GPT-3), this means computing 175,000,000,000
partial derivatives — on every single forward pass, thousands of times per second. The question is not whether
we can compute gradients, but how to do it with exactly one backward pass instead of 175 billion forward passes.
Three Approaches to Computing Gradients
Numerical Differentiation
Finite difference approximation: for each parameter \(\theta_j\), run two forward passes:
Cost: \(O(n)\) forward passes for \(n\) parameters. GPT-3: 350 billion evaluations per
gradient. Numerically unstable (truncation vs roundoff error tradeoff). Only used for gradient checking,
not production training.
Symbolic Differentiation
Apply calculus rules symbolically (like Wolfram Alpha). Result: an exact closed-form expression for the
gradient.
Problem: Expression swell — the symbolic gradient can be exponentially larger than the
original expression. Doesn't handle loops or dynamic computation graphs. Only practical for small,
fixed-structure expressions.
The Backpropagation Solution
Backpropagation computes the exact gradient using one forward pass (to cache intermediate values)
and one backward pass (to propagate gradients). Total cost: \(O(n)\) operations for \(n\)
parameters — the same order as a single forward pass. This is the efficiency that makes deep learning
computationally tractable.
The Core Insight
The Goal of BackpropagationObjective
\[\nabla_\theta\mathcal{L} = \left[\frac{\partial\mathcal{L}}{\partial\theta_1},
\frac{\partial\mathcal{L}}{\partial\theta_2}, \ldots,
\frac{\partial\mathcal{L}}{\partial\theta_n}\right]^T\]
\[\text{Backprop achieves this in: } T_{\text{fwd}} + T_{\text{bwd}} \approx 2 \cdot T_{\text{fwd}} \quad
\text{(vs } n \cdot T_{\text{fwd}} \text{ for finite differences)}\]
The backward pass costs roughly the same as the forward pass — not n times more. This
is the fundamental theorem underlying all deep learning training. Without it, modern neural networks with
millions or billions of parameters could not be trained.
02
Graphs
Computational Graphs
// DAG structure · nodes · edges · caching · topological order
A computational graph is a Directed Acyclic Graph (DAG) where each node represents an operation and each
directed edge represents the flow of data (a tensor). The graph makes the structure of the computation explicit
— enabling automatic gradient computation through systematic traversal.
Graph Structure for a Neural Layer
Fig 1. Computational graph for one linear layer: x and W enter a matrix multiply (@),
bias b is added (+), then activation σ produces output a. Backward pass (red dashed) traverses this graph in
reverse, computing gradients via the chain rule at each node.
Forward Pass: Topological Evaluation
The forward pass evaluates all nodes in topological order — each node is evaluated only after all its inputs
are ready. For a neural network with \(L\) layers, this is simply the sequential layer-by-layer computation.
Every intermediate value must be cached because the backward pass needs them.
What Gets Cached During Forward PassMemory
\[\text{For each layer }\ell:\quad \text{Cache}\left\{\mathbf{Z}^{[\ell]},\, \mathbf{A}^{[\ell]}\right\}\]
\[\text{Total cache size: } O\!\left(m \cdot \sum_{\ell=1}^L n^{[\ell]}\right) = O(m \cdot L \cdot d)\]
m = batch size, n[ℓ] = neurons in layer ℓ, L = depth. For GPT-3: 175B parameters, 96
layers, batch size 1024 — the activation cache alone requires ~350GB of GPU memory during training. This
is why activation checkpointing exists (§12).
Static vs Dynamic Graphs
Static Graph (TensorFlow v1)
Graph is defined once, then compiled and executed repeatedly. More efficient: the compiler can optimize
the graph, fuse operations, and preallocate memory. Less flexible: can't use Python control flow
(if/while) inside the graph.
Dynamic Graph (PyTorch)
Graph is built on-the-fly during each forward pass. Python control flow works naturally — the graph
changes with each forward pass. More flexible and debuggable. Slightly less efficient but the default in
modern research.
Backpropagation is the chain rule of calculus applied systematically to a computational graph. Nothing more.
The beauty of backpropagation is not that it invents new mathematics — it's that it applies the chain rule
efficiently using the graph structure.
Univariate Chain Rule
Single-Variable Chain RuleCalculus
I
\[\text{If } f = g(h(x)):\quad \frac{df}{dx} = \frac{dg}{dh} \cdot \frac{dh}{dx}\]
\[\text{For a chain } x \to u \to v \to w \to \mathcal{L}:\quad \frac{d\mathcal{L}}{dx} =
\frac{d\mathcal{L}}{dw}\cdot\frac{dw}{dv}\cdot\frac{dv}{du}\cdot\frac{du}{dx}\]
The chain rule: to compute the derivative of the final output with respect to any
intermediate variable, multiply all the local derivatives along the path from that variable to the output.
In a neural network, the "path" goes from each parameter through all subsequent layers to the loss.
Multivariate Chain Rule
In neural networks, intermediate values are tensors, and one upstream variable may depend on multiple
intermediate nodes. The multivariate chain rule sums over all paths:
where
the sum is over all immediate downstream nodes y_j that depend on x.
This sum-over-paths is exactly what "gradient accumulation" means in practice — when a
parameter is used in multiple places (e.g., tied weights), its gradient is the sum of contributions from
all paths to the loss. PyTorch's .grad accumulates (+=) rather than overwrites, implementing this formula.
Why Backprop Is Efficient
The Key Efficiency
Naive application of the chain rule would recompute all intermediate values for each parameter.
Backpropagation avoids this by computing the chain rule in reverse topological order, reusing each
intermediate gradient \(\partial\mathcal{L}/\partial z\) for all upstream parameters that depend on \(z\).
Each gradient is computed exactly once and shared across all paths. This is why backprop is \(O(n)\) rather
than \(O(n^2)\).
04
AD Theory
Reverse Mode Automatic Differentiation
// forward mode vs reverse mode · the adjoint · cotangent · accumulation order
Automatic differentiation (AD) is the general framework; backpropagation is its instantiation for neural
networks. AD has two modes — and the choice between them determines computational complexity.
Forward Mode vs Reverse Mode
Forward Mode AD
Computes \(\frac{\partial \mathbf{y}}{\partial x_i}\) for one specific input \(x_i\). Propagates a
tangent vector \(\dot{v}\) forward alongside the primal values.
Cost: One forward pass per input dimension. For \(n\) inputs: \(O(n)\) forward passes.
Best when: \(n_{\text{inputs}} \ll n_{\text{outputs}}\). Rare in ML — usually
\(n_{\text{params}} \gg 1\).
Used in: Jacobian-vector products (JVPs), second-order optimization, physics
simulations.
Reverse Mode AD (Backprop)
Computes \(\frac{\partial \mathcal{L}}{\partial x_i}\) for all inputs simultaneously. Propagates an
adjoint vector \(\bar{v}\) backward through the graph.
Cost: One forward + one backward pass. For \(n\) inputs: still \(O(1)\) passes.
Best when: \(n_{\text{outputs}} \ll n_{\text{inputs}}\). ML: one scalar loss, millions
of parameters.
Used in: All neural network training. The reason deep learning is computationally
feasible.
The Adjoint Variable
In reverse mode AD, we define the adjoint (or cotangent) of each node \(v\) as the
derivative of the loss with respect to that node's value:
For each upstream node w_j that uses v as an input: the contribution to v̄ is the
adjoint of w_j times the local Jacobian ∂w_j/∂v. Summing over all downstream nodes gives the total
adjoint. This is the multivariate chain rule in adjoint form. The final adjoints ∂L/∂θ for all parameters
θ are the gradients we use for SGD.
The Reverse Accumulation Algorithm
Reverse Mode AD — Core Algorithm
# Forward pass: evaluate all nodes, cache values
values = topological_sort(graph) for node in values:
node.value = evaluate(node.op, node.inputs)
# Backward pass: propagate adjoints in reverse topo order for node inreversed(values): # node has adjoint bar_v = dL/d(node) for i, inp inenumerate(node.inputs): # local Jacobian: d(node)/d(inp) evaluated at cached
values
local_grad = jacobian(node.op, wrt=i,
at=node.cached_inputs) # Adjoint accumulation: bar_inp += bar_v * local_grad
adjoints[inp] += adjoints[node] * local_grad
# Result: adjoints[param] = dL/d(param) for all parameters
05
Linear Algebra
Jacobians
// the derivative of a vector function · shape · why it's a matrix · computational cost
Neural network layers map vectors to vectors. The derivative of a vector-valued function with respect to a
vector input is a matrix called the Jacobian. Understanding Jacobians is the key to
understanding why backprop works for vector and tensor operations.
The (i,j) entry is ∂yᵢ/∂xⱼ — how much the i-th output changes when the j-th input
changes. The Jacobian fully describes the local linear approximation of f at x: f(x+δ) ≈ f(x) + J_f(x)·δ.
For a scalar function (m=1), the Jacobian reduces to the gradient row vector ∇f^T.
Chain Rule for Vector Functions
Vector Chain Rule via JacobiansMatrix
Multiplication
For the special case where L is a scalar (which is always true in neural network
training): ∂L/∂x is a row vector of shape (1×n), J is (m×n), and ∂L/∂z is a row vector of shape (1×m). The
product (1×m)·(m×n) = (1×n). This is the vector-Jacobian product (VJP).
For a layer with \(n\) inputs and \(m\) outputs, the Jacobian has \(mn\) entries. For a 4096-dimensional
layer: 16 million entries per Jacobian, per layer. Computing and storing full Jacobians is infeasible.
Instead, we only ever need the Jacobian-vector product (JVP) for forward mode or the
vector-Jacobian product (VJP) for reverse mode — both computable in \(O(n)\) without forming
the full Jacobian.
06
Core Primitive
Vector-Jacobian Products (VJPs)
// the fundamental operation · why we don't need the Jacobian · examples · efficiency
The vector-Jacobian product is the fundamental primitive of reverse-mode AD. Every backpropagation
rule reduces to computing a VJP for that specific operation. Defining how to compute the VJP for each operation
is equivalent to defining backpropagation for that operation.
ȳ = ∂L/∂y is the "upstream gradient" — the adjoint from all downstream nodes. VJP(f, ȳ)
computes ∂L/∂x without ever forming J_f. Cost: same order as one forward pass through f. This is why
backpropagation is efficient: each backward operation is as cheap as the corresponding forward operation.
VJPs for Key Operations
// VJP derivations for the most common neural network operations
For Wx: the VJP w.r.t. x is W^T ȳ (transpose the weight matrix). The VJP w.r.t. W is
the outer product ȳ x^T. These are the standard backprop rules for linear layers — now derived from
first principles via VJPs.
For elementwise functions: the Jacobian is diagonal, so the VJP is just the elementwise
product ⊙ (Hadamard product) of the upstream gradient and the elementwise derivative. ReLU: x̄ = ȳ ⊙
1[x>0]. Sigmoid: x̄ = ȳ ⊙ σ(x)(1−σ(x)).
The gradient of a sum with respect to each input is 1 — the upstream scalar is
broadcast to all inputs. This is why in batch normalization: the gradient through the sum broadcasts to
all elements.
The cross-entropy loss involves log — so during backward, we divide by the predicted
probability. This is why near-zero predictions cause numerical instability: 1/ε is very large. log(ε +
1e-8) clipping or LogSumExp are used to stabilize this.
JVP vs VJP — When to Use Which
JVP vs VJP ComparisonMode
Selection
\[\text{JVP}(f, \dot{\mathbf{x}}) = J_f \dot{\mathbf{x}} \quad \text{(tangent vector }
\dot{\mathbf{x}}\text{, forward mode)}\]
\[\text{VJP}(f, \bar{\mathbf{y}}) = J_f^T \bar{\mathbf{y}} \quad \text{(cotangent vector }
\bar{\mathbf{y}}\text{, reverse mode)}\]
\[\text{Reverse mode: } O(1) \text{ VJP calls to get } \nabla_{\mathbf{x}}\mathcal{L} \text{ for scalar
}\mathcal{L}\]
\[\text{Forward mode: } O(n) \text{ JVP calls to get all of } J_f\]
Conclusion: for a scalar loss and many parameters (ML training), use reverse mode. For
vector outputs and few inputs (Jacobian estimation, sensitivity analysis), use forward mode. JAX supports
both natively via jax.vjp() and jax.jvp().
Neural networks involve matrix operations. Matrix calculus provides the rules for differentiating functions
that take matrices as inputs or produce matrices as outputs. Two layout conventions exist — the choice affects
the shape of the result and leads to constant confusion in the literature.
Layout Conventions
Numerator Layout
The gradient \(\partial y/\partial \mathbf{x}\) has the same shape as \(y\) "in the numerator." For
scalar \(y\) and vector \(\mathbf{x} \in \mathbb{R}^n\): the gradient is a row vector \(\in
\mathbb{R}^{1\times n}\).
For vector \(\mathbf{y} \in \mathbb{R}^m\) and vector \(\mathbf{x} \in \mathbb{R}^n\): result is
\(\mathbb{R}^{m\times n}\) Jacobian.
Used by: Bishop's PRML, most numerical analysis texts.
Denominator Layout
The gradient \(\partial y/\partial \mathbf{x}\) has the same shape as the "denominator" \(\mathbf{x}\).
For scalar \(y\) and vector \(\mathbf{x} \in \mathbb{R}^n\): the gradient is a column vector
\(\in \mathbb{R}^{n\times 1}\).
The ML convention: \(\nabla_\mathbf{x}\mathcal{L}\) is a column vector matching \(\mathbf{x}\).
Used by: Most ML textbooks, Goodfellow's DL book, standard practice.
Key Matrix Calculus Identities
Essential Identities for Neural Network BackpropReference
The trace trick: tr(A^T B) = vec(A)^T vec(B) — converts matrix inner products to scalar
inner products. Essential for computing ∂L/∂W when L involves matrix operations. Most backprop derivations
for linear layers use ∂tr(X^T A)/∂X = A.
Deriving Linear Layer Gradients via Matrix Calculus
// Deriving ∂L/∂W and ∂L/∂x for y = Wx using matrix calculus trace trick
Setup
\[\mathbf{y} = \mathbf{W}\mathbf{x},\quad \mathcal{L} = f(\mathbf{y})\]
Use the differential approach: \(d\mathcal{L} =
\text{tr}\!\left(\left(\frac{\partial\mathcal{L}}{\partial\mathbf{y}}\right)^T d\mathbf{y}\right) =
\bar{\mathbf{y}}^T\, d\mathbf{y}\)
The key rule: for y = Wx, backprop through x is W^T ȳ (transpose the weight). This is
the most important single formula in all of deep learning. ∎
08
Full Derivation
Full Network Backpropagation Derivation
// notation · BP1 through BP4 · full vectorized proof · every gradient computed
We now derive the four fundamental equations of backpropagation for a general \(L\)-layer neural network with
arbitrary activation functions. These four equations are the complete specification of backpropagation —
everything else is an application of them.
The transpose W^[ℓ+1]T is not a coincidence — it's the VJP of the linear operation z =
Wa. Multiply by the adjoint from the right: the "backward weight matrix" is the transpose of the forward
weight matrix. ∎
BP3
Gradient of the weights:
\[\frac{\partial\mathcal{L}}{\partial\mathbf{W}^{[\ell]}} =
\boldsymbol{\delta}^{[\ell]}(\mathbf{a}^{[\ell-1]})^T\]
Derivation: \(z_j^{[\ell]} = \sum_k W_{jk}^{[\ell]}a_k^{[\ell-1]}\), so \(\partial
z_j^{[\ell]}/\partial W_{jk}^{[\ell]} = a_k^{[\ell-1]}\) and \(\partial z_{j'}^{[\ell]}/\partial
W_{jk}^{[\ell]} = 0\) for \(j'\neq j\). Therefore:
\[\frac{\partial\mathcal{L}}{\partial W_{jk}^{[\ell]}} = \frac{\partial\mathcal{L}}{\partial
z_j^{[\ell]}}\cdot a_k^{[\ell-1]} = \delta_j^{[\ell]}\cdot a_k^{[\ell-1]}\]
In matrix form: ∂L/∂W[ℓ] = δ[ℓ](a[ℓ-1])^T — the outer product of the error signal and
the previous layer's activation. Shape: n[ℓ] × n[ℓ-1], matching W[ℓ]. ∎
BP4
Gradient of the biases:
\[\frac{\partial\mathcal{L}}{\partial\mathbf{b}^{[\ell]}} = \boldsymbol{\delta}^{[\ell]}\]
Derivation: \(z_j^{[\ell]} = \sum_k W_{jk}a_k + b_j\), so \(\partial z_j^{[\ell]}/\partial
b_j^{[\ell]} = 1\) and zero otherwise. Therefore \(\partial\mathcal{L}/\partial b_j^{[\ell]} =
\delta_j^{[\ell]}\cdot 1 = \delta_j^{[\ell]}\). ∎
The bias gradient is simply the error signal — the bias acts as a shift with derivative
1, so it just passes the error through unchanged.
// the softmax Jacobian · diagonal minus outer product · numerical stability
The softmax function maps a vector to a probability distribution. Its Jacobian is not diagonal (unlike ReLU or
sigmoid) — each output depends on every input through the normalization constant. Computing the softmax Jacobian
correctly is essential for the output layer.
where ȳ_sum = ŷ^T ȳ is a scalar. The VJP reduces to: subtract the scalar dot product
ŷ^T ȳ from each component of ȳ, then multiply elementwise by ŷ. Total cost: O(K), not O(K²). This
efficient form is what PyTorch implements internally.
10
Elegant Result
Cross-Entropy + Softmax — The Cancellation
// the combined gradient · why it simplifies · MLE connection · numerical stability
When softmax is used with categorical cross-entropy loss, the combined gradient
\(\partial\mathcal{L}/\partial\mathbf{z}^{[L]}\) has an extraordinarily clean form. This cancellation is not
coincidence — it's a consequence of the exponential family structure of the categorical distribution.
Using Σₖ yₖ = 1 (one-hot labels sum to 1) and Σₖ≠ⱼ yₖ = 1 − yⱼ. The remarkable
cancellation: the softmax Jacobian exactly "cancels" the division by ŷₖ in the cross-entropy derivative. ∎
Softmax + Cross-Entropy Gradient
\[\boxed{\frac{\partial\mathcal{L}_{\text{CCE}}}{\partial\mathbf{z}^{[L]}} = \hat{\mathbf{y}} - \mathbf{y}}\]
The gradient with respect to the pre-softmax logits is simply predicted probability minus true label.
This is the most important single formula in deep learning training — the error signal for every neural
network classifier.
Numerical Stability
Computing \(\text{softmax}(\mathbf{z})_k = e^{z_k}/\sum_j e^{z_j}\) directly can overflow when \(z_k\) is
large. The numerically stable version subtracts the maximum before exponentiation:
Subtracting the maximum doesn't change the result (it cancels in numerator and
denominator) but prevents overflow. The largest exp becomes exp(0) = 1; all others are smaller. The loss:
-log(softmax(z)_c) = -(z_c − z_max) + log(Σ_j exp(z_j − z_max)) — this is the log-sum-exp computation,
computed stably in PyTorch as F.cross_entropy(), which combines log_softmax + nll_loss numerically.
11
Mini-Batch
Batch Gradients
// single to batch · vectorized form · 1/m averaging · efficient GPU computation
The batch form processes all m samples simultaneously via matrix operations — maximally
parallelizable on GPUs. The 1/m factor averages the gradient over the batch. BP4 requires summing over the
batch dimension — a sum/reduction operation — to get the bias gradient, since the bias is shared across all
samples.
Why Batch Size Matters for Gradients
Large batch: More accurate gradient estimate (lower variance). But the gradient direction
barely improves beyond batch size ~1000 for most problems (see large-batch training literature).
Small batch: Noisier gradients, but the noise acts as regularization (implicit noise in
SGD helps escape sharp minima). Usually reaches better minima than large-batch training at the same number
of epochs.
Gradient variance: \(\text{Var}(\hat{g}) = \sigma^2_g/m\) — variance decreases as
\(1/m\). But the computational cost grows linearly in \(m\). For large \(m\), you pay full compute for
diminishing variance reduction.
In practice, training large neural networks is limited not by compute but by GPU memory. The standard
backpropagation algorithm requires storing all intermediate activations, which can exceed GPU memory for large
models or sequences.
Memory Breakdown During Training
Component
Memory
Notes
Parameters (weights + biases)
\(O(P)\)
Fixed; FP32: 4 bytes/param. GPT-3: 700GB
Gradients
\(O(P)\)
Same shape as parameters
Optimizer states (Adam)
\(O(2P)\)
m_t and v_t for each parameter
Activations (forward cache)
\(O(m \cdot L \cdot d)\)
Grows with batch size × depth × width
Input data
\(O(m \cdot n)\)
Usually small relative to activations
Activation Checkpointing
Instead of caching all intermediate activations, recompute them on demand during the backward pass. Trade
compute for memory:
With checkpointing: save activations only at every √L-th layer (the "checkpoints").
During backward, recompute from the nearest checkpoint. Memory scales as √L instead of L — for a 100-layer
network, memory reduces 10×. PyTorch: torch.utils.checkpoint.checkpoint(). Used in training LLMs with very
long sequences.
Gradient Accumulation
To simulate a large effective batch size with limited GPU memory, accumulate gradients over multiple small
batches before taking an optimizer step:
for i, (x, y) inenumerate(dataloader):
loss = model(x, y) / accumulation_steps
loss.backward() # gradients accumulate in
.grad
if (i + 1) % accumulation_steps == 0:
optimizer.step() # update with
accumulated gradients
optimizer.zero_grad()
# Effectively processes 8x larger batches with 1x memory
Automatic Mixed Precision (AMP)
AMP — Gradient ScalingFP16
Training
\[\text{FP32 params and gradients: 4 bytes/value}\]
\[\text{FP16 forward pass: 2 bytes/value (2× faster on Tensor Cores)}\]
\[\text{Gradient scaling: } \hat{g} = s \cdot g_{\text{FP16}}, \quad g_{\text{FP32}} = \hat{g}/s\]
FP16 has limited range (max ~65504) — small gradients underflow to 0. Solution:
multiply loss by a large scale factor s before backward, then divide the accumulated gradients by s.
PyTorch GradScaler handles this automatically. AMP reduces memory ~2× and increases throughput ~2-3× on
modern GPUs with no accuracy loss.
13
AD Framework
Automatic Differentiation — The Theory
// three flavors of differentiation · operator overloading · tape-based AD
Automatic differentiation is the umbrella framework. It is distinct from both numerical differentiation (finite
differences) and symbolic differentiation (computer algebra). It computes exact derivatives efficiently using
the structure of the program.
Three Flavors of Differentiation
Method
How
Exact?
Cost
When Used
Numerical (finite diff)
\((f(x+ε) - f(x-ε))/2ε\)
No — approx \(O(ε^2)\)
\(O(n)\) passes
Gradient checking only
Symbolic (CAS)
Derive closed-form expression
Yes
Expr swell; static only
Toy problems, paper derivations
Forward AD
Dual numbers, JVPs
Yes
\(O(n)\) passes for full J
Few inputs, many outputs
Reverse AD (Backprop)
Tape + VJPs
Yes
\(O(1)\) pass for scalar loss
All neural network training
Operator Overloading
The most common AD implementation technique: overload mathematical operators (+, *, sin, etc.) on a special
number type that tracks both the primal value and the derivative information:
# Differentiate f(x) = x*x + x at x=3
x = Dual(3.0, 1.0) # x=3, ẋ=1 (seed the tangent)
result = x*x + x print(result.dual) # f'(3) = 2*3 +
1 = 7 ✓
Tape-Based Reverse Mode AD
PyTorch's autograd and TensorFlow use a tape (Wengert list): a record of every
operation performed during the forward pass, including enough information to compute the VJP for each
operation during the backward pass:
Tape-Based AD — Minimal Implementation
classTensor: def__init__(self, data, _children=(), _backward_fn=None):
self.data = data
self.grad = 0.0# accumulates ∂L/∂self
self._children = _children# inputs to this node
self._backward = _backward_fnor (lambda: None)
def__mul__(self, other):
out = Tensor(self.data *
other.data, (self, other)) def _bwd(): # VJP of multiply: ∂(a*b)/∂a = b, ∂(a*b)/∂b =
a
self.grad += out.grad
* other.data# accumulate!
other.grad += out.grad
* self.data
out._backward = _bwd return out
# 1. requires_grad=True enables gradient tracking
x = torch.randn(3, 4,
requires_grad=True)
W = torch.randn(4, 5,
requires_grad=True)
# 2. Forward pass builds the computational graph automatically
z = x @ W # graph:
z.grad_fn = MmBackward
loss = z.sum()
# 3. Backward pass computes all gradients
loss.backward() # default: free the
graph after backward print(W.grad)
# dL/dW — shape (4,5)
# 4. retain_graph=True: keep graph for multiple backward passes
loss.backward(retain_graph=True) #
needed for higher-order grads
# 5. Gradient accumulation: .grad ADDS (doesn't replace)
optimizer.zero_grad() # must zero before next
backward!
# 6. torch.no_grad(): disable tracking for inference with torch.no_grad():
y_pred = model(x_test) # 2x faster, no memory overhead
# 7. detach(): remove tensor from graph
target = output.detach() # treat as constant, stop grad
flow
TensorFlow GradientTape
TensorFlow 2.x GradientTape
import tensorflow as tf
# 1. GradientTape records forward pass operations
x = tf.Variable(tf.random.normal([3, 4]))
W = tf.Variable(tf.random.normal([4, 5]))
with tf.GradientTape() as tape:
z = tf.matmul(x, W) #
recorded on tape
loss = tf.reduce_sum(z) # recorded on
tape
# 2. Compute gradients by replaying the tape backward
grads = tape.gradient(loss, [x, W]) # [dL/dx,
dL/dW]
# 3. persistent=True: call gradient() multiple times with tf.GradientTape(persistent=True) as tape:
loss = compute_loss()
dL_dx = tape.gradient(loss, x)
dL_dW = tape.gradient(loss, W) # tape still alive del tape # must delete manually
Custom Backward Functions in PyTorch
torch.autograd.Function — Custom Backward
classStraightThroughEstimator(torch.autograd.Function): """Example: gradient through a non-differentiable op"""
Global norm clipping scales all gradients uniformly — preserving the direction, only
reducing the magnitude. Value clipping (clip each gradient independently to [−c, c]) is different and less
principled. PyTorch: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0). Typical max_norm:
0.5–5.0 depending on the model.
Higher-Order Gradients
Since the gradient computation is itself a differentiable computation, you can differentiate through the
backward pass to get second (or higher) order derivatives:
# Hessian-Vector Product (efficient second order)
v = torch.ones_like(x)
hvp = torch.autograd.grad(dy_dx, x, grad_outputs=v)[0] # Computes H·v in O(n) without materializing H (n×n)
MAML (Meta-Learning): Differentiates through multiple gradient steps to find
initialization that quickly adapts. Requires second-order gradients (gradient through gradient).
Hessian-Vector Products (HVPs): Used in second-order optimization (Newton's method,
Hessian-free optimization, K-FAC). Computable in \(O(n)\) — same cost as a gradient — without forming the
\(O(n^2)\) Hessian.
Implicit differentiation: Differentiate through the solution of an optimization problem.
Used in meta-learning, hyperparameter optimization, and neural ODEs.
Gradient Checkpointing — Deep Dive
torch.utils.checkpoint — Activation Recomputation
import torch.utils.checkpoint as cp
classCheckpointedModel(nn.Module): defforward(self, x): # Without checkpointing: all activations cached (high
memory) # With checkpointing: recomputes on backward (low
memory)
x = cp.checkpoint(self.layer1, x) #
don't cache layer1 activations
x = cp.checkpoint(self.layer2, x) #
recompute from input on backward
x = self.layer3(x)
# last layer: always cached return x
# Memory: O(√L) instead of O(L) — 10x reduction for 100 layers # Compute: ~33% overhead (recomputes each checkpoint once on bwd) # Enables 2-3x larger models or longer sequences
Gradient Noise and Perturbation
Label smoothing: Replaces one-hot targets with soft targets — the gradient signal is
\(\hat{y} - y_{\text{smooth}}\) where \(y_{\text{smooth}} = (1-\epsilon)y + \epsilon/K\). Prevents
overconfident predictions.
Stochastic depth: Randomly drop entire residual branches during training. The gradient
paths change randomly each step, acting as an ensemble of shallower networks.
Noise injection: Add Gaussian noise to gradients (Langevin dynamics). Helps escape sharp
local minima. Formally equivalent to sampling from the Bayesian posterior.
16
Synthesis
The Complete Mental Model
// everything unified · one diagram · one algorithm · the whole picture
Fig 2. Complete backpropagation mental map — from the computational goal through the four
fundamental equations to the gradient update that trains every neural network.
The complete story in one coherent thread:
The problem: We need \(\partial\mathcal{L}/\partial\theta_j\) for every parameter
\(\theta_j\). Finite differences require \(O(n)\) forward passes. Backpropagation computes all gradients
simultaneously in one forward + one backward pass — \(O(1)\) passes regardless of \(n\).
Computational graphs represent computation as a DAG. The forward pass evaluates nodes in
topological order, caching all intermediate values. The graph makes the chain rule application systematic and
automatic.
Backpropagation is the chain rule applied in reverse topological order, reusing each
computed gradient \(\partial\mathcal{L}/\partial z\) for all upstream parameters that depend on \(z\). This
sharing is the source of the \(O(n)\) efficiency.
Reverse mode AD propagates adjoint variables \(\bar{v} = \partial\mathcal{L}/\partial v\)
backward through the graph. Initialization: \(\bar{\mathcal{L}} = 1\). Each node computes the VJP for its
operation and accumulates the result into its input adjoints.
VJPs are the primitive: each operation need only define its VJP (how to propagate a
cotangent backward), not its full Jacobian. The Jacobian is never formed — only its product with a vector.
Cost: same order as the forward operation.
Matrix calculus gives us the VJP rules for common operations: for \(\mathbf{y} =
\mathbf{Wx}\), the VJP w.r.t. \(\mathbf{W}\) is \(\bar{\mathbf{y}}\mathbf{x}^T\) (outer product), and w.r.t.
\(\mathbf{x}\) is \(\mathbf{W}^T\bar{\mathbf{y}}\) (transpose times cotangent).
The four BP equations follow directly from applying VJPs to the specific structure of a
layered neural network. BP1 gives the output error; BP2 propagates it backward; BP3 and BP4 give weight and
bias gradients.
Softmax + CCE gives the cleanest gradient: \(\partial\mathcal{L}/\partial\mathbf{z}^{[L]} =
\hat{\mathbf{y}} - \mathbf{y}\). This follows from the exponential family structure of the categorical
distribution and is not a coincidence.
Memory is the bottleneck for large models. Activation checkpointing reduces cached memory
from \(O(L)\) to \(O(\sqrt{L})\) by recomputing activations on demand. Gradient accumulation and AMP are the
other two key techniques for fitting large models into GPU memory.
Modern autograd engines (PyTorch, JAX, TensorFlow) implement tape-based reverse mode AD via
operator overloading. Every operation records its VJP. The backward pass replays the tape in reverse,
computing all parameter gradients automatically.
The One Algorithm That Trains Everything
Backpropagation is not a clever trick or an approximation — it is the exact, optimal algorithm for
computing gradients of a scalar loss with respect to all parameters of a computation. It is a direct
consequence of the chain rule of calculus, applied systematically using the structure of the computational
graph. Every neural network ever trained — from the 1986 XOR network to GPT-4 with 1.8 trillion parameters —
was trained by exactly this algorithm. The equations are the same; only the graph grew larger. Understanding
backpropagation is not merely understanding a technique — it is understanding the mathematical mechanism by
which machines learn.