Deep Learning / Optimization / Mathematical Foundations
Foundation of all Deep Learning

the algorithm that trains every neural network Backpropagation

From computational graphs and the chain rule to reverse-mode AD, Jacobians, VJPs, matrix calculus, PyTorch autograd, and every modern trick. The most complete treatment of backpropagation ever written.

Computational Graphs Chain Rule Reverse Mode AD Jacobians & VJPs Matrix Calculus Full Network Derivation Softmax Gradients Cross-Entropy Batch Gradients Autograd PyTorch / TensorFlow Memory Efficiency
1986
Rumelhart et al.
O(n)
vs O(n²) naive
BP4
fundamental eqs
VJP
key primitive
∂L/∂W
the goal
01
The Problem

Why Backpropagation? The Gradient Problem

// the core question · naive vs efficient · what training actually needs

Training a neural network requires computing the gradient of the loss function with respect to every learnable parameter. For a modern network with 175 billion parameters (GPT-3), this means computing 175,000,000,000 partial derivatives — on every single forward pass, thousands of times per second. The question is not whether we can compute gradients, but how to do it with exactly one backward pass instead of 175 billion forward passes.

Three Approaches to Computing Gradients

Numerical Differentiation

Finite difference approximation: for each parameter \(\theta_j\), run two forward passes:

\[\frac{\partial \mathcal{L}}{\partial \theta_j} \approx \frac{\mathcal{L}(\theta_j+\epsilon) - \mathcal{L}(\theta_j-\epsilon)}{2\epsilon}\]

Cost: \(O(n)\) forward passes for \(n\) parameters. GPT-3: 350 billion evaluations per gradient. Numerically unstable (truncation vs roundoff error tradeoff). Only used for gradient checking, not production training.

Symbolic Differentiation

Apply calculus rules symbolically (like Wolfram Alpha). Result: an exact closed-form expression for the gradient.

Problem: Expression swell — the symbolic gradient can be exponentially larger than the original expression. Doesn't handle loops or dynamic computation graphs. Only practical for small, fixed-structure expressions.

The Backpropagation Solution

Backpropagation computes the exact gradient using one forward pass (to cache intermediate values) and one backward pass (to propagate gradients). Total cost: \(O(n)\) operations for \(n\) parameters — the same order as a single forward pass. This is the efficiency that makes deep learning computationally tractable.

The Core Insight
The Goal of BackpropagationObjective
\[\nabla_\theta\mathcal{L} = \left[\frac{\partial\mathcal{L}}{\partial\theta_1}, \frac{\partial\mathcal{L}}{\partial\theta_2}, \ldots, \frac{\partial\mathcal{L}}{\partial\theta_n}\right]^T\] \[\text{Backprop achieves this in: } T_{\text{fwd}} + T_{\text{bwd}} \approx 2 \cdot T_{\text{fwd}} \quad \text{(vs } n \cdot T_{\text{fwd}} \text{ for finite differences)}\]
The backward pass costs roughly the same as the forward pass — not n times more. This is the fundamental theorem underlying all deep learning training. Without it, modern neural networks with millions or billions of parameters could not be trained.
02
Graphs

Computational Graphs

// DAG structure · nodes · edges · caching · topological order

A computational graph is a Directed Acyclic Graph (DAG) where each node represents an operation and each directed edge represents the flow of data (a tensor). The graph makes the structure of the computation explicit — enabling automatic gradient computation through systematic traversal.

Graph Structure for a Neural Layer
x W b inputs @ matmul + add bias z cached σ(z) activation a output ← backward pass (gradients)
Fig 1. Computational graph for one linear layer: x and W enter a matrix multiply (@), bias b is added (+), then activation σ produces output a. Backward pass (red dashed) traverses this graph in reverse, computing gradients via the chain rule at each node.
Forward Pass: Topological Evaluation

The forward pass evaluates all nodes in topological order — each node is evaluated only after all its inputs are ready. For a neural network with \(L\) layers, this is simply the sequential layer-by-layer computation. Every intermediate value must be cached because the backward pass needs them.

What Gets Cached During Forward PassMemory
\[\text{For each layer }\ell:\quad \text{Cache}\left\{\mathbf{Z}^{[\ell]},\, \mathbf{A}^{[\ell]}\right\}\] \[\text{Total cache size: } O\!\left(m \cdot \sum_{\ell=1}^L n^{[\ell]}\right) = O(m \cdot L \cdot d)\]
m = batch size, n[ℓ] = neurons in layer ℓ, L = depth. For GPT-3: 175B parameters, 96 layers, batch size 1024 — the activation cache alone requires ~350GB of GPU memory during training. This is why activation checkpointing exists (§12).
Static vs Dynamic Graphs

Static Graph (TensorFlow v1)

Graph is defined once, then compiled and executed repeatedly. More efficient: the compiler can optimize the graph, fuse operations, and preallocate memory. Less flexible: can't use Python control flow (if/while) inside the graph.

Dynamic Graph (PyTorch)

Graph is built on-the-fly during each forward pass. Python control flow works naturally — the graph changes with each forward pass. More flexible and debuggable. Slightly less efficient but the default in modern research.

03
Mathematics

The Chain Rule — The Foundation

// univariate · multivariate · vector chain rule · systematic application

Backpropagation is the chain rule of calculus applied systematically to a computational graph. Nothing more. The beauty of backpropagation is not that it invents new mathematics — it's that it applies the chain rule efficiently using the graph structure.

Univariate Chain Rule
Single-Variable Chain RuleCalculus I
\[\text{If } f = g(h(x)):\quad \frac{df}{dx} = \frac{dg}{dh} \cdot \frac{dh}{dx}\] \[\text{For a chain } x \to u \to v \to w \to \mathcal{L}:\quad \frac{d\mathcal{L}}{dx} = \frac{d\mathcal{L}}{dw}\cdot\frac{dw}{dv}\cdot\frac{dv}{du}\cdot\frac{du}{dx}\]
The chain rule: to compute the derivative of the final output with respect to any intermediate variable, multiply all the local derivatives along the path from that variable to the output. In a neural network, the "path" goes from each parameter through all subsequent layers to the loss.
Multivariate Chain Rule

In neural networks, intermediate values are tensors, and one upstream variable may depend on multiple intermediate nodes. The multivariate chain rule sums over all paths:

Multivariate Chain RuleGeneral Form
\[\frac{\partial \mathcal{L}}{\partial x} = \sum_j \frac{\partial \mathcal{L}}{\partial y_j}\cdot\frac{\partial y_j}{\partial x}\]

where the sum is over all immediate downstream nodes y_j that depend on x.

This sum-over-paths is exactly what "gradient accumulation" means in practice — when a parameter is used in multiple places (e.g., tied weights), its gradient is the sum of contributions from all paths to the loss. PyTorch's .grad accumulates (+=) rather than overwrites, implementing this formula.
Why Backprop Is Efficient
The Key Efficiency

Naive application of the chain rule would recompute all intermediate values for each parameter. Backpropagation avoids this by computing the chain rule in reverse topological order, reusing each intermediate gradient \(\partial\mathcal{L}/\partial z\) for all upstream parameters that depend on \(z\). Each gradient is computed exactly once and shared across all paths. This is why backprop is \(O(n)\) rather than \(O(n^2)\).

04
AD Theory

Reverse Mode Automatic Differentiation

// forward mode vs reverse mode · the adjoint · cotangent · accumulation order

Automatic differentiation (AD) is the general framework; backpropagation is its instantiation for neural networks. AD has two modes — and the choice between them determines computational complexity.

Forward Mode vs Reverse Mode

Forward Mode AD

Computes \(\frac{\partial \mathbf{y}}{\partial x_i}\) for one specific input \(x_i\). Propagates a tangent vector \(\dot{v}\) forward alongside the primal values.

Cost: One forward pass per input dimension. For \(n\) inputs: \(O(n)\) forward passes.

Best when: \(n_{\text{inputs}} \ll n_{\text{outputs}}\). Rare in ML — usually \(n_{\text{params}} \gg 1\).

Used in: Jacobian-vector products (JVPs), second-order optimization, physics simulations.

Reverse Mode AD (Backprop)

Computes \(\frac{\partial \mathcal{L}}{\partial x_i}\) for all inputs simultaneously. Propagates an adjoint vector \(\bar{v}\) backward through the graph.

Cost: One forward + one backward pass. For \(n\) inputs: still \(O(1)\) passes.

Best when: \(n_{\text{outputs}} \ll n_{\text{inputs}}\). ML: one scalar loss, millions of parameters.

Used in: All neural network training. The reason deep learning is computationally feasible.

The Adjoint Variable

In reverse mode AD, we define the adjoint (or cotangent) of each node \(v\) as the derivative of the loss with respect to that node's value:

Adjoint (Cotangent) DefinitionReverse Mode
\[\bar{v} \triangleq \frac{\partial \mathcal{L}}{\partial v}\] \[\text{Initialization: } \bar{\mathcal{L}} = 1 \quad \text{(the loss's adjoint w.r.t. itself)}\] \[\text{Propagation: } \bar{v} = \sum_{j:\, v \in \text{inputs}(j)} \bar{w}_j \cdot \frac{\partial w_j}{\partial v}\]
For each upstream node w_j that uses v as an input: the contribution to v̄ is the adjoint of w_j times the local Jacobian ∂w_j/∂v. Summing over all downstream nodes gives the total adjoint. This is the multivariate chain rule in adjoint form. The final adjoints ∂L/∂θ for all parameters θ are the gradients we use for SGD.
The Reverse Accumulation Algorithm
Reverse Mode AD — Core Algorithm
# Forward pass: evaluate all nodes, cache values
values = topological_sort(graph)
for node in values:
  node.value = evaluate(node.op, node.inputs)

# Initialize adjoints
adjoints = {node: 0.0 for node in graph}
adjoints[loss_node] = 1.0  # dL/dL = 1

# Backward pass: propagate adjoints in reverse topo order
for node in reversed(values):
  # node has adjoint bar_v = dL/d(node)
  for i, inp in enumerate(node.inputs):
    # local Jacobian: d(node)/d(inp) evaluated at cached values
    local_grad = jacobian(node.op, wrt=i, at=node.cached_inputs)
    # Adjoint accumulation: bar_inp += bar_v * local_grad
    adjoints[inp] += adjoints[node] * local_grad

# Result: adjoints[param] = dL/d(param) for all parameters
05
Linear Algebra

Jacobians

// the derivative of a vector function · shape · why it's a matrix · computational cost

Neural network layers map vectors to vectors. The derivative of a vector-valued function with respect to a vector input is a matrix called the Jacobian. Understanding Jacobians is the key to understanding why backprop works for vector and tensor operations.

The Jacobian MatrixDefinition
\[\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m, \quad \mathbf{y} = \mathbf{f}(\mathbf{x})\] \[J_\mathbf{f}(\mathbf{x}) = \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \begin{pmatrix}\frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial y_m}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_n}\end{pmatrix} \in \mathbb{R}^{m \times n}\]
The (i,j) entry is ∂yᵢ/∂xⱼ — how much the i-th output changes when the j-th input changes. The Jacobian fully describes the local linear approximation of f at x: f(x+δ) ≈ f(x) + J_f(x)·δ. For a scalar function (m=1), the Jacobian reduces to the gradient row vector ∇f^T.
Chain Rule for Vector Functions
Vector Chain Rule via JacobiansMatrix Multiplication
\[\text{If } \mathbf{z} = g(\mathbf{f}(\mathbf{x})):\quad J_{\mathbf{z}}(\mathbf{x}) = J_g(\mathbf{f}(\mathbf{x})) \cdot J_\mathbf{f}(\mathbf{x})\] \[\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}} J_{\mathbf{z}}(\mathbf{x}) = \frac{\partial \mathcal{L}}{\partial \mathbf{z}} J_g J_\mathbf{f}\]
For the special case where L is a scalar (which is always true in neural network training): ∂L/∂x is a row vector of shape (1×n), J is (m×n), and ∂L/∂z is a row vector of shape (1×m). The product (1×m)·(m×n) = (1×n). This is the vector-Jacobian product (VJP).
Jacobian Examples for Common Operations
Operation Forward Jacobian Shape
Linear \(\mathbf{y} = \mathbf{Wx}\) \(J = \mathbf{W}\) \(m\times n\)
ReLU \(\mathbf{y} = \max(0,\mathbf{x})\) \(J = \text{diag}(\mathbf{1}[\mathbf{x}>0])\) \(n\times n\) diagonal
Softmax \(\mathbf{y} = \text{softmax}(\mathbf{x})\) \(J = \text{diag}(\mathbf{y}) - \mathbf{y}\mathbf{y}^T\) \(n\times n\)
Sigmoid \(y_i = \sigma(x_i)\) \(J = \text{diag}(\sigma'(x_i))\) \(n\times n\) diagonal
Elementwise \(f\) \(y_i = f(x_i)\) \(J = \text{diag}(f'(\mathbf{x}))\) \(n\times n\) diagonal
Sum \(y = \sum_i x_i\) \(J = \mathbf{1}^T\) \(1\times n\)
L2 norm squared \(y = \|\mathbf{x}\|^2\) \(J = 2\mathbf{x}^T\) \(1\times n\)
The Jacobian Is Usually Never Materialized

For a layer with \(n\) inputs and \(m\) outputs, the Jacobian has \(mn\) entries. For a 4096-dimensional layer: 16 million entries per Jacobian, per layer. Computing and storing full Jacobians is infeasible. Instead, we only ever need the Jacobian-vector product (JVP) for forward mode or the vector-Jacobian product (VJP) for reverse mode — both computable in \(O(n)\) without forming the full Jacobian.

06
Core Primitive

Vector-Jacobian Products (VJPs)

// the fundamental operation · why we don't need the Jacobian · examples · efficiency

The vector-Jacobian product is the fundamental primitive of reverse-mode AD. Every backpropagation rule reduces to computing a VJP for that specific operation. Defining how to compute the VJP for each operation is equivalent to defining backpropagation for that operation.

VJP — DefinitionCore Primitive
\[\text{VJP}(f, \bar{\mathbf{y}}) = \bar{\mathbf{y}}^T J_f = \frac{\partial \mathcal{L}}{\partial \mathbf{y}}^T \cdot \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial \mathbf{x}}^T\]
ȳ = ∂L/∂y is the "upstream gradient" — the adjoint from all downstream nodes. VJP(f, ȳ) computes ∂L/∂x without ever forming J_f. Cost: same order as one forward pass through f. This is why backpropagation is efficient: each backward operation is as cheap as the corresponding forward operation.
VJPs for Key Operations
// VJP derivations for the most common neural network operations
MatMul
\[\mathbf{y} = \mathbf{W}\mathbf{x}, \quad J = \mathbf{W}\] \[\text{VJP w.r.t. } \mathbf{x}: \quad \bar{\mathbf{x}} = \bar{\mathbf{y}}^T\mathbf{W} = \mathbf{W}^T\bar{\mathbf{y}}\] \[\text{VJP w.r.t. } \mathbf{W}: \quad \bar{\mathbf{W}} = \bar{\mathbf{y}}\mathbf{x}^T\]
For Wx: the VJP w.r.t. x is W^T ȳ (transpose the weight matrix). The VJP w.r.t. W is the outer product ȳ x^T. These are the standard backprop rules for linear layers — now derived from first principles via VJPs.
Elementwise
\[\mathbf{y} = f(\mathbf{x}), \quad J = \text{diag}(f'(\mathbf{x}))\] \[\text{VJP}: \quad \bar{\mathbf{x}} = \bar{\mathbf{y}}^T \text{diag}(f'(\mathbf{x})) = \bar{\mathbf{y}} \odot f'(\mathbf{x})\]
For elementwise functions: the Jacobian is diagonal, so the VJP is just the elementwise product ⊙ (Hadamard product) of the upstream gradient and the elementwise derivative. ReLU: x̄ = ȳ ⊙ 1[x>0]. Sigmoid: x̄ = ȳ ⊙ σ(x)(1−σ(x)).
Sum
\[y = \sum_i x_i, \quad J = \mathbf{1}^T\] \[\text{VJP}: \quad \bar{\mathbf{x}} = \bar{y} \cdot \mathbf{1}\]
The gradient of a sum with respect to each input is 1 — the upstream scalar is broadcast to all inputs. This is why in batch normalization: the gradient through the sum broadcasts to all elements.
Log
\[y = \log x, \quad \frac{dy}{dx} = 1/x\] \[\text{VJP}: \quad \bar{x} = \bar{y}/x\]
The cross-entropy loss involves log — so during backward, we divide by the predicted probability. This is why near-zero predictions cause numerical instability: 1/ε is very large. log(ε + 1e-8) clipping or LogSumExp are used to stabilize this.
JVP vs VJP — When to Use Which
JVP vs VJP ComparisonMode Selection
\[\text{JVP}(f, \dot{\mathbf{x}}) = J_f \dot{\mathbf{x}} \quad \text{(tangent vector } \dot{\mathbf{x}}\text{, forward mode)}\] \[\text{VJP}(f, \bar{\mathbf{y}}) = J_f^T \bar{\mathbf{y}} \quad \text{(cotangent vector } \bar{\mathbf{y}}\text{, reverse mode)}\] \[\text{Reverse mode: } O(1) \text{ VJP calls to get } \nabla_{\mathbf{x}}\mathcal{L} \text{ for scalar }\mathcal{L}\] \[\text{Forward mode: } O(n) \text{ JVP calls to get all of } J_f\]
Conclusion: for a scalar loss and many parameters (ML training), use reverse mode. For vector outputs and few inputs (Jacobian estimation, sensitivity analysis), use forward mode. JAX supports both natively via jax.vjp() and jax.jvp().
07
Calculus

Matrix Calculus

// layout conventions · scalar-by-matrix · trace trick · key identities

Neural networks involve matrix operations. Matrix calculus provides the rules for differentiating functions that take matrices as inputs or produce matrices as outputs. Two layout conventions exist — the choice affects the shape of the result and leads to constant confusion in the literature.

Layout Conventions

Numerator Layout

The gradient \(\partial y/\partial \mathbf{x}\) has the same shape as \(y\) "in the numerator." For scalar \(y\) and vector \(\mathbf{x} \in \mathbb{R}^n\): the gradient is a row vector \(\in \mathbb{R}^{1\times n}\).

For vector \(\mathbf{y} \in \mathbb{R}^m\) and vector \(\mathbf{x} \in \mathbb{R}^n\): result is \(\mathbb{R}^{m\times n}\) Jacobian.

Used by: Bishop's PRML, most numerical analysis texts.

Denominator Layout

The gradient \(\partial y/\partial \mathbf{x}\) has the same shape as the "denominator" \(\mathbf{x}\). For scalar \(y\) and vector \(\mathbf{x} \in \mathbb{R}^n\): the gradient is a column vector \(\in \mathbb{R}^{n\times 1}\).

The ML convention: \(\nabla_\mathbf{x}\mathcal{L}\) is a column vector matching \(\mathbf{x}\).

Used by: Most ML textbooks, Goodfellow's DL book, standard practice.

Key Matrix Calculus Identities
Essential Identities for Neural Network BackpropReference
\[\frac{\partial}{\partial \mathbf{x}}(\mathbf{a}^T\mathbf{x}) = \mathbf{a}\] \[\frac{\partial}{\partial \mathbf{x}}(\mathbf{x}^T\mathbf{A}\mathbf{x}) = (\mathbf{A} + \mathbf{A}^T)\mathbf{x} \quad\Rightarrow\quad 2\mathbf{A}\mathbf{x} \text{ if }\mathbf{A}=\mathbf{A}^T\] \[\frac{\partial}{\partial \mathbf{X}}\text{tr}(\mathbf{A}\mathbf{X}) = \mathbf{A}^T\] \[\frac{\partial}{\partial \mathbf{X}}\text{tr}(\mathbf{X}^T\mathbf{A}) = \mathbf{A}\] \[\frac{\partial}{\partial \mathbf{X}}\text{tr}(\mathbf{A}\mathbf{X}\mathbf{B}) = \mathbf{A}^T\mathbf{B}^T\] \[\frac{\partial}{\partial \mathbf{X}}\log|\mathbf{X}| = \mathbf{X}^{-T}\]
The trace trick: tr(A^T B) = vec(A)^T vec(B) — converts matrix inner products to scalar inner products. Essential for computing ∂L/∂W when L involves matrix operations. Most backprop derivations for linear layers use ∂tr(X^T A)/∂X = A.
Deriving Linear Layer Gradients via Matrix Calculus
// Deriving ∂L/∂W and ∂L/∂x for y = Wx using matrix calculus trace trick
Setup
\[\mathbf{y} = \mathbf{W}\mathbf{x},\quad \mathcal{L} = f(\mathbf{y})\] Use the differential approach: \(d\mathcal{L} = \text{tr}\!\left(\left(\frac{\partial\mathcal{L}}{\partial\mathbf{y}}\right)^T d\mathbf{y}\right) = \bar{\mathbf{y}}^T\, d\mathbf{y}\)
∂L/∂W
\[d\mathbf{y} = (d\mathbf{W})\mathbf{x} \Rightarrow d\mathcal{L} = \bar{\mathbf{y}}^T(d\mathbf{W})\mathbf{x} = \text{tr}(\bar{\mathbf{y}}^T(d\mathbf{W})\mathbf{x}) = \text{tr}(\mathbf{x}\bar{\mathbf{y}}^T\, d\mathbf{W})\] \[\therefore\quad \frac{\partial\mathcal{L}}{\partial\mathbf{W}} = (\mathbf{x}\bar{\mathbf{y}}^T)^T = \bar{\mathbf{y}}\mathbf{x}^T\]
Using tr(A d B) → ∂/∂B = A^T. The outer product ȳ x^T matches the shape of W exactly.
∂L/∂x
\[d\mathbf{y} = \mathbf{W}(d\mathbf{x}) \Rightarrow d\mathcal{L} = \bar{\mathbf{y}}^T\mathbf{W}(d\mathbf{x}) = (\mathbf{W}^T\bar{\mathbf{y}})^T\, d\mathbf{x}\] \[\therefore\quad \frac{\partial\mathcal{L}}{\partial\mathbf{x}} = \mathbf{W}^T\bar{\mathbf{y}}\]
The key rule: for y = Wx, backprop through x is W^T ȳ (transpose the weight). This is the most important single formula in all of deep learning. ∎
08
Full Derivation

Full Network Backpropagation Derivation

// notation · BP1 through BP4 · full vectorized proof · every gradient computed

We now derive the four fundamental equations of backpropagation for a general \(L\)-layer neural network with arbitrary activation functions. These four equations are the complete specification of backpropagation — everything else is an application of them.

Notation and Setup
Network NotationSetup
\[\mathbf{a}^{[0]} = \mathbf{x},\quad \mathbf{z}^{[\ell]} = \mathbf{W}^{[\ell]}\mathbf{a}^{[\ell-1]} + \mathbf{b}^{[\ell]},\quad \mathbf{a}^{[\ell]} = \sigma^{[\ell]}(\mathbf{z}^{[\ell]})\] \[\boldsymbol{\delta}^{[\ell]} \triangleq \frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[\ell]}} \quad\text{(error signal / adjoint at layer }\ell\text{)}\]
The Four Equations
// Complete derivation of BP1–BP4 from first principles
BP1
Output layer error: \[\boldsymbol{\delta}^{[L]} = \frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[L]}} = \frac{\partial\mathcal{L}}{\partial\mathbf{a}^{[L]}} \odot \sigma'^{[L]}(\mathbf{z}^{[L]})\] Derivation: Apply chain rule: \(\partial\mathcal{L}/\partial z_k^{[L]} = \sum_j (\partial\mathcal{L}/\partial a_j^{[L]})(\partial a_j^{[L]}/\partial z_k^{[L]})\). Since \(a_j^{[L]} = \sigma(z_j^{[L]})\) and activation is applied elementwise: \(\partial a_j/\partial z_k = \sigma'(z_j)\delta_{jk}\). Therefore \(\partial\mathcal{L}/\partial z_k^{[L]} = (\partial\mathcal{L}/\partial a_k^{[L]})\sigma'(z_k^{[L]})\).
For Softmax + CCE: δ[L] = ŷ − y (the elegant cancellation — see §10). For Sigmoid + BCE: also δ[L] = ŷ − y.
BP2
Error propagation backward: \[\boldsymbol{\delta}^{[\ell]} = \left(\mathbf{W}^{[\ell+1]T}\boldsymbol{\delta}^{[\ell+1]}\right) \odot \sigma'^{[\ell]}(\mathbf{z}^{[\ell]})\] Derivation: \(\partial\mathcal{L}/\partial z_k^{[\ell]} = \sum_j (\partial\mathcal{L}/\partial z_j^{[\ell+1]})(\partial z_j^{[\ell+1]}/\partial z_k^{[\ell]})\). Now \(z_j^{[\ell+1]} = \sum_i W_{ji}^{[\ell+1]}a_i^{[\ell]}\) and \(a_i^{[\ell]} = \sigma(z_i^{[\ell]})\). So: \[\frac{\partial z_j^{[\ell+1]}}{\partial z_k^{[\ell]}} = W_{jk}^{[\ell+1]}\sigma'(z_k^{[\ell]})\] \[\therefore\quad \delta_k^{[\ell]} = \sigma'(z_k^{[\ell]})\sum_j W_{jk}^{[\ell+1]}\delta_j^{[\ell+1]} = \sigma'(z_k^{[\ell]})\cdot[\mathbf{W}^{[\ell+1]T}\boldsymbol{\delta}^{[\ell+1]}]_k\]
The transpose W^[ℓ+1]T is not a coincidence — it's the VJP of the linear operation z = Wa. Multiply by the adjoint from the right: the "backward weight matrix" is the transpose of the forward weight matrix. ∎
BP3
Gradient of the weights: \[\frac{\partial\mathcal{L}}{\partial\mathbf{W}^{[\ell]}} = \boldsymbol{\delta}^{[\ell]}(\mathbf{a}^{[\ell-1]})^T\] Derivation: \(z_j^{[\ell]} = \sum_k W_{jk}^{[\ell]}a_k^{[\ell-1]}\), so \(\partial z_j^{[\ell]}/\partial W_{jk}^{[\ell]} = a_k^{[\ell-1]}\) and \(\partial z_{j'}^{[\ell]}/\partial W_{jk}^{[\ell]} = 0\) for \(j'\neq j\). Therefore: \[\frac{\partial\mathcal{L}}{\partial W_{jk}^{[\ell]}} = \frac{\partial\mathcal{L}}{\partial z_j^{[\ell]}}\cdot a_k^{[\ell-1]} = \delta_j^{[\ell]}\cdot a_k^{[\ell-1]}\]
In matrix form: ∂L/∂W[ℓ] = δ[ℓ](a[ℓ-1])^T — the outer product of the error signal and the previous layer's activation. Shape: n[ℓ] × n[ℓ-1], matching W[ℓ]. ∎
BP4
Gradient of the biases: \[\frac{\partial\mathcal{L}}{\partial\mathbf{b}^{[\ell]}} = \boldsymbol{\delta}^{[\ell]}\] Derivation: \(z_j^{[\ell]} = \sum_k W_{jk}a_k + b_j\), so \(\partial z_j^{[\ell]}/\partial b_j^{[\ell]} = 1\) and zero otherwise. Therefore \(\partial\mathcal{L}/\partial b_j^{[\ell]} = \delta_j^{[\ell]}\cdot 1 = \delta_j^{[\ell]}\). ∎
The bias gradient is simply the error signal — the bias acts as a shift with derivative 1, so it just passes the error through unchanged.
The Four Fundamental Equations of Backpropagation

\[\mathbf{BP1:}\quad \boldsymbol{\delta}^{[L]} = \nabla_{\mathbf{a}^{[L]}}\mathcal{L} \odot \sigma'^{[L]}(\mathbf{z}^{[L]})\] \[\mathbf{BP2:}\quad \boldsymbol{\delta}^{[\ell]} = (\mathbf{W}^{[\ell+1]T}\boldsymbol{\delta}^{[\ell+1]}) \odot \sigma'^{[\ell]}(\mathbf{z}^{[\ell]})\] \[\mathbf{BP3:}\quad \nabla_{\mathbf{W}^{[\ell]}}\mathcal{L} = \boldsymbol{\delta}^{[\ell]}(\mathbf{a}^{[\ell-1]})^T\] \[\mathbf{BP4:}\quad \nabla_{\mathbf{b}^{[\ell]}}\mathcal{L} = \boldsymbol{\delta}^{[\ell]}\]

09
Output Layer

Softmax Gradients — The Full Jacobian

// the softmax Jacobian · diagonal minus outer product · numerical stability

The softmax function maps a vector to a probability distribution. Its Jacobian is not diagonal (unlike ReLU or sigmoid) — each output depends on every input through the normalization constant. Computing the softmax Jacobian correctly is essential for the output layer.

Softmax Jacobian DerivationOutput Layer
\[\text{softmax}(\mathbf{z})_i = \hat{y}_i = \frac{e^{z_i}}{\sum_k e^{z_k}}\] \[\frac{\partial \hat{y}_i}{\partial z_j} = \begin{cases}\hat{y}_i(1-\hat{y}_i) & i = j \\ -\hat{y}_i\hat{y}_j & i \neq j\end{cases}\] \[J_{\text{softmax}} = \text{diag}(\hat{\mathbf{y}}) - \hat{\mathbf{y}}\hat{\mathbf{y}}^T \in \mathbb{R}^{K\times K}\]
Derivation for i≠j: ∂(eᶻⁱ/Σₖeᶻᵏ)/∂zⱼ = 0/Σ − eᶻⁱeᶻʲ/Σ² = −ŷᵢŷⱼ. For i=j: ∂ŷᵢ/∂zᵢ = (eᶻⁱΣ − eᶻⁱeᶻⁱ)/Σ² = ŷᵢ − ŷᵢ² = ŷᵢ(1−ŷᵢ). In matrix form: J = diag(ŷ) − ŷŷ^T.
The VJP of Softmax

The VJP \(\bar{\mathbf{z}} = J_{\text{softmax}}^T \bar{\mathbf{y}}\) can be computed without materializing the full \(K\times K\) Jacobian:

Efficient Softmax VJPO(K) not O(K²)
\[\bar{\mathbf{z}} = J_{\text{softmax}}^T\bar{\mathbf{y}} = (\text{diag}(\hat{\mathbf{y}}) - \hat{\mathbf{y}}\hat{\mathbf{y}}^T)\bar{\mathbf{y}}\] \[= \hat{\mathbf{y}} \odot \bar{\mathbf{y}} - \hat{\mathbf{y}}(\hat{\mathbf{y}}^T\bar{\mathbf{y}})\] \[= \hat{\mathbf{y}} \odot \left(\bar{\mathbf{y}} - \mathbf{1}\cdot\hat{\mathbf{y}}^T\bar{\mathbf{y}}\right) = \hat{\mathbf{y}} \odot (\bar{\mathbf{y}} - \bar{y}_{\text{sum}})\]
where ȳ_sum = ŷ^T ȳ is a scalar. The VJP reduces to: subtract the scalar dot product ŷ^T ȳ from each component of ȳ, then multiply elementwise by ŷ. Total cost: O(K), not O(K²). This efficient form is what PyTorch implements internally.
10
Elegant Result

Cross-Entropy + Softmax — The Cancellation

// the combined gradient · why it simplifies · MLE connection · numerical stability

When softmax is used with categorical cross-entropy loss, the combined gradient \(\partial\mathcal{L}/\partial\mathbf{z}^{[L]}\) has an extraordinarily clean form. This cancellation is not coincidence — it's a consequence of the exponential family structure of the categorical distribution.

// Deriving the combined softmax + CCE gradient
Setup
\[\mathcal{L}_{\text{CCE}} = -\sum_{k=1}^K y_k\log\hat{y}_k \quad\text{(one-hot label: } y_c=1, y_{k\neq c}=0\text{)}\] \[= -\log\hat{y}_c\]
∂L/∂ŷ
\[\frac{\partial\mathcal{L}}{\partial\hat{y}_k} = -\frac{y_k}{\hat{y}_k}\]
Chain
Apply chain rule through softmax: \[\frac{\partial\mathcal{L}}{\partial z_j} = \sum_k \frac{\partial\mathcal{L}}{\partial\hat{y}_k}\cdot\frac{\partial\hat{y}_k}{\partial z_j} = \sum_k\left(-\frac{y_k}{\hat{y}_k}\right)\frac{\partial\hat{y}_k}{\partial z_j}\]
Expand
Split into \(k=j\) and \(k\neq j\) terms: \[= -\frac{y_j}{\hat{y}_j}\hat{y}_j(1-\hat{y}_j) + \sum_{k\neq j}\frac{y_k}{\hat{y}_k}\hat{y}_k\hat{y}_j\] \[= -y_j(1-\hat{y}_j) + \hat{y}_j\sum_{k\neq j}y_k = -y_j + y_j\hat{y}_j + \hat{y}_j(1-y_j) = \hat{y}_j - y_j\]
Using Σₖ yₖ = 1 (one-hot labels sum to 1) and Σₖ≠ⱼ yₖ = 1 − yⱼ. The remarkable cancellation: the softmax Jacobian exactly "cancels" the division by ŷₖ in the cross-entropy derivative. ∎
Softmax + Cross-Entropy Gradient

\[\boxed{\frac{\partial\mathcal{L}_{\text{CCE}}}{\partial\mathbf{z}^{[L]}} = \hat{\mathbf{y}} - \mathbf{y}}\] The gradient with respect to the pre-softmax logits is simply predicted probability minus true label. This is the most important single formula in deep learning training — the error signal for every neural network classifier.

Numerical Stability

Computing \(\text{softmax}(\mathbf{z})_k = e^{z_k}/\sum_j e^{z_j}\) directly can overflow when \(z_k\) is large. The numerically stable version subtracts the maximum before exponentiation:

Numerically Stable SoftmaxStability
\[\text{softmax}(\mathbf{z})_k = \frac{e^{z_k - z_{\max}}}{\sum_j e^{z_j - z_{\max}}}\]
Subtracting the maximum doesn't change the result (it cancels in numerator and denominator) but prevents overflow. The largest exp becomes exp(0) = 1; all others are smaller. The loss: -log(softmax(z)_c) = -(z_c − z_max) + log(Σ_j exp(z_j − z_max)) — this is the log-sum-exp computation, computed stably in PyTorch as F.cross_entropy(), which combines log_softmax + nll_loss numerically.
11
Mini-Batch

Batch Gradients

// single to batch · vectorized form · 1/m averaging · efficient GPU computation
Vectorized Mini-Batch BackpropagationBatch Form
\[\mathbf{X} \in \mathbb{R}^{n^{[0]}\times m},\quad \mathbf{Z}^{[\ell]} \in \mathbb{R}^{n^{[\ell]}\times m},\quad \boldsymbol{\Delta}^{[\ell]} \in \mathbb{R}^{n^{[\ell]}\times m}\] \[\mathbf{BP1}:\quad \boldsymbol{\Delta}^{[L]} = \hat{\mathbf{Y}} - \mathbf{Y} \in \mathbb{R}^{K\times m}\] \[\mathbf{BP2}:\quad \boldsymbol{\Delta}^{[\ell]} = (\mathbf{W}^{[\ell+1]T}\boldsymbol{\Delta}^{[\ell+1]}) \odot \sigma'^{[\ell]}(\mathbf{Z}^{[\ell]})\] \[\mathbf{BP3}:\quad \frac{\partial\mathcal{L}}{\partial\mathbf{W}^{[\ell]}} = \frac{1}{m}\boldsymbol{\Delta}^{[\ell]}(\mathbf{A}^{[\ell-1]})^T\] \[\mathbf{BP4}:\quad \frac{\partial\mathcal{L}}{\partial\mathbf{b}^{[\ell]}} = \frac{1}{m}\sum_{i=1}^m\boldsymbol{\Delta}^{[\ell]}_{:,i} = \frac{1}{m}\boldsymbol{\Delta}^{[\ell]}\mathbf{1}\]
The batch form processes all m samples simultaneously via matrix operations — maximally parallelizable on GPUs. The 1/m factor averages the gradient over the batch. BP4 requires summing over the batch dimension — a sum/reduction operation — to get the bias gradient, since the bias is shared across all samples.
Why Batch Size Matters for Gradients
  • Large batch: More accurate gradient estimate (lower variance). But the gradient direction barely improves beyond batch size ~1000 for most problems (see large-batch training literature).
  • Small batch: Noisier gradients, but the noise acts as regularization (implicit noise in SGD helps escape sharp minima). Usually reaches better minima than large-batch training at the same number of epochs.
  • Gradient variance: \(\text{Var}(\hat{g}) = \sigma^2_g/m\) — variance decreases as \(1/m\). But the computational cost grows linearly in \(m\). For large \(m\), you pay full compute for diminishing variance reduction.
12
Engineering

Memory-Efficiency Tradeoffs

// activation checkpointing · gradient accumulation · AMP · memory breakdown

In practice, training large neural networks is limited not by compute but by GPU memory. The standard backpropagation algorithm requires storing all intermediate activations, which can exceed GPU memory for large models or sequences.

Memory Breakdown During Training
Component Memory Notes
Parameters (weights + biases) \(O(P)\) Fixed; FP32: 4 bytes/param. GPT-3: 700GB
Gradients \(O(P)\) Same shape as parameters
Optimizer states (Adam) \(O(2P)\) m_t and v_t for each parameter
Activations (forward cache) \(O(m \cdot L \cdot d)\) Grows with batch size × depth × width
Input data \(O(m \cdot n)\) Usually small relative to activations
Activation Checkpointing

Instead of caching all intermediate activations, recompute them on demand during the backward pass. Trade compute for memory:

Activation Checkpointing — Memory/Compute TradeoffAlgorithm
\[\text{Standard backprop: } O(L\cdot d\cdot m) \text{ memory}, \quad O(2\cdot T_{\text{fwd}}) \text{ compute}\] \[\text{Checkpointing: } O(\sqrt{L}\cdot d\cdot m) \text{ memory}, \quad O(3\cdot T_{\text{fwd}}) \text{ compute}\]
With checkpointing: save activations only at every √L-th layer (the "checkpoints"). During backward, recompute from the nearest checkpoint. Memory scales as √L instead of L — for a 100-layer network, memory reduces 10×. PyTorch: torch.utils.checkpoint.checkpoint(). Used in training LLMs with very long sequences.
Gradient Accumulation

To simulate a large effective batch size with limited GPU memory, accumulate gradients over multiple small batches before taking an optimizer step:

Gradient Accumulation — PyTorch
accumulation_steps = 8  # simulate batch_size * 8
optimizer.zero_grad()

for i, (x, y) in enumerate(dataloader):
  loss = model(x, y) / accumulation_steps
  loss.backward()  # gradients accumulate in .grad

  if (i + 1) % accumulation_steps == 0:
    optimizer.step()   # update with accumulated gradients
    optimizer.zero_grad()

# Effectively processes 8x larger batches with 1x memory
Automatic Mixed Precision (AMP)
AMP — Gradient ScalingFP16 Training
\[\text{FP32 params and gradients: 4 bytes/value}\] \[\text{FP16 forward pass: 2 bytes/value (2× faster on Tensor Cores)}\] \[\text{Gradient scaling: } \hat{g} = s \cdot g_{\text{FP16}}, \quad g_{\text{FP32}} = \hat{g}/s\]
FP16 has limited range (max ~65504) — small gradients underflow to 0. Solution: multiply loss by a large scale factor s before backward, then divide the accumulated gradients by s. PyTorch GradScaler handles this automatically. AMP reduces memory ~2× and increases throughput ~2-3× on modern GPUs with no accuracy loss.
13
AD Framework

Automatic Differentiation — The Theory

// three flavors of differentiation · operator overloading · tape-based AD

Automatic differentiation is the umbrella framework. It is distinct from both numerical differentiation (finite differences) and symbolic differentiation (computer algebra). It computes exact derivatives efficiently using the structure of the program.

Three Flavors of Differentiation
Method How Exact? Cost When Used
Numerical (finite diff) \((f(x+ε) - f(x-ε))/2ε\) No — approx \(O(ε^2)\) \(O(n)\) passes Gradient checking only
Symbolic (CAS) Derive closed-form expression Yes Expr swell; static only Toy problems, paper derivations
Forward AD Dual numbers, JVPs Yes \(O(n)\) passes for full J Few inputs, many outputs
Reverse AD (Backprop) Tape + VJPs Yes \(O(1)\) pass for scalar loss All neural network training
Operator Overloading

The most common AD implementation technique: overload mathematical operators (+, *, sin, etc.) on a special number type that tracks both the primal value and the derivative information:

Dual Numbers — Forward Mode AD (Python)
class Dual:
  def __init__(self, real, dual=0.0):
    self.real = real   # f(x) — primal value
    self.dual = dual   # f'(x)·ẋ — tangent

  def __mul__(self, other):
    # Product rule: (f·g)' = f'·g + f·g'
    return Dual(self.real*other.real,
               self.dual*other.real + self.real*other.dual)

  def __add__(self, other):
    return Dual(self.real+other.real, self.dual+other.dual)

# Differentiate f(x) = x*x + x at x=3
x = Dual(3.0, 1.0)  # x=3, ẋ=1 (seed the tangent)
result = x*x + x
print(result.dual)  # f'(3) = 2*3 + 1 = 7 ✓
Tape-Based Reverse Mode AD

PyTorch's autograd and TensorFlow use a tape (Wengert list): a record of every operation performed during the forward pass, including enough information to compute the VJP for each operation during the backward pass:

Tape-Based AD — Minimal Implementation
class Tensor:
  def __init__(self, data, _children=(), _backward_fn=None):
    self.data = data
    self.grad = 0.0                    # accumulates ∂L/∂self
    self._children = _children        # inputs to this node
    self._backward = _backward_fn or (lambda: None)

  def __mul__(self, other):
    out = Tensor(self.data * other.data, (self, other))
    def _bwd():
      # VJP of multiply: ∂(a*b)/∂a = b, ∂(a*b)/∂b = a
      self.grad += out.grad * other.data  # accumulate!
      other.grad += out.grad * self.data
    out._backward = _bwd
    return out

  def backward(self):
    self.grad = 1.0                      # dL/dL = 1
    for node in topological_sort(self): # reverse topo
      node._backward()
14
Frameworks

PyTorch & TensorFlow Autograd

// dynamic graph · GradientTape · retain_graph · custom backward · hooks
PyTorch Autograd — Dynamic Graph
PyTorch Autograd — Everything You Need to Know
import torch

# 1. requires_grad=True enables gradient tracking
x = torch.randn(3, 4, requires_grad=True)
W = torch.randn(4, 5, requires_grad=True)

# 2. Forward pass builds the computational graph automatically
z = x @ W            # graph: z.grad_fn = MmBackward
loss = z.sum()

# 3. Backward pass computes all gradients
loss.backward()      # default: free the graph after backward
print(W.grad)        # dL/dW — shape (4,5)

# 4. retain_graph=True: keep graph for multiple backward passes
loss.backward(retain_graph=True)  # needed for higher-order grads

# 5. Gradient accumulation: .grad ADDS (doesn't replace)
optimizer.zero_grad()  # must zero before next backward!

# 6. torch.no_grad(): disable tracking for inference
with torch.no_grad():
  y_pred = model(x_test)  # 2x faster, no memory overhead

# 7. detach(): remove tensor from graph
target = output.detach()  # treat as constant, stop grad flow
TensorFlow GradientTape
TensorFlow 2.x GradientTape
import tensorflow as tf

# 1. GradientTape records forward pass operations
x = tf.Variable(tf.random.normal([3, 4]))
W = tf.Variable(tf.random.normal([4, 5]))

with tf.GradientTape() as tape:
  z = tf.matmul(x, W)      # recorded on tape
  loss = tf.reduce_sum(z)  # recorded on tape

# 2. Compute gradients by replaying the tape backward
grads = tape.gradient(loss, [x, W])  # [dL/dx, dL/dW]

# 3. persistent=True: call gradient() multiple times
with tf.GradientTape(persistent=True) as tape:
  loss = compute_loss()
dL_dx = tape.gradient(loss, x)
dL_dW = tape.gradient(loss, W) # tape still alive
del tape # must delete manually
Custom Backward Functions in PyTorch
torch.autograd.Function — Custom Backward
class StraightThroughEstimator(torch.autograd.Function):
  """Example: gradient through a non-differentiable op"""

  @staticmethod
  def forward(ctx, x):
    ctx.save_for_backward(x)  # cache what backward needs
    return (x > 0).float()    # non-differentiable step

  @staticmethod
  def backward(ctx, grad_output):
    x, = ctx.saved_tensors
    # Straight-through: pass gradient as-is (pretend forward = identity)
    return grad_output * (x.abs() <= 1).float()

# Use it like a regular function:
y = StraightThroughEstimator.apply(x)
15
Advanced

Modern Backprop Tricks

// gradient clipping · second-order · higher-order grads · grad checkpointing in depth
Gradient Clipping

Exploding gradients (common in RNNs and deep networks) are handled by clipping the gradient norm:

Global Norm ClippingStability
\[\mathbf{g} \leftarrow \begin{cases}\mathbf{g} & \|\mathbf{g}\|_2 \leq \text{max\_norm} \\ \text{max\_norm}\cdot\frac{\mathbf{g}}{\|\mathbf{g}\|_2} & \text{otherwise}\end{cases}\]
Global norm clipping scales all gradients uniformly — preserving the direction, only reducing the magnitude. Value clipping (clip each gradient independently to [−c, c]) is different and less principled. PyTorch: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0). Typical max_norm: 0.5–5.0 depending on the model.
Higher-Order Gradients

Since the gradient computation is itself a differentiable computation, you can differentiate through the backward pass to get second (or higher) order derivatives:

Higher-Order Gradients — MAML / Hessian-Vector Products
# Second-order gradient (gradient of gradient)
x = torch.randn(3, requires_grad=True)
y = (x ** 2).sum()              # y = ||x||²

# First derivative
dy_dx = torch.autograd.grad(y, x, create_graph=True)[0]
# dy/dx = 2x — create_graph=True keeps the gradient graph

# Second derivative (Hessian diagonal)
d2y_dx2 = torch.autograd.grad(dy_dx.sum(), x)[0]
# d²y/dx² = 2 ✓

# Hessian-Vector Product (efficient second order)
v = torch.ones_like(x)
hvp = torch.autograd.grad(dy_dx, x, grad_outputs=v)[0]
# Computes H·v in O(n) without materializing H (n×n)
  • MAML (Meta-Learning): Differentiates through multiple gradient steps to find initialization that quickly adapts. Requires second-order gradients (gradient through gradient).
  • Hessian-Vector Products (HVPs): Used in second-order optimization (Newton's method, Hessian-free optimization, K-FAC). Computable in \(O(n)\) — same cost as a gradient — without forming the \(O(n^2)\) Hessian.
  • Implicit differentiation: Differentiate through the solution of an optimization problem. Used in meta-learning, hyperparameter optimization, and neural ODEs.
Gradient Checkpointing — Deep Dive
torch.utils.checkpoint — Activation Recomputation
import torch.utils.checkpoint as cp

class CheckpointedModel(nn.Module):
  def forward(self, x):
    # Without checkpointing: all activations cached (high memory)
    # With checkpointing: recomputes on backward (low memory)
    x = cp.checkpoint(self.layer1, x)  # don't cache layer1 activations
    x = cp.checkpoint(self.layer2, x)  # recompute from input on backward
    x = self.layer3(x)                 # last layer: always cached
    return x

# Memory: O(√L) instead of O(L) — 10x reduction for 100 layers
# Compute: ~33% overhead (recomputes each checkpoint once on bwd)
# Enables 2-3x larger models or longer sequences
Gradient Noise and Perturbation
  • Label smoothing: Replaces one-hot targets with soft targets — the gradient signal is \(\hat{y} - y_{\text{smooth}}\) where \(y_{\text{smooth}} = (1-\epsilon)y + \epsilon/K\). Prevents overconfident predictions.
  • Stochastic depth: Randomly drop entire residual branches during training. The gradient paths change randomly each step, acting as an ensemble of shallower networks.
  • Noise injection: Add Gaussian noise to gradients (Langevin dynamics). Helps escape sharp local minima. Formally equivalent to sampling from the Bayesian posterior.
16
Synthesis

The Complete Mental Model

// everything unified · one diagram · one algorithm · the whole picture
BACKPROPAGATION — COMPLETE MENTAL MAP THE GOAL: ∂L/∂θ for all parameters θ one forward + one backward = O(n) total FORWARD PASS compute & cache z[ℓ], a[ℓ] COMP. GRAPH DAG · topological order CHAIN RULE ∂L/∂x = Σ (∂L/∂y)·(∂y/∂x) THE FOUR EQUATIONS — BP1 through BP4 BP1: δ[L]=∇ₐL⊙σ' BP2: δ[ℓ]=W[ℓ+1]ᵀδ[ℓ+1]⊙σ' BP3: ∂L/∂W=δa^T BP4: ∂L/∂b=δ VJPs / JACOBIANS local grad primitives AUTOGRAD ENGINE tape · operator overload MEMORY TRICKS checkpointing · AMP · accum GRADIENT UPDATE ∂L/∂W[L] = δ[L] · a[L-1]^T ∂L/∂W[L-1] = δ[L-1] · a[L-2]^T ∂L/∂W[1] = δ[1] · x^T θ ← θ − α · ∂L/∂θ (optimizer step) All computed in one backward pass · O(n) total · GPU parallel · AMP optional
Fig 2. Complete backpropagation mental map — from the computational goal through the four fundamental equations to the gradient update that trains every neural network.

The complete story in one coherent thread:

  1. The problem: We need \(\partial\mathcal{L}/\partial\theta_j\) for every parameter \(\theta_j\). Finite differences require \(O(n)\) forward passes. Backpropagation computes all gradients simultaneously in one forward + one backward pass — \(O(1)\) passes regardless of \(n\).
  2. Computational graphs represent computation as a DAG. The forward pass evaluates nodes in topological order, caching all intermediate values. The graph makes the chain rule application systematic and automatic.
  3. Backpropagation is the chain rule applied in reverse topological order, reusing each computed gradient \(\partial\mathcal{L}/\partial z\) for all upstream parameters that depend on \(z\). This sharing is the source of the \(O(n)\) efficiency.
  4. Reverse mode AD propagates adjoint variables \(\bar{v} = \partial\mathcal{L}/\partial v\) backward through the graph. Initialization: \(\bar{\mathcal{L}} = 1\). Each node computes the VJP for its operation and accumulates the result into its input adjoints.
  5. VJPs are the primitive: each operation need only define its VJP (how to propagate a cotangent backward), not its full Jacobian. The Jacobian is never formed — only its product with a vector. Cost: same order as the forward operation.
  6. Matrix calculus gives us the VJP rules for common operations: for \(\mathbf{y} = \mathbf{Wx}\), the VJP w.r.t. \(\mathbf{W}\) is \(\bar{\mathbf{y}}\mathbf{x}^T\) (outer product), and w.r.t. \(\mathbf{x}\) is \(\mathbf{W}^T\bar{\mathbf{y}}\) (transpose times cotangent).
  7. The four BP equations follow directly from applying VJPs to the specific structure of a layered neural network. BP1 gives the output error; BP2 propagates it backward; BP3 and BP4 give weight and bias gradients.
  8. Softmax + CCE gives the cleanest gradient: \(\partial\mathcal{L}/\partial\mathbf{z}^{[L]} = \hat{\mathbf{y}} - \mathbf{y}\). This follows from the exponential family structure of the categorical distribution and is not a coincidence.
  9. Memory is the bottleneck for large models. Activation checkpointing reduces cached memory from \(O(L)\) to \(O(\sqrt{L})\) by recomputing activations on demand. Gradient accumulation and AMP are the other two key techniques for fitting large models into GPU memory.
  10. Modern autograd engines (PyTorch, JAX, TensorFlow) implement tape-based reverse mode AD via operator overloading. Every operation records its VJP. The backward pass replays the tape in reverse, computing all parameter gradients automatically.
The One Algorithm That Trains Everything

Backpropagation is not a clever trick or an approximation — it is the exact, optimal algorithm for computing gradients of a scalar loss with respect to all parameters of a computation. It is a direct consequence of the chain rule of calculus, applied systematically using the structure of the computational graph. Every neural network ever trained — from the 1986 XOR network to GPT-4 with 1.8 trillion parameters — was trained by exactly this algorithm. The equations are the same; only the graph grew larger. Understanding backpropagation is not merely understanding a technique — it is understanding the mathematical mechanism by which machines learn.