Backpropagation | Shahid Ul Islam

01

The Problem

Why Backpropagation? The Gradient Problem

// the core question · naive vs efficient · what training actually needs

Training a neural network requires computing the gradient of the loss function with respect to every learnable parameter. For a modern network with 175 billion parameters (GPT-3), this means computing 175,000,000,000 partial derivatives — on every single forward pass, thousands of times per second. The question is not whether we can compute gradients, but how to do it with exactly one backward pass instead of 175 billion forward passes.

Three Approaches to Computing Gradients

Numerical Differentiation

Finite difference approximation: for each parameter \(\theta_j\), run two forward passes:

\[\frac{\partial \mathcal{L}}{\partial \theta_j} \approx \frac{\mathcal{L}(\theta_j+\epsilon) - \mathcal{L}(\theta_j-\epsilon)}{2\epsilon}\]

Cost: \(O(n)\) forward passes for \(n\) parameters. GPT-3: 350 billion evaluations per gradient. Numerically unstable (truncation vs roundoff error tradeoff). Only used for gradient checking, not production training.

Symbolic Differentiation

Apply calculus rules symbolically (like Wolfram Alpha). Result: an exact closed-form expression for the gradient.

Problem: Expression swell — the symbolic gradient can be exponentially larger than the original expression. Doesn't handle loops or dynamic computation graphs. Only practical for small, fixed-structure expressions.

The Backpropagation Solution

Backpropagation computes the exact gradient using one forward pass (to cache intermediate values) and one backward pass (to propagate gradients). Total cost: \(O(n)\) operations for \(n\) parameters — the same order as a single forward pass. This is the efficiency that makes deep learning computationally tractable.

The Core Insight

The Goal of BackpropagationObjective

\[\nabla_\theta\mathcal{L} = \left[\frac{\partial\mathcal{L}}{\partial\theta_1}, \frac{\partial\mathcal{L}}{\partial\theta_2}, \ldots, \frac{\partial\mathcal{L}}{\partial\theta_n}\right]^T\] \[\text{Backprop achieves this in: } T_{\text{fwd}} + T_{\text{bwd}} \approx 2 \cdot T_{\text{fwd}} \quad \text{(vs } n \cdot T_{\text{fwd}} \text{ for finite differences)}\]

The backward pass costs roughly the same as the forward pass — not n times more. This is the fundamental theorem underlying all deep learning training. Without it, modern neural networks with millions or billions of parameters could not be trained.

02

Graphs

Computational Graphs

// DAG structure · nodes · edges · caching · topological order

A computational graph is a Directed Acyclic Graph (DAG) where each node represents an operation and each directed edge represents the flow of data (a tensor). The graph makes the structure of the computation explicit — enabling automatic gradient computation through systematic traversal.

Graph Structure for a Neural Layer

Fig 1. Computational graph for one linear layer: x and W enter a matrix multiply (@), bias b is added (+), then activation σ produces output a. Backward pass (red dashed) traverses this graph in reverse, computing gradients via the chain rule at each node.

Forward Pass: Topological Evaluation

The forward pass evaluates all nodes in topological order — each node is evaluated only after all its inputs are ready. For a neural network with \(L\) layers, this is simply the sequential layer-by-layer computation. Every intermediate value must be cached because the backward pass needs them.

What Gets Cached During Forward PassMemory

\[\text{For each layer }\ell:\quad \text{Cache}\left\{\mathbf{Z}^{[\ell]},\, \mathbf{A}^{[\ell]}\right\}\] \[\text{Total cache size: } O\!\left(m \cdot \sum_{\ell=1}^L n^{[\ell]}\right) = O(m \cdot L \cdot d)\]

m = batch size, n[ℓ] = neurons in layer ℓ, L = depth. For GPT-3: 175B parameters, 96 layers, batch size 1024 — the activation cache alone requires ~350GB of GPU memory during training. This is why activation checkpointing exists (§12).

Static vs Dynamic Graphs

Static Graph (TensorFlow v1)

Graph is defined once, then compiled and executed repeatedly. More efficient: the compiler can optimize the graph, fuse operations, and preallocate memory. Less flexible: can't use Python control flow (if/while) inside the graph.

Dynamic Graph (PyTorch)

Graph is built on-the-fly during each forward pass. Python control flow works naturally — the graph changes with each forward pass. More flexible and debuggable. Slightly less efficient but the default in modern research.

03

Mathematics

The Chain Rule — The Foundation

// univariate · multivariate · vector chain rule · systematic application

Backpropagation is the chain rule of calculus applied systematically to a computational graph. Nothing more. The beauty of backpropagation is not that it invents new mathematics — it's that it applies the chain rule efficiently using the graph structure.

Univariate Chain Rule

Single-Variable Chain RuleCalculus I

\[\text{If } f = g(h(x)):\quad \frac{df}{dx} = \frac{dg}{dh} \cdot \frac{dh}{dx}\] \[\text{For a chain } x \to u \to v \to w \to \mathcal{L}:\quad \frac{d\mathcal{L}}{dx} = \frac{d\mathcal{L}}{dw}\cdot\frac{dw}{dv}\cdot\frac{dv}{du}\cdot\frac{du}{dx}\]

The chain rule: to compute the derivative of the final output with respect to any intermediate variable, multiply all the local derivatives along the path from that variable to the output. In a neural network, the "path" goes from each parameter through all subsequent layers to the loss.

Multivariate Chain Rule

In neural networks, intermediate values are tensors, and one upstream variable may depend on multiple intermediate nodes. The multivariate chain rule sums over all paths:

Multivariate Chain RuleGeneral Form

\[\frac{\partial \mathcal{L}}{\partial x} = \sum_j \frac{\partial \mathcal{L}}{\partial y_j}\cdot\frac{\partial y_j}{\partial x}\]

where the sum is over all immediate downstream nodes y_j that depend on x.

This sum-over-paths is exactly what "gradient accumulation" means in practice — when a parameter is used in multiple places (e.g., tied weights), its gradient is the sum of contributions from all paths to the loss. PyTorch's .grad accumulates (+=) rather than overwrites, implementing this formula.

Why Backprop Is Efficient

The Key Efficiency

Naive application of the chain rule would recompute all intermediate values for each parameter. Backpropagation avoids this by computing the chain rule in reverse topological order, reusing each intermediate gradient \(\partial\mathcal{L}/\partial z\) for all upstream parameters that depend on \(z\). Each gradient is computed exactly once and shared across all paths. This is why backprop is \(O(n)\) rather than \(O(n^2)\).

04

AD Theory

Reverse Mode Automatic Differentiation

// forward mode vs reverse mode · the adjoint · cotangent · accumulation order

Automatic differentiation (AD) is the general framework; backpropagation is its instantiation for neural networks. AD has two modes — and the choice between them determines computational complexity.

Forward Mode vs Reverse Mode

Forward Mode AD

Computes \(\frac{\partial \mathbf{y}}{\partial x_i}\) for one specific input \(x_i\). Propagates a tangent vector \(\dot{v}\) forward alongside the primal values.

Cost: One forward pass per input dimension. For \(n\) inputs: \(O(n)\) forward passes.

Best when: \(n_{\text{inputs}} \ll n_{\text{outputs}}\). Rare in ML — usually \(n_{\text{params}} \gg 1\).

Used in: Jacobian-vector products (JVPs), second-order optimization, physics simulations.

Reverse Mode AD (Backprop)

Computes \(\frac{\partial \mathcal{L}}{\partial x_i}\) for all inputs simultaneously. Propagates an adjoint vector \(\bar{v}\) backward through the graph.

Cost: One forward + one backward pass. For \(n\) inputs: still \(O(1)\) passes.

Best when: \(n_{\text{outputs}} \ll n_{\text{inputs}}\). ML: one scalar loss, millions of parameters.

Used in: All neural network training. The reason deep learning is computationally feasible.

The Adjoint Variable

In reverse mode AD, we define the adjoint (or cotangent) of each node \(v\) as the derivative of the loss with respect to that node's value:

Adjoint (Cotangent) DefinitionReverse Mode

\[\bar{v} \triangleq \frac{\partial \mathcal{L}}{\partial v}\] \[\text{Initialization: } \bar{\mathcal{L}} = 1 \quad \text{(the loss's adjoint w.r.t. itself)}\] \[\text{Propagation: } \bar{v} = \sum_{j:\, v \in \text{inputs}(j)} \bar{w}_j \cdot \frac{\partial w_j}{\partial v}\]

For each upstream node w_j that uses v as an input: the contribution to v̄ is the adjoint of w_j times the local Jacobian ∂w_j/∂v. Summing over all downstream nodes gives the total adjoint. This is the multivariate chain rule in adjoint form. The final adjoints ∂L/∂θ for all parameters θ are the gradients we use for SGD.

The Reverse Accumulation Algorithm

Reverse Mode AD — Core Algorithm

# Forward pass: evaluate all nodes, cache values
values = topological_sort(graph)
for node in values:
  node.value = evaluate(node.op, node.inputs)

# Initialize adjoints
adjoints = {node: 0.0 for node in graph}
adjoints[loss_node] = 1.0 # dL/dL = 1

# Backward pass: propagate adjoints in reverse topo order
for node in reversed(values):
  # node has adjoint bar_v = dL/d(node)
  for i, inp in enumerate(node.inputs):
    # local Jacobian: d(node)/d(inp) evaluated at cached values
    local_grad = jacobian(node.op, wrt=i, at=node.cached_inputs)
    # Adjoint accumulation: bar_inp += bar_v * local_grad
    adjoints[inp] += adjoints[node] * local_grad

# Result: adjoints[param] = dL/d(param) for all parameters

05

Linear Algebra

Jacobians

// the derivative of a vector function · shape · why it's a matrix · computational cost

Neural network layers map vectors to vectors. The derivative of a vector-valued function with respect to a vector input is a matrix called the Jacobian. Understanding Jacobians is the key to understanding why backprop works for vector and tensor operations.

The Jacobian MatrixDefinition

\[\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m, \quad \mathbf{y} = \mathbf{f}(\mathbf{x})\] \[J_\mathbf{f}(\mathbf{x}) = \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \begin{pmatrix}\frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial y_m}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_n}\end{pmatrix} \in \mathbb{R}^{m \times n}\]

The (i,j) entry is ∂yᵢ/∂xⱼ — how much the i-th output changes when the j-th input changes. The Jacobian fully describes the local linear approximation of f at x: f(x+δ) ≈ f(x) + J_f(x)·δ. For a scalar function (m=1), the Jacobian reduces to the gradient row vector ∇f^T.

Chain Rule for Vector Functions

Vector Chain Rule via JacobiansMatrix Multiplication

\[\text{If } \mathbf{z} = g(\mathbf{f}(\mathbf{x})):\quad J_{\mathbf{z}}(\mathbf{x}) = J_g(\mathbf{f}(\mathbf{x})) \cdot J_\mathbf{f}(\mathbf{x})\] \[\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}} J_{\mathbf{z}}(\mathbf{x}) = \frac{\partial \mathcal{L}}{\partial \mathbf{z}} J_g J_\mathbf{f}\]

For the special case where L is a scalar (which is always true in neural network training): ∂L/∂x is a row vector of shape (1×n), J is (m×n), and ∂L/∂z is a row vector of shape (1×m). The product (1×m)·(m×n) = (1×n). This is the vector-Jacobian product (VJP).

Jacobian Examples for Common Operations

Operation	Forward	Jacobian	Shape
Linear	\(\mathbf{y} = \mathbf{Wx}\)	\(J = \mathbf{W}\)	\(m\times n\)
ReLU	\(\mathbf{y} = \max(0,\mathbf{x})\)	\(J = \text{diag}(\mathbf{1}[\mathbf{x}>0])\)	\(n\times n\) diagonal
Softmax	\(\mathbf{y} = \text{softmax}(\mathbf{x})\)	\(J = \text{diag}(\mathbf{y}) - \mathbf{y}\mathbf{y}^T\)	\(n\times n\)
Sigmoid	\(y_i = \sigma(x_i)\)	\(J = \text{diag}(\sigma'(x_i))\)	\(n\times n\) diagonal
Elementwise \(f\)	\(y_i = f(x_i)\)	\(J = \text{diag}(f'(\mathbf{x}))\)	\(n\times n\) diagonal
Sum	\(y = \sum_i x_i\)	\(J = \mathbf{1}^T\)	\(1\times n\)
L2 norm squared	\(y = \\|\mathbf{x}\\|^2\)	\(J = 2\mathbf{x}^T\)	\(1\times n\)

The Jacobian Is Usually Never Materialized

For a layer with \(n\) inputs and \(m\) outputs, the Jacobian has \(mn\) entries. For a 4096-dimensional layer: 16 million entries per Jacobian, per layer. Computing and storing full Jacobians is infeasible. Instead, we only ever need the Jacobian-vector product (JVP) for forward mode or the vector-Jacobian product (VJP) for reverse mode — both computable in \(O(n)\) without forming the full Jacobian.

06

Core Primitive

Vector-Jacobian Products (VJPs)

// the fundamental operation · why we don't need the Jacobian · examples · efficiency

The vector-Jacobian product is the fundamental primitive of reverse-mode AD. Every backpropagation rule reduces to computing a VJP for that specific operation. Defining how to compute the VJP for each operation is equivalent to defining backpropagation for that operation.

VJP — DefinitionCore Primitive

\[\text{VJP}(f, \bar{\mathbf{y}}) = \bar{\mathbf{y}}^T J_f = \frac{\partial \mathcal{L}}{\partial \mathbf{y}}^T \cdot \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial \mathbf{x}}^T\]

ȳ = ∂L/∂y is the "upstream gradient" — the adjoint from all downstream nodes. VJP(f, ȳ) computes ∂L/∂x without ever forming J_f. Cost: same order as one forward pass through f. This is why backpropagation is efficient: each backward operation is as cheap as the corresponding forward operation.

VJPs for Key Operations

// VJP derivations for the most common neural network operations

MatMul

\[\mathbf{y} = \mathbf{W}\mathbf{x}, \quad J = \mathbf{W}\] \[\text{VJP w.r.t. } \mathbf{x}: \quad \bar{\mathbf{x}} = \bar{\mathbf{y}}^T\mathbf{W} = \mathbf{W}^T\bar{\mathbf{y}}\] \[\text{VJP w.r.t. } \mathbf{W}: \quad \bar{\mathbf{W}} = \bar{\mathbf{y}}\mathbf{x}^T\]

For Wx: the VJP w.r.t. x is W^T ȳ (transpose the weight matrix). The VJP w.r.t. W is the outer product ȳ x^T. These are the standard backprop rules for linear layers — now derived from first principles via VJPs.

Elementwise

\[\mathbf{y} = f(\mathbf{x}), \quad J = \text{diag}(f'(\mathbf{x}))\] \[\text{VJP}: \quad \bar{\mathbf{x}} = \bar{\mathbf{y}}^T \text{diag}(f'(\mathbf{x})) = \bar{\mathbf{y}} \odot f'(\mathbf{x})\]

For elementwise functions: the Jacobian is diagonal, so the VJP is just the elementwise product ⊙ (Hadamard product) of the upstream gradient and the elementwise derivative. ReLU: x̄ = ȳ ⊙ 1[x>0]. Sigmoid: x̄ = ȳ ⊙ σ(x)(1−σ(x)).

Sum

\[y = \sum_i x_i, \quad J = \mathbf{1}^T\] \[\text{VJP}: \quad \bar{\mathbf{x}} = \bar{y} \cdot \mathbf{1}\]

The gradient of a sum with respect to each input is 1 — the upstream scalar is broadcast to all inputs. This is why in batch normalization: the gradient through the sum broadcasts to all elements.

Log

\[y = \log x, \quad \frac{dy}{dx} = 1/x\] \[\text{VJP}: \quad \bar{x} = \bar{y}/x\]

The cross-entropy loss involves log — so during backward, we divide by the predicted probability. This is why near-zero predictions cause numerical instability: 1/ε is very large. log(ε + 1e-8) clipping or LogSumExp are used to stabilize this.

JVP vs VJP — When to Use Which

JVP vs VJP ComparisonMode Selection

\[\text{JVP}(f, \dot{\mathbf{x}}) = J_f \dot{\mathbf{x}} \quad \text{(tangent vector } \dot{\mathbf{x}}\text{, forward mode)}\] \[\text{VJP}(f, \bar{\mathbf{y}}) = J_f^T \bar{\mathbf{y}} \quad \text{(cotangent vector } \bar{\mathbf{y}}\text{, reverse mode)}\] \[\text{Reverse mode: } O(1) \text{ VJP calls to get } \nabla_{\mathbf{x}}\mathcal{L} \text{ for scalar }\mathcal{L}\] \[\text{Forward mode: } O(n) \text{ JVP calls to get all of } J_f\]

Conclusion: for a scalar loss and many parameters (ML training), use reverse mode. For vector outputs and few inputs (Jacobian estimation, sensitivity analysis), use forward mode. JAX supports both natively via jax.vjp() and jax.jvp().

07

Calculus

Matrix Calculus

// layout conventions · scalar-by-matrix · trace trick · key identities

Neural networks involve matrix operations. Matrix calculus provides the rules for differentiating functions that take matrices as inputs or produce matrices as outputs. Two layout conventions exist — the choice affects the shape of the result and leads to constant confusion in the literature.

Layout Conventions

Numerator Layout

The gradient \(\partial y/\partial \mathbf{x}\) has the same shape as \(y\) "in the numerator." For scalar \(y\) and vector \(\mathbf{x} \in \mathbb{R}^n\): the gradient is a row vector \(\in \mathbb{R}^{1\times n}\).

For vector \(\mathbf{y} \in \mathbb{R}^m\) and vector \(\mathbf{x} \in \mathbb{R}^n\): result is \(\mathbb{R}^{m\times n}\) Jacobian.

Used by: Bishop's PRML, most numerical analysis texts.

Denominator Layout

The gradient \(\partial y/\partial \mathbf{x}\) has the same shape as the "denominator" \(\mathbf{x}\). For scalar \(y\) and vector \(\mathbf{x} \in \mathbb{R}^n\): the gradient is a column vector \(\in \mathbb{R}^{n\times 1}\).

The ML convention: \(\nabla_\mathbf{x}\mathcal{L}\) is a column vector matching \(\mathbf{x}\).

Used by: Most ML textbooks, Goodfellow's DL book, standard practice.

Key Matrix Calculus Identities

Essential Identities for Neural Network BackpropReference

\[\frac{\partial}{\partial \mathbf{x}}(\mathbf{a}^T\mathbf{x}) = \mathbf{a}\] \[\frac{\partial}{\partial \mathbf{x}}(\mathbf{x}^T\mathbf{A}\mathbf{x}) = (\mathbf{A} + \mathbf{A}^T)\mathbf{x} \quad\Rightarrow\quad 2\mathbf{A}\mathbf{x} \text{ if }\mathbf{A}=\mathbf{A}^T\] \[\frac{\partial}{\partial \mathbf{X}}\text{tr}(\mathbf{A}\mathbf{X}) = \mathbf{A}^T\] \[\frac{\partial}{\partial \mathbf{X}}\text{tr}(\mathbf{X}^T\mathbf{A}) = \mathbf{A}\] \[\frac{\partial}{\partial \mathbf{X}}\text{tr}(\mathbf{A}\mathbf{X}\mathbf{B}) = \mathbf{A}^T\mathbf{B}^T\] \[\frac{\partial}{\partial \mathbf{X}}\log|\mathbf{X}| = \mathbf{X}^{-T}\]

The trace trick: tr(A^T B) = vec(A)^T vec(B) — converts matrix inner products to scalar inner products. Essential for computing ∂L/∂W when L involves matrix operations. Most backprop derivations for linear layers use ∂tr(X^T A)/∂X = A.

Deriving Linear Layer Gradients via Matrix Calculus

// Deriving ∂L/∂W and ∂L/∂x for y = Wx using matrix calculus trace trick

Setup

\[\mathbf{y} = \mathbf{W}\mathbf{x},\quad \mathcal{L} = f(\mathbf{y})\] Use the differential approach: \(d\mathcal{L} = \text{tr}\!\left(\left(\frac{\partial\mathcal{L}}{\partial\mathbf{y}}\right)^T d\mathbf{y}\right) = \bar{\mathbf{y}}^T\, d\mathbf{y}\)

∂L/∂W

\[d\mathbf{y} = (d\mathbf{W})\mathbf{x} \Rightarrow d\mathcal{L} = \bar{\mathbf{y}}^T(d\mathbf{W})\mathbf{x} = \text{tr}(\bar{\mathbf{y}}^T(d\mathbf{W})\mathbf{x}) = \text{tr}(\mathbf{x}\bar{\mathbf{y}}^T\, d\mathbf{W})\] \[\therefore\quad \frac{\partial\mathcal{L}}{\partial\mathbf{W}} = (\mathbf{x}\bar{\mathbf{y}}^T)^T = \bar{\mathbf{y}}\mathbf{x}^T\]

Using tr(A d B) → ∂/∂B = A^T. The outer product ȳ x^T matches the shape of W exactly.

∂L/∂x

\[d\mathbf{y} = \mathbf{W}(d\mathbf{x}) \Rightarrow d\mathcal{L} = \bar{\mathbf{y}}^T\mathbf{W}(d\mathbf{x}) = (\mathbf{W}^T\bar{\mathbf{y}})^T\, d\mathbf{x}\] \[\therefore\quad \frac{\partial\mathcal{L}}{\partial\mathbf{x}} = \mathbf{W}^T\bar{\mathbf{y}}\]

The key rule: for y = Wx, backprop through x is W^T ȳ (transpose the weight). This is the most important single formula in all of deep learning. ∎

08

Full Derivation

Full Network Backpropagation Derivation

// notation · BP1 through BP4 · full vectorized proof · every gradient computed

We now derive the four fundamental equations of backpropagation for a general \(L\)-layer neural network with arbitrary activation functions. These four equations are the complete specification of backpropagation — everything else is an application of them.

Notation and Setup

Network NotationSetup

\[\mathbf{a}^{[0]} = \mathbf{x},\quad \mathbf{z}^{[\ell]} = \mathbf{W}^{[\ell]}\mathbf{a}^{[\ell-1]} + \mathbf{b}^{[\ell]},\quad \mathbf{a}^{[\ell]} = \sigma^{[\ell]}(\mathbf{z}^{[\ell]})\] \[\boldsymbol{\delta}^{[\ell]} \triangleq \frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[\ell]}} \quad\text{(error signal / adjoint at layer }\ell\text{)}\]

The Four Equations

// Complete derivation of BP1–BP4 from first principles

BP1

Output layer error: \[\boldsymbol{\delta}^{[L]} = \frac{\partial\mathcal{L}}{\partial\mathbf{z}^{[L]}} = \frac{\partial\mathcal{L}}{\partial\mathbf{a}^{[L]}} \odot \sigma'^{[L]}(\mathbf{z}^{[L]})\] Derivation: Apply chain rule: \(\partial\mathcal{L}/\partial z_k^{[L]} = \sum_j (\partial\mathcal{L}/\partial a_j^{[L]})(\partial a_j^{[L]}/\partial z_k^{[L]})\). Since \(a_j^{[L]} = \sigma(z_j^{[L]})\) and activation is applied elementwise: \(\partial a_j/\partial z_k = \sigma'(z_j)\delta_{jk}\). Therefore \(\partial\mathcal{L}/\partial z_k^{[L]} = (\partial\mathcal{L}/\partial a_k^{[L]})\sigma'(z_k^{[L]})\).

For Softmax + CCE: δ[L] = ŷ − y (the elegant cancellation — see §10). For Sigmoid + BCE: also δ[L] = ŷ − y.

BP2

Error propagation backward: \[\boldsymbol{\delta}^{[\ell]} = \left(\mathbf{W}^{[\ell+1]T}\boldsymbol{\delta}^{[\ell+1]}\right) \odot \sigma'^{[\ell]}(\mathbf{z}^{[\ell]})\] Derivation: \(\partial\mathcal{L}/\partial z_k^{[\ell]} = \sum_j (\partial\mathcal{L}/\partial z_j^{[\ell+1]})(\partial z_j^{[\ell+1]}/\partial z_k^{[\ell]})\). Now \(z_j^{[\ell+1]} = \sum_i W_{ji}^{[\ell+1]}a_i^{[\ell]}\) and \(a_i^{[\ell]} = \sigma(z_i^{[\ell]})\). So: \[\frac{\partial z_j^{[\ell+1]}}{\partial z_k^{[\ell]}} = W_{jk}^{[\ell+1]}\sigma'(z_k^{[\ell]})\] \[\therefore\quad \delta_k^{[\ell]} = \sigma'(z_k^{[\ell]})\sum_j W_{jk}^{[\ell+1]}\delta_j^{[\ell+1]} = \sigma'(z_k^{[\ell]})\cdot[\mathbf{W}^{[\ell+1]T}\boldsymbol{\delta}^{[\ell+1]}]_k\]

The transpose W^[ℓ+1]T is not a coincidence — it's the VJP of the linear operation z = Wa. Multiply by the adjoint from the right: the "backward weight matrix" is the transpose of the forward weight matrix. ∎

BP3

Gradient of the weights: \[\frac{\partial\mathcal{L}}{\partial\mathbf{W}^{[\ell]}} = \boldsymbol{\delta}^{[\ell]}(\mathbf{a}^{[\ell-1]})^T\] Derivation: \(z_j^{[\ell]} = \sum_k W_{jk}^{[\ell]}a_k^{[\ell-1]}\), so \(\partial z_j^{[\ell]}/\partial W_{jk}^{[\ell]} = a_k^{[\ell-1]}\) and \(\partial z_{j'}^{[\ell]}/\partial W_{jk}^{[\ell]} = 0\) for \(j'\neq j\). Therefore: \[\frac{\partial\mathcal{L}}{\partial W_{jk}^{[\ell]}} = \frac{\partial\mathcal{L}}{\partial z_j^{[\ell]}}\cdot a_k^{[\ell-1]} = \delta_j^{[\ell]}\cdot a_k^{[\ell-1]}\]

In matrix form: ∂L/∂W[ℓ] = δ[ℓ](a[ℓ-1])^T — the outer product of the error signal and the previous layer's activation. Shape: n[ℓ] × n[ℓ-1], matching W[ℓ]. ∎

BP4

Gradient of the biases: \[\frac{\partial\mathcal{L}}{\partial\mathbf{b}^{[\ell]}} = \boldsymbol{\delta}^{[\ell]}\] Derivation: \(z_j^{[\ell]} = \sum_k W_{jk}a_k + b_j\), so \(\partial z_j^{[\ell]}/\partial b_j^{[\ell]} = 1\) and zero otherwise. Therefore \(\partial\mathcal{L}/\partial b_j^{[\ell]} = \delta_j^{[\ell]}\cdot 1 = \delta_j^{[\ell]}\). ∎

The bias gradient is simply the error signal — the bias acts as a shift with derivative 1, so it just passes the error through unchanged.

The Four Fundamental Equations of Backpropagation

\[\mathbf{BP1:}\quad \boldsymbol{\delta}^{[L]} = \nabla_{\mathbf{a}^{[L]}}\mathcal{L} \odot \sigma'^{[L]}(\mathbf{z}^{[L]})\] \[\mathbf{BP2:}\quad \boldsymbol{\delta}^{[\ell]} = (\mathbf{W}^{[\ell+1]T}\boldsymbol{\delta}^{[\ell+1]}) \odot \sigma'^{[\ell]}(\mathbf{z}^{[\ell]})\] \[\mathbf{BP3:}\quad \nabla_{\mathbf{W}^{[\ell]}}\mathcal{L} = \boldsymbol{\delta}^{[\ell]}(\mathbf{a}^{[\ell-1]})^T\] \[\mathbf{BP4:}\quad \nabla_{\mathbf{b}^{[\ell]}}\mathcal{L} = \boldsymbol{\delta}^{[\ell]}\]

09

Output Layer

Softmax Gradients — The Full Jacobian

// the softmax Jacobian · diagonal minus outer product · numerical stability

The softmax function maps a vector to a probability distribution. Its Jacobian is not diagonal (unlike ReLU or sigmoid) — each output depends on every input through the normalization constant. Computing the softmax Jacobian correctly is essential for the output layer.

Softmax Jacobian DerivationOutput Layer

\[\text{softmax}(\mathbf{z})_i = \hat{y}_i = \frac{e^{z_i}}{\sum_k e^{z_k}}\] \[\frac{\partial \hat{y}_i}{\partial z_j} = \begin{cases}\hat{y}_i(1-\hat{y}_i) & i = j \\ -\hat{y}_i\hat{y}_j & i \neq j\end{cases}\] \[J_{\text{softmax}} = \text{diag}(\hat{\mathbf{y}}) - \hat{\mathbf{y}}\hat{\mathbf{y}}^T \in \mathbb{R}^{K\times K}\]

Derivation for i≠j: ∂(eᶻⁱ/Σₖeᶻᵏ)/∂zⱼ = 0/Σ − eᶻⁱeᶻʲ/Σ² = −ŷᵢŷⱼ. For i=j: ∂ŷᵢ/∂zᵢ = (eᶻⁱΣ − eᶻⁱeᶻⁱ)/Σ² = ŷᵢ − ŷᵢ² = ŷᵢ(1−ŷᵢ). In matrix form: J = diag(ŷ) − ŷŷ^T.

The VJP of Softmax

The VJP \(\bar{\mathbf{z}} = J_{\text{softmax}}^T \bar{\mathbf{y}}\) can be computed without materializing the full \(K\times K\) Jacobian:

Efficient Softmax VJPO(K) not O(K²)

\[\bar{\mathbf{z}} = J_{\text{softmax}}^T\bar{\mathbf{y}} = (\text{diag}(\hat{\mathbf{y}}) - \hat{\mathbf{y}}\hat{\mathbf{y}}^T)\bar{\mathbf{y}}\] \[= \hat{\mathbf{y}} \odot \bar{\mathbf{y}} - \hat{\mathbf{y}}(\hat{\mathbf{y}}^T\bar{\mathbf{y}})\] \[= \hat{\mathbf{y}} \odot \left(\bar{\mathbf{y}} - \mathbf{1}\cdot\hat{\mathbf{y}}^T\bar{\mathbf{y}}\right) = \hat{\mathbf{y}} \odot (\bar{\mathbf{y}} - \bar{y}_{\text{sum}})\]

where ȳ_sum = ŷ^T ȳ is a scalar. The VJP reduces to: subtract the scalar dot product ŷ^T ȳ from each component of ȳ, then multiply elementwise by ŷ. Total cost: O(K), not O(K²). This efficient form is what PyTorch implements internally.

10

Elegant Result

Cross-Entropy + Softmax — The Cancellation

// the combined gradient · why it simplifies · MLE connection · numerical stability

When softmax is used with categorical cross-entropy loss, the combined gradient \(\partial\mathcal{L}/\partial\mathbf{z}^{[L]}\) has an extraordinarily clean form. This cancellation is not coincidence — it's a consequence of the exponential family structure of the categorical distribution.

// Deriving the combined softmax + CCE gradient

Setup

\[\mathcal{L}_{\text{CCE}} = -\sum_{k=1}^K y_k\log\hat{y}_k \quad\text{(one-hot label: } y_c=1, y_{k\neq c}=0\text{)}\] \[= -\log\hat{y}_c\]

∂L/∂ŷ

\[\frac{\partial\mathcal{L}}{\partial\hat{y}_k} = -\frac{y_k}{\hat{y}_k}\]

Chain

Apply chain rule through softmax: \[\frac{\partial\mathcal{L}}{\partial z_j} = \sum_k \frac{\partial\mathcal{L}}{\partial\hat{y}_k}\cdot\frac{\partial\hat{y}_k}{\partial z_j} = \sum_k\left(-\frac{y_k}{\hat{y}_k}\right)\frac{\partial\hat{y}_k}{\partial z_j}\]

Expand

Split into \(k=j\) and \(k\neq j\) terms: \[= -\frac{y_j}{\hat{y}_j}\hat{y}_j(1-\hat{y}_j) + \sum_{k\neq j}\frac{y_k}{\hat{y}_k}\hat{y}_k\hat{y}_j\] \[= -y_j(1-\hat{y}_j) + \hat{y}_j\sum_{k\neq j}y_k = -y_j + y_j\hat{y}_j + \hat{y}_j(1-y_j) = \hat{y}_j - y_j\]

Using Σₖ yₖ = 1 (one-hot labels sum to 1) and Σₖ≠ⱼ yₖ = 1 − yⱼ. The remarkable cancellation: the softmax Jacobian exactly "cancels" the division by ŷₖ in the cross-entropy derivative. ∎

Softmax + Cross-Entropy Gradient

\[\boxed{\frac{\partial\mathcal{L}_{\text{CCE}}}{\partial\mathbf{z}^{[L]}} = \hat{\mathbf{y}} - \mathbf{y}}\] The gradient with respect to the pre-softmax logits is simply predicted probability minus true label. This is the most important single formula in deep learning training — the error signal for every neural network classifier.

Numerical Stability

Computing \(\text{softmax}(\mathbf{z})_k = e^{z_k}/\sum_j e^{z_j}\) directly can overflow when \(z_k\) is large. The numerically stable version subtracts the maximum before exponentiation:

Numerically Stable SoftmaxStability

\[\text{softmax}(\mathbf{z})_k = \frac{e^{z_k - z_{\max}}}{\sum_j e^{z_j - z_{\max}}}\]

Subtracting the maximum doesn't change the result (it cancels in numerator and denominator) but prevents overflow. The largest exp becomes exp(0) = 1; all others are smaller. The loss: -log(softmax(z)_c) = -(z_c − z_max) + log(Σ_j exp(z_j − z_max)) — this is the log-sum-exp computation, computed stably in PyTorch as F.cross_entropy(), which combines log_softmax + nll_loss numerically.

11

Mini-Batch

Batch Gradients

// single to batch · vectorized form · 1/m averaging · efficient GPU computation

Vectorized Mini-Batch BackpropagationBatch Form

\[\mathbf{X} \in \mathbb{R}^{n^{[0]}\times m},\quad \mathbf{Z}^{[\ell]} \in \mathbb{R}^{n^{[\ell]}\times m},\quad \boldsymbol{\Delta}^{[\ell]} \in \mathbb{R}^{n^{[\ell]}\times m}\] \[\mathbf{BP1}:\quad \boldsymbol{\Delta}^{[L]} = \hat{\mathbf{Y}} - \mathbf{Y} \in \mathbb{R}^{K\times m}\] \[\mathbf{BP2}:\quad \boldsymbol{\Delta}^{[\ell]} = (\mathbf{W}^{[\ell+1]T}\boldsymbol{\Delta}^{[\ell+1]}) \odot \sigma'^{[\ell]}(\mathbf{Z}^{[\ell]})\] \[\mathbf{BP3}:\quad \frac{\partial\mathcal{L}}{\partial\mathbf{W}^{[\ell]}} = \frac{1}{m}\boldsymbol{\Delta}^{[\ell]}(\mathbf{A}^{[\ell-1]})^T\] \[\mathbf{BP4}:\quad \frac{\partial\mathcal{L}}{\partial\mathbf{b}^{[\ell]}} = \frac{1}{m}\sum_{i=1}^m\boldsymbol{\Delta}^{[\ell]}_{:,i} = \frac{1}{m}\boldsymbol{\Delta}^{[\ell]}\mathbf{1}\]

The batch form processes all m samples simultaneously via matrix operations — maximally parallelizable on GPUs. The 1/m factor averages the gradient over the batch. BP4 requires summing over the batch dimension — a sum/reduction operation — to get the bias gradient, since the bias is shared across all samples.

Why Batch Size Matters for Gradients

Large batch: More accurate gradient estimate (lower variance). But the gradient direction barely improves beyond batch size ~1000 for most problems (see large-batch training literature).
Small batch: Noisier gradients, but the noise acts as regularization (implicit noise in SGD helps escape sharp minima). Usually reaches better minima than large-batch training at the same number of epochs.
Gradient variance: \(\text{Var}(\hat{g}) = \sigma^2_g/m\) — variance decreases as \(1/m\). But the computational cost grows linearly in \(m\). For large \(m\), you pay full compute for diminishing variance reduction.

12

Engineering

Memory-Efficiency Tradeoffs

// activation checkpointing · gradient accumulation · AMP · memory breakdown

In practice, training large neural networks is limited not by compute but by GPU memory. The standard backpropagation algorithm requires storing all intermediate activations, which can exceed GPU memory for large models or sequences.

Memory Breakdown During Training

Component	Memory	Notes
Parameters (weights + biases)	\(O(P)\)	Fixed; FP32: 4 bytes/param. GPT-3: 700GB
Gradients	\(O(P)\)	Same shape as parameters
Optimizer states (Adam)	\(O(2P)\)	m_t and v_t for each parameter
Activations (forward cache)	\(O(m \cdot L \cdot d)\)	Grows with batch size × depth × width
Input data	\(O(m \cdot n)\)	Usually small relative to activations

Activation Checkpointing

Instead of caching all intermediate activations, recompute them on demand during the backward pass. Trade compute for memory:

Activation Checkpointing — Memory/Compute TradeoffAlgorithm

\[\text{Standard backprop: } O(L\cdot d\cdot m) \text{ memory}, \quad O(2\cdot T_{\text{fwd}}) \text{ compute}\] \[\text{Checkpointing: } O(\sqrt{L}\cdot d\cdot m) \text{ memory}, \quad O(3\cdot T_{\text{fwd}}) \text{ compute}\]

With checkpointing: save activations only at every √L-th layer (the "checkpoints"). During backward, recompute from the nearest checkpoint. Memory scales as √L instead of L — for a 100-layer network, memory reduces 10×. PyTorch: torch.utils.checkpoint.checkpoint(). Used in training LLMs with very long sequences.

Gradient Accumulation

To simulate a large effective batch size with limited GPU memory, accumulate gradients over multiple small batches before taking an optimizer step:

Gradient Accumulation — PyTorch

accumulation_steps = 8 # simulate batch_size * 8
optimizer.zero_grad()

for i, (x, y) in enumerate(dataloader):
  loss = model(x, y) / accumulation_steps
  loss.backward() # gradients accumulate in .grad

  if (i + 1) % accumulation_steps == 0:
    optimizer.step()   # update with accumulated gradients
    optimizer.zero_grad()

# Effectively processes 8x larger batches with 1x memory

Automatic Mixed Precision (AMP)

AMP — Gradient ScalingFP16 Training

\[\text{FP32 params and gradients: 4 bytes/value}\] \[\text{FP16 forward pass: 2 bytes/value (2× faster on Tensor Cores)}\] \[\text{Gradient scaling: } \hat{g} = s \cdot g_{\text{FP16}}, \quad g_{\text{FP32}} = \hat{g}/s\]

FP16 has limited range (max ~65504) — small gradients underflow to 0. Solution: multiply loss by a large scale factor s before backward, then divide the accumulated gradients by s. PyTorch GradScaler handles this automatically. AMP reduces memory ~2× and increases throughput ~2-3× on modern GPUs with no accuracy loss.

13

AD Framework

Automatic Differentiation — The Theory

// three flavors of differentiation · operator overloading · tape-based AD

Automatic differentiation is the umbrella framework. It is distinct from both numerical differentiation (finite differences) and symbolic differentiation (computer algebra). It computes exact derivatives efficiently using the structure of the program.

Three Flavors of Differentiation

Method	How	Exact?	Cost	When Used
Numerical (finite diff)	\((f(x+ε) - f(x-ε))/2ε\)	No — approx \(O(ε^2)\)	\(O(n)\) passes	Gradient checking only
Symbolic (CAS)	Derive closed-form expression	Yes	Expr swell; static only	Toy problems, paper derivations
Forward AD	Dual numbers, JVPs	Yes	\(O(n)\) passes for full J	Few inputs, many outputs
Reverse AD (Backprop)	Tape + VJPs	Yes	\(O(1)\) pass for scalar loss	All neural network training

Operator Overloading

The most common AD implementation technique: overload mathematical operators (+, *, sin, etc.) on a special number type that tracks both the primal value and the derivative information:

Dual Numbers — Forward Mode AD (Python)

class Dual:
  def __init__(self, real, dual=0.0):
    self.real = real   # f(x) — primal value
    self.dual = dual   # f'(x)·ẋ — tangent

  def __mul__(self, other):
    # Product rule: (f·g)' = f'·g + f·g'
    return Dual(self.real*other.real,
               self.dual*other.real + self.real*other.dual)

  def __add__(self, other):
    return Dual(self.real+other.real, self.dual+other.dual)

# Differentiate f(x) = x*x + x at x=3
x = Dual(3.0, 1.0) # x=3, ẋ=1 (seed the tangent)
result = x*x + x
print(result.dual) # f'(3) = 2*3 + 1 = 7 ✓

Tape-Based Reverse Mode AD

PyTorch's autograd and TensorFlow use a tape (Wengert list): a record of every operation performed during the forward pass, including enough information to compute the VJP for each operation during the backward pass:

Tape-Based AD — Minimal Implementation

class Tensor:
  def __init__(self, data, _children=(), _backward_fn=None):
    self.data = data
    self.grad = 0.0                    # accumulates ∂L/∂self
    self._children = _children        # inputs to this node
    self._backward = _backward_fn or (lambda: None)

  def __mul__(self, other):
    out = Tensor(self.data * other.data, (self, other))
    def _bwd():
      # VJP of multiply: ∂(a*b)/∂a = b, ∂(a*b)/∂b = a
      self.grad += out.grad * other.data # accumulate!
      other.grad += out.grad * self.data
    out._backward = _bwd
    return out

  def backward(self):
    self.grad = 1.0                      # dL/dL = 1
    for node in topological_sort(self): # reverse topo
      node._backward()

14

Frameworks

PyTorch & TensorFlow Autograd

// dynamic graph · GradientTape · retain_graph · custom backward · hooks

PyTorch Autograd — Dynamic Graph

PyTorch Autograd — Everything You Need to Know

import torch

# 1. requires_grad=True enables gradient tracking
x = torch.randn(3, 4, requires_grad=True)
W = torch.randn(4, 5, requires_grad=True)

# 2. Forward pass builds the computational graph automatically
z = x @ W            # graph: z.grad_fn = MmBackward
loss = z.sum()

# 3. Backward pass computes all gradients
loss.backward()      # default: free the graph after backward
print(W.grad)        # dL/dW — shape (4,5)

# 4. retain_graph=True: keep graph for multiple backward passes
loss.backward(retain_graph=True) # needed for higher-order grads

# 5. Gradient accumulation: .grad ADDS (doesn't replace)
optimizer.zero_grad() # must zero before next backward!

# 6. torch.no_grad(): disable tracking for inference
with torch.no_grad():
  y_pred = model(x_test) # 2x faster, no memory overhead

# 7. detach(): remove tensor from graph
target = output.detach() # treat as constant, stop grad flow

TensorFlow GradientTape

TensorFlow 2.x GradientTape

import tensorflow as tf

# 1. GradientTape records forward pass operations
x = tf.Variable(tf.random.normal([3, 4]))
W = tf.Variable(tf.random.normal([4, 5]))

with tf.GradientTape() as tape:
  z = tf.matmul(x, W)      # recorded on tape
  loss = tf.reduce_sum(z) # recorded on tape

# 2. Compute gradients by replaying the tape backward
grads = tape.gradient(loss, [x, W]) # [dL/dx, dL/dW]

# 3. persistent=True: call gradient() multiple times
with tf.GradientTape(persistent=True) as tape:
  loss = compute_loss()
dL_dx = tape.gradient(loss, x)
dL_dW = tape.gradient(loss, W) # tape still alive
del tape # must delete manually

Custom Backward Functions in PyTorch

torch.autograd.Function — Custom Backward

class StraightThroughEstimator(torch.autograd.Function):
  """Example: gradient through a non-differentiable op"""

  @staticmethod
  def forward(ctx, x):
    ctx.save_for_backward(x) # cache what backward needs
    return (x > 0).float()    # non-differentiable step

  @staticmethod
  def backward(ctx, grad_output):
    x, = ctx.saved_tensors
    # Straight-through: pass gradient as-is (pretend forward = identity)
    return grad_output * (x.abs() <= 1).float()

# Use it like a regular function:
y = StraightThroughEstimator.apply(x)

15

Advanced

Modern Backprop Tricks

// gradient clipping · second-order · higher-order grads · grad checkpointing in depth

Gradient Clipping

Exploding gradients (common in RNNs and deep networks) are handled by clipping the gradient norm:

Global Norm ClippingStability

\[\mathbf{g} \leftarrow \begin{cases}\mathbf{g} & \|\mathbf{g}\|_2 \leq \text{max\_norm} \\ \text{max\_norm}\cdot\frac{\mathbf{g}}{\|\mathbf{g}\|_2} & \text{otherwise}\end{cases}\]

Global norm clipping scales all gradients uniformly — preserving the direction, only reducing the magnitude. Value clipping (clip each gradient independently to [−c, c]) is different and less principled. PyTorch: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0). Typical max_norm: 0.5–5.0 depending on the model.

Higher-Order Gradients

Since the gradient computation is itself a differentiable computation, you can differentiate through the backward pass to get second (or higher) order derivatives:

Higher-Order Gradients — MAML / Hessian-Vector Products

# Second-order gradient (gradient of gradient)
x = torch.randn(3, requires_grad=True)
y = (x ** 2).sum() # y = ||x||²

# First derivative
dy_dx = torch.autograd.grad(y, x, create_graph=True)[0]
# dy/dx = 2x — create_graph=True keeps the gradient graph

# Second derivative (Hessian diagonal)
d2y_dx2 = torch.autograd.grad(dy_dx.sum(), x)[0]
# d²y/dx² = 2 ✓

# Hessian-Vector Product (efficient second order)
v = torch.ones_like(x)
hvp = torch.autograd.grad(dy_dx, x, grad_outputs=v)[0]
# Computes H·v in O(n) without materializing H (n×n)

MAML (Meta-Learning): Differentiates through multiple gradient steps to find initialization that quickly adapts. Requires second-order gradients (gradient through gradient).
Hessian-Vector Products (HVPs): Used in second-order optimization (Newton's method, Hessian-free optimization, K-FAC). Computable in \(O(n)\) — same cost as a gradient — without forming the \(O(n^2)\) Hessian.
Implicit differentiation: Differentiate through the solution of an optimization problem. Used in meta-learning, hyperparameter optimization, and neural ODEs.

Gradient Checkpointing — Deep Dive

torch.utils.checkpoint — Activation Recomputation

import torch.utils.checkpoint as cp

class CheckpointedModel(nn.Module):
  def forward(self, x):
    # Without checkpointing: all activations cached (high memory)
    # With checkpointing: recomputes on backward (low memory)
    x = cp.checkpoint(self.layer1, x) # don't cache layer1 activations
    x = cp.checkpoint(self.layer2, x) # recompute from input on backward
    x = self.layer3(x)                 # last layer: always cached
    return x

# Memory: O(√L) instead of O(L) — 10x reduction for 100 layers
# Compute: ~33% overhead (recomputes each checkpoint once on bwd)
# Enables 2-3x larger models or longer sequences

Gradient Noise and Perturbation

Label smoothing: Replaces one-hot targets with soft targets — the gradient signal is \(\hat{y} - y_{\text{smooth}}\) where \(y_{\text{smooth}} = (1-\epsilon)y + \epsilon/K\). Prevents overconfident predictions.
Stochastic depth: Randomly drop entire residual branches during training. The gradient paths change randomly each step, acting as an ensemble of shallower networks.
Noise injection: Add Gaussian noise to gradients (Langevin dynamics). Helps escape sharp local minima. Formally equivalent to sampling from the Bayesian posterior.

16

Synthesis

The Complete Mental Model

// everything unified · one diagram · one algorithm · the whole picture

Fig 2. Complete backpropagation mental map — from the computational goal through the four fundamental equations to the gradient update that trains every neural network.

The complete story in one coherent thread:

The problem: We need \(\partial\mathcal{L}/\partial\theta_j\) for every parameter \(\theta_j\). Finite differences require \(O(n)\) forward passes. Backpropagation computes all gradients simultaneously in one forward + one backward pass — \(O(1)\) passes regardless of \(n\).
Computational graphs represent computation as a DAG. The forward pass evaluates nodes in topological order, caching all intermediate values. The graph makes the chain rule application systematic and automatic.
Backpropagation is the chain rule applied in reverse topological order, reusing each computed gradient \(\partial\mathcal{L}/\partial z\) for all upstream parameters that depend on \(z\). This sharing is the source of the \(O(n)\) efficiency.
Reverse mode AD propagates adjoint variables \(\bar{v} = \partial\mathcal{L}/\partial v\) backward through the graph. Initialization: \(\bar{\mathcal{L}} = 1\). Each node computes the VJP for its operation and accumulates the result into its input adjoints.
VJPs are the primitive: each operation need only define its VJP (how to propagate a cotangent backward), not its full Jacobian. The Jacobian is never formed — only its product with a vector. Cost: same order as the forward operation.
Matrix calculus gives us the VJP rules for common operations: for \(\mathbf{y} = \mathbf{Wx}\), the VJP w.r.t. \(\mathbf{W}\) is \(\bar{\mathbf{y}}\mathbf{x}^T\) (outer product), and w.r.t. \(\mathbf{x}\) is \(\mathbf{W}^T\bar{\mathbf{y}}\) (transpose times cotangent).
The four BP equations follow directly from applying VJPs to the specific structure of a layered neural network. BP1 gives the output error; BP2 propagates it backward; BP3 and BP4 give weight and bias gradients.
Softmax + CCE gives the cleanest gradient: \(\partial\mathcal{L}/\partial\mathbf{z}^{[L]} = \hat{\mathbf{y}} - \mathbf{y}\). This follows from the exponential family structure of the categorical distribution and is not a coincidence.
Memory is the bottleneck for large models. Activation checkpointing reduces cached memory from \(O(L)\) to \(O(\sqrt{L})\) by recomputing activations on demand. Gradient accumulation and AMP are the other two key techniques for fitting large models into GPU memory.
Modern autograd engines (PyTorch, JAX, TensorFlow) implement tape-based reverse mode AD via operator overloading. Every operation records its VJP. The backward pass replays the tape in reverse, computing all parameter gradients automatically.

The One Algorithm That Trains Everything

Backpropagation is not a clever trick or an approximation — it is the exact, optimal algorithm for computing gradients of a scalar loss with respect to all parameters of a computation. It is a direct consequence of the chain rule of calculus, applied systematically using the structure of the computational graph. Every neural network ever trained — from the 1986 XOR network to GPT-4 with 1.8 trillion parameters — was trained by exactly this algorithm. The equations are the same; only the graph grew larger. Understanding backpropagation is not merely understanding a technique — it is understanding the mathematical mechanism by which machines learn.

the algorithm that trains every neural network Backpropagation