Scalars to tensors, differentials to Jacobians, trace tricks to full
derivations of linear regression, logistic regression, PCA, and
backprop. The complete reference for every gradient you'll ever need.
Matrix calculus extends ordinary calculus to functions whose inputs and/or outputs are vectors or matrices. The
single biggest source of confusion in the field is not the calculus itself — it's notation. Before any
derivative is computed, fix the shapes and the layout convention.
The shape of \(\partial y/\partial\mathbf{x}\) matches the shape of \(y\) "stacked over" \(\mathbf{x}\):
for scalar \(y\) and \(\mathbf{x}\in\mathbb{R}^n\), the gradient is a row vector
\(\in\mathbb{R}^{1\times n}\).
For \(\mathbf{y}\in\mathbb{R}^m\), \(\mathbf{x}\in\mathbb{R}^n\): the Jacobian is \(m\times n\) — rows
index outputs, columns index inputs. This is mathematically elegant because it preserves the chain rule
as simple matrix multiplication \(J_{g \circ f} = J_g J_f\).
Denominator Layout (ML convention)
The shape of \(\partial y/\partial\mathbf{x}\) matches the shape of \(\mathbf{x}\): for scalar \(y\), the
gradient \(\nabla_\mathbf{x}y\) is a column vector \(\in\mathbb{R}^{n\times1}\), same shape as
\(\mathbf{x}\) itself.
This is the convention used throughout deep learning: w.grad always has the same shape as
w, which makes gradient descent updates \(\mathbf{w} \leftarrow \mathbf{w} - \eta \nabla_\mathbf{w} \mathcal{L}\)
dimensionally consistent without transposing. This page uses denominator layout unless explicitly stated.
▣
The Golden Rule
Whatever convention you pick, the gradient of a scalar w.r.t. any tensor has the same shape as
that tensor. If \(\partial \mathcal{L}/\partial \mathbf{W}\) doesn't have the same shape as
\(\mathbf{W}\), something is transposed wrong. This single check catches the majority of matrix calculus
bugs — and is exactly what PyTorch enforces for every .grad attribute.
Notation Used Throughout This Page
Scalars: lowercase italic — \(x, y, \alpha, \lambda\)
Gradient of scalar \(f\) w.r.t. vector \(\mathbf{x}\): \(\nabla_\mathbf{x}f\) or \(\partial
f/\partial\mathbf{x}\) — always same shape as \(\mathbf{x}\)
Hessian of scalar \(f\) w.r.t. vector \(\mathbf{x}\): \(\nabla^2 f \in \mathbb{R}^{n\times n}\), entry
\((i,j) = \partial^2 f/\partial x_i\partial x_j\)
02
Differentials
The Differential Viewpoint
// total differential · why differentials beat index chasing · the workflow
The single most powerful technique in matrix calculus is to stop thinking in terms of \(\partial y/\partial
x_{ij}\) (index chasing) and start thinking in terms of differentials \(df\) — small
linear perturbations. Differentials are the matrix-notation version of “first-order Taylor”: they keep only the
linear (in \(d\mathbf{x}\)) part of the change, so you can manipulate derivatives with the same algebra you
already know. Differentials obey simple algebraic rules and translate directly into gradients via the Identification Theorem.
The Identification Theorem (Magnus & Neudecker) states that if you can massage your
differential into the form \(df = \mathbf{a}^T d\mathbf{x}\), then \(\nabla \mathbf{x} = \mathbf{a}\).
If it's in the form \(df = \text{tr}(\mathbf{A} d\mathbf{X})\), then \(\partial f/\partial \mathbf{X} = \mathbf{A}^T\).
This turns calculus into pure algebraic rearrangement.
Differential Rules (Algebra of d)
Rules for Manipulating DifferentialsAlgebra
\[d(A+B) = dA + dB \qquad d(\alpha A) = \alpha\, dA\]
\[d(AB) = (dA)B + A(dB) \quad\text{(product rule — order matters!)}\]
\[d(\text{tr}(A)) = \text{tr}(dA) \qquad d(A^{-1}) = -A^{-1}(dA)A^{-1}\]
\[d(\log\det A) = \text{tr}(A^{-1}dA)\]
These rules look exactly like scalar calculus rules — because they ARE the scalar rules
applied entrywise, just packaged into matrix notation. The product rule d(AB)=(dA)B+A(dB) preserves the
ORDER of multiplication (matrices don't commute) — this is the only real new thing to remember.
Worked Example: \(f(\mathbf{x}) = \mathbf{x}^T\mathbf{A}\mathbf{x}\)
// The differential workflow, end to end
1
Write the differential using the product rule:
\[df = d(\mathbf{x}^T\mathbf{A}\mathbf{x}) = (d\mathbf{x})^T\mathbf{A}\mathbf{x} +
\mathbf{x}^T\mathbf{A}(d\mathbf{x})\]
2
Both terms are scalars, so each equals its own transpose. Transpose the first term:
\[(d\mathbf{x})^T\mathbf{A}\mathbf{x} = (\mathbf{x}^T\mathbf{A}^T(d\mathbf{x}))^T =
\mathbf{x}^T\mathbf{A}^T d\mathbf{x} \quad\text{(scalar = its transpose)}\]
Read off the gradient directly as the coefficient of dx (transposed to match
denominator layout). If A is symmetric: ∇f = 2Ax. No index sums, no Kronecker deltas — pure algebra. ∎
03
Vector Calc
The Jacobian Matrix
// definition · examples · geometric meaning · special diagonal cases
The Jacobian is the matrix of all first-order partial derivatives of a vector-valued function. It generalizes
the scalar derivative to the multivariate, multi-output case and is the local linear approximation of
\(\mathbf{f}\) at a point. Conceptually, it represents the linear transformation that maps the tangent space
of the input space to the tangent space of the output space.
Row i, column j is ∂fᵢ/∂xⱼ — how output i changes per unit change in input j. The
Jacobian is the best linear approximation to f near x. For m=1 (scalar output), the Jacobian is a 1×n row
vector — the transpose of the gradient in denominator layout.
Jacobians of Building-Block Operations
Function
Jacobian
Structure
\(\mathbf{y}=\mathbf{Ax}\) (linear)
\(J=\mathbf{A}\)
\(m\times n\), constant
\(\mathbf{y}=\mathbf{x}+\mathbf{b}\)
\(J=\mathbf{I}\)
\(n\times n\) identity
\(y_i=\sigma(x_i)\) (elementwise)
\(J=\text{diag}(\sigma'(x_i))\)
diagonal — cheap!
\(\mathbf{y}=\text{softmax}(\mathbf{x})\)
\(J=\text{diag}(\mathbf{y})-\mathbf{yy}^T\)
dense, low-rank correction — reflects the "sum-to-one" constraint.
Elementwise operations (activations, dropout masks, Hadamard products) always have diagonal
Jacobians — full \(n\times n\) matrices that are mostly zero. In practice these are never materialized; the
Jacobian-vector product reduces to a simple elementwise multiply: \(J\mathbf{v} =
f'(\mathbf{x})\odot\mathbf{v}\). This is why elementwise nonlinearities are computationally "free" in
backprop.
The Hessian is the Jacobian of the gradient — the matrix of all second-order partial derivatives of a scalar
function. It describes local curvature and is the basis of Newton's method, convexity analysis, and second-order
optimization. In high-dimensional deep learning landscapes, the Hessian helps identify not just minima and maxima,
but also saddle points, which are the primary obstacles to gradient-based optimization.
By Clairaut's/Schwarz's theorem, if f is C² (continuous second partials), the Hessian is
always SYMMETRIC: ∂²f/∂xᵢ∂xⱼ = ∂²f/∂xⱼ∂xᵢ. This symmetry is exploited everywhere — eigendecomposition,
Cholesky factorization for Newton steps, and the convexity test ∇²f ⪰ 0.
Diagonalize \(\nabla^2f = \mathbf{Q}\boldsymbol{\Lambda}\mathbf{Q}^T\). Along eigenvector
\(\mathbf{q}_i\), the function locally looks like a 1-D parabola with curvature \(\lambda_i\):
\(f(\mathbf{x}+t\mathbf{q}_i)\approx f(\mathbf{x})+t\nabla f^T\mathbf{q}_i+\frac{1}{2}\lambda_it^2\). All
\(\lambda_i>0\): local min (bowl). All \(\lambda_i<0\): local max. Mixed signs: saddle point. This
is precisely the second-order condition for convexity from the Convex Optimization masterclass:
\(\nabla^2f\succeq0 \iff\) all \(\lambda_i\geq0\) everywhere.
05
Composition
Chain Rule for Vectors
// composing Jacobians · shape bookkeeping · the backprop connection
The chain rule for vector functions composes Jacobians via matrix multiplication. Getting the order and
transposes right is purely a matter of shape bookkeeping — and shape bookkeeping is the entire content of
backpropagation. When a variable influences the loss through multiple paths, the Multivariate Chain Rule
requires summing the contributions from each path: \(\nabla_{\mathbf{x}} \mathcal{L} = \sum_{k} J_{\mathbf{f}_k}(\mathbf{x})^T \nabla_{\mathbf{f}_k} \mathcal{L}\).
The shapes must "telescope": (p×m)·(m×n) = (p×n) — the inner dimension m cancels. If your
matrix dimensions don't telescope, you have an error. This single shape-check is the most useful debugging
tool in matrix calculus.
This is the vector-Jacobian product (VJP) — the fundamental operation of reverse-mode
autodiff (§13). ∇_x L is obtained by left-multiplying the upstream gradient ∇_f L by the TRANSPOSED
Jacobian J_f^T. Every backprop rule is an instance of this formula with a specific f.
Worked Example: Two-Layer Composition
// z = Wx, a = σ(z), L = ℓ(a) — chain rule through two layers
Backward through \(\mathbf{z}=\mathbf{Wx}\): \(J_\mathbf{z}(\mathbf{x})=\mathbf{W}\), so:
\[\frac{\partial\mathcal{L}}{\partial\mathbf{x}} =
\mathbf{W}^T\frac{\partial\mathcal{L}}{\partial\mathbf{z}}, \qquad
\frac{\partial\mathcal{L}}{\partial\mathbf{W}} =
\frac{\partial\mathcal{L}}{\partial\mathbf{z}}\mathbf{x}^T\]
∂L/∂x uses W^T (chain rule through x). ∂L/∂W is the outer product δzᵀ x^T — a different
rule because W is the "parameter," not a chained variable. Both follow from the same Jacobian-transpose
logic applied to different "input" roles. This is BP2/BP3 from the Backpropagation masterclass.
06
Identities
Matrix Derivatives
// derivatives w.r.t. a matrix · trace, determinant, inverse · key building blocks
When the input is itself a matrix \(\mathbf{X}\), the derivative \(\partial f/\partial\mathbf{X}\) is a matrix
of the same shape as \(\mathbf{X}\), with entry \((i,j)\) equal to \(\partial f/\partial X_{ij}\). A handful of
identities cover almost every case that arises in ML.
tr(AX) and tr(X^TA) are the two most common patterns — they cover any "linear in X"
scalar. tr(X^TAX) covers quadratic forms. log|det X| appears in Gaussian log-likelihoods (covariance
determinant terms) — its derivative is the (transposed) inverse, central to the EM algorithm and Gaussian
MLE.
Worked Derivation: \(\partial\,\text{tr}(\mathbf{AX})/\partial\mathbf{X}\)
// Direct index computation — verifying the trace identity
Differentiate w.r.t. a single entry \(X_{kl}\): only the term with \(j=k, i=l\) survives:
\[\frac{\partial\,\text{tr}(\mathbf{AX})}{\partial X_{kl}} = A_{lk}\]
This single index computation justifies the entire trace-trick toolbox in §07 — every
other identity reduces to this one via cyclic permutation and the chain rule.
07
Technique
Trace Tricks
// converting scalars to traces · cyclic property · the universal technique
Any scalar can be written as the trace of a 1×1 matrix — and once in trace form, the cyclic property of the
trace lets you rearrange products freely. This is the single most useful trick for deriving matrix gradients of
scalar objectives. It also bridges the gap between matrix norms and linear algebra, most notably with the
Frobenius norm: \(\|\mathbf{A}\|_F^2 = \text{tr}(\mathbf{A}^T\mathbf{A}) = \sum_{i,j} A_{ij}^2\).
The Trace TrickUniversal Technique
\[\text{Any scalar } a = \text{tr}(a) \qquad \mathbf{u}^T\mathbf{v} = \text{tr}(\mathbf{u}^T\mathbf{v}) =
\text{tr}(\mathbf{v}\mathbf{u}^T)\]
\[\text{Cyclic property: } \text{tr}(\mathbf{ABC}) = \text{tr}(\mathbf{BCA}) = \text{tr}(\mathbf{CAB})\]
\[\text{(NOT } \text{tr}(\mathbf{ABC})=\text{tr}(\mathbf{ACB}) \text{ — order within the cycle matters!)}\]
The cyclic property lets you move the matrix you're differentiating with respect to into
the "tr(AX)" position, where the identity from §06 applies directly. This is how you derive ∂/∂X tr(X^T AX)
= (A+A^T)X: write d(tr(X^TAX)) = tr((dX)^TAX) + tr(X^TA dX) = tr((AX)^T dX) + tr((A^TX)^TdX) — wait,
cleaner: use cyclic + transpose properties together.
Worked Example: Deriving \(\partial/\partial\mathbf{X}\,\text{tr}(\mathbf{X}^T\mathbf{AX})\)
// Combining differentials with the trace trick
1
Take the differential:
\[d(\text{tr}(\mathbf{X}^T\mathbf{AX})) = \text{tr}(d(\mathbf{X}^T\mathbf{AX})) =
\text{tr}((d\mathbf{X})^T\mathbf{AX} + \mathbf{X}^T\mathbf{A}\,d\mathbf{X})\]
2
Use \(\text{tr}((d\mathbf{X})^T\mathbf{AX}) = \text{tr}((\mathbf{AX})^Td\mathbf{X})\) (transpose inside
trace doesn't change value for scalars... more precisely tr(B^T) = tr(B), and (d\mathbf{X})^T\mathbf{AX}
is the transpose of \(\mathbf{X}^T\mathbf{A}^Td\mathbf{X}\)):
\[\text{tr}((d\mathbf{X})^T\mathbf{AX}) = \text{tr}(\mathbf{X}^T\mathbf{A}^Td\mathbf{X})\]
Pattern: tr(X^TBX) where the result is (B+B^T)X. For symmetric A: result is 2AX. This
identity is THE workhorse for deriving gradients of quadratic objectives w.r.t. matrices — used directly
in §11 (PCA).
// x^TAx · gradients and Hessians · symmetric vs general A · connections to optimization
Quadratic forms \(\mathbf{x}^T\mathbf{A}\mathbf{x}\) are everywhere in ML — squared losses, regularizers,
covariance terms, Gaussian exponents, Newton steps. Their derivatives are simple, fixed formulas worth
memorizing completely. Crucially, notice that only the symmetric part of \(\mathbf{A}\) contributes to the value:
\(\mathbf{x}^T\mathbf{A}\mathbf{x} = \mathbf{x}^T \left( \frac{\mathbf{A} + \mathbf{A}^T}{2} \right) \mathbf{x}\),
which is why we almost always assume \(\mathbf{A}\) is symmetric when discussing quadratic forms.
Almost every "matrix calculus derivation" in ML reduces to applying these four formulas
to the appropriate A, a, or combination. The general-A→symmetric simplification matters: in practice, if A
isn't symmetric, you can always replace it with (A+A^T)/2 without changing x^TAx — so WLOG assume A
symmetric.
Substitute u=x−μ (constant shift doesn't affect the gradient w.r.t. x, since ∂u/∂x = I)
and apply the symmetric quadratic form rule with A=Σ^{-1}. This gradient appears directly in the
multivariate Gaussian log-likelihood — and its negative is the "pull toward the mean, scaled by inverse
covariance" direction used in Mahalanobis-distance-based anomaly detection.
Newton's Method as a Quadratic-Form Minimization
▣
Why Quadratic Forms Are THE Local Model
Every twice-differentiable function looks locally like a quadratic form:
\(f(\mathbf{x}+\boldsymbol{\delta})\approx f(\mathbf{x})+\nabla
f^T\boldsymbol{\delta}+\frac{1}{2}\boldsymbol{\delta}^T\nabla^2f\,\boldsymbol{\delta}\). Minimizing this
quadratic model over \(\boldsymbol{\delta}\) (set gradient w.r.t. \(\boldsymbol{\delta}\) to zero, using
\(\nabla f+\nabla^2f\,\boldsymbol{\delta}=0\)) gives the Newton step
\(\boldsymbol{\delta}=-(\nabla^2f)^{-1}\nabla f\). Every second-order optimization method is, at its core,
"solve the quadratic-form gradient formula above for the local model."
09
Application
Deriving Linear Regression
// the normal equations · full matrix calculus derivation · geometric interpretation
Linear regression is the cleanest possible application of matrix calculus: a quadratic objective, a linear
model, and a closed-form solution that follows from setting the gradient to zero.
Apply the quadratic-form rule (§08) to the first term with \(\mathbf{A}=\mathbf{X}^T\mathbf{X}\)
(symmetric), and the linear rule to the second term:
\[\nabla_\mathbf{w}(\mathbf{w}^T\mathbf{X}^T\mathbf{Xw}) = 2\mathbf{X}^T\mathbf{Xw}, \qquad
\nabla_\mathbf{w}(-2\mathbf{y}^T\mathbf{Xw}) = -2\mathbf{X}^T\mathbf{y}\]
Set \(\nabla_\mathbf{w}\mathcal{L}=\mathbf{0}\):
\[\mathbf{X}^T\mathbf{Xw} = \mathbf{X}^T\mathbf{y} \quad\Rightarrow\quad \boxed{\mathbf{w}^* =
(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}}\]
The Normal Equations. Since ∇²L = 2X^TX ⪰ 0 (always PSD, §06 of Convex Optimization), L
is convex — this stationary point is a GLOBAL minimum. If X^TX is singular (rank-deficient), use the
pseudoinverse w* = X^+y. ∎
The Normal Equations literally say: the residual vector is orthogonal to every column
of X — i.e., orthogonal to the entire column space. \(\mathbf{Xw}^*\) is the orthogonal projection of \(\mathbf{y}\) onto \(\text{Col}(\mathbf{X})\).
The matrix \(\mathbf{X}^\dagger = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\) is the Moore-Penrose Pseudoinverse of \(\mathbf{X}\).
The matrix \(\mathbf{H} = \mathbf{X}\mathbf{X}^\dagger\) is known as the Hat Matrix because it
"puts a hat" on \(\mathbf{y}\). It is an idempotent matrix (\(\mathbf{H}^2 = \mathbf{H}\)), a hallmark of projection operators.
10
Application
Deriving Logistic Regression
// sigmoid · cross-entropy · gradient and Hessian via the chain rule
Logistic regression composes a linear map, the sigmoid, and the binary cross-entropy loss. Its gradient and
Hessian follow directly from the chain rule (§05) plus two elementary derivative facts about the sigmoid.
σ'(z) = σ(z)(1−σ(z)) is the single most important fact about the sigmoid for matrix
calculus — proven by direct differentiation: d/dz[1/(1+e^{-z})] = e^{-z}/(1+e^{-z})² = σ(z)·(1−σ(z)).
Chain through \(\sigma\): \(\partial\mathcal{L}/\partial z_i =
(\partial\mathcal{L}/\partial\hat{y}_i)\sigma'(z_i)\). Substitute and simplify (the same cancellation as
softmax+CCE in the Backpropagation masterclass):
\[\frac{\partial\mathcal{L}}{\partial z_i} =
\left(-\frac{y_i}{\hat{y}_i}+\frac{1-y_i}{1-\hat{y}_i}\right)\hat{y}_i(1-\hat{y}_i) = \hat{y}_i-y_i\]
3
Chain through \(z_i=\mathbf{w}^T\mathbf{x}_i\) (so \(\partial z_i/\partial\mathbf{w}=\mathbf{x}_i\)) and
sum over samples:
\[\nabla_\mathbf{w}\mathcal{L} = \sum_i(\hat{y}_i-y_i)\mathbf{x}_i =
\mathbf{X}^T(\hat{\mathbf{y}}-\mathbf{y})\]
Identical "prediction minus label" form as softmax+CCE — both are exponential-family
GLMs, and the gradient of any GLM negative log-likelihood w.r.t. the linear predictor is always
(prediction − target). ∎
D = diag(ŷᵢ(1−ŷᵢ)) has strictly positive diagonal entries (0 < ŷᵢ < 1 always), so
X^TDX is PSD by the same v^T(X^TDX)v = ‖D^{1/2}Xv‖² ≥ 0 argument as MSE. Convex — guaranteed global
optimum. This Hessian is also the Fisher Information Matrix for logistic regression, and is used directly
in Newton's method (Iteratively Reweighted Least Squares, IRLS).
PCA asks: find the unit direction \(\mathbf{v}\) along which the projected data has maximum variance. This is a
constrained optimization problem — and matrix calculus plus a Lagrange multiplier turns it directly into an
eigenvalue problem. Intuitively, maximizing the variance of the projection is equivalent to minimizing the
mean-squared distance (reconstruction error) between the original data points and their projections, a result
of the Pythagorean theorem.
// From variance maximization to the eigenvalue equation
1
Variance of data projected onto direction \(\mathbf{v}\) (data centered, covariance \(\mathbf{S}\)):
\[\text{Var}(\mathbf{Xv}) = \mathbf{v}^T\mathbf{S}\mathbf{v}, \quad
\mathbf{S}=\frac{1}{n}\mathbf{X}^T\mathbf{X}\]
Maximize subject to \(\|\mathbf{v}\|=1\) (otherwise variance is unbounded — scale v up arbitrarily).
2
Form the Lagrangian (bridging to the Lagrange Multipliers masterclass):
\[\mathcal{L}(\mathbf{v},\lambda) = \mathbf{v}^T\mathbf{S}\mathbf{v} - \lambda(\mathbf{v}^T\mathbf{v}-1)\]
3
Apply the symmetric quadratic-form gradient rule (§08) to both terms:
\[\nabla_\mathbf{v}\mathcal{L} = 2\mathbf{S}\mathbf{v} - 2\lambda\mathbf{v} = \mathbf{0}\]
The stationarity condition for variance maximization is EXACTLY the eigenvalue equation.
v must be an eigenvector of S, and λ is the corresponding eigenvalue. Substituting back: Var(Xv) = v^TSv =
v^T(λv) = λ‖v‖² = λ. The variance along v equals its eigenvalue — so the maximum-variance direction is the
eigenvector with the LARGEST eigenvalue. ∎
PCA: Two Lines of Matrix Calculus
The entire derivation of PCA's first principal component is: (1) write the Lagrangian for constrained
variance maximization, (2) differentiate using the symmetric quadratic-form rule from §08. The result — that
principal components are eigenvectors of the covariance matrix, ordered by eigenvalue — falls out as a direct
algebraic consequence, with no additional machinery. This is the same Lagrangian framework as the Lagrange
Multipliers masterclass, applied to a quadratic objective via matrix calculus.
Subsequent Components — Sequential Deflation
▣
Why All Eigenvectors Solve the Same Equation
To find the second principal component, maximize \(\mathbf{v}^T\mathbf{S}\mathbf{v}\) subject to
\(\|\mathbf{v}\|=1\) AND \(\mathbf{v}\perp\mathbf{v}_1\) (orthogonal to the first PC). Adding this
orthogonality constraint with another Lagrange multiplier and differentiating yields the same equation
\(\mathbf{S}\mathbf{v}=\lambda\mathbf{v}\) — but restricted to the subspace orthogonal to
\(\mathbf{v}_1\), which (since \(\mathbf{S}\) is symmetric) is automatically satisfied by the eigenvector
with the second-largest eigenvalue. By induction, all principal components are simply all eigenvectors of
\(\mathbf{S}\), sorted by eigenvalue — the Spectral Theorem guarantees they exist, are real, and are
mutually orthogonal.
12
Application
Deriving Backprop
// the chain rule for matrices, applied to a full layer · recap via matrix calculus
Backpropagation (covered in full depth in its own masterclass) is, at its mathematical core, repeated
application of the vector chain rule from §05 — with the matrix derivative identities from §06–08 supplying the
rules for linear layers.
One Layer, Derived via Matrix CalculusBackprop Recap
Every one of these four lines is a direct instance of an identity already established:
diagonal-Jacobian VJP (§03), the trace trick applied to a differential (§07), the vector chain rule with a
linear Jacobian (§05), and the identity Jacobian. Backpropagation invents NO new mathematics — it is matrix
calculus applied systematically, layer by layer, via the chain rule.
The ∂L/∂W Derivation via Trace Trick — In Full
// Deriving ∂L/∂W = δxᵀ using differentials and the trace trick
1
Write the differential of the loss through \(\mathbf{z}\):
\[d\mathcal{L} = \boldsymbol{\delta}^Td\mathbf{z} = \boldsymbol{\delta}^T(d\mathbf{W})\mathbf{x}\]
2
\(\boldsymbol{\delta}^T(d\mathbf{W})\mathbf{x}\) is a scalar, so it equals its own trace:
\[d\mathcal{L} = \text{tr}(\boldsymbol{\delta}^T(d\mathbf{W})\mathbf{x}) =
\text{tr}(\mathbf{x}\boldsymbol{\delta}^Td\mathbf{W}) \quad\text{(cyclic property, §07)}\]
3
Match against \(d\mathcal{L}=\text{tr}(\mathbf{A}\,d\mathbf{W}) \Rightarrow
\partial\mathcal{L}/\partial\mathbf{W}=\mathbf{A}^T\) (§06):
\[\frac{\partial\mathcal{L}}{\partial\mathbf{W}} = (\mathbf{x}\boldsymbol{\delta}^T)^T =
\boldsymbol{\delta}\mathbf{x}^T \quad\blacksquare\]
This is BP3 from the Backpropagation masterclass, now derived purely from the
differential + trace-trick toolkit built in this page.
13
Computation
Automatic Differentiation
// from formulas to algorithms · forward vs reverse mode · how frameworks implement these
rules
Everything derived so far is a formula. Automatic differentiation (AD) is the algorithm that
applies these formulas mechanically, for arbitrary compositions, without a human re-deriving anything.
Forward vs Reverse Mode — Matrix Calculus ViewAD Modes
Forward mode is equivalent to augmenting every scalar with an infinitesimal "dual number" \(\epsilon\) such that \(\epsilon^2=0\).
Reverse mode is the workhorse of deep learning because it computes the gradient of one output (the loss) with respect to millions
of parameters in a single backward pass, regardless of network depth.
A Minimal Reverse-Mode Engine
Each op stores its own matrix-calculus rule
# Every operation defines forward AND its local VJP rule classMatMul: defforward(self, W, x): return W @ x defbackward(self, grad_z): # §05/§12 identities — VJP rules, not re-derived each
time
grad_W = grad_z @ self.x.T # δ xᵀ
grad_x = self.W.T @ grad_z # Wᵀ δ return grad_W, grad_x
# AD frameworks are libraries of exactly these (forward, backward) pairs — # every pair IS one of the identities derived in §03–§08.
▣
The Punchline
PyTorch, JAX, and TensorFlow contain no "magic." Every .backward() call walks the computation
graph and, at each node, applies a pre-derived matrix calculus identity — exactly the ones catalogued in
this reference. Knowing matrix calculus means being able to read and even write
autograd.Function implementations from first principles.
14
Errors
Common Pitfalls
// the mistakes everyone makes · how to catch them fast
The Six Most Common Errors
Shape mismatches: \(\partial\mathcal{L}/\partial\mathbf{W}\) must have the SAME shape as
\(\mathbf{W}\). If you compute something with the wrong shape, you have a transpose error somewhere. Check
this first, always.
Forgetting the factor of 2: \(\nabla_\mathbf{x}\|\mathbf{x}\|^2=2\mathbf{x}\), not
\(\mathbf{x}\). This single factor of 2 (or its absence) is the most common silent bug — it doesn't change
the direction of gradient descent, only the effective learning rate, so it can go unnoticed for a long time.
Assuming \(\mathbf{A}\) is symmetric when it isn't:
\(\nabla_\mathbf{x}(\mathbf{x}^T\mathbf{A}\mathbf{x})=(\mathbf{A}+\mathbf{A}^T)\mathbf{x}\), which only
simplifies to \(2\mathbf{A}\mathbf{x}\) if \(\mathbf{A}=\mathbf{A}^T\). Many "derivation errors" online
silently assume symmetry without stating it.
Treating non-commuting matrix products as scalars:
\(d(\mathbf{AB})=(d\mathbf{A})\mathbf{B}+\mathbf{A}(d\mathbf{B})\) — NOT
\((d\mathbf{B})\mathbf{A}+\mathbf{B}(d\mathbf{A})\). Order matters, always.
Incorrect trace cyclic permutations:
\(\text{tr}(\mathbf{ABC})=\text{tr}(\mathbf{BCA})=\text{tr}(\mathbf{CAB})\), but
\(\text{tr}(\mathbf{ABC})\neq\text{tr}(\mathbf{ACB})\) in general (only cyclic shifts are valid, not
arbitrary permutations).
Mixing numerator/denominator layout mid-derivation: Pick ONE convention (this page uses
denominator/ML layout for gradients, standard m×n for Jacobians) and stay consistent. Switching layouts
mid-derivation is the #1 cause of "off by a transpose" errors.
Compare your analytical formula against this finite-difference approximation for a few
random entries. If they disagree beyond ~10⁻⁴ relative error, your analytical derivative has a bug —
almost always one of the six pitfalls above. This is the universal "trust but verify" step for any
hand-derived gradient.
15
Reference
Cheat Sheet
// every identity from this page, in one place
Vectors → Scalars
Linear
f = aᵀx → ∇f = a
∇²f = 0
Quadratic (sym A)
f = xᵀAx → ∇f = 2Ax
∇²f = 2A
Quadratic (general A)
f = xᵀAx → ∇f = (A+Aᵀ)x
Norm
f = ‖x‖² → ∇f = 2x
Matrices → Scalars (Trace Forms)
tr(AX)
∂/∂X tr(AX) = Aᵀ
tr(XᵀA)
∂/∂X tr(XᵀA) = A
tr(AXB)
∂/∂X tr(AXB) = AᵀBᵀ
tr(XᵀAX)
∂/∂X = (A+Aᵀ)X
log|det X|
∂/∂X = (Xᵀ)⁻¹
‖X‖_F²
∂/∂X = 2X
Chain Rule & Backprop Rules
Vector chain rule
∇ₓL = Jᵥᵀ ∇ᵧL
z = Wx + b
∂L/∂x = Wᵀδ
∂L/∂W = δxᵀ, ∂L/∂b = δ
Elementwise σ(z)
∂L/∂z = σ'(z) ⊙ ∂L/∂a
Softmax + CCE
∂L/∂z = ŷ − y
Sigmoid + BCE
∂L/∂z = ŷ − y
MSE
∇w = 2Xᵀ(Xw−y)
Differential Algebra
Sum / scale
d(A+B)=dA+dB
d(αA)=α dA
Product
d(AB)=(dA)B+A(dB)
Trace
d(tr A)=tr(dA)
Inverse
d(A⁻¹)=−A⁻¹(dA)A⁻¹
16
Synthesis
The Complete Mental Model
// everything connected · one diagram · the unified workflow
Fig 1. Complete matrix calculus map — from the differential viewpoint, through
Jacobians/Hessians, trace tricks, and the chain rule, into quadratic forms, every major derivation, and
automatic differentiation.
The complete workflow, every time:
Fix shapes and convention. Determine the shapes of input and output. In ML, gradients of
scalars match the shape of the variable (denominator/ML layout).
Write the differential. Express \(df\) (or \(d\mathcal{L}\)) using the algebra of
differentials (§02): sum, product, trace rules. This avoids index chasing entirely.
Apply the trace trick if matrix-valued. Use cyclic permutation (§07) to massage the
expression into the form \(\text{tr}(\mathbf{A}\,d\mathbf{X})\), then read off \(\partial
f/\partial\mathbf{X}=\mathbf{A}^T\) (§06).
For compositions, chain Jacobians.
\(\nabla_\mathbf{x}\mathcal{L}=J_\mathbf{f}(\mathbf{x})^T\nabla_\mathbf{f}\mathcal{L}\) (§05) — transpose the
local Jacobian and multiply by the upstream gradient. This is the VJP, the atom of reverse-mode AD (§13).
Recognize quadratic forms.
\(\mathbf{x}^T\mathbf{A}\mathbf{x}\to(\mathbf{A}+\mathbf{A}^T)\mathbf{x}\) (§08) is the single most common
pattern — MSE, regularizers, Mahalanobis distance, Newton's method, PCA variance.
Verify with finite differences. Before trusting any hand-derived formula, check it
numerically (§14). This catches the factor-of-2, symmetry, and transpose errors that are otherwise invisible.
Apply to real models. Linear regression (§09), logistic regression (§10), PCA (§11), and
every layer of a neural network (§12) are all instances of steps 1–5 applied to specific \(f\).
Matrix Calculus Is Scalar Calculus With Bookkeeping
There is no new calculus here — only new bookkeeping. Every derivative is still a limit of a difference
quotient; every chain rule is still "multiply local derivatives along the path." What matrix calculus adds is
a systematic notation (differentials, traces, Jacobians) for tracking which derivative goes with which shape,
so that the same five-step workflow above scales from a single scalar \(x\) to a billion-parameter neural
network without ever writing an index sum by hand.
The ultimate mental model is to alternate between Vectorization (viewing the system as a whole to see geometric structures like projections and ellipses) and Scalarization (viewing individual entries to verify specific partial derivatives). Master the differential viewpoint and the trace trick, and every gradient in this reference — and every gradient PyTorch computes for you — becomes a two-line derivation.