Matrix Calculus

01

Notation

Scalars, Vectors, Matrices

// notation conventions · dimension bookkeeping · numerator vs denominator layout

Matrix calculus extends ordinary calculus to functions whose inputs and/or outputs are vectors or matrices. The single biggest source of confusion in the field is not the calculus itself — it's notation. Before any derivative is computed, fix the shapes and the layout convention.

The Four Cases

Input	Output	Derivative Object	Shape	Example
scalar \(x\)	scalar \(y\)	\(dy/dx\)	scalar	\(y=x^2 \Rightarrow dy/dx=2x\)
vector \(\mathbf{x}\in\mathbb{R}^n\)	scalar \(y\)	gradient \(\nabla_\mathbf{x}y\)	\(n\times1\)	\(y=\mathbf{a}^T\mathbf{x} \Rightarrow \nabla y=\mathbf{a}\)
vector \(\mathbf{x}\in\mathbb{R}^n\)	vector \(\mathbf{y}\in\mathbb{R}^m\)	Jacobian \(J\)	\(m\times n\)	\(\mathbf{y}=\mathbf{Ax} \Rightarrow J=\mathbf{A}\)
matrix \(\mathbf{X}\in\mathbb{R}^{m\times n}\)	scalar \(y\)	matrix gradient \(\partial y/\partial\mathbf{X}\)	\(m\times n\)	\(y=\text{tr}(\mathbf{AX})\Rightarrow \partial y/\partial\mathbf{X}=\mathbf{A}^T\)

Numerator vs Denominator Layout

Numerator Layout

The shape of \(\partial y/\partial\mathbf{x}\) matches the shape of \(y\) "stacked over" \(\mathbf{x}\): for scalar \(y\) and \(\mathbf{x}\in\mathbb{R}^n\), the gradient is a row vector \(\in\mathbb{R}^{1\times n}\).

For \(\mathbf{y}\in\mathbb{R}^m\), \(\mathbf{x}\in\mathbb{R}^n\): the Jacobian is \(m\times n\) — rows index outputs, columns index inputs. This is mathematically elegant because it preserves the chain rule as simple matrix multiplication \(J_{g \circ f} = J_g J_f\).

Denominator Layout (ML convention)

The shape of \(\partial y/\partial\mathbf{x}\) matches the shape of \(\mathbf{x}\): for scalar \(y\), the gradient \(\nabla_\mathbf{x}y\) is a column vector \(\in\mathbb{R}^{n\times1}\), same shape as \(\mathbf{x}\) itself.

This is the convention used throughout deep learning: w.grad always has the same shape as w, which makes gradient descent updates \(\mathbf{w} \leftarrow \mathbf{w} - \eta \nabla_\mathbf{w} \mathcal{L}\) dimensionally consistent without transposing. This page uses denominator layout unless explicitly stated.

▣

The Golden Rule

Whatever convention you pick, the gradient of a scalar w.r.t. any tensor has the same shape as that tensor. If \(\partial \mathcal{L}/\partial \mathbf{W}\) doesn't have the same shape as \(\mathbf{W}\), something is transposed wrong. This single check catches the majority of matrix calculus bugs — and is exactly what PyTorch enforces for every .grad attribute.

Notation Used Throughout This Page

Scalars: lowercase italic — \(x, y, \alpha, \lambda\)
Vectors: lowercase bold — \(\mathbf{x}, \mathbf{w}, \mathbf{a}\) (column vectors by default)
Matrices: uppercase bold — \(\mathbf{X}, \mathbf{W}, \mathbf{A}\)
Gradient of scalar \(f\) w.r.t. vector \(\mathbf{x}\): \(\nabla_\mathbf{x}f\) or \(\partial f/\partial\mathbf{x}\) — always same shape as \(\mathbf{x}\)
Jacobian of vector \(\mathbf{y}\) w.r.t. vector \(\mathbf{x}\): \(J = \partial\mathbf{y}/\partial\mathbf{x} \in \mathbb{R}^{m\times n}\), entry \((i,j) = \partial y_i/\partial x_j\)
Hessian of scalar \(f\) w.r.t. vector \(\mathbf{x}\): \(\nabla^2 f \in \mathbb{R}^{n\times n}\), entry \((i,j) = \partial^2 f/\partial x_i\partial x_j\)

02

Differentials

The Differential Viewpoint

// total differential · why differentials beat index chasing · the workflow

The single most powerful technique in matrix calculus is to stop thinking in terms of \(\partial y/\partial x_{ij}\) (index chasing) and start thinking in terms of differentials \(df\) — small linear perturbations. Differentials are the matrix-notation version of “first-order Taylor”: they keep only the linear (in \(d\mathbf{x}\)) part of the change, so you can manipulate derivatives with the same algebra you already know. Differentials obey simple algebraic rules and translate directly into gradients via the Identification Theorem.

The Total Differential & IdentificationCore Idea

\[df = \sum_{i}\frac{\partial f}{\partial x_i}dx_i = \nabla f(\mathbf{x})^T d\mathbf{x} \quad\text{(vector case)}\] \[df = \text{tr}\left(\left(\frac{\partial f}{\partial \mathbf{X}}\right)^T d\mathbf{X}\right) \quad\text{(matrix case)}\]

The Identification Theorem (Magnus & Neudecker) states that if you can massage your differential into the form \(df = \mathbf{a}^T d\mathbf{x}\), then \(\nabla \mathbf{x} = \mathbf{a}\). If it's in the form \(df = \text{tr}(\mathbf{A} d\mathbf{X})\), then \(\partial f/\partial \mathbf{X} = \mathbf{A}^T\). This turns calculus into pure algebraic rearrangement.

Differential Rules (Algebra of d)

Rules for Manipulating DifferentialsAlgebra

\[d(A+B) = dA + dB \qquad d(\alpha A) = \alpha\, dA\] \[d(AB) = (dA)B + A(dB) \quad\text{(product rule — order matters!)}\] \[d(\text{tr}(A)) = \text{tr}(dA) \qquad d(A^{-1}) = -A^{-1}(dA)A^{-1}\] \[d(\log\det A) = \text{tr}(A^{-1}dA)\]

These rules look exactly like scalar calculus rules — because they ARE the scalar rules applied entrywise, just packaged into matrix notation. The product rule d(AB)=(dA)B+A(dB) preserves the ORDER of multiplication (matrices don't commute) — this is the only real new thing to remember.

Worked Example: \(f(\mathbf{x}) = \mathbf{x}^T\mathbf{A}\mathbf{x}\)

// The differential workflow, end to end

1

Write the differential using the product rule: \[df = d(\mathbf{x}^T\mathbf{A}\mathbf{x}) = (d\mathbf{x})^T\mathbf{A}\mathbf{x} + \mathbf{x}^T\mathbf{A}(d\mathbf{x})\]

2

Both terms are scalars, so each equals its own transpose. Transpose the first term: \[(d\mathbf{x})^T\mathbf{A}\mathbf{x} = (\mathbf{x}^T\mathbf{A}^T(d\mathbf{x}))^T = \mathbf{x}^T\mathbf{A}^T d\mathbf{x} \quad\text{(scalar = its transpose)}\]

3

Combine: \[df = \mathbf{x}^T\mathbf{A}^Td\mathbf{x} + \mathbf{x}^T\mathbf{A}\,d\mathbf{x} = \mathbf{x}^T(\mathbf{A}+\mathbf{A}^T)d\mathbf{x} = \left[(\mathbf{A}+\mathbf{A}^T)\mathbf{x}\right]^Td\mathbf{x}\] \[\therefore\quad \nabla_\mathbf{x}f = (\mathbf{A}+\mathbf{A}^T)\mathbf{x}\]

Read off the gradient directly as the coefficient of dx (transposed to match denominator layout). If A is symmetric: ∇f = 2Ax. No index sums, no Kronecker deltas — pure algebra. ∎

03

Vector Calc

The Jacobian Matrix

// definition · examples · geometric meaning · special diagonal cases

The Jacobian is the matrix of all first-order partial derivatives of a vector-valued function. It generalizes the scalar derivative to the multivariate, multi-output case and is the local linear approximation of \(\mathbf{f}\) at a point. Conceptually, it represents the linear transformation that maps the tangent space of the input space to the tangent space of the output space.

The Jacobian MatrixDefinition

\[\mathbf{f}:\mathbb{R}^n\to\mathbb{R}^m, \quad J_\mathbf{f}(\mathbf{x}) = \frac{\partial \mathbf{f}}{\partial\mathbf{x}} = \begin{pmatrix}\frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n}\\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n}\end{pmatrix}\in\mathbb{R}^{m\times n}\] \[\mathbf{f}(\mathbf{x}+\boldsymbol{\delta}) \approx \mathbf{f}(\mathbf{x}) + J_\mathbf{f}(\mathbf{x})\boldsymbol{\delta} \quad\text{(local linear approximation)}\]

Row i, column j is ∂fᵢ/∂xⱼ — how output i changes per unit change in input j. The Jacobian is the best linear approximation to f near x. For m=1 (scalar output), the Jacobian is a 1×n row vector — the transpose of the gradient in denominator layout.

Jacobians of Building-Block Operations

Function	Jacobian	Structure
\(\mathbf{y}=\mathbf{Ax}\) (linear)	\(J=\mathbf{A}\)	\(m\times n\), constant
\(\mathbf{y}=\mathbf{x}+\mathbf{b}\)	\(J=\mathbf{I}\)	\(n\times n\) identity
\(y_i=\sigma(x_i)\) (elementwise)	\(J=\text{diag}(\sigma'(x_i))\)	diagonal — cheap!
\(\mathbf{y}=\text{softmax}(\mathbf{x})\)	\(J=\text{diag}(\mathbf{y})-\mathbf{yy}^T\)	dense, low-rank correction — reflects the "sum-to-one" constraint.
\(y=\\|\mathbf{x}\\|^2\)	\(J=2\mathbf{x}^T\)	\(1\times n\)
\(\mathbf{y}=\mathbf{x}\odot\mathbf{z}\) (Hadamard)	\(\partial\mathbf{y}/\partial\mathbf{x}=\text{diag}(\mathbf{z})\)	diagonal

▣

Why Diagonal Jacobians Matter

Elementwise operations (activations, dropout masks, Hadamard products) always have diagonal Jacobians — full \(n\times n\) matrices that are mostly zero. In practice these are never materialized; the Jacobian-vector product reduces to a simple elementwise multiply: \(J\mathbf{v} = f'(\mathbf{x})\odot\mathbf{v}\). This is why elementwise nonlinearities are computationally "free" in backprop.

04

Curvature

The Hessian Matrix

// second derivatives · symmetry · quadratic approximation · eigenvalue interpretation

The Hessian is the Jacobian of the gradient — the matrix of all second-order partial derivatives of a scalar function. It describes local curvature and is the basis of Newton's method, convexity analysis, and second-order optimization. In high-dimensional deep learning landscapes, the Hessian helps identify not just minima and maxima, but also saddle points, which are the primary obstacles to gradient-based optimization.

The Hessian MatrixDefinition

\[f:\mathbb{R}^n\to\mathbb{R}, \quad \nabla^2f(\mathbf{x}) = \begin{pmatrix}\frac{\partial^2f}{\partial x_1^2} & \cdots & \frac{\partial^2f}{\partial x_1\partial x_n}\\ \vdots & \ddots & \vdots \\ \frac{\partial^2f}{\partial x_n\partial x_1} & \cdots & \frac{\partial^2f}{\partial x_n^2}\end{pmatrix}\in\mathbb{R}^{n\times n}\] \[f(\mathbf{x}+\boldsymbol{\delta}) \approx f(\mathbf{x}) + \nabla f(\mathbf{x})^T\boldsymbol{\delta} + \frac{1}{2}\boldsymbol{\delta}^T\nabla^2f(\mathbf{x})\boldsymbol{\delta} \quad\text{(2nd-order Taylor)}\]

By Clairaut's/Schwarz's theorem, if f is C² (continuous second partials), the Hessian is always SYMMETRIC: ∂²f/∂xᵢ∂xⱼ = ∂²f/∂xⱼ∂xᵢ. This symmetry is exploited everywhere — eigendecomposition, Cholesky factorization for Newton steps, and the convexity test ∇²f ⪰ 0.

Hessian Examples

Function	Gradient	Hessian
\(f=\mathbf{a}^T\mathbf{x}\)	\(\nabla f=\mathbf{a}\)	\(\nabla^2f=\mathbf{0}\)
\(f=\mathbf{x}^T\mathbf{A}\mathbf{x}\) (\(A\) sym.)	\(\nabla f=2\mathbf{Ax}\)	\(\nabla^2f=2\mathbf{A}\)
\(f=\\|\mathbf{Xw}-\mathbf{y}\\|^2\)	\(\nabla f=2\mathbf{X}^T(\mathbf{Xw}-\mathbf{y})\)	\(\nabla^2f=2\mathbf{X}^T\mathbf{X}\)
\(f=-\sum y_i\log\sigma(\mathbf{w}^T\mathbf{x}_i)+\ldots\)	\(\nabla f=\mathbf{X}^T(\hat{\mathbf{y}}-\mathbf{y})\)	\(\nabla^2f=\mathbf{X}^T\mathbf{D}\mathbf{X}\), \(D_{ii}=\hat y_i(1-\hat y_i)\)

Eigenvalues and Curvature

▣

Geometric Reading of the Hessian

Diagonalize \(\nabla^2f = \mathbf{Q}\boldsymbol{\Lambda}\mathbf{Q}^T\). Along eigenvector \(\mathbf{q}_i\), the function locally looks like a 1-D parabola with curvature \(\lambda_i\): \(f(\mathbf{x}+t\mathbf{q}_i)\approx f(\mathbf{x})+t\nabla f^T\mathbf{q}_i+\frac{1}{2}\lambda_it^2\). All \(\lambda_i>0\): local min (bowl). All \(\lambda_i<0\): local max. Mixed signs: saddle point. This is precisely the second-order condition for convexity from the Convex Optimization masterclass: \(\nabla^2f\succeq0 \iff\) all \(\lambda_i\geq0\) everywhere.

05

Composition

Chain Rule for Vectors

// composing Jacobians · shape bookkeeping · the backprop connection

The chain rule for vector functions composes Jacobians via matrix multiplication. Getting the order and transposes right is purely a matter of shape bookkeeping — and shape bookkeeping is the entire content of backpropagation. When a variable influences the loss through multiple paths, the Multivariate Chain Rule requires summing the contributions from each path: \(\nabla_{\mathbf{x}} \mathcal{L} = \sum_{k} J_{\mathbf{f}_k}(\mathbf{x})^T \nabla_{\mathbf{f}_k} \mathcal{L}\).

Vector Chain RuleJacobian Composition

\[\mathbf{z}=g(\mathbf{f}(\mathbf{x})), \quad \mathbf{x}\in\mathbb{R}^n,\;\mathbf{f}(\mathbf{x})\in\mathbb{R}^m,\;\mathbf{z}\in\mathbb{R}^p\] \[J_\mathbf{z}(\mathbf{x}) = J_g(\mathbf{f}(\mathbf{x}))\cdot J_\mathbf{f}(\mathbf{x}) \in\mathbb{R}^{p\times n} = (p\times m)(m\times n)\]

The shapes must "telescope": (p×m)·(m×n) = (p×n) — the inner dimension m cancels. If your matrix dimensions don't telescope, you have an error. This single shape-check is the most useful debugging tool in matrix calculus.

Special Case: Scalar Loss (the Backprop Case)

Chain Rule for a Scalar LossMost Common Case

\[\mathcal{L}=g(\mathbf{f}(\mathbf{x})), \quad \mathcal{L}\in\mathbb{R} \;\Rightarrow\; p=1\] \[\frac{\partial\mathcal{L}}{\partial\mathbf{x}}^T = \frac{\partial\mathcal{L}}{\partial\mathbf{f}}^TJ_\mathbf{f}(\mathbf{x}) \quad\text{(row vector = row vector × matrix)}\] \[\text{In denominator (column) layout:}\quad \nabla_\mathbf{x}\mathcal{L} = J_\mathbf{f}(\mathbf{x})^T\nabla_\mathbf{f}\mathcal{L}\]

This is the vector-Jacobian product (VJP) — the fundamental operation of reverse-mode autodiff (§13). ∇_x L is obtained by left-multiplying the upstream gradient ∇_f L by the TRANSPOSED Jacobian J_f^T. Every backprop rule is an instance of this formula with a specific f.

Worked Example: Two-Layer Composition

// z = Wx, a = σ(z), L = ℓ(a) — chain rule through two layers

1

Forward: \(\mathbf{z}=\mathbf{Wx}\in\mathbb{R}^m\), \(\mathbf{a}=\sigma(\mathbf{z})\in\mathbb{R}^m\), \(\mathcal{L}=\ell(\mathbf{a})\in\mathbb{R}\).

2

Backward through \(\sigma\) (diagonal Jacobian): \[\frac{\partial\mathcal{L}}{\partial\mathbf{z}} = J_\sigma^T\frac{\partial\mathcal{L}}{\partial\mathbf{a}} = \text{diag}(\sigma'(\mathbf{z}))\frac{\partial\mathcal{L}}{\partial\mathbf{a}} = \sigma'(\mathbf{z})\odot\frac{\partial\mathcal{L}}{\partial\mathbf{a}}\]

3

Backward through \(\mathbf{z}=\mathbf{Wx}\): \(J_\mathbf{z}(\mathbf{x})=\mathbf{W}\), so: \[\frac{\partial\mathcal{L}}{\partial\mathbf{x}} = \mathbf{W}^T\frac{\partial\mathcal{L}}{\partial\mathbf{z}}, \qquad \frac{\partial\mathcal{L}}{\partial\mathbf{W}} = \frac{\partial\mathcal{L}}{\partial\mathbf{z}}\mathbf{x}^T\]

∂L/∂x uses W^T (chain rule through x). ∂L/∂W is the outer product δzᵀ x^T — a different rule because W is the "parameter," not a chained variable. Both follow from the same Jacobian-transpose logic applied to different "input" roles. This is BP2/BP3 from the Backpropagation masterclass.

06

Identities

Matrix Derivatives

// derivatives w.r.t. a matrix · trace, determinant, inverse · key building blocks

When the input is itself a matrix \(\mathbf{X}\), the derivative \(\partial f/\partial\mathbf{X}\) is a matrix of the same shape as \(\mathbf{X}\), with entry \((i,j)\) equal to \(\partial f/\partial X_{ij}\). A handful of identities cover almost every case that arises in ML.

Core Matrix Derivative IdentitiesBuilding Blocks

\[\frac{\partial}{\partial\mathbf{X}}\text{tr}(\mathbf{AX}) = \mathbf{A}^T \qquad \frac{\partial}{\partial\mathbf{X}}\text{tr}(\mathbf{X}^T\mathbf{A}) = \mathbf{A}\] \[\frac{\partial}{\partial\mathbf{X}}\text{tr}(\mathbf{AXB}) = \mathbf{A}^T\mathbf{B}^T \qquad \frac{\partial}{\partial\mathbf{X}}\text{tr}(\mathbf{X}^T\mathbf{AX}) = (\mathbf{A}+\mathbf{A}^T)\mathbf{X}\] \[\frac{\partial}{\partial\mathbf{X}}\log|\det\mathbf{X}| = (\mathbf{X}^{-1})^T = (\mathbf{X}^T)^{-1}\] \[\frac{\partial}{\partial\mathbf{X}}\text{tr}(\mathbf{A}\mathbf{X}^{-1}\mathbf{B}) = -(\mathbf{X}^{-1})^T\mathbf{A}^T\mathbf{B}^T(\mathbf{X}^{-1})^T\]

tr(AX) and tr(X^TA) are the two most common patterns — they cover any "linear in X" scalar. tr(X^TAX) covers quadratic forms. log|det X| appears in Gaussian log-likelihoods (covariance determinant terms) — its derivative is the (transposed) inverse, central to the EM algorithm and Gaussian MLE.

Worked Derivation: \(\partial\,\text{tr}(\mathbf{AX})/\partial\mathbf{X}\)

// Direct index computation — verifying the trace identity

1

\(\text{tr}(\mathbf{AX}) = \sum_i(\mathbf{AX})_{ii} = \sum_i\sum_j A_{ij}X_{ji}\)

2

Differentiate w.r.t. a single entry \(X_{kl}\): only the term with \(j=k, i=l\) survives: \[\frac{\partial\,\text{tr}(\mathbf{AX})}{\partial X_{kl}} = A_{lk}\]

3

Assembling all entries: \(\left[\frac{\partial\,\text{tr}(\mathbf{AX})}{\partial\mathbf{X}}\right]_{kl} = A_{lk} = (\mathbf{A}^T)_{kl}\) \[\therefore\quad \frac{\partial\,\text{tr}(\mathbf{AX})}{\partial\mathbf{X}} = \mathbf{A}^T \quad\blacksquare\]

This single index computation justifies the entire trace-trick toolbox in §07 — every other identity reduces to this one via cyclic permutation and the chain rule.

07

Technique

Trace Tricks

// converting scalars to traces · cyclic property · the universal technique

Any scalar can be written as the trace of a 1×1 matrix — and once in trace form, the cyclic property of the trace lets you rearrange products freely. This is the single most useful trick for deriving matrix gradients of scalar objectives. It also bridges the gap between matrix norms and linear algebra, most notably with the Frobenius norm: \(\|\mathbf{A}\|_F^2 = \text{tr}(\mathbf{A}^T\mathbf{A}) = \sum_{i,j} A_{ij}^2\).

The Trace TrickUniversal Technique

\[\text{Any scalar } a = \text{tr}(a) \qquad \mathbf{u}^T\mathbf{v} = \text{tr}(\mathbf{u}^T\mathbf{v}) = \text{tr}(\mathbf{v}\mathbf{u}^T)\] \[\text{Cyclic property: } \text{tr}(\mathbf{ABC}) = \text{tr}(\mathbf{BCA}) = \text{tr}(\mathbf{CAB})\] \[\text{(NOT } \text{tr}(\mathbf{ABC})=\text{tr}(\mathbf{ACB}) \text{ — order within the cycle matters!)}\]

The cyclic property lets you move the matrix you're differentiating with respect to into the "tr(AX)" position, where the identity from §06 applies directly. This is how you derive ∂/∂X tr(X^T AX) = (A+A^T)X: write d(tr(X^TAX)) = tr((dX)^TAX) + tr(X^TA dX) = tr((AX)^T dX) + tr((A^TX)^TdX) — wait, cleaner: use cyclic + transpose properties together.

Worked Example: Deriving \(\partial/\partial\mathbf{X}\,\text{tr}(\mathbf{X}^T\mathbf{AX})\)

// Combining differentials with the trace trick

1

Take the differential: \[d(\text{tr}(\mathbf{X}^T\mathbf{AX})) = \text{tr}(d(\mathbf{X}^T\mathbf{AX})) = \text{tr}((d\mathbf{X})^T\mathbf{AX} + \mathbf{X}^T\mathbf{A}\,d\mathbf{X})\]

2

Use \(\text{tr}((d\mathbf{X})^T\mathbf{AX}) = \text{tr}((\mathbf{AX})^Td\mathbf{X})\) (transpose inside trace doesn't change value for scalars... more precisely tr(B^T) = tr(B), and (d\mathbf{X})^T\mathbf{AX} is the transpose of \(\mathbf{X}^T\mathbf{A}^Td\mathbf{X}\)): \[\text{tr}((d\mathbf{X})^T\mathbf{AX}) = \text{tr}(\mathbf{X}^T\mathbf{A}^Td\mathbf{X})\]

3

Combine both terms: \[d(\text{tr}(\mathbf{X}^T\mathbf{AX})) = \text{tr}(\mathbf{X}^T\mathbf{A}^Td\mathbf{X}) + \text{tr}(\mathbf{X}^T\mathbf{A}\,d\mathbf{X}) = \text{tr}(\mathbf{X}^T(\mathbf{A}+\mathbf{A}^T)d\mathbf{X})\] \[\therefore\quad \frac{\partial}{\partial\mathbf{X}}\text{tr}(\mathbf{X}^T\mathbf{AX}) = (\mathbf{A}+\mathbf{A}^T)\mathbf{X} \quad\blacksquare\]

Pattern: tr(X^TBX) where the result is (B+B^T)X. For symmetric A: result is 2AX. This identity is THE workhorse for deriving gradients of quadratic objectives w.r.t. matrices — used directly in §11 (PCA).

Quick-Reference Trace Identity Table

Scalar Expression	Trace Form	Gradient w.r.t. X
\(\mathbf{a}^T\mathbf{X}\mathbf{b}\)	\(\text{tr}(\mathbf{b}\mathbf{a}^T\mathbf{X})\)	\(\mathbf{a}\mathbf{b}^T\)
\(\mathbf{a}^T\mathbf{X}^T\mathbf{b}\)	\(\text{tr}(\mathbf{a}\mathbf{b}^T\mathbf{X})\)... wait, \(\text{tr}(\mathbf{b}^T\mathbf{X}^T\mathbf{a})\)	\(\mathbf{b}\mathbf{a}^T\)
\(\\|\mathbf{X}\\|_F^2=\text{tr}(\mathbf{X}^T\mathbf{X})\)	—	\(2\mathbf{X}\)
\(\text{tr}(\mathbf{X}^T\mathbf{AX})\)	—	\((\mathbf{A}+\mathbf{A}^T)\mathbf{X}\)

08

Forms

Quadratic Forms

// x^TAx · gradients and Hessians · symmetric vs general A · connections to optimization

Quadratic forms \(\mathbf{x}^T\mathbf{A}\mathbf{x}\) are everywhere in ML — squared losses, regularizers, covariance terms, Gaussian exponents, Newton steps. Their derivatives are simple, fixed formulas worth memorizing completely. Crucially, notice that only the symmetric part of \(\mathbf{A}\) contributes to the value: \(\mathbf{x}^T\mathbf{A}\mathbf{x} = \mathbf{x}^T \left( \frac{\mathbf{A} + \mathbf{A}^T}{2} \right) \mathbf{x}\), which is why we almost always assume \(\mathbf{A}\) is symmetric when discussing quadratic forms.

Quadratic Form DerivativesMemorize These

\[f(\mathbf{x})=\mathbf{x}^T\mathbf{A}\mathbf{x}\;\Rightarrow\; \nabla f=(\mathbf{A}+\mathbf{A}^T)\mathbf{x}, \quad \nabla^2f=\mathbf{A}+\mathbf{A}^T\] \[\text{If } \mathbf{A}=\mathbf{A}^T\text{ (symmetric):}\quad \nabla f = 2\mathbf{A}\mathbf{x}, \quad \nabla^2f=2\mathbf{A}\] \[g(\mathbf{x})=\mathbf{a}^T\mathbf{x}\;\Rightarrow\;\nabla g=\mathbf{a},\quad \nabla^2g=\mathbf{0}\] \[h(\mathbf{x})=\|\mathbf{x}\|^2=\mathbf{x}^T\mathbf{x}\;\Rightarrow\;\nabla h=2\mathbf{x},\quad\nabla^2h=2\mathbf{I}\]

Almost every "matrix calculus derivation" in ML reduces to applying these four formulas to the appropriate A, a, or combination. The general-A→symmetric simplification matters: in practice, if A isn't symmetric, you can always replace it with (A+A^T)/2 without changing x^TAx — so WLOG assume A symmetric.

Worked Example: Mahalanobis Distance

Gradient of Mahalanobis DistanceApplication

\[d^2(\mathbf{x}) = (\mathbf{x}-\boldsymbol{\mu})^T\boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\] \[\nabla_\mathbf{x}d^2 = 2\boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu}) \quad\text{(since }\boldsymbol{\Sigma}^{-1}\text{ symmetric)}\]

Substitute u=x−μ (constant shift doesn't affect the gradient w.r.t. x, since ∂u/∂x = I) and apply the symmetric quadratic form rule with A=Σ^{-1}. This gradient appears directly in the multivariate Gaussian log-likelihood — and its negative is the "pull toward the mean, scaled by inverse covariance" direction used in Mahalanobis-distance-based anomaly detection.

Newton's Method as a Quadratic-Form Minimization

▣

Why Quadratic Forms Are THE Local Model

Every twice-differentiable function looks locally like a quadratic form: \(f(\mathbf{x}+\boldsymbol{\delta})\approx f(\mathbf{x})+\nabla f^T\boldsymbol{\delta}+\frac{1}{2}\boldsymbol{\delta}^T\nabla^2f\,\boldsymbol{\delta}\). Minimizing this quadratic model over \(\boldsymbol{\delta}\) (set gradient w.r.t. \(\boldsymbol{\delta}\) to zero, using \(\nabla f+\nabla^2f\,\boldsymbol{\delta}=0\)) gives the Newton step \(\boldsymbol{\delta}=-(\nabla^2f)^{-1}\nabla f\). Every second-order optimization method is, at its core, "solve the quadratic-form gradient formula above for the local model."

09

Application

Deriving Linear Regression

// the normal equations · full matrix calculus derivation · geometric interpretation

Linear regression is the cleanest possible application of matrix calculus: a quadratic objective, a linear model, and a closed-form solution that follows from setting the gradient to zero.

// Deriving the Normal Equations from scratch

1

Objective: \(\mathcal{L}(\mathbf{w})=\|\mathbf{Xw}-\mathbf{y}\|^2=(\mathbf{Xw}-\mathbf{y})^T(\mathbf{Xw}-\mathbf{y})\). Expand: \[\mathcal{L}(\mathbf{w}) = \mathbf{w}^T\mathbf{X}^T\mathbf{Xw} - 2\mathbf{y}^T\mathbf{Xw} + \mathbf{y}^T\mathbf{y}\]

2

Apply the quadratic-form rule (§08) to the first term with \(\mathbf{A}=\mathbf{X}^T\mathbf{X}\) (symmetric), and the linear rule to the second term: \[\nabla_\mathbf{w}(\mathbf{w}^T\mathbf{X}^T\mathbf{Xw}) = 2\mathbf{X}^T\mathbf{Xw}, \qquad \nabla_\mathbf{w}(-2\mathbf{y}^T\mathbf{Xw}) = -2\mathbf{X}^T\mathbf{y}\]

3

Combine: \[\nabla_\mathbf{w}\mathcal{L} = 2\mathbf{X}^T\mathbf{Xw} - 2\mathbf{X}^T\mathbf{y}\]

4

Set \(\nabla_\mathbf{w}\mathcal{L}=\mathbf{0}\): \[\mathbf{X}^T\mathbf{Xw} = \mathbf{X}^T\mathbf{y} \quad\Rightarrow\quad \boxed{\mathbf{w}^* = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}}\]

The Normal Equations. Since ∇²L = 2X^TX ⪰ 0 (always PSD, §06 of Convex Optimization), L is convex — this stationary point is a GLOBAL minimum. If X^TX is singular (rank-deficient), use the pseudoinverse w* = X^+y. ∎

Geometric Interpretation — Orthogonal Projection

Normal Equations as OrthogonalityGeometry

\[\mathbf{X}^T(\mathbf{Xw}^*-\mathbf{y}) = \mathbf{0} \quad\Leftrightarrow\quad \text{residual } (\mathbf{Xw}^*-\mathbf{y}) \perp \text{Col}(\mathbf{X})\] \[\hat{\mathbf{y}} = \mathbf{Xw}^* = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} = \mathbf{H}\mathbf{y}\]

The Normal Equations literally say: the residual vector is orthogonal to every column of X — i.e., orthogonal to the entire column space. \(\mathbf{Xw}^*\) is the orthogonal projection of \(\mathbf{y}\) onto \(\text{Col}(\mathbf{X})\). The matrix \(\mathbf{X}^\dagger = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\) is the Moore-Penrose Pseudoinverse of \(\mathbf{X}\). The matrix \(\mathbf{H} = \mathbf{X}\mathbf{X}^\dagger\) is known as the Hat Matrix because it "puts a hat" on \(\mathbf{y}\). It is an idempotent matrix (\(\mathbf{H}^2 = \mathbf{H}\)), a hallmark of projection operators.

10

Application

Deriving Logistic Regression

// sigmoid · cross-entropy · gradient and Hessian via the chain rule

Logistic regression composes a linear map, the sigmoid, and the binary cross-entropy loss. Its gradient and Hessian follow directly from the chain rule (§05) plus two elementary derivative facts about the sigmoid.

Setup and the Sigmoid Derivative

Logistic Regression — SetupModel

\[\hat{y}_i=\sigma(z_i),\;\; z_i=\mathbf{w}^T\mathbf{x}_i, \quad \sigma(z)=\frac{1}{1+e^{-z}}, \quad \sigma'(z)=\sigma(z)(1-\sigma(z))\] \[\mathcal{L}=-\sum_i\left[y_i\log\hat{y}_i+(1-y_i)\log(1-\hat{y}_i)\right]\]

σ'(z) = σ(z)(1−σ(z)) is the single most important fact about the sigmoid for matrix calculus — proven by direct differentiation: d/dz[1/(1+e^{-z})] = e^{-z}/(1+e^{-z})² = σ(z)·(1−σ(z)).

Deriving the Gradient

// Chain rule: L → ŷᵢ → zᵢ → w

1

\(\partial\mathcal{L}/\partial\hat{y}_i = -y_i/\hat{y}_i + (1-y_i)/(1-\hat{y}_i)\)

2

Chain through \(\sigma\): \(\partial\mathcal{L}/\partial z_i = (\partial\mathcal{L}/\partial\hat{y}_i)\sigma'(z_i)\). Substitute and simplify (the same cancellation as softmax+CCE in the Backpropagation masterclass): \[\frac{\partial\mathcal{L}}{\partial z_i} = \left(-\frac{y_i}{\hat{y}_i}+\frac{1-y_i}{1-\hat{y}_i}\right)\hat{y}_i(1-\hat{y}_i) = \hat{y}_i-y_i\]

3

Chain through \(z_i=\mathbf{w}^T\mathbf{x}_i\) (so \(\partial z_i/\partial\mathbf{w}=\mathbf{x}_i\)) and sum over samples: \[\nabla_\mathbf{w}\mathcal{L} = \sum_i(\hat{y}_i-y_i)\mathbf{x}_i = \mathbf{X}^T(\hat{\mathbf{y}}-\mathbf{y})\]

Identical "prediction minus label" form as softmax+CCE — both are exponential-family GLMs, and the gradient of any GLM negative log-likelihood w.r.t. the linear predictor is always (prediction − target). ∎

Deriving the Hessian

Logistic Regression HessianCurvature

\[\nabla^2_\mathbf{w}\mathcal{L} = \frac{\partial}{\partial\mathbf{w}}\left[\mathbf{X}^T(\hat{\mathbf{y}}-\mathbf{y})\right] = \mathbf{X}^T\frac{\partial\hat{\mathbf{y}}}{\partial\mathbf{w}} = \mathbf{X}^T\text{diag}(\hat{y}_i(1-\hat{y}_i))\mathbf{X} = \mathbf{X}^T\mathbf{D}\mathbf{X}\]

D = diag(ŷᵢ(1−ŷᵢ)) has strictly positive diagonal entries (0 < ŷᵢ < 1 always), so X^TDX is PSD by the same v^T(X^TDX)v = ‖D^{1/2}Xv‖² ≥ 0 argument as MSE. Convex — guaranteed global optimum. This Hessian is also the Fisher Information Matrix for logistic regression, and is used directly in Newton's method (Iteratively Reweighted Least Squares, IRLS).

11

Application

Deriving PCA

// constrained optimization · Lagrangian · matrix calculus → eigenvalue problem

PCA asks: find the unit direction \(\mathbf{v}\) along which the projected data has maximum variance. This is a constrained optimization problem — and matrix calculus plus a Lagrange multiplier turns it directly into an eigenvalue problem. Intuitively, maximizing the variance of the projection is equivalent to minimizing the mean-squared distance (reconstruction error) between the original data points and their projections, a result of the Pythagorean theorem.

// From variance maximization to the eigenvalue equation

1

Variance of data projected onto direction \(\mathbf{v}\) (data centered, covariance \(\mathbf{S}\)): \[\text{Var}(\mathbf{Xv}) = \mathbf{v}^T\mathbf{S}\mathbf{v}, \quad \mathbf{S}=\frac{1}{n}\mathbf{X}^T\mathbf{X}\] Maximize subject to \(\|\mathbf{v}\|=1\) (otherwise variance is unbounded — scale v up arbitrarily).

2

Form the Lagrangian (bridging to the Lagrange Multipliers masterclass): \[\mathcal{L}(\mathbf{v},\lambda) = \mathbf{v}^T\mathbf{S}\mathbf{v} - \lambda(\mathbf{v}^T\mathbf{v}-1)\]

3

Apply the symmetric quadratic-form gradient rule (§08) to both terms: \[\nabla_\mathbf{v}\mathcal{L} = 2\mathbf{S}\mathbf{v} - 2\lambda\mathbf{v} = \mathbf{0}\]

4

Simplify: \[\boxed{\mathbf{S}\mathbf{v}=\lambda\mathbf{v}}\]

The stationarity condition for variance maximization is EXACTLY the eigenvalue equation. v must be an eigenvector of S, and λ is the corresponding eigenvalue. Substituting back: Var(Xv) = v^TSv = v^T(λv) = λ‖v‖² = λ. The variance along v equals its eigenvalue — so the maximum-variance direction is the eigenvector with the LARGEST eigenvalue. ∎

PCA: Two Lines of Matrix Calculus

The entire derivation of PCA's first principal component is: (1) write the Lagrangian for constrained variance maximization, (2) differentiate using the symmetric quadratic-form rule from §08. The result — that principal components are eigenvectors of the covariance matrix, ordered by eigenvalue — falls out as a direct algebraic consequence, with no additional machinery. This is the same Lagrangian framework as the Lagrange Multipliers masterclass, applied to a quadratic objective via matrix calculus.

Subsequent Components — Sequential Deflation

▣

Why All Eigenvectors Solve the Same Equation

To find the second principal component, maximize \(\mathbf{v}^T\mathbf{S}\mathbf{v}\) subject to \(\|\mathbf{v}\|=1\) AND \(\mathbf{v}\perp\mathbf{v}_1\) (orthogonal to the first PC). Adding this orthogonality constraint with another Lagrange multiplier and differentiating yields the same equation \(\mathbf{S}\mathbf{v}=\lambda\mathbf{v}\) — but restricted to the subspace orthogonal to \(\mathbf{v}_1\), which (since \(\mathbf{S}\) is symmetric) is automatically satisfied by the eigenvector with the second-largest eigenvalue. By induction, all principal components are simply all eigenvectors of \(\mathbf{S}\), sorted by eigenvalue — the Spectral Theorem guarantees they exist, are real, and are mutually orthogonal.

12

Application

Deriving Backprop

// the chain rule for matrices, applied to a full layer · recap via matrix calculus

Backpropagation (covered in full depth in its own masterclass) is, at its mathematical core, repeated application of the vector chain rule from §05 — with the matrix derivative identities from §06–08 supplying the rules for linear layers.

One Layer, Derived via Matrix CalculusBackprop Recap

\[\mathbf{z}=\mathbf{Wx}+\mathbf{b}, \quad \mathbf{a}=\sigma(\mathbf{z}), \quad \mathcal{L}=\ell(\mathbf{a})\] \[\boldsymbol{\delta} \triangleq \frac{\partial\mathcal{L}}{\partial\mathbf{z}} = \sigma'(\mathbf{z})\odot\frac{\partial\mathcal{L}}{\partial\mathbf{a}} \quad\text{(diagonal Jacobian VJP, §03)}\] \[\frac{\partial\mathcal{L}}{\partial\mathbf{W}} = \boldsymbol{\delta}\,\mathbf{x}^T \quad\text{(from }d\mathcal{L}=\boldsymbol{\delta}^Td\mathbf{z}=\boldsymbol{\delta}^T(d\mathbf{W})\mathbf{x}=\text{tr}(\mathbf{x}\boldsymbol{\delta}^Td\mathbf{W})\text{, §07 trace trick)}\] \[\frac{\partial\mathcal{L}}{\partial\mathbf{x}} = \mathbf{W}^T\boldsymbol{\delta} \quad\text{(vector chain rule, §05)}\] \[\frac{\partial\mathcal{L}}{\partial\mathbf{b}} = \boldsymbol{\delta} \quad\text{(}\partial\mathbf{z}/\partial\mathbf{b}=\mathbf{I}\text{)}\]

Every one of these four lines is a direct instance of an identity already established: diagonal-Jacobian VJP (§03), the trace trick applied to a differential (§07), the vector chain rule with a linear Jacobian (§05), and the identity Jacobian. Backpropagation invents NO new mathematics — it is matrix calculus applied systematically, layer by layer, via the chain rule.

The ∂L/∂W Derivation via Trace Trick — In Full

// Deriving ∂L/∂W = δxᵀ using differentials and the trace trick

1

Write the differential of the loss through \(\mathbf{z}\): \[d\mathcal{L} = \boldsymbol{\delta}^Td\mathbf{z} = \boldsymbol{\delta}^T(d\mathbf{W})\mathbf{x}\]

2

\(\boldsymbol{\delta}^T(d\mathbf{W})\mathbf{x}\) is a scalar, so it equals its own trace: \[d\mathcal{L} = \text{tr}(\boldsymbol{\delta}^T(d\mathbf{W})\mathbf{x}) = \text{tr}(\mathbf{x}\boldsymbol{\delta}^Td\mathbf{W}) \quad\text{(cyclic property, §07)}\]

3

Match against \(d\mathcal{L}=\text{tr}(\mathbf{A}\,d\mathbf{W}) \Rightarrow \partial\mathcal{L}/\partial\mathbf{W}=\mathbf{A}^T\) (§06): \[\frac{\partial\mathcal{L}}{\partial\mathbf{W}} = (\mathbf{x}\boldsymbol{\delta}^T)^T = \boldsymbol{\delta}\mathbf{x}^T \quad\blacksquare\]

This is BP3 from the Backpropagation masterclass, now derived purely from the differential + trace-trick toolkit built in this page.

13

Computation

Automatic Differentiation

// from formulas to algorithms · forward vs reverse mode · how frameworks implement these rules

Everything derived so far is a formula. Automatic differentiation (AD) is the algorithm that applies these formulas mechanically, for arbitrary compositions, without a human re-deriving anything.

Forward vs Reverse Mode — Matrix Calculus ViewAD Modes

\[\text{Forward mode: propagate } \dot{\mathbf{x}}\to J_1\dot{\mathbf{x}}\to J_2J_1\dot{\mathbf{x}}\to\cdots \quad\text{(left-to-right Jacobian-vector products, JVPs)}\] \[\text{Reverse mode: propagate } \bar{\mathbf{y}}\to J_n^T\bar{\mathbf{y}}\to J_{n-1}^TJ_n^T\bar{\mathbf{y}}\to\cdots \quad\text{(right-to-left vector-Jacobian products, VJPs)}\]

Forward mode is equivalent to augmenting every scalar with an infinitesimal "dual number" \(\epsilon\) such that \(\epsilon^2=0\). Reverse mode is the workhorse of deep learning because it computes the gradient of one output (the loss) with respect to millions of parameters in a single backward pass, regardless of network depth.

A Minimal Reverse-Mode Engine

Each op stores its own matrix-calculus rule

# Every operation defines forward AND its local VJP rule
class MatMul:
  def forward(self, W, x): return W @ x
  def backward(self, grad_z):
    # §05/§12 identities — VJP rules, not re-derived each time
    grad_W = grad_z @ self.x.T # δ xᵀ
    grad_x = self.W.T @ grad_z # Wᵀ δ
    return grad_W, grad_x

class Sigmoid:
  def forward(self, z): self.a = sigmoid(z); return self.a
  def backward(self, grad_a):
    # §03/§10 — diagonal Jacobian → elementwise multiply
    return grad_a * self.a * (1 - self.a)

# AD frameworks are libraries of exactly these (forward, backward) pairs —
# every pair IS one of the identities derived in §03–§08.

▣

The Punchline

PyTorch, JAX, and TensorFlow contain no "magic." Every .backward() call walks the computation graph and, at each node, applies a pre-derived matrix calculus identity — exactly the ones catalogued in this reference. Knowing matrix calculus means being able to read and even write autograd.Function implementations from first principles.

14

Errors

Common Pitfalls

// the mistakes everyone makes · how to catch them fast

The Six Most Common Errors

Shape mismatches: \(\partial\mathcal{L}/\partial\mathbf{W}\) must have the SAME shape as \(\mathbf{W}\). If you compute something with the wrong shape, you have a transpose error somewhere. Check this first, always.
Forgetting the factor of 2: \(\nabla_\mathbf{x}\|\mathbf{x}\|^2=2\mathbf{x}\), not \(\mathbf{x}\). This single factor of 2 (or its absence) is the most common silent bug — it doesn't change the direction of gradient descent, only the effective learning rate, so it can go unnoticed for a long time.
Assuming \(\mathbf{A}\) is symmetric when it isn't: \(\nabla_\mathbf{x}(\mathbf{x}^T\mathbf{A}\mathbf{x})=(\mathbf{A}+\mathbf{A}^T)\mathbf{x}\), which only simplifies to \(2\mathbf{A}\mathbf{x}\) if \(\mathbf{A}=\mathbf{A}^T\). Many "derivation errors" online silently assume symmetry without stating it.
Treating non-commuting matrix products as scalars: \(d(\mathbf{AB})=(d\mathbf{A})\mathbf{B}+\mathbf{A}(d\mathbf{B})\) — NOT \((d\mathbf{B})\mathbf{A}+\mathbf{B}(d\mathbf{A})\). Order matters, always.
Incorrect trace cyclic permutations: \(\text{tr}(\mathbf{ABC})=\text{tr}(\mathbf{BCA})=\text{tr}(\mathbf{CAB})\), but \(\text{tr}(\mathbf{ABC})\neq\text{tr}(\mathbf{ACB})\) in general (only cyclic shifts are valid, not arbitrary permutations).
Mixing numerator/denominator layout mid-derivation: Pick ONE convention (this page uses denominator/ML layout for gradients, standard m×n for Jacobians) and stay consistent. Switching layouts mid-derivation is the #1 cause of "off by a transpose" errors.

Debugging Technique: Numerical Gradient Checking

Gradient CheckingVerification

\[\left[\nabla f(\mathbf{x})\right]_i \approx \frac{f(\mathbf{x}+\epsilon\mathbf{e}_i)-f(\mathbf{x}-\epsilon\mathbf{e}_i)}{2\epsilon}, \quad \epsilon\approx10^{-5}\]

Compare your analytical formula against this finite-difference approximation for a few random entries. If they disagree beyond ~10⁻⁴ relative error, your analytical derivative has a bug — almost always one of the six pitfalls above. This is the universal "trust but verify" step for any hand-derived gradient.

15

Reference

Cheat Sheet

// every identity from this page, in one place

Vectors → Scalars

Linear

f = aᵀx → ∇f = a

∇²f = 0

Quadratic (sym A)

f = xᵀAx → ∇f = 2Ax

∇²f = 2A

Quadratic (general A)

f = xᵀAx → ∇f = (A+Aᵀ)x

Norm

f = ‖x‖² → ∇f = 2x

Matrices → Scalars (Trace Forms)

tr(AX)

∂/∂X tr(AX) = Aᵀ

tr(XᵀA)

∂/∂X tr(XᵀA) = A

tr(AXB)

∂/∂X tr(AXB) = AᵀBᵀ

tr(XᵀAX)

∂/∂X = (A+Aᵀ)X

log|det X|

∂/∂X = (Xᵀ)⁻¹

‖X‖_F²

∂/∂X = 2X

Chain Rule & Backprop Rules

Vector chain rule

∇ₓL = Jᵥᵀ ∇ᵧL

z = Wx + b

∂L/∂x = Wᵀδ

∂L/∂W = δxᵀ, ∂L/∂b = δ

Elementwise σ(z)

∂L/∂z = σ'(z) ⊙ ∂L/∂a

Softmax + CCE

∂L/∂z = ŷ − y

Sigmoid + BCE

∂L/∂z = ŷ − y

MSE

∇w = 2Xᵀ(Xw−y)

Differential Algebra

Sum / scale

d(A+B)=dA+dB

d(αA)=α dA

Product

d(AB)=(dA)B+A(dB)

Trace

d(tr A)=tr(dA)

Inverse

d(A⁻¹)=−A⁻¹(dA)A⁻¹

16

Synthesis

The Complete Mental Model

// everything connected · one diagram · the unified workflow

Fig 1. Complete matrix calculus map — from the differential viewpoint, through Jacobians/Hessians, trace tricks, and the chain rule, into quadratic forms, every major derivation, and automatic differentiation.

The complete workflow, every time:

Fix shapes and convention. Determine the shapes of input and output. In ML, gradients of scalars match the shape of the variable (denominator/ML layout).
Write the differential. Express \(df\) (or \(d\mathcal{L}\)) using the algebra of differentials (§02): sum, product, trace rules. This avoids index chasing entirely.
Apply the trace trick if matrix-valued. Use cyclic permutation (§07) to massage the expression into the form \(\text{tr}(\mathbf{A}\,d\mathbf{X})\), then read off \(\partial f/\partial\mathbf{X}=\mathbf{A}^T\) (§06).
For compositions, chain Jacobians. \(\nabla_\mathbf{x}\mathcal{L}=J_\mathbf{f}(\mathbf{x})^T\nabla_\mathbf{f}\mathcal{L}\) (§05) — transpose the local Jacobian and multiply by the upstream gradient. This is the VJP, the atom of reverse-mode AD (§13).
Recognize quadratic forms. \(\mathbf{x}^T\mathbf{A}\mathbf{x}\to(\mathbf{A}+\mathbf{A}^T)\mathbf{x}\) (§08) is the single most common pattern — MSE, regularizers, Mahalanobis distance, Newton's method, PCA variance.
Verify with finite differences. Before trusting any hand-derived formula, check it numerically (§14). This catches the factor-of-2, symmetry, and transpose errors that are otherwise invisible.
Apply to real models. Linear regression (§09), logistic regression (§10), PCA (§11), and every layer of a neural network (§12) are all instances of steps 1–5 applied to specific \(f\).

Matrix Calculus Is Scalar Calculus With Bookkeeping

There is no new calculus here — only new bookkeeping. Every derivative is still a limit of a difference quotient; every chain rule is still "multiply local derivatives along the path." What matrix calculus adds is a systematic notation (differentials, traces, Jacobians) for tracking which derivative goes with which shape, so that the same five-step workflow above scales from a single scalar \(x\) to a billion-parameter neural network without ever writing an index sum by hand.

The ultimate mental model is to alternate between Vectorization (viewing the system as a whole to see geometric structures like projections and ellipses) and Scalarization (viewing individual entries to verify specific partial derivatives). Master the differential viewpoint and the trace trick, and every gradient in this reference — and every gradient PyTorch computes for you — becomes a two-line derivation.

∂y/∂x for every shape Matrix Calculus

Scalars, Vectors, Matrices

Numerator Layout

Denominator Layout (ML convention)

The Differential Viewpoint

The Jacobian Matrix

The Hessian Matrix

Chain Rule for Vectors

Matrix Derivatives

Trace Tricks

Quadratic Forms

Deriving Linear Regression

Deriving Logistic Regression

Deriving PCA

Deriving Backprop

Automatic Differentiation

Common Pitfalls

Cheat Sheet

The Complete Mental Model