NN
Machine Learning // Deep Learning Foundations

Neural
Networks
Mathematical Foundations

From biological neurons to backpropagation through time —
complete mathematical derivations, geometric intuitions,
and every connection to the models you've already built.

16
Sections
Depth
Biological NeuronMcCulloch-Pitts PerceptronActivation Functions Forward PassLoss Functions Computational GraphChain Rule BackpropagationVanishing Gradient Adam OptimizerXavier / He Init Universal Approximation
// Table of Contents
  1. 01The Biological Neuron
  2. 02The Mathematical Neuron
  3. 03Activation Functions
  4. 04Network Architecture & Layers
  5. 05Forward Propagation — Full Math
  6. 06Loss Functions
  7. 07The Computational Graph
  8. 08Backpropagation — Chain Rule Derivation
  9. 09Gradient Flow Through a Full Network
  10. 10Vanishing & Exploding Gradients
  11. 11Weight Initialization
  12. 12Optimization for Neural Networks
  13. 13Universal Approximation Theorem
  14. 14Putting It All Together — Full Pass
  15. 15Connection to Modern Architectures
  16. 16The Complete Mental Model
01
§ Sec

The Biological Neuron

// inspiration · structure · signal processing · why it matters

Neural networks borrow their structure — and their name — from the biological neurons in the human brain. Understanding the biology isn't just historical context; the parallels reveal why the mathematical structure of artificial networks works.

Anatomy of a Biological Neuron
Dendrites (receive signals) Soma (cell body) integrates Axon (transmits signal) myelin sheaths Axon Terminals (output synapses) next neurons
Fig 1. Biological neuron anatomy. Dendrites receive signals → Soma integrates them → Axon fires if threshold exceeded → Terminals pass signal to downstream neurons.
  • Dendrites — branch-like extensions that receive electrical/chemical signals from upstream neurons. Each connection (synapse) has a variable strength — analogous to weights \(w_i\).
  • Soma (cell body) — integrates all incoming signals from all dendrites. Performs a weighted summation. If the total exceeds a threshold, it fires.
  • Axon — transmits the output signal (action potential) to downstream neurons. A neuron either fires or doesn't — it's a binary threshold event.
  • Synapse — the junction between two neurons. Strength is modifiable through learning (Hebbian plasticity: "neurons that fire together, wire together").
The McCulloch-Pitts Neuron (1943)

Warren McCulloch and Walter Pitts built the first mathematical model of a neuron. Their insight: the neuron is a threshold logic unit.

McCulloch-Pitts Model (1943)
\[ \text{output} = \begin{cases} 1 & \text{if } \displaystyle\sum_{i=1}^n w_i x_i \geq \theta \\ 0 & \text{otherwise} \end{cases} \]

where xᵢ ∈ {0,1} are binary inputs, wᵢ ∈ {-1,+1} are fixed weights, θ is the threshold

Limitations: weights are fixed (no learning), inputs must be binary, threshold must be hand-set. Rosenblatt's Perceptron (1958) fixed the first problem by introducing a learning rule. Modern neural networks fix all three.

Insight

The brain has ~86 billion neurons, each connected to ~7,000 others — approximately 600 trillion synaptic connections. A GPT-4-scale model has ~1.8 trillion parameters. The brain still wins by several orders of magnitude in connection density, but artificial networks compensate with precise floating-point computation, gradient-based learning, and perfect memory.

02
§ Sec

The Mathematical Neuron

// weighted sum · bias · pre-activation · post-activation

The modern artificial neuron generalizes the McCulloch-Pitts unit with continuous weights, a learnable bias, and a differentiable activation function. This is the atom from which all neural networks are built.

Anatomy of a Single Artificial Neuron
x₁ x₂ xₙ inputs w₁ w₂ wₙ b bias Σ z = Σwᵢxᵢ+b z pre-activation σ(z) activation a = σ(z) output ŷ z = w·x + b linear combination a = σ(z) nonlinear transform
Fig 2. A single artificial neuron. Inputs {xᵢ} are weighted (wᵢ), summed with bias b to form pre-activation z. An activation function σ squashes z to output a.
The Two Operations
Single Neuron — Full Computation
\[ z = \mathbf{w}^T\mathbf{x} + b = \sum_{i=1}^n w_i x_i + b \qquad \text{(pre-activation / linear combination)} \] \[ a = \sigma(z) \qquad \text{(post-activation / output)} \]
  • Pre-activation \(z\) — the raw linear combination. Identical to a linear regression output. By itself, it's just linear. This is the "integrate" step.
  • Bias \(b\) — a learnable offset. Allows the decision boundary to shift away from the origin. Without bias, the hyperplane \(\mathbf{w}^T\mathbf{x} = 0\) is constrained to pass through the origin.
  • Activation \(\sigma(z)\) — the crucial nonlinearity. Without it, stacking multiple linear layers is still just a single linear layer (linear functions compose to linear). Activations are what allow neural networks to approximate nonlinear functions.
Key Fact

A neural network with no activation functions is identically a linear model, regardless of how many layers it has. If every layer is \(\mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b}\), then composing \(L\) layers gives \(\mathbf{y} = \mathbf{W}_L\cdots\mathbf{W}_1\mathbf{x} + \text{const} = \tilde{\mathbf{W}}\mathbf{x} + \tilde{\mathbf{b}}\) — still linear. Activation functions are not optional decorations. They are the entire reason deep networks work.

03
§ Sec

Activation Functions

// sigmoid · tanh · ReLU · Leaky ReLU · GELU · Softmax

Activation functions are the nonlinearities that give neural networks their expressive power. Their mathematical properties — differentiability, range, gradient behavior — directly determine training dynamics.

Sigmoid σ
\(\sigma(z) = \frac{1}{1+e^{-z}}\)
\(\sigma'(z) = \sigma(z)(1-\sigma(z))\)

Range (0,1). Probabilistic output. Saturates for large |z| — gradient → 0 causes vanishing gradients. Max gradient 0.25 at z=0. Use only in output layer for binary classification.

Tanh
\(\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}\)
\(\tanh'(z) = 1 - \tanh^2(z)\)

Range (−1,1). Zero-centered — better gradient flow than sigmoid. Still saturates. Note: \(\tanh(z) = 2\sigma(2z) - 1\). Preferred over sigmoid for hidden layers in RNNs.

ReLU
\(\text{ReLU}(z) = \max(0, z)\)
\(\text{ReLU}'(z) = \mathbf{1}[z > 0]\)

Range [0,∞). Not differentiable at 0 (use subgradient = 0). Sparse activation. No vanishing gradient for z > 0. Risk: "dying ReLU" — neurons stuck at 0 when gradient is always 0. Default for hidden layers in CNNs/MLPs.

Leaky ReLU
\(\text{LReLU}(z) = \max(\alpha z, z)\)
\(\alpha \approx 0.01\) (small slope for z < 0)

Fixes dying ReLU: negative inputs have small gradient α instead of 0. PReLU learns α as a parameter. ELU uses exp for negative region. Generally preferred over ReLU in deeper networks.

GELU
\(\text{GELU}(z) = z \cdot \Phi(z)\)
\(\approx 0.5z(1 + \tanh[\sqrt{2/\pi}(z + 0.044715z^3)])\)

Gaussian Error Linear Unit. Φ(z) = CDF of standard Gaussian. Smooth everywhere; stochastic interpretation (randomly zeroes inputs based on magnitude). Default in BERT, GPT, ViT — including your chest X-ray ViT.

Softmax
\(\text{softmax}(\mathbf{z})_k = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}}\)
outputs sum to 1

Multi-class probability output. Range (0,1) per component, sums to 1. Generalizes sigmoid to K classes. Numerically stable: subtract max(z) before exponentiating. Used only in output layer for classification.

Derivatives — Why They Matter for Training

The derivative of the activation function appears directly in the backpropagation chain rule. If the derivative saturates (→ 0), gradients vanish and early layers stop learning. This is the fundamental reason ReLU variants replaced sigmoid/tanh in deep networks.

Activation Derivative Range Saturates? Sparse? Zero-Centered?
Sigmoid (0, 0.25] Yes — strongly No No (outputs 0.5 bias)
Tanh (0, 1] Yes No Yes
ReLU {0, 1} Half-saturates (z<0) Yes ≈50% No
Leaky ReLU {α, 1} α≈0.01 No Weak No
GELU ~(−0.17, 1.13) Barely Soft Nearly
Softmax Jacobian matrix Somewhat No N/A
04
§ Sec

Network Architecture & Layers

// depth · width · dense layers · feature hierarchies

A neural network is a composition of layers, each containing multiple neurons. The depth (number of layers) and width (neurons per layer) are the two primary architectural choices. Notation: \(L\) total layers, layer \(\ell\) has \(n^{[\ell]}\) neurons.

Network Notation
\[ \mathbf{W}^{[\ell]} \in \mathbb{R}^{n^{[\ell]} \times n^{[\ell-1]}}, \quad \mathbf{b}^{[\ell]} \in \mathbb{R}^{n^{[\ell]}} \] \[ \text{Layer } \ell: \quad \mathbf{z}^{[\ell]} = \mathbf{W}^{[\ell]}\mathbf{a}^{[\ell-1]} + \mathbf{b}^{[\ell]}, \quad \mathbf{a}^{[\ell]} = \sigma^{[\ell]}(\mathbf{z}^{[\ell]}) \]

a⁽⁰⁾ = x (input). Each W[ℓ] row is one neuron's weight vector. Matrix multiply broadcasts across all neurons in the layer simultaneously.

The Three Layer Types
Input Layer

Not a computational layer — it's just the raw feature vector \(\mathbf{x} = \mathbf{a}^{[0]}\). No weights, no activation. Its size \(n^{[0]}\) equals the number of input features. For images: \(n^{[0]}\) = H×W×C pixels.

Hidden Layers

Layers 1 through \(L-1\). Where all nonlinear feature learning happens. Intermediate representations build hierarchically — early layers detect edges, later layers detect objects. Use ReLU/GELU activations.

Output Layer

Layer \(L\). Size matches the task: 1 neuron + sigmoid (binary), K neurons + softmax (K-class), 1 neuron + linear (regression). Activation is determined by the loss function, not architecture preference.

Parameter Count

Total parameters = \(\sum_{\ell=1}^{L}(n^{[\ell-1]} \cdot n^{[\ell]} + n^{[\ell]})\). For a [784→256→128→10] network: (784·256+256) + (256·128+128) + (128·10+10) = 234,378 parameters.

Depth vs Width — The Key Trade-off
  • Width (more neurons per layer) increases the number of features learned at each level. Diminishing returns beyond a point — wider doesn't always mean better.
  • Depth (more layers) enables hierarchical feature composition — each layer builds on abstractions from the previous. Exponentially more expressive than width for representing compositional functions.
  • Universal approximation (§13): one hidden layer + enough neurons can approximate any continuous function. But depth achieves the same accuracy with exponentially fewer neurons.
05
§ Sec

Forward Propagation — Full Math

// vectorized pass · matrix operations · mini-batch · complete example

Forward propagation computes the network output from input to prediction, layer by layer. Every computation is a matrix multiply + bias add + elementwise activation. On a GPU, this is massively parallel.

Single Sample Forward Pass — Explicit
Layer-by-Layer Forward Pass (L layers)
\[ \mathbf{a}^{[0]} = \mathbf{x} \in \mathbb{R}^{n^{[0]}} \] \[ \text{For } \ell = 1, 2, \ldots, L: \] \[ \mathbf{z}^{[\ell]} = \mathbf{W}^{[\ell]}\mathbf{a}^{[\ell-1]} + \mathbf{b}^{[\ell]} \quad \in \mathbb{R}^{n^{[\ell]}} \] \[ \mathbf{a}^{[\ell]} = \sigma^{[\ell]}\!\left(\mathbf{z}^{[\ell]}\right) \quad \in \mathbb{R}^{n^{[\ell]}} \] \[ \hat{\mathbf{y}} = \mathbf{a}^{[L]} \quad \text{(final prediction)} \]
Mini-Batch Forward Pass — Vectorized

For a mini-batch of \(m\) samples stacked as columns: \(\mathbf{X} \in \mathbb{R}^{n^{[0]} \times m}\). All operations broadcast across the batch simultaneously:

Vectorized Batch Forward Pass
\[ \mathbf{A}^{[0]} = \mathbf{X} \in \mathbb{R}^{n^{[0]} \times m} \] \[ \mathbf{Z}^{[\ell]} = \mathbf{W}^{[\ell]}\mathbf{A}^{[\ell-1]} + \mathbf{b}^{[\ell]} \in \mathbb{R}^{n^{[\ell]} \times m} \] \[ \mathbf{A}^{[\ell]} = \sigma^{[\ell]}\!\left(\mathbf{Z}^{[\ell]}\right) \in \mathbb{R}^{n^{[\ell]} \times m} \]

b[ℓ] is broadcast across all m columns. σ is applied elementwise. Each column of A[L] is one sample's prediction.

Worked Example — 2-Layer Network

Network: [2 → 3 → 1], input \(\mathbf{x} = [0.5, -0.2]^T\), hidden layer uses ReLU, output uses sigmoid.

// Explicit numerical forward pass through a 2-layer network
1
Layer 1 pre-activation: \(\mathbf{z}^{[1]} = \mathbf{W}^{[1]}\mathbf{x} + \mathbf{b}^{[1]}\) \[\text{e.g. } \mathbf{W}^{[1]} = \begin{bmatrix}0.5 & -0.3 \\ 0.8 & 0.1 \\ -0.4 & 0.7\end{bmatrix}, \mathbf{b}^{[1]} = \begin{bmatrix}0\\0\\0\end{bmatrix} \Rightarrow \mathbf{z}^{[1]} = \begin{bmatrix}0.31 \\ 0.38 \\ -0.34\end{bmatrix}\]
2
Layer 1 activation (ReLU): \(\mathbf{a}^{[1]} = \text{ReLU}(\mathbf{z}^{[1]}) = \begin{bmatrix}0.31 \\ 0.38 \\ 0\end{bmatrix}\)
Third neuron is dead (z < 0 → output 0). Its weight contributes nothing to the forward pass.
3
Layer 2 pre-activation: \(z^{[2]} = \mathbf{w}^{[2]T}\mathbf{a}^{[1]} + b^{[2]}\) \[\mathbf{w}^{[2]} = \begin{bmatrix}0.6 \\ -0.5 \\ 0.9\end{bmatrix}, b^{[2]} = 0.1 \Rightarrow z^{[2]} = 0.186 - 0.190 + 0 + 0.1 = 0.096\]
4
Output (sigmoid): \(\hat{y} = \sigma(0.096) = \frac{1}{1+e^{-0.096}} \approx 0.524\)
The network predicts ≈52.4% probability for class 1. Random initialization usually gives predictions near 0.5.
Cache!

During forward propagation, cache every \(\mathbf{Z}^{[\ell]}\) and \(\mathbf{A}^{[\ell]}\) — you'll need them during backpropagation. This is memory vs compute trade-off: gradient checkpointing recomputes some activations to save memory at the cost of compute.

06
§ Sec

Loss Functions

// MSE · cross-entropy · MLE connection · practical choices

The loss function \(\mathcal{L}(\hat{y}, y)\) measures how wrong the network's prediction is. It must be differentiable with respect to the network parameters so that backpropagation can compute gradients. The choice of loss is not arbitrary — it follows from the statistical model of the output.

Binary Cross-Entropy (BCE)
Binary Cross-Entropy
\[ \mathcal{L}_{\text{BCE}} = -\frac{1}{m}\sum_{i=1}^{m}\left[y^{(i)}\log\hat{y}^{(i)} + (1-y^{(i)})\log(1-\hat{y}^{(i)})\right] \]

Use with sigmoid output. Derived from MLE under Bernoulli assumption. Penalizes confident wrong predictions exponentially.

Categorical Cross-Entropy (CCE)
Categorical Cross-Entropy
\[ \mathcal{L}_{\text{CCE}} = -\frac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{K} y_k^{(i)}\log\hat{y}_k^{(i)} \]

For one-hot labels: reduces to −log(p̂_true_class). Use with softmax output. Your chest X-ray classifier used this.

Mean Squared Error (MSE)
MSE
\[ \mathcal{L}_{\text{MSE}} = \frac{1}{2m}\sum_{i=1}^{m}\left(\hat{y}^{(i)} - y^{(i)}\right)^2 = \frac{1}{2m}\|\hat{\mathbf{y}} - \mathbf{y}\|^2 \]

Use with linear output for regression. Equivalent to MLE under Gaussian noise. The 1/2 makes the gradient cleaner: ∂L/∂ŷ = (ŷ−y)/m.

Softmax + CCE — The Beautiful Cancellation

The gradient of CCE loss w.r.t. the logits \(\mathbf{z}^{[L]}\) (before softmax) is perhaps the most elegant result in neural network training:

Softmax + CCE Gradient — Elegant Cancellation
\[ \frac{\partial \mathcal{L}_{\text{CCE}}}{\partial z_k^{[L]}} = \hat{y}_k - y_k \quad \text{(prediction minus true label)} \]

The softmax Jacobian and cross-entropy gradient cancel exactly — just like sigmoid + BCE. The error signal is simply the residual. In vector form: ∂L/∂z[L] = (ŷ − y)

This isn't coincidence — it's a consequence of the sigmoid/softmax being the canonical link function for Bernoulli/Categorical distributions (generalized linear models). The MLE gradient always has this residual form.

07
§ Sec

The Computational Graph

// DAG representation · nodes · edges · automatic differentiation

A computational graph is a Directed Acyclic Graph (DAG) where nodes are operations (or values) and edges are data dependencies. It's the data structure that makes automatic differentiation — and therefore backpropagation — mechanical and exact.

x w b × wx + z=wx+b σ(z) a=σ(z) Loss y L(a,y) ∂L/∂a ∂L/∂z ∂L/∂(wx+b) ∂L/∂w → FORWARD PASS (compute values) → ← BACKWARD PASS (compute gradients) ←
Fig 3. Computational graph for a single neuron. Forward pass (grey arrows) computes values left→right. Backward pass (red dashed) propagates gradients right→left via the chain rule.
Why the Graph Formalism Matters

Every modern deep learning framework (PyTorch, JAX, TensorFlow) represents computation as a graph and uses automatic differentiation (autograd) to backpropagate through it. You never hand-code gradients. The rules are:

  • Forward pass: Evaluate every node in topological order, caching all intermediate values.
  • Backward pass: Traverse the graph in reverse topological order, multiplying local gradients using the chain rule.
  • Local gradient: Each node knows its own derivative (e.g., + passes gradient through unchanged; × flips operands as gradients; max routes gradient to the winning branch).
08
§ Sec

Backpropagation — Chain Rule Derivation

// the chain rule · delta notation · full derivation · matrix form

Backpropagation (Rumelhart, Hinton, Williams 1986) is the algorithm for computing \(\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[\ell]}}\) and \(\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{[\ell]}}\) for every layer \(\ell\) simultaneously. It's pure chain rule applied to the computational graph — nothing more.

The Multivariate Chain Rule
Chain Rule (scalar → vector)
\[ \frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial y} \cdot \frac{\partial y}{\partial x} \qquad \text{(scalar case)} \] \[ \frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \left(\frac{\partial \mathbf{y}}{\partial \mathbf{x}}\right)^T \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \qquad \text{(vector case — Jacobian transpose)} \]
Deriving the Backprop Equations — Layer by Layer

Define the error signal (delta) for layer \(\ell\):

Delta (Error Signal) Definition
\[ \boldsymbol{\delta}^{[\ell]} \triangleq \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{[\ell]}} \in \mathbb{R}^{n^{[\ell]}} \]

How much the loss changes per unit change in the pre-activations of layer ℓ. This is the central quantity in backpropagation.

// Deriving the four fundamental equations of backpropagation
BP1
Error in output layer (layer \(L\)): \[\boldsymbol{\delta}^{[L]} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{[L]}} = \frac{\partial \mathcal{L}}{\partial \mathbf{a}^{[L]}} \odot \sigma'^{[L]}\!\left(\mathbf{z}^{[L]}\right)\] For softmax + CCE: \(\boldsymbol{\delta}^{[L]} = \hat{\mathbf{y}} - \mathbf{y}\) (the elegant cancellation from §06).
⊙ = elementwise product (Hadamard). ∂L/∂a[L] is the derivative of the loss w.r.t. the output — depends on which loss function you use.
BP2
Error propagation backward (layer \(\ell = L-1, \ldots, 1\)): \[\boldsymbol{\delta}^{[\ell]} = \left(\mathbf{W}^{[\ell+1]T}\boldsymbol{\delta}^{[\ell+1]}\right) \odot \sigma'^{[\ell]}\!\left(\mathbf{z}^{[\ell]}\right)\]
W[ℓ+1]ᵀ δ[ℓ+1]: backpropagate error through the weight matrix (transpose!). Then scale by local derivative σ'(z[ℓ]). This is the chain rule applied to the layered composition.
BP3
Gradient of weights: \[\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[\ell]}} = \boldsymbol{\delta}^{[\ell]}\left(\mathbf{a}^{[\ell-1]}\right)^T\]
Outer product of the error δ[ℓ] (column) and the previous layer activation a[ℓ-1] (row). Shape: n[ℓ] × n[ℓ-1] — matches W[ℓ]. Every weight gradient is error times its input.
BP4
Gradient of biases: \[\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{[\ell]}} = \boldsymbol{\delta}^{[\ell]}\]
The bias gradient equals the error signal directly. Since b has derivative 1 w.r.t. z = Wa + b, the chain rule gives ∂L/∂b = δ · 1 = δ.
The Four Fundamental Equations of Backpropagation

BP1: \(\boldsymbol{\delta}^{[L]} = \nabla_{\mathbf{a}^{[L]}}\mathcal{L} \odot \sigma'^{[L]}(\mathbf{z}^{[L]})\) — output layer error  |  BP2: \(\boldsymbol{\delta}^{[\ell]} = (\mathbf{W}^{[\ell+1]T}\boldsymbol{\delta}^{[\ell+1]}) \odot \sigma'^{[\ell]}(\mathbf{z}^{[\ell]})\) — error backpropagation  |  BP3: \(\partial\mathcal{L}/\partial\mathbf{W}^{[\ell]} = \boldsymbol{\delta}^{[\ell]}(\mathbf{a}^{[\ell-1]})^T\)  |  BP4: \(\partial\mathcal{L}/\partial\mathbf{b}^{[\ell]} = \boldsymbol{\delta}^{[\ell]}\).

09
§ Sec

Gradient Flow Through a Full Network

// vectorized backprop · mini-batch · full algorithm · complexity
Mini-Batch Backpropagation — Vectorized

With a batch of \(m\) samples stacked as columns in \(\mathbf{A}^{[\ell]} \in \mathbb{R}^{n^{[\ell]} \times m}\):

Vectorized Backpropagation (Batch)
\[ \boldsymbol{\Delta}^{[L]} = \hat{\mathbf{Y}} - \mathbf{Y} \in \mathbb{R}^{n^{[L]} \times m} \quad \text{(for softmax+CCE)} \] \[ \boldsymbol{\Delta}^{[\ell]} = \left(\mathbf{W}^{[\ell+1]T}\boldsymbol{\Delta}^{[\ell+1]}\right) \odot \sigma'^{[\ell]}\!\left(\mathbf{Z}^{[\ell]}\right) \] \[ \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[\ell]}} = \frac{1}{m}\boldsymbol{\Delta}^{[\ell]}\left(\mathbf{A}^{[\ell-1]}\right)^T \in \mathbb{R}^{n^{[\ell]} \times n^{[\ell-1]}} \] \[ \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{[\ell]}} = \frac{1}{m}\sum_{i=1}^{m}\boldsymbol{\Delta}^{[\ell]}_{:,i} = \frac{1}{m}\boldsymbol{\Delta}^{[\ell]}\mathbf{1} \in \mathbb{R}^{n^{[\ell]}} \]
Full Training Algorithm
// Neural Network Training — Complete Loop
# Initialize weights (Xavier/He — see §11)
initialize_weights(model)

for epoch in range(num_epochs):
  for X_batch, y_batch in dataloader(train_data, batch_size):

    # ─── FORWARD PASS ───────────────────────────────
    cache = {} # store Z[ℓ], A[ℓ] for backprop
    A = X_batch
    for l in range(1, L+1):
      Z = W[l] @ A + b[l]
      A = activation[l](Z)
      cache[l] = (Z, A) # ← critical!
    y_hat = A

    # ─── COMPUTE LOSS ───────────────────────────────
    loss = cross_entropy(y_hat, y_batch)

    # ─── BACKWARD PASS (BP1 → BP4) ──────────────────
    Delta = y_hat y_batch # BP1: output error
    for l in range(L, 0, -1):
      Z_l, A_prev = cache[l][0], cache[l-1][1]
      dW[l] = (Delta @ A_prev.T) / m # BP3
      db[l] = Delta.sum(axis=1) / m # BP4
      if l > 1: # BP2: propagate error
        Delta = (W[l].T @ Delta) * sigma_prime(Z_l)

    # ─── OPTIMIZER STEP ─────────────────────────────
    optimizer.step(W, b, dW, db) # Adam / SGD
Computational Complexity
  • Forward pass per layer: \(O(n^{[\ell-1]} \cdot n^{[\ell]} \cdot m)\) — dominated by the matrix multiply \(\mathbf{W}^{[\ell]}\mathbf{A}^{[\ell-1]}\).
  • Backward pass per layer: Also \(O(n^{[\ell-1]} \cdot n^{[\ell]} \cdot m)\) — same cost as forward. So total training is ≈2× inference cost.
  • Memory: Must store all cached activations \(\mathbf{A}^{[\ell]}\) — \(O(m \cdot \sum_\ell n^{[\ell]})\). This is why batch size and model depth are the primary memory constraints.
10
§ Sec

Vanishing & Exploding Gradients

// the pathology · geometric explanation · remedies

The backpropagation recurrence \(\boldsymbol{\delta}^{[\ell]} = (\mathbf{W}^{[\ell+1]T}\boldsymbol{\delta}^{[\ell+1]}) \odot \sigma'(\mathbf{z}^{[\ell]})\) multiplies together many matrices and activation derivatives. In deep networks, this product either collapses to zero or explodes to infinity.

The Mathematics of Vanishing Gradients

For a simplified network with scalar weights \(w\) and sigmoid activations:

Gradient at Layer ℓ — Product Form
\[ \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{[1]}} = \left(\prod_{\ell=2}^{L} \mathbf{W}^{[\ell]T} \operatorname{diag}(\sigma'(\mathbf{z}^{[\ell]}))\right) \cdot \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{[L]}} \]

Each factor contains σ'(z[ℓ]) ≤ 0.25 for sigmoid. L=20 layers: 0.25²⁰ ≈ 10⁻¹² — effectively zero.

Vanishing Gradients

Cause: \(|\sigma'(z)| \ll 1\) at every layer (sigmoid: max 0.25) or weights \(|\mathbf{W}| \ll 1\).

Effect: Early layers receive near-zero gradient — they barely update. Later layers learn; earlier layers stay near initialization. The network behaves as if it has far fewer effective layers.

Symptom: Training loss stagnates after early epochs. Gradients in early layers orders of magnitude smaller than late layers.

Exploding Gradients

Cause: Weights \(|\mathbf{W}| \gg 1\) compound exponentially. Particularly severe in RNNs where the same weight matrix is multiplied many times.

Effect: Gradients grow to NaN. Parameters update by enormous amounts; training diverges. Loss goes to infinity.

Symptom: NaN in loss after first few iterations. Gradient norm plot shows explosive growth.

Remedies
Problem Remedy Mechanism
Vanishing (activations) Replace sigmoid/tanh with ReLU/GELU ReLU has gradient 1 for z>0 — doesn't compress gradient magnitude
Vanishing (architecture) Residual connections (ResNets, §15) Skip connections provide gradient highway: ∂(x+F(x))/∂x = 1 + ∂F/∂x
Vanishing/Exploding Batch Normalization Normalizes pre-activations to unit Gaussian, stabilizing gradient magnitude
Exploding Gradient clipping Cap gradient norm: if ‖g‖ > threshold, g ← g·threshold/‖g‖
Both Careful initialization (§11) Xavier/He keeps gradient variance ≈1 at every layer at initialization
Both (RNNs) LSTM/GRU gates Learned gates control gradient flow, preventing compounding
11
§ Sec

Weight Initialization

// why it matters · Xavier / Glorot · He · analysis

Initialization determines the starting point of gradient descent and, more critically, the variance of activations and gradients at the start of training. Poor initialization → vanishing or exploding gradients from iteration one.

The Variance Propagation Analysis

For a layer with \(n\) inputs and weight \(w_{ij} \sim (0, \sigma_w^2)\), input \(a_j \sim (0, \sigma_a^2)\), the output variance is:

Forward Variance Propagation
\[ \text{Var}(z) = n \cdot \sigma_w^2 \cdot \sigma_a^2 \]

For activations to stay unit-variance: σ_w² = 1/n. But we also need backward stability: σ_w² = 1/n_out. Xavier compromises between both.

Xavier / Glorot Initialization (for tanh/sigmoid)
Xavier Initialization
\[ w_{ij} \sim \mathcal{U}\!\left[-\frac{\sqrt{6}}{\sqrt{n^{[\ell-1]} + n^{[\ell]}}},\; \frac{\sqrt{6}}{\sqrt{n^{[\ell-1]} + n^{[\ell]}}}\right] \] \[ \text{or equivalently: } w_{ij} \sim \mathcal{N}\!\left(0, \frac{2}{n^{[\ell-1]} + n^{[\ell]}}\right) \]

Compromises between forward (÷n_in) and backward (÷n_out) variance stability. Optimal for tanh — its derivative ≈1 near zero satisfies the linear assumption.

He Initialization (for ReLU)
He Initialization (Kaiming)
\[ w_{ij} \sim \mathcal{N}\!\left(0, \frac{2}{n^{[\ell-1]}}\right) \]

Factor 2 accounts for ReLU killing half the neurons (E[max(0,z)²] = σ²/2 for Gaussian z). Scales by 2/n_in instead of 1/n_in. Default for any ReLU network.

Rule

Rule of thumb: sigmoid/tanh → Xavier (Glorot). ReLU/Leaky ReLU → He (Kaiming). GELU → He works well in practice. Never initialize all weights to zero — symmetric weights produce identical gradients and the network fails to break symmetry. Always add small random noise.

12
§ Sec

Optimization for Neural Networks

// SGD · momentum · Adam · learning rate schedules · BatchNorm

Neural network optimization is gradient descent with modifications to handle the non-convex loss landscape of deep networks. The core algorithm is SGD; Adam is the practical standard. Full mathematical treatment is in the Gradient Descent masterclass — here we focus on the NN-specific aspects.

SGD + Momentum — The Baseline
SGD with Momentum
\[ \mathbf{v}^{(t)} = \beta\mathbf{v}^{(t-1)} + (1-\beta)\nabla_{\boldsymbol{\theta}}\mathcal{L}^{(t)} \] \[ \boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \alpha \mathbf{v}^{(t)} \]
Adam — The Standard
Adam Optimizer
\[ \mathbf{m}^{(t)} = \beta_1\mathbf{m}^{(t-1)} + (1-\beta_1)\mathbf{g}^{(t)}, \quad \hat{\mathbf{m}} = \frac{\mathbf{m}^{(t)}}{1-\beta_1^t} \] \[ \mathbf{v}^{(t)} = \beta_2\mathbf{v}^{(t-1)} + (1-\beta_2)(\mathbf{g}^{(t)})^2, \quad \hat{\mathbf{v}} = \frac{\mathbf{v}^{(t)}}{1-\beta_2^t} \] \[ \boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \frac{\alpha}{\sqrt{\hat{\mathbf{v}}} + \epsilon}\hat{\mathbf{m}} \]

Defaults: β₁=0.9, β₂=0.999, ε=1e-8, α=1e-3. AdamW (decoupled weight decay) is preferred for transformers.

Batch Normalization — Stabilizing Training

BatchNorm normalizes the pre-activations within each mini-batch to zero mean and unit variance, then applies a learnable scale and shift:

Batch Normalization
\[ \hat{z}^{(i)} = \frac{z^{(i)} - \mu_\mathcal{B}}{\sqrt{\sigma_\mathcal{B}^2 + \epsilon}}, \quad \tilde{z}^{(i)} = \gamma\hat{z}^{(i)} + \beta \]

γ and β are learnable per-feature. μ_𝓑 and σ_𝓑 computed per mini-batch during training; running stats used at inference. Allows higher LR, reduces sensitivity to initialization, provides mild regularization.

13
§ Sec

Universal Approximation Theorem

// statement · what it guarantees · what it doesn't · depth advantage
Universal Approximation Theorem (Cybenko 1989, Hornik 1991)

A feedforward neural network with one hidden layer of a sufficient number of neurons using a non-polynomial activation function can approximate any continuous function \(f: [0,1]^n \to \mathbb{R}\) to arbitrary precision \(\epsilon > 0\):

\[\forall\epsilon > 0,\; \exists N: \quad \sup_{\mathbf{x}\in[0,1]^n}\left|f(\mathbf{x}) - \sum_{j=1}^{N}v_j\sigma(\mathbf{w}_j^T\mathbf{x}+b_j)\right| < \epsilon\]
What the Theorem Does NOT Guarantee
  • It doesn't say how many neurons \(N\) are needed — it may be exponential in the input dimension. A single wide layer is theoretically sufficient but practically infeasible.
  • It doesn't say gradient descent will find the parameters — the optimization landscape may prevent learning the right function even if it exists.
  • It doesn't say the network will generalize — approximating training data ≠ learning the true function. A depth-1 network that memorizes is a valid approximator but useless.
Why Depth is Exponentially More Efficient

Telgarsky (2016) and others showed: functions that require exponentially many neurons in a shallow network can be represented with polynomially many neurons by adding depth. Deep networks exploit hierarchical compositionality — each layer reuses computations from the previous, combinatorially building complex representations from simpler ones.

Intuition

To detect a face: pixel intensities → edges → corners → eyes/nose/mouth → face. A single hidden layer must detect "face" from raw pixels directly — no intermediate concepts. A deep network builds intermediate representations that are reused across many faces. This is why depth, not just width, is what makes deep learning work.

14
§ Sec

Putting It All Together — Full Worked Pass

// 3-layer network · forward · loss · backward · update

A complete end-to-end example: network [2 → 4 → 4 → 1] for binary classification. We trace one full iteration — forward, loss, backward, update — with full matrix notation.

Architecture Setup
Network: [2 → 4 → 4 → 1]
\[ \mathbf{W}^{[1]} \in \mathbb{R}^{4\times 2}, \; \mathbf{b}^{[1]} \in \mathbb{R}^4 \qquad \text{(Layer 1: ReLU)} \] \[ \mathbf{W}^{[2]} \in \mathbb{R}^{4\times 4}, \; \mathbf{b}^{[2]} \in \mathbb{R}^4 \qquad \text{(Layer 2: ReLU)} \] \[ \mathbf{w}^{[3]} \in \mathbb{R}^{1\times 4}, \; b^{[3]} \in \mathbb{R} \qquad \text{(Layer 3: Sigmoid)} \] \[ \text{Parameters: }4\times 2+4 + 4\times 4+4 + 1\times 4+1 = 8+4+16+4+4+1 = 37 \]
Complete Forward → Loss → Backward → Update Flow
// Full training step for one mini-batch
F1
Layer 1 forward: \(\mathbf{Z}^{[1]} = \mathbf{W}^{[1]}\mathbf{X} + \mathbf{b}^{[1]}\), \(\mathbf{A}^{[1]} = \text{ReLU}(\mathbf{Z}^{[1]})\)  → cache \((\mathbf{Z}^{[1]}, \mathbf{A}^{[1]})\)
F2
Layer 2 forward: \(\mathbf{Z}^{[2]} = \mathbf{W}^{[2]}\mathbf{A}^{[1]} + \mathbf{b}^{[2]}\), \(\mathbf{A}^{[2]} = \text{ReLU}(\mathbf{Z}^{[2]})\)  → cache \((\mathbf{Z}^{[2]}, \mathbf{A}^{[2]})\)
F3
Layer 3 forward: \(Z^{[3]} = \mathbf{w}^{[3]}\mathbf{A}^{[2]} + b^{[3]}\), \(\hat{Y} = \sigma(Z^{[3]})\)  → cache \((Z^{[3]}, \hat{Y})\)
Loss
\(\mathcal{L} = -\frac{1}{m}\sum_i\left[y^{(i)}\log\hat{y}^{(i)} + (1-y^{(i)})\log(1-\hat{y}^{(i)})\right]\)
B3
Backprop Layer 3 (BP1 + BP3 + BP4): \[\boldsymbol{\Delta}^{[3]} = \hat{Y} - Y \quad \text{(sigmoid+BCE cancellation)}\] \[d\mathbf{w}^{[3]} = \frac{1}{m}\boldsymbol{\Delta}^{[3]}(\mathbf{A}^{[2]})^T, \quad db^{[3]} = \frac{1}{m}\sum\boldsymbol{\Delta}^{[3]}\]
B2
Backprop Layer 2 (BP2 + BP3 + BP4): \[\boldsymbol{\Delta}^{[2]} = (\mathbf{w}^{[3]T}\boldsymbol{\Delta}^{[3]}) \odot \mathbf{1}[\mathbf{Z}^{[2]} > 0]\] \[d\mathbf{W}^{[2]} = \frac{1}{m}\boldsymbol{\Delta}^{[2]}(\mathbf{A}^{[1]})^T, \quad d\mathbf{b}^{[2]} = \frac{1}{m}\boldsymbol{\Delta}^{[2]}\mathbf{1}\]
ReLU derivative = 1[Z>0] elementwise — 1 where activated, 0 where dead.
B1
Backprop Layer 1 (BP2 + BP3 + BP4): \[\boldsymbol{\Delta}^{[1]} = (\mathbf{W}^{[2]T}\boldsymbol{\Delta}^{[2]}) \odot \mathbf{1}[\mathbf{Z}^{[1]} > 0]\] \[d\mathbf{W}^{[1]} = \frac{1}{m}\boldsymbol{\Delta}^{[1]}\mathbf{X}^T, \quad d\mathbf{b}^{[1]} = \frac{1}{m}\boldsymbol{\Delta}^{[1]}\mathbf{1}\]
Upd
Update all parameters (using Adam or SGD): \[\mathbf{W}^{[\ell]} \leftarrow \mathbf{W}^{[\ell]} - \alpha \cdot \text{optimizer}(d\mathbf{W}^{[\ell]}), \quad \mathbf{b}^{[\ell]} \leftarrow \mathbf{b}^{[\ell]} - \alpha \cdot \text{optimizer}(d\mathbf{b}^{[\ell]})\]
Repeat for all epochs. The network converges when gradient norms are small or validation loss plateaus.
15
§ Sec

Connection to Modern Architectures

// CNNs · Transformers · ResNets · your XAI project

Every modern architecture is an MLP with structural modifications that exploit domain knowledge to reduce parameters, improve gradient flow, and capture spatial/sequential structure.

CNN
z[ℓ] = W[ℓ] ★ a[ℓ-1] + b[ℓ]

Replace matrix multiply with convolution (★). Weights are shared spatially — a single filter applied at every location. Dramatically fewer parameters than dense. Backprop through convolution = convolution with flipped filter (cross-correlation).

ResNet
a[ℓ+1] = σ(F(a[ℓ]) + a[ℓ])

Residual (skip) connection adds input directly to output. Gradient flows back through the identity path: ∂/∂a[ℓ] = 1 + ∂F/∂a[ℓ]. Eliminates vanishing gradient even for 1000+ layers. Your VGG16 transfer learning used this principle.

Transformer / ViT
Attn(Q,K,V) = softmax(QKᵀ/√d)V

Self-attention replaces spatial inductive bias with learned pairwise relationships. Your ViT-B/16 for chest X-ray classification: image divided into 16×16 patches → projected to embeddings → attention layers → MLP head → softmax. All governed by BP1–BP4.

XAI / GradCAM
L_c^{GradCAM} = ReLU(Σ_k α_k^c A^k)

Your XAI work used GradCAM: gradients of the class score w.r.t. final conv feature maps identify discriminative regions. α_k^c = (1/Z)∑∑∂y^c/∂A^k. This is backprop stopped at the last conv layer — same BP equations, different stopping point.

My Work

My respiratory disease chest X-ray classification with custom CNN, VGG16, and ViT-B/16: all three used the same BP1–BP4 equations during training. VGG16 transfer learning froze early layers (no gradient update) and fine-tuned late layers — demonstrating that the gradient highway through residuals/convolutions can be selectively enabled. The XAI explanations were literally gradient flows visualized spatially.

16
§ Sec

The Complete Mental Model

// everything connected · one diagram · one thread
NEURAL NETWORK — COMPLETE MENTAL MAP THE NEURON a = σ(Wx + b) = σ(z) LAYERS Z[ℓ]=W[ℓ]A[ℓ-1]+b[ℓ] ACTIVATION ReLU / GELU / Softmax INITIALIZATION Xavier / He — stable σ FORWARD PASS ŷ = σ(W[L]···σ(W[1]x)) LOSS FUNCTION L(ŷ, y) — MLE derived COMP. GRAPH DAG of operations BACKPROPAGATION BP1–BP4 · δ[ℓ] equations OPTIMIZER Adam · SGD · momentum training loop THE ONE UNIFIED PRINCIPLE 1. Model: compose neurons → layers → network. Each layer: Z = WA+b, A = σ(Z). 2. Loss: MLE under output distribution → BCE (binary), CCE (multiclass), MSE (regression). 3. Backprop: BP1 δ[L]=ŷ-y, BP2 δ[ℓ]=W[ℓ+1]ᵀδ[ℓ+1]⊙σ'(z[ℓ]), BP3 dW=δA^T, BP4 db=δ. 4. Optimize: Adam (momentum + adaptive LR per param). Init: He for ReLU, Xavier for tanh. 5. Architecture shapes the graph. Convolution, attention, residuals all use the same BP equations.
Fig 4. Complete neural network mental map — from biological inspiration through the four backprop equations to the training loop that closes everything together.

The complete story in one coherent thread:

  1. Biological origin: A neuron integrates weighted inputs, adds a bias, and fires if the total crosses a threshold. The mathematical neuron generalizes this: \(z = \mathbf{w}^T\mathbf{x} + b\), \(a = \sigma(z)\), with continuous weights and a differentiable activation.
  2. Why nonlinearity is mandatory: Linear layers compose to a single linear map. Activation functions \(\sigma\) introduce the nonlinearity that makes deep networks exponentially more expressive than their shallow counterparts.
  3. Forward propagation: Layer by layer, \(\mathbf{Z}^{[\ell]} = \mathbf{W}^{[\ell]}\mathbf{A}^{[\ell-1]} + \mathbf{b}^{[\ell]}\), \(\mathbf{A}^{[\ell]} = \sigma(\mathbf{Z}^{[\ell]})\). Cache everything — you'll need it for backprop.
  4. Loss functions are not arbitrary: MLE under Gaussian noise → MSE. MLE under Bernoulli → BCE. MLE under Categorical → CCE. The output activation and loss are a matched pair from probability theory.
  5. Backpropagation is four equations (BP1–BP4): Output error \(\boldsymbol{\delta}^{[L]} = \hat{\mathbf{y}}-\mathbf{y}\), then propagate backward as \(\boldsymbol{\delta}^{[\ell]} = (\mathbf{W}^{[\ell+1]T}\boldsymbol{\delta}^{[\ell+1]}) \odot \sigma'(\mathbf{z}^{[\ell]})\), then \(d\mathbf{W}^{[\ell]} = \boldsymbol{\delta}^{[\ell]}(\mathbf{A}^{[\ell-1]})^T\) and \(d\mathbf{b}^{[\ell]} = \boldsymbol{\delta}^{[\ell]}\). That's it.
  6. Vanishing gradients arise because \(\sigma'(z)\) compounds across layers — use ReLU/GELU, residual connections, and He initialization to keep gradient magnitude ≈1 throughout the network.
  7. Initialization matters critically at the start: He initialization keeps activation variance ≈1 for ReLU; Xavier for sigmoid/tanh. Poor init → vanishing/exploding gradients before training even begins.
  8. All modern architectures — CNNs, Transformers, ViTs, ResNets — run the same BP1–BP4 equations. Convolution shares weights spatially. Residual connections add gradient highways. Attention learns pairwise relationships. The equations are unchanged; only the graph topology differs.