The Biological Neuron
Neural networks borrow their structure — and their name — from the biological neurons in the human brain. Understanding the biology isn't just historical context; the parallels reveal why the mathematical structure of artificial networks works.
- Dendrites — branch-like extensions that receive electrical/chemical signals from upstream neurons. Each connection (synapse) has a variable strength — analogous to weights \(w_i\).
- Soma (cell body) — integrates all incoming signals from all dendrites. Performs a weighted summation. If the total exceeds a threshold, it fires.
- Axon — transmits the output signal (action potential) to downstream neurons. A neuron either fires or doesn't — it's a binary threshold event.
- Synapse — the junction between two neurons. Strength is modifiable through learning (Hebbian plasticity: "neurons that fire together, wire together").
Warren McCulloch and Walter Pitts built the first mathematical model of a neuron. Their insight: the neuron is a threshold logic unit.
where xᵢ ∈ {0,1} are binary inputs, wᵢ ∈ {-1,+1} are fixed weights, θ is the threshold
Limitations: weights are fixed (no learning), inputs must be binary, threshold must be hand-set. Rosenblatt's Perceptron (1958) fixed the first problem by introducing a learning rule. Modern neural networks fix all three.
The brain has ~86 billion neurons, each connected to ~7,000 others — approximately 600 trillion synaptic connections. A GPT-4-scale model has ~1.8 trillion parameters. The brain still wins by several orders of magnitude in connection density, but artificial networks compensate with precise floating-point computation, gradient-based learning, and perfect memory.
The Mathematical Neuron
The modern artificial neuron generalizes the McCulloch-Pitts unit with continuous weights, a learnable bias, and a differentiable activation function. This is the atom from which all neural networks are built.
- Pre-activation \(z\) — the raw linear combination. Identical to a linear regression output. By itself, it's just linear. This is the "integrate" step.
- Bias \(b\) — a learnable offset. Allows the decision boundary to shift away from the origin. Without bias, the hyperplane \(\mathbf{w}^T\mathbf{x} = 0\) is constrained to pass through the origin.
- Activation \(\sigma(z)\) — the crucial nonlinearity. Without it, stacking multiple linear layers is still just a single linear layer (linear functions compose to linear). Activations are what allow neural networks to approximate nonlinear functions.
A neural network with no activation functions is identically a linear model, regardless of how many layers it has. If every layer is \(\mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b}\), then composing \(L\) layers gives \(\mathbf{y} = \mathbf{W}_L\cdots\mathbf{W}_1\mathbf{x} + \text{const} = \tilde{\mathbf{W}}\mathbf{x} + \tilde{\mathbf{b}}\) — still linear. Activation functions are not optional decorations. They are the entire reason deep networks work.
Activation Functions
Activation functions are the nonlinearities that give neural networks their expressive power. Their mathematical properties — differentiability, range, gradient behavior — directly determine training dynamics.
Range (0,1). Probabilistic output. Saturates for large |z| — gradient → 0 causes vanishing gradients. Max gradient 0.25 at z=0. Use only in output layer for binary classification.
Range (−1,1). Zero-centered — better gradient flow than sigmoid. Still saturates. Note: \(\tanh(z) = 2\sigma(2z) - 1\). Preferred over sigmoid for hidden layers in RNNs.
Range [0,∞). Not differentiable at 0 (use subgradient = 0). Sparse activation. No vanishing gradient for z > 0. Risk: "dying ReLU" — neurons stuck at 0 when gradient is always 0. Default for hidden layers in CNNs/MLPs.
Fixes dying ReLU: negative inputs have small gradient α instead of 0. PReLU learns α as a parameter. ELU uses exp for negative region. Generally preferred over ReLU in deeper networks.
Gaussian Error Linear Unit. Φ(z) = CDF of standard Gaussian. Smooth everywhere; stochastic interpretation (randomly zeroes inputs based on magnitude). Default in BERT, GPT, ViT — including your chest X-ray ViT.
Multi-class probability output. Range (0,1) per component, sums to 1. Generalizes sigmoid to K classes. Numerically stable: subtract max(z) before exponentiating. Used only in output layer for classification.
The derivative of the activation function appears directly in the backpropagation chain rule. If the derivative saturates (→ 0), gradients vanish and early layers stop learning. This is the fundamental reason ReLU variants replaced sigmoid/tanh in deep networks.
| Activation | Derivative Range | Saturates? | Sparse? | Zero-Centered? |
|---|---|---|---|---|
| Sigmoid | (0, 0.25] | Yes — strongly | No | No (outputs 0.5 bias) |
| Tanh | (0, 1] | Yes | No | Yes |
| ReLU | {0, 1} | Half-saturates (z<0) | Yes ≈50% | No |
| Leaky ReLU | {α, 1} α≈0.01 | No | Weak | No |
| GELU | ~(−0.17, 1.13) | Barely | Soft | Nearly |
| Softmax | Jacobian matrix | Somewhat | No | N/A |
Network Architecture & Layers
A neural network is a composition of layers, each containing multiple neurons. The depth (number of layers) and width (neurons per layer) are the two primary architectural choices. Notation: \(L\) total layers, layer \(\ell\) has \(n^{[\ell]}\) neurons.
a⁽⁰⁾ = x (input). Each W[ℓ] row is one neuron's weight vector. Matrix multiply broadcasts across all neurons in the layer simultaneously.
Not a computational layer — it's just the raw feature vector \(\mathbf{x} = \mathbf{a}^{[0]}\). No weights, no activation. Its size \(n^{[0]}\) equals the number of input features. For images: \(n^{[0]}\) = H×W×C pixels.
Layers 1 through \(L-1\). Where all nonlinear feature learning happens. Intermediate representations build hierarchically — early layers detect edges, later layers detect objects. Use ReLU/GELU activations.
Layer \(L\). Size matches the task: 1 neuron + sigmoid (binary), K neurons + softmax (K-class), 1 neuron + linear (regression). Activation is determined by the loss function, not architecture preference.
Total parameters = \(\sum_{\ell=1}^{L}(n^{[\ell-1]} \cdot n^{[\ell]} + n^{[\ell]})\). For a [784→256→128→10] network: (784·256+256) + (256·128+128) + (128·10+10) = 234,378 parameters.
- Width (more neurons per layer) increases the number of features learned at each level. Diminishing returns beyond a point — wider doesn't always mean better.
- Depth (more layers) enables hierarchical feature composition — each layer builds on abstractions from the previous. Exponentially more expressive than width for representing compositional functions.
- Universal approximation (§13): one hidden layer + enough neurons can approximate any continuous function. But depth achieves the same accuracy with exponentially fewer neurons.
Forward Propagation — Full Math
Forward propagation computes the network output from input to prediction, layer by layer. Every computation is a matrix multiply + bias add + elementwise activation. On a GPU, this is massively parallel.
For a mini-batch of \(m\) samples stacked as columns: \(\mathbf{X} \in \mathbb{R}^{n^{[0]} \times m}\). All operations broadcast across the batch simultaneously:
b[ℓ] is broadcast across all m columns. σ is applied elementwise. Each column of A[L] is one sample's prediction.
Network: [2 → 3 → 1], input \(\mathbf{x} = [0.5, -0.2]^T\), hidden layer uses ReLU, output uses sigmoid.
During forward propagation, cache every \(\mathbf{Z}^{[\ell]}\) and \(\mathbf{A}^{[\ell]}\) — you'll need them during backpropagation. This is memory vs compute trade-off: gradient checkpointing recomputes some activations to save memory at the cost of compute.
Loss Functions
The loss function \(\mathcal{L}(\hat{y}, y)\) measures how wrong the network's prediction is. It must be differentiable with respect to the network parameters so that backpropagation can compute gradients. The choice of loss is not arbitrary — it follows from the statistical model of the output.
Use with sigmoid output. Derived from MLE under Bernoulli assumption. Penalizes confident wrong predictions exponentially.
For one-hot labels: reduces to −log(p̂_true_class). Use with softmax output. Your chest X-ray classifier used this.
Use with linear output for regression. Equivalent to MLE under Gaussian noise. The 1/2 makes the gradient cleaner: ∂L/∂ŷ = (ŷ−y)/m.
The gradient of CCE loss w.r.t. the logits \(\mathbf{z}^{[L]}\) (before softmax) is perhaps the most elegant result in neural network training:
The softmax Jacobian and cross-entropy gradient cancel exactly — just like sigmoid + BCE. The error signal is simply the residual. In vector form: ∂L/∂z[L] = (ŷ − y)
This isn't coincidence — it's a consequence of the sigmoid/softmax being the canonical link function for Bernoulli/Categorical distributions (generalized linear models). The MLE gradient always has this residual form.
The Computational Graph
A computational graph is a Directed Acyclic Graph (DAG) where nodes are operations (or values) and edges are data dependencies. It's the data structure that makes automatic differentiation — and therefore backpropagation — mechanical and exact.
Every modern deep learning framework (PyTorch, JAX, TensorFlow) represents computation as a graph and uses automatic differentiation (autograd) to backpropagate through it. You never hand-code gradients. The rules are:
- Forward pass: Evaluate every node in topological order, caching all intermediate values.
- Backward pass: Traverse the graph in reverse topological order, multiplying local gradients using the chain rule.
- Local gradient: Each node knows its own derivative (e.g.,
+passes gradient through unchanged;×flips operands as gradients;maxroutes gradient to the winning branch).
Backpropagation — Chain Rule Derivation
Backpropagation (Rumelhart, Hinton, Williams 1986) is the algorithm for computing \(\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[\ell]}}\) and \(\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{[\ell]}}\) for every layer \(\ell\) simultaneously. It's pure chain rule applied to the computational graph — nothing more.
Define the error signal (delta) for layer \(\ell\):
How much the loss changes per unit change in the pre-activations of layer ℓ. This is the central quantity in backpropagation.
BP1: \(\boldsymbol{\delta}^{[L]} = \nabla_{\mathbf{a}^{[L]}}\mathcal{L} \odot \sigma'^{[L]}(\mathbf{z}^{[L]})\) — output layer error | BP2: \(\boldsymbol{\delta}^{[\ell]} = (\mathbf{W}^{[\ell+1]T}\boldsymbol{\delta}^{[\ell+1]}) \odot \sigma'^{[\ell]}(\mathbf{z}^{[\ell]})\) — error backpropagation | BP3: \(\partial\mathcal{L}/\partial\mathbf{W}^{[\ell]} = \boldsymbol{\delta}^{[\ell]}(\mathbf{a}^{[\ell-1]})^T\) | BP4: \(\partial\mathcal{L}/\partial\mathbf{b}^{[\ell]} = \boldsymbol{\delta}^{[\ell]}\).
Gradient Flow Through a Full Network
With a batch of \(m\) samples stacked as columns in \(\mathbf{A}^{[\ell]} \in \mathbb{R}^{n^{[\ell]} \times m}\):
initialize_weights(model)
for epoch in range(num_epochs):
for X_batch, y_batch in dataloader(train_data, batch_size):
# ─── FORWARD PASS ───────────────────────────────
cache = {} # store Z[ℓ], A[ℓ] for backprop
A = X_batch
for l in range(1, L+1):
Z = W[l] @ A + b[l]
A = activation[l](Z)
cache[l] = (Z, A) # ← critical!
y_hat = A
# ─── COMPUTE LOSS ───────────────────────────────
loss = cross_entropy(y_hat, y_batch)
# ─── BACKWARD PASS (BP1 → BP4) ──────────────────
Delta = y_hat − y_batch # BP1: output error
for l in range(L, 0, -1):
Z_l, A_prev = cache[l][0], cache[l-1][1]
dW[l] = (Delta @ A_prev.T) / m # BP3
db[l] = Delta.sum(axis=1) / m # BP4
if l > 1: # BP2: propagate error
Delta = (W[l].T @ Delta) * sigma_prime(Z_l)
# ─── OPTIMIZER STEP ─────────────────────────────
optimizer.step(W, b, dW, db) # Adam / SGD
- Forward pass per layer: \(O(n^{[\ell-1]} \cdot n^{[\ell]} \cdot m)\) — dominated by the matrix multiply \(\mathbf{W}^{[\ell]}\mathbf{A}^{[\ell-1]}\).
- Backward pass per layer: Also \(O(n^{[\ell-1]} \cdot n^{[\ell]} \cdot m)\) — same cost as forward. So total training is ≈2× inference cost.
- Memory: Must store all cached activations \(\mathbf{A}^{[\ell]}\) — \(O(m \cdot \sum_\ell n^{[\ell]})\). This is why batch size and model depth are the primary memory constraints.
Vanishing & Exploding Gradients
The backpropagation recurrence \(\boldsymbol{\delta}^{[\ell]} = (\mathbf{W}^{[\ell+1]T}\boldsymbol{\delta}^{[\ell+1]}) \odot \sigma'(\mathbf{z}^{[\ell]})\) multiplies together many matrices and activation derivatives. In deep networks, this product either collapses to zero or explodes to infinity.
For a simplified network with scalar weights \(w\) and sigmoid activations:
Each factor contains σ'(z[ℓ]) ≤ 0.25 for sigmoid. L=20 layers: 0.25²⁰ ≈ 10⁻¹² — effectively zero.
Cause: \(|\sigma'(z)| \ll 1\) at every layer (sigmoid: max 0.25) or weights \(|\mathbf{W}| \ll 1\).
Effect: Early layers receive near-zero gradient — they barely update. Later layers learn; earlier layers stay near initialization. The network behaves as if it has far fewer effective layers.
Symptom: Training loss stagnates after early epochs. Gradients in early layers orders of magnitude smaller than late layers.
Cause: Weights \(|\mathbf{W}| \gg 1\) compound exponentially. Particularly severe in RNNs where the same weight matrix is multiplied many times.
Effect: Gradients grow to NaN. Parameters update by enormous amounts; training diverges. Loss goes to infinity.
Symptom: NaN in loss after first few iterations. Gradient norm plot shows explosive growth.
| Problem | Remedy | Mechanism |
|---|---|---|
| Vanishing (activations) | Replace sigmoid/tanh with ReLU/GELU | ReLU has gradient 1 for z>0 — doesn't compress gradient magnitude |
| Vanishing (architecture) | Residual connections (ResNets, §15) | Skip connections provide gradient highway: ∂(x+F(x))/∂x = 1 + ∂F/∂x |
| Vanishing/Exploding | Batch Normalization | Normalizes pre-activations to unit Gaussian, stabilizing gradient magnitude |
| Exploding | Gradient clipping | Cap gradient norm: if ‖g‖ > threshold, g ← g·threshold/‖g‖ |
| Both | Careful initialization (§11) | Xavier/He keeps gradient variance ≈1 at every layer at initialization |
| Both (RNNs) | LSTM/GRU gates | Learned gates control gradient flow, preventing compounding |
Weight Initialization
Initialization determines the starting point of gradient descent and, more critically, the variance of activations and gradients at the start of training. Poor initialization → vanishing or exploding gradients from iteration one.
For a layer with \(n\) inputs and weight \(w_{ij} \sim (0, \sigma_w^2)\), input \(a_j \sim (0, \sigma_a^2)\), the output variance is:
For activations to stay unit-variance: σ_w² = 1/n. But we also need backward stability: σ_w² = 1/n_out. Xavier compromises between both.
Compromises between forward (÷n_in) and backward (÷n_out) variance stability. Optimal for tanh — its derivative ≈1 near zero satisfies the linear assumption.
Factor 2 accounts for ReLU killing half the neurons (E[max(0,z)²] = σ²/2 for Gaussian z). Scales by 2/n_in instead of 1/n_in. Default for any ReLU network.
Rule of thumb: sigmoid/tanh → Xavier (Glorot). ReLU/Leaky ReLU → He (Kaiming). GELU → He works well in practice. Never initialize all weights to zero — symmetric weights produce identical gradients and the network fails to break symmetry. Always add small random noise.
Optimization for Neural Networks
Neural network optimization is gradient descent with modifications to handle the non-convex loss landscape of deep networks. The core algorithm is SGD; Adam is the practical standard. Full mathematical treatment is in the Gradient Descent masterclass — here we focus on the NN-specific aspects.
Defaults: β₁=0.9, β₂=0.999, ε=1e-8, α=1e-3. AdamW (decoupled weight decay) is preferred for transformers.
BatchNorm normalizes the pre-activations within each mini-batch to zero mean and unit variance, then applies a learnable scale and shift:
γ and β are learnable per-feature. μ_𝓑 and σ_𝓑 computed per mini-batch during training; running stats used at inference. Allows higher LR, reduces sensitivity to initialization, provides mild regularization.
Universal Approximation Theorem
A feedforward neural network with one hidden layer of a sufficient number of neurons using a non-polynomial activation function can approximate any continuous function \(f: [0,1]^n \to \mathbb{R}\) to arbitrary precision \(\epsilon > 0\):
- It doesn't say how many neurons \(N\) are needed — it may be exponential in the input dimension. A single wide layer is theoretically sufficient but practically infeasible.
- It doesn't say gradient descent will find the parameters — the optimization landscape may prevent learning the right function even if it exists.
- It doesn't say the network will generalize — approximating training data ≠ learning the true function. A depth-1 network that memorizes is a valid approximator but useless.
Telgarsky (2016) and others showed: functions that require exponentially many neurons in a shallow network can be represented with polynomially many neurons by adding depth. Deep networks exploit hierarchical compositionality — each layer reuses computations from the previous, combinatorially building complex representations from simpler ones.
To detect a face: pixel intensities → edges → corners → eyes/nose/mouth → face. A single hidden layer must detect "face" from raw pixels directly — no intermediate concepts. A deep network builds intermediate representations that are reused across many faces. This is why depth, not just width, is what makes deep learning work.
Putting It All Together — Full Worked Pass
A complete end-to-end example: network [2 → 4 → 4 → 1] for binary classification. We trace one full iteration — forward, loss, backward, update — with full matrix notation.
Connection to Modern Architectures
Every modern architecture is an MLP with structural modifications that exploit domain knowledge to reduce parameters, improve gradient flow, and capture spatial/sequential structure.
Replace matrix multiply with convolution (★). Weights are shared spatially — a single filter applied at every location. Dramatically fewer parameters than dense. Backprop through convolution = convolution with flipped filter (cross-correlation).
Residual (skip) connection adds input directly to output. Gradient flows back through the identity path: ∂/∂a[ℓ] = 1 + ∂F/∂a[ℓ]. Eliminates vanishing gradient even for 1000+ layers. Your VGG16 transfer learning used this principle.
Self-attention replaces spatial inductive bias with learned pairwise relationships. Your ViT-B/16 for chest X-ray classification: image divided into 16×16 patches → projected to embeddings → attention layers → MLP head → softmax. All governed by BP1–BP4.
Your XAI work used GradCAM: gradients of the class score w.r.t. final conv feature maps identify discriminative regions. α_k^c = (1/Z)∑∑∂y^c/∂A^k. This is backprop stopped at the last conv layer — same BP equations, different stopping point.
My respiratory disease chest X-ray classification with custom CNN, VGG16, and ViT-B/16: all three used the same BP1–BP4 equations during training. VGG16 transfer learning froze early layers (no gradient update) and fine-tuned late layers — demonstrating that the gradient highway through residuals/convolutions can be selectively enabled. The XAI explanations were literally gradient flows visualized spatially.
The Complete Mental Model
The complete story in one coherent thread:
- Biological origin: A neuron integrates weighted inputs, adds a bias, and fires if the total crosses a threshold. The mathematical neuron generalizes this: \(z = \mathbf{w}^T\mathbf{x} + b\), \(a = \sigma(z)\), with continuous weights and a differentiable activation.
- Why nonlinearity is mandatory: Linear layers compose to a single linear map. Activation functions \(\sigma\) introduce the nonlinearity that makes deep networks exponentially more expressive than their shallow counterparts.
- Forward propagation: Layer by layer, \(\mathbf{Z}^{[\ell]} = \mathbf{W}^{[\ell]}\mathbf{A}^{[\ell-1]} + \mathbf{b}^{[\ell]}\), \(\mathbf{A}^{[\ell]} = \sigma(\mathbf{Z}^{[\ell]})\). Cache everything — you'll need it for backprop.
- Loss functions are not arbitrary: MLE under Gaussian noise → MSE. MLE under Bernoulli → BCE. MLE under Categorical → CCE. The output activation and loss are a matched pair from probability theory.
- Backpropagation is four equations (BP1–BP4): Output error \(\boldsymbol{\delta}^{[L]} = \hat{\mathbf{y}}-\mathbf{y}\), then propagate backward as \(\boldsymbol{\delta}^{[\ell]} = (\mathbf{W}^{[\ell+1]T}\boldsymbol{\delta}^{[\ell+1]}) \odot \sigma'(\mathbf{z}^{[\ell]})\), then \(d\mathbf{W}^{[\ell]} = \boldsymbol{\delta}^{[\ell]}(\mathbf{A}^{[\ell-1]})^T\) and \(d\mathbf{b}^{[\ell]} = \boldsymbol{\delta}^{[\ell]}\). That's it.
- Vanishing gradients arise because \(\sigma'(z)\) compounds across layers — use ReLU/GELU, residual connections, and He initialization to keep gradient magnitude ≈1 throughout the network.
- Initialization matters critically at the start: He initialization keeps activation variance ≈1 for ReLU; Xavier for sigmoid/tanh. Poor init → vanishing/exploding gradients before training even begins.
- All modern architectures — CNNs, Transformers, ViTs, ResNets — run the same BP1–BP4 equations. Convolution shares weights spatially. Residual connections add gradient highways. Attention learns pairwise relationships. The equations are unchanged; only the graph topology differs.