// surprise · log probability · bits vs nats · the fundamental unit
Information theory was invented by Claude Shannon in 1948 to answer a practical question: how many bits are
needed to transmit a message over a noisy channel? But the answer revealed something profound about the nature
of knowledge itself.
The Surprise Principle
Information content is measured by surprise. An event you were certain would happen carries no
information. An impossible event carrying infinite information. The function must satisfy three axioms:
Monotone decreasing in probability: More probable events carry less information. If \(p_1
> p_2\), then \(I(p_1) < I(p_2)\).
Non-negative: \(I(p) \geq 0\) for all \(p \in [0,1]\). We gain, never lose, information.
Additive for independent events: \(I(p_1 \cdot p_2) = I(p_1) + I(p_2)\). Learning two
independent facts gives the sum of their information contents.
These three axioms uniquely determine the form of the information function:
Base b=2 → units of bits. Base b=e → nats. Base b=10 → hartleys. ML uses nats (natural
log) for mathematical convenience. The three axioms above uniquely determine this logarithmic form — no
other function works.
Self-information measures the surprise of a single outcome. Entropy generalizes this to
an entire probability distribution — it is the expected self-information, or the average surprise of a
random variable.
Convention: 0 log 0 = 0 (by L'Hôpital's rule: lim_{p→0} p log p = 0). For continuous
distributions: H(X) = −∫p(x)log p(x)dx (differential entropy — can be negative).
Entropy answers: "If I sample from this distribution repeatedly, how many bits on average do I need to encode
each outcome?" It is the minimum average code length achievable (Shannon's source coding theorem).
The Binary Entropy Function
For a Bernoulli(\(p\)) variable (coin with bias \(p\)):
Binary H
\[H(p) = -p\log_2 p - (1-p)\log_2(1-p)\]
H(0) = H(1) = 0 (certain outcome, no uncertainty). H(0.5) = 1 bit (maximum — fully
uncertain). H is concave: mixing distributions increases entropy.
Fig 1. Binary entropy H(p). Maximum entropy (1 bit) at p=0.5 — maximum uncertainty. Zero
entropy at p=0 and p=1 — complete certainty.
Properties of Entropy
Non-negative: \(H(X) \geq 0\), with equality iff \(X\) is deterministic (one outcome has
probability 1).
Maximum at uniform: For \(|\mathcal{X}| = n\), \(H(X) \leq \log n\), with equality iff
\(P(x) = 1/n\) for all \(x\). The uniform distribution is maximally uncertain.
If X and Y are independent: H(X,Y) = H(X) + H(Y). If Y is a deterministic function of
X: H(X,Y) = H(X). In general: max(H(X),H(Y)) ≤ H(X,Y) ≤ H(X) + H(Y).
Conditional Entropy H(Y|X)
How much uncertainty remains about \(Y\) after observing \(X\)?
H(Y|X) ≥ 0. H(Y|X) = 0 iff Y is fully determined by X. H(Y|X) = H(Y) iff X and Y are
independent — knowing X tells you nothing about Y. Conditioning never increases entropy: H(Y|X) ≤ H(Y).
Fig 2. Information Venn diagram. H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y). The overlap I(X;Y)
is mutual information — the shared information between X and Y.
Proof: expand log P(x₁,…,xₙ) = Σ log P(xᵢ|x₁,…,xᵢ₋₁) using the chain rule of
probability. Take expectation. Each term is a conditional entropy.
Interpretation
The total uncertainty about \(n\) variables equals the uncertainty about the first,
plus the remaining uncertainty about the second given the first, plus the remaining uncertainty about the
third given the first two, and so on. Each piece of knowledge reduces entropy.
05
Section 05
Mutual Information
// shared information · independence test · feature selection · channel capacity
Mutual information \(I(X;Y)\) quantifies how much knowing \(X\) tells you about \(Y\) — and vice versa. It is
the single most important quantity for understanding statistical dependence.
The last form is the KL divergence between the joint distribution and the product of
marginals. I(X;Y) = 0 iff X and Y are independent (P(x,y) = P(x)P(y) everywhere). Always I(X;Y) ≥ 0.
Properties
Non-negative: \(I(X;Y) \geq 0\) — observing \(X\) never increases our uncertainty about
\(Y\) on average.
Symmetric: \(I(X;Y) = I(Y;X)\) — information is shared equally both ways.
Bounded: \(I(X;Y) \leq \min(H(X), H(Y))\) — can't learn more about \(X\) from \(Y\) than
\(X\) itself contains.
Zero iff independent: \(I(X;Y) = 0 \Leftrightarrow X \perp Y\) — MI is a general
dependence measure, not just linear correlation.
MI in Machine Learning — Feature Selection
Mutual information is a principled criterion for selecting features: \(I(X_j; Y)\) measures how informative
feature \(j\) is about the label \(Y\), regardless of the relationship form. Unlike correlation, it captures
nonlinear dependence.
mRMR
Minimum Redundancy Maximum Relevance (mRMR)
\[\max_{X_j}\;\underbrace{I(X_j;\, Y)}_{\text{relevance}} - \frac{1}{k}\sum_{X_i \in S}\underbrace{I(X_j;\,
X_i)}_{\text{redundancy with selected}}\]
Select features that are maximally informative about Y, while minimally redundant with
already-selected features. Pure MI filter methods just rank by I(X_j; Y).
Cross-entropy measures the cost of encoding events drawn from distribution \(P\) using a code optimized for
distribution \(Q\). When our model \(Q\) matches reality \(P\) exactly, the cost is minimized at \(H(P)\).
This is the expected log-loss when the true distribution is P but we predict Q. The
extra cost of using the wrong model is exactly KL divergence: H(P,Q) = H(P) + D_KL(P‖Q).
Decomp
Cross-Entropy = Entropy + KL Divergence
\[H(P, Q) = H(P) + D_{\text{KL}}(P \,\|\, Q)\]
Since D_KL ≥ 0: H(P,Q) ≥ H(P), with equality iff P = Q. Cross-entropy is minimized when
Q perfectly matches P. This is the fundamental insight behind all ML loss functions.
Properties
Not symmetric: \(H(P,Q) \neq H(Q,P)\) in general — encoding \(P\)-drawn events with a
\(Q\)-code is different from encoding \(Q\)-events with a \(P\)-code.
Not a distance: But it defines a loss — minimizing \(H(P,Q)\) over \(Q\) is equivalent to
minimizing \(D_{\text{KL}}(P\|Q)\) (since \(H(P)\) is constant w.r.t. \(Q\)).
Connection to MLE: Minimizing cross-entropy \(H(\hat{P}_{\text{data}}, Q_\theta)\) over
model parameters \(\theta\) is exactly maximum likelihood estimation (§12).
07
Section 07
KL Divergence — Relative Entropy
// Kullback-Leibler · non-symmetry · information gain · forward vs reverse
KL divergence is the extra information cost of using an approximate model \(Q\) instead of the true
distribution \(P\). It is the single most important quantity in Bayesian statistics, variational inference, and
information theory.
When Q(x) = 0 but P(x) > 0: D_KL = +∞ (our model assigns zero probability to
something that can happen — catastrophic). When P(x) = 0: the term contributes 0 (by convention). This
asymmetry is crucial.
Forward vs Reverse KL — The Mode-Seeking vs Mean-Seeking Distinction
Forward KL: D_KL(P‖Q)
Used in: Maximum Likelihood, cross-entropy training.
Behavior: Where \(P > 0\), forces \(Q > 0\). "Zero-avoiding" — penalizes \(Q\) for
not covering all modes of \(P\). If \(P\) is bimodal, \(Q\) spreads to cover both modes.
"Mean-seeking": \(Q\) tends to cover the average of \(P\).
Reverse KL: D_KL(Q‖P)
Used in: Variational inference (ELBO), VAEs.
Behavior: Where \(Q > 0\), forces \(P > 0\). "Zero-forcing" — if \(P\) is bimodal,
\(Q\) concentrates on one mode and ignores the other.
"Mode-seeking": \(Q\) commits to one mode of \(P\).
Why This Matters for VAEs
In the VAE ELBO, the KL term is \(D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \,\|\,
p(\mathbf{z}))\) — reverse KL. This makes the approximate posterior \(q\) mode-seeking: it concentrates on
the regions of the prior \(p\) that best explain the data, rather than covering all of it. This is
why VAE posteriors are well-behaved Gaussians.
Properties of KL Divergence
Non-negative: \(D_{\text{KL}}(P\|Q) \geq 0\), with equality iff \(P = Q\) almost
everywhere.
Asymmetric: \(D_{\text{KL}}(P\|Q) \neq D_{\text{KL}}(Q\|P)\) in general.
Not a metric: No triangle inequality, not symmetric. The Jensen-Shannon divergence \(JSD
= \frac{1}{2}[D_{\text{KL}}(P\|M) + D_{\text{KL}}(Q\|M)]\) where \(M = \frac{P+Q}{2}\) is symmetric, bounded
in [0,1], and its square root is a metric.
Invariant to bijective reparametrization: Unlike variance, KL divergence is preserved
under invertible transformations.
\[f\!\left(\mathbb{E}[X]\right) \leq \mathbb{E}[f(X)] \quad \text{if } f \text{ is convex}\]
\[f\!\left(\mathbb{E}[X]\right) \geq \mathbb{E}[f(X)] \quad \text{if } f \text{ is concave}\]
Proof sketch: a convex function lies above its tangent at any point. Since the tangent
at E[X] satisfies f(x) ≥ f(E[X]) + f'(E[X])(x − E[X]), take expectations. The entropy function is concave
(−p log p is concave), which is why H achieves its maximum at the uniform distribution.
A generalization of Jensen's. Used to prove subadditivity of entropy: H(X,Y) ≤ H(X) +
H(Y), and the data processing inequality.
09
Section 09
The Maximum Entropy Principle
// maximum entropy distribution · Lagrangian derivation · exponential family · softmax
Given only partial knowledge about a distribution (e.g., its mean), which distribution should we choose? The
answer: the one with maximum entropy consistent with the known constraints. We should be maximally
uncertain about what we don't know.
where Z = exp(1+λ₀) is the partition function, determined by the normalization
constraint.
4
For classification with feature \(f_k(x) = \mathbf{1}[x=k]\) (indicator function per class):
\[P(y=k) = \frac{e^{\lambda_k}}{\sum_{j=1}^K e^{\lambda_j}} \quad\leftarrow\quad \text{the Softmax!}\]
Softmax is the maximum entropy distribution subject to constraints on expected class
membership — not an arbitrary choice, but a principled probabilistic inference.
★ Theorem — MaxEnt = Exponential Family
The maximum entropy distribution subject to \(n\) moment constraints \(\mathbb{E}[f_j(X)] = \mu_j\) is an
exponential family distribution: \(P_{\boldsymbol{\lambda}}(x) =
\frac{1}{Z(\boldsymbol{\lambda})}\exp(\boldsymbol{\lambda}^T\mathbf{f}(x))\). Every common distribution —
Gaussian, Bernoulli, Poisson, Categorical — is a MaxEnt distribution under the appropriate constraints. This
is why they appear everywhere.
MaxEnt Implies Common Distributions
Constraints
Support
MaxEnt Distribution
Normalization only
Finite set \(\{1,\ldots,K\}\)
Uniform: \(P(k)=1/K\)
Known mean \(\mu\)
\([0,\infty)\)
Exponential: \(\text{Exp}(1/\mu)\)
Known mean and variance
\((-\infty,\infty)\)
Gaussian: \(\mathcal{N}(\mu, \sigma^2)\)
Known mean, binary outcomes
\(\{0,1\}\)
Bernoulli: \(\text{Bern}(p)\)
Feature expectations per class
\(\{1,\ldots,K\}\)
Softmax / Categorical
10
Section 10
Fisher Information
// curvature of log-likelihood · Cramér-Rao bound · natural gradient
Fisher information measures how much information data carries about a model parameter \(\theta\). It is the
variance of the score function — the gradient of the log-likelihood.
Geometrically: Fisher information is the curvature of the log-likelihood. High Fisher
information → sharply peaked likelihood → parameter is precisely identifiable from data. Low Fisher
information → flat likelihood → data is uninformative about θ.
No unbiased estimator can have variance smaller than the inverse Fisher information.
Fisher information is the fundamental limit on statistical estimation precision. The MLE achieves this
bound asymptotically (it is asymptotically efficient).
Natural Gradient Descent
Standard gradient descent uses the Euclidean metric in parameter space. The natural
gradient uses the Fisher information metric — the geometry of the distribution space:
The natural gradient is invariant to reparametrization. It moves in the direction of
steepest ascent in distribution space (KL geometry) rather than parameter space. Used in TRPO, PPO, and
K-FAC for neural networks. Avoids the instability of parameter-space gradient descent when parameters are
ill-conditioned.
11
Section 11
Data Processing Inequality
// information can only be lost · Markov chains · no free lunch in processing
One of the most important theorems in information theory: processing data cannot create new information. It can
only preserve or destroy the information already present.
★ Data Processing Inequality (DPI)
If \(X \to Y \to Z\) form a Markov chain (Z depends on X only through Y), then: \[I(X; Z) \leq I(X; Y)\] Any
processing of Y to produce Z cannot increase the mutual information with X. Equivalently: \(I(X;Z) \leq I(X;Y)
\leq H(X)\).
Implications for Machine Learning
Feature extraction: Any feature map \(\phi: \mathcal{X} \to \mathcal{Z}\) satisfies
\(I(\phi(\mathbf{x}); Y) \leq I(\mathbf{x}; Y)\). A good feature extractor preserves label-relevant
information; a bad one discards it. The DPI bounds how good any feature can be.
Information bottleneck: Neural networks can be viewed as successive Markov chains
\(\mathbf{x} \to \mathbf{h}^{[1]} \to \cdots \to \mathbf{h}^{[L]} \to \hat{y}\). DPI implies
\(I(\mathbf{h}^{[L]}; y) \leq \cdots \leq I(\mathbf{h}^{[1]}; y) \leq I(\mathbf{x}; y)\).
Compression: You can't recover information lost during compression. If an encoder throws
away label-relevant bits, the decoder cannot recover them — no matter how powerful.
Sufficient statistics: A statistic \(T(\mathbf{x})\) is sufficient iff
\(I(T(\mathbf{x}); \theta) = I(\mathbf{x}; \theta)\) — it preserves all information about the parameter.
12
Section 12
Why Neural Networks Use Cross-Entropy
// MLE = minimizing cross-entropy · empirical distribution · the direct proof
This is not a design choice — it is a theorem. Cross-entropy loss is the unique correct loss for probabilistic
classification under the maximum likelihood principle.
The MLE Connection — Full Derivation
Given training data \(\{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^m\), define the empirical data
distribution:
The log-likelihood of the data under model \(p_\theta(y|\mathbf{x})\):
\[\ell(\theta) = \sum_{i=1}^m\log p_\theta(y^{(i)} \mid \mathbf{x}^{(i)})\]
2
Rewrite as an expectation over the empirical distribution:
\[\ell(\theta) = m\,\mathbb{E}_{(\mathbf{x},y)\sim\hat{P}}\!\left[\log p_\theta(y \mid
\mathbf{x})\right]\]
3
Maximizing \(\ell(\theta)\) is equivalent to minimizing \(-\mathbb{E}_{\hat{P}}\!\left[\log
p_\theta(y|\mathbf{x})\right]\) = cross-entropy \(H(\hat{P},p_\theta)\). Since \(H(\hat{P})\) is fixed:
\[\arg\max_\theta\ell(\theta) = \arg\min_\theta H(\hat{P},p_\theta) = \arg\min_\theta
D_{\text{KL}}(\hat{P}\,\|\,p_\theta)\]
4
In practice, with softmax output: \(p_\theta(y=k|\mathbf{x}) = \text{softmax}(\mathbf{z})_k\), so:
\[-\mathbb{E}_{\hat{P}}[\log p_\theta] = -\frac{1}{m}\sum_{i=1}^m\sum_{k=1}^K y_k^{(i)}\log\hat{y}_k^{(i)}
= \text{Categorical Cross-Entropy} \quad\blacksquare\]
The Core Insight
Training a neural network with cross-entropy loss is exactly equivalent to
fitting the model distribution to the empirical data distribution by minimizing KL divergence.
There is no heuristic here — it is the unique optimal loss function for probabilistic models under maximum
likelihood. Any other loss (MSE, MAE) implicitly assumes a different noise model and is suboptimal for
classification.
Why Not MSE for Classification?
MSE assumes the residuals are Gaussian-distributed (MLE under Gaussian noise). For classification outputs in
\([0,1]\), the noise is Bernoulli. Using the wrong noise model gives the wrong gradient signal — particularly
near the decision boundary where sigmoid saturates, giving vanishingly small gradients with MSE but
informative gradients with cross-entropy.
MDL principle: the best model is the one that provides the shortest description of the data. A model that
fits the data in \(L_{\text{data}}\) bits but requires \(L_{\text{model}}\) bits to describe, has total cost
\(L = L_{\text{model}} + L_{\text{data}|\text{model}}\).
MDL
\[L_{\text{data}|\text{model}} = -\log p_\theta(\mathbf{x}) = H(\hat{P}, p_\theta) \quad\text{(coding cost
of data given model)}\]
\[\text{MDL selects: }\hat{\theta} = \arg\min_\theta\; [-\log p_\theta(\text{data}) + \text{description
length of }\theta]\]
MDL = MAP estimation in Bayesian terms: log P(data|θ) + log P(θ). The prior P(θ)
encodes the model description length. More complex models have shorter data description but longer model
description — MDL finds the optimal tradeoff.
The Information Bottleneck
Tishby et al. (2000) proposed viewing deep learning as finding a compressed representation \(T(\mathbf{x})\)
that maximally preserves information about labels \(Y\) while minimally retaining information about inputs
\(\mathbf{x}\):
IB
Information Bottleneck Objective
\[\min_{p(t|x)}\; I(X; T) - \beta\, I(T; Y)\]
β controls the tradeoff between compression (minimize I(X;T)) and predictive power
(maximize I(T;Y)). For β → ∞: T retains all label information. For β → 0: T is maximally compressed. By
DPI: I(T;Y) ≤ I(X;Y) — no representation can do better than the original data.
The ELBO — Variational Inference
ELBO
Evidence Lower Bound (VAE)
\[\log p(\mathbf{x}) \geq \underbrace{\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\log
p_\theta(\mathbf{x}|\mathbf{z})]}_{\text{reconstruction: } -H(P_{\text{data}}, p_\theta)} -
\underbrace{D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \,\|\, p(\mathbf{z}))}_{\text{regularization: keeps
posterior close to prior}}\]
The ELBO is maximized when the KL term is zero (posterior = prior) and reconstruction
is perfect. The tension between these two is the fundamental tradeoff in VAEs — and it is entirely an
information-theoretic statement about the encoder-decoder channel.
With probability ≥ 1−δ over the training set. Q is the posterior over hypotheses, P is
the prior. The generalization gap is bounded by the KL divergence between what we learned (Q) and what we
assumed beforehand (P) — divided by sqrt(m). More data → tighter bound. KL(Q‖P) is a complexity measure:
the more the data "surprises" our prior, the less we generalize.
f(x, x⁺) is a similarity score between anchor and positive pair. Minimizing InfoNCE
maximizes a lower bound on I(X; X⁺) — the mutual information between two views of the same data. SimCLR,
MoCo, CLIP all use this principle: learn representations that maximize MI between augmented views while
contrasting against negatives.
GANs — JS Divergence
The original GAN (Goodfellow 2014) minimizes the Jensen-Shannon divergence between the real data distribution
\(p_{\text{data}}\) and the generator distribution \(p_g\):
Wasserstein GAN replaces JSD with the Wasserstein distance (Earth Mover's distance),
which provides meaningful gradients even when distributions have disjoint support — fixing the mode
collapse problem in original GANs.
Total correlation measures the total dependence between all latent dimensions. β-VAE
(Higgins 2017) adds β>1 weight to the KL term, forcing higher compression and lower TC, encouraging
disentangled representations where each latent captures an independent factor of variation.
Label Smoothing — Entropy Regularization
Label smoothing replaces one-hot targets \(y_k \in \{0,1\}\) with soft targets \(y_k' = (1-\epsilon)y_k +
\epsilon/K\). This is equivalent to minimizing cross-entropy with a target that has higher entropy, preventing
the model from becoming overconfident. The network cannot drive its predictions to 0 or 1 without large
negative cross-entropy from the smoothed labels.
Where u is the uniform distribution (1/K per class). Label smoothing improves
calibration, reduces overfitting, and has an information-theoretic interpretation as penalizing
low-entropy (overconfident) predictions.
15
Section 15
The Complete Mental Model
// every quantity connected · one unified picture
Fig 3. Complete Information Theory mental map for ML — from self-information through
entropy, MI, KL divergence, and their ML applications.
The unified story, in one coherent thread:
H
Entropy
The expected surprise. How many bits needed to encode outcomes. Maximum at uniform distribution. Zero when
deterministic. The foundation of everything.
I(X;Y)
Mutual Information
Shared information between variables. Zero iff independent. The objective in contrastive learning, feature
selection, and the information bottleneck.
H(P,Q)
Cross-Entropy
Cost of encoding P-events with Q-code. H(P,Q) = H(P) + KL(P‖Q). Minimizing over Q equals MLE. This IS the
neural network loss function.
KL
KL Divergence
Extra bits from using Q instead of P. Always ≥ 0 (Jensen's). Not symmetric — forward and reverse KL have
different mode-seeking behaviors.
MaxEnt
Maximum Entropy
Given constraints, be maximally uncertain. Implies exponential family distributions. Softmax IS the MaxEnt
categorical distribution.
DPI
Data Processing
Processing cannot create information. I(X;Z) ≤ I(X;Y) for X→Y→Z. Every layer of a neural network can only
preserve or lose information about the labels.
★ The Unifying Principle
Every major concept in machine learning can be read as an information-theoretic statement. Training a neural
network minimizes \(D_{\text{KL}}(\hat{P}_{\text{data}} \,\|\, p_\theta)\) — the divergence between empirical
reality and our model. Regularization minimizes \(D_{\text{KL}}(q_\theta \,\|\, p_{\text{prior}})\) — how far
we've moved from our initial beliefs. Generalization is bounded by how much the data surprised our prior.
Feature learning maximizes \(I(\text{representation}; \text{labels})\). Compression minimizes
\(I(\text{representation}; \text{inputs})\). Every objective, every bound, every architectural choice traces
back to a single question Shannon asked in 1948: how much do we not know?