Information Theory Machine Learning Shannon · Kullback · Leibler
Information the mathematics of uncertainty & learning

Every loss function, every neural network, every
generalization bound — all grounded in a single question:
how do we measure what we don't know?

H(X)
01
Section 01

What Is Information? — Self-Information

// surprise · log probability · bits vs nats · the fundamental unit

Information theory was invented by Claude Shannon in 1948 to answer a practical question: how many bits are needed to transmit a message over a noisy channel? But the answer revealed something profound about the nature of knowledge itself.

The Surprise Principle

Information content is measured by surprise. An event you were certain would happen carries no information. An impossible event carrying infinite information. The function must satisfy three axioms:

  • Monotone decreasing in probability: More probable events carry less information. If \(p_1 > p_2\), then \(I(p_1) < I(p_2)\).
  • Non-negative: \(I(p) \geq 0\) for all \(p \in [0,1]\). We gain, never lose, information.
  • Additive for independent events: \(I(p_1 \cdot p_2) = I(p_1) + I(p_2)\). Learning two independent facts gives the sum of their information contents.

These three axioms uniquely determine the form of the information function:

Self-Info
Self-Information (Surprisal)
\[I(x) = -\log_b P(X=x) = \log_b \frac{1}{P(X=x)}\]
Base b=2 → units of bits. Base b=e → nats. Base b=10 → hartleys. ML uses nats (natural log) for mathematical convenience. The three axioms above uniquely determine this logarithmic form — no other function works.
P = 0
Impossible event. Infinite surprise if it occurs.
1 bit
P = ½
Fair coin flip. One binary question answered.
3.32 bits
P = 1/10
One of ten equally likely outcomes.
0 bits
P = 1
Certain event. No surprise, no information.
02
Section 02

Entropy — Measuring Average Uncertainty

// Shannon entropy · binary entropy · axiomatic derivation · properties

Self-information measures the surprise of a single outcome. Entropy generalizes this to an entire probability distribution — it is the expected self-information, or the average surprise of a random variable.

Shannon Entropy — Definition
Entropy
Shannon Entropy H(X)
\[H(X) = \mathbb{E}_{x \sim P}\left[I(x)\right] = -\sum_{x \in \mathcal{X}} P(x)\log P(x) = \mathbb{E}\left[-\log P(X)\right]\]
Convention: 0 log 0 = 0 (by L'Hôpital's rule: lim_{p→0} p log p = 0). For continuous distributions: H(X) = −∫p(x)log p(x)dx (differential entropy — can be negative).

Entropy answers: "If I sample from this distribution repeatedly, how many bits on average do I need to encode each outcome?" It is the minimum average code length achievable (Shannon's source coding theorem).

The Binary Entropy Function

For a Bernoulli(\(p\)) variable (coin with bias \(p\)):

Binary H
\[H(p) = -p\log_2 p - (1-p)\log_2(1-p)\]
H(0) = H(1) = 0 (certain outcome, no uncertainty). H(0.5) = 1 bit (maximum — fully uncertain). H is concave: mixing distributions increases entropy.
1 .5 p H(p) 0 0.5 — max 1 H=1 bit
Fig 1. Binary entropy H(p). Maximum entropy (1 bit) at p=0.5 — maximum uncertainty. Zero entropy at p=0 and p=1 — complete certainty.
Properties of Entropy
  • Non-negative: \(H(X) \geq 0\), with equality iff \(X\) is deterministic (one outcome has probability 1).
  • Maximum at uniform: For \(|\mathcal{X}| = n\), \(H(X) \leq \log n\), with equality iff \(P(x) = 1/n\) for all \(x\). The uniform distribution is maximally uncertain.
  • Concavity: Mixing distributions always increases entropy.
    \[H(\lambda P + (1-\lambda)Q) \geq \lambda H(P) + (1-\lambda)H(Q)\]
  • Invariant to relabeling: Renaming outcomes doesn't change entropy — it depends only on probabilities, not their identities.
  • Continuous and symmetric in the probabilities.
03
Section 03

Joint & Conditional Entropy

// two variables · what remains unknown · the information Venn diagram
Joint Entropy H(X, Y)

The entropy of two variables together — how many bits to describe both simultaneously:

Joint
Joint Entropy
\[H(X, Y) = -\sum_{x,y} P(x,y)\log P(x,y) = \mathbb{E}_{(x,y)\sim P}[-\log P(X,Y)]\]
If X and Y are independent: H(X,Y) = H(X) + H(Y). If Y is a deterministic function of X: H(X,Y) = H(X). In general: max(H(X),H(Y)) ≤ H(X,Y) ≤ H(X) + H(Y).
Conditional Entropy H(Y|X)

How much uncertainty remains about \(Y\) after observing \(X\)?

Conditional
Conditional Entropy
\[H(Y \mid X) = \sum_{x} P(x)\,H(Y \mid X=x) = -\sum_{x,y}P(x,y)\log P(y \mid x)\] \[= \mathbb{E}_{x\sim P_X}\left[H(Y \mid X=x)\right] = H(X,Y) - H(X)\]
H(Y|X) ≥ 0. H(Y|X) = 0 iff Y is fully determined by X. H(Y|X) = H(Y) iff X and Y are independent — knowing X tells you nothing about Y. Conditioning never increases entropy: H(Y|X) ≤ H(Y).
H(X|Y) I(X;Y) mutual information H(Y|X) ◄──────── H(X,Y) ────────► ◄── H(X) ──► ◄── H(Y) ──►
Fig 2. Information Venn diagram. H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y). The overlap I(X;Y) is mutual information — the shared information between X and Y.
04
Section 04

The Chain Rules of Information Theory

// entropy chain rule · mutual information chain rule · modularity
Chain Rule for Entropy
Chain Rule
Chain Rule — Entropy
\[H(X_1, X_2, \ldots, X_n) = \sum_{i=1}^n H(X_i \mid X_1, \ldots, X_{i-1})\] \[\text{Special case: } H(X,Y) = H(X) + H(Y \mid X) = H(Y) + H(X \mid Y)\]
Proof: expand log P(x₁,…,xₙ) = Σ log P(xᵢ|x₁,…,xᵢ₋₁) using the chain rule of probability. Take expectation. Each term is a conditional entropy.
Interpretation
The total uncertainty about \(n\) variables equals the uncertainty about the first, plus the remaining uncertainty about the second given the first, plus the remaining uncertainty about the third given the first two, and so on. Each piece of knowledge reduces entropy.
05
Section 05

Mutual Information

// shared information · independence test · feature selection · channel capacity

Mutual information \(I(X;Y)\) quantifies how much knowing \(X\) tells you about \(Y\) — and vice versa. It is the single most important quantity for understanding statistical dependence.

Definition and Equivalent Forms
MI
Mutual Information — Four Equivalent Definitions
\[I(X;Y) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X)\] \[\phantom{I(X;Y)} = H(X) + H(Y) - H(X,Y)\] \[\phantom{I(X;Y)} = \sum_{x,y}P(x,y)\log\frac{P(x,y)}{P(x)P(y)} = D_{\text{KL}}\!\left(P_{XY} \,\|\, P_X \otimes P_Y\right)\]
The last form is the KL divergence between the joint distribution and the product of marginals. I(X;Y) = 0 iff X and Y are independent (P(x,y) = P(x)P(y) everywhere). Always I(X;Y) ≥ 0.
Properties
  • Non-negative: \(I(X;Y) \geq 0\) — observing \(X\) never increases our uncertainty about \(Y\) on average.
  • Symmetric: \(I(X;Y) = I(Y;X)\) — information is shared equally both ways.
  • Bounded: \(I(X;Y) \leq \min(H(X), H(Y))\) — can't learn more about \(X\) from \(Y\) than \(X\) itself contains.
  • Zero iff independent: \(I(X;Y) = 0 \Leftrightarrow X \perp Y\) — MI is a general dependence measure, not just linear correlation.
MI in Machine Learning — Feature Selection

Mutual information is a principled criterion for selecting features: \(I(X_j; Y)\) measures how informative feature \(j\) is about the label \(Y\), regardless of the relationship form. Unlike correlation, it captures nonlinear dependence.

mRMR
Minimum Redundancy Maximum Relevance (mRMR)
\[\max_{X_j}\;\underbrace{I(X_j;\, Y)}_{\text{relevance}} - \frac{1}{k}\sum_{X_i \in S}\underbrace{I(X_j;\, X_i)}_{\text{redundancy with selected}}\]
Select features that are maximally informative about Y, while minimally redundant with already-selected features. Pure MI filter methods just rank by I(X_j; Y).
06
Section 06

Cross-Entropy

// comparing distributions · encoding cost · H(P,Q) ≥ H(P)

Cross-entropy measures the cost of encoding events drawn from distribution \(P\) using a code optimized for distribution \(Q\). When our model \(Q\) matches reality \(P\) exactly, the cost is minimized at \(H(P)\).

Definition
Cross-H
Cross-Entropy H(P, Q)
\[H(P, Q) = -\sum_x P(x)\log Q(x) = \mathbb{E}_{x\sim P}[-\log Q(x)]\]
This is the expected log-loss when the true distribution is P but we predict Q. The extra cost of using the wrong model is exactly KL divergence: H(P,Q) = H(P) + D_KL(P‖Q).
Decomp
Cross-Entropy = Entropy + KL Divergence
\[H(P, Q) = H(P) + D_{\text{KL}}(P \,\|\, Q)\]
Since D_KL ≥ 0: H(P,Q) ≥ H(P), with equality iff P = Q. Cross-entropy is minimized when Q perfectly matches P. This is the fundamental insight behind all ML loss functions.
Properties
  • Not symmetric: \(H(P,Q) \neq H(Q,P)\) in general — encoding \(P\)-drawn events with a \(Q\)-code is different from encoding \(Q\)-events with a \(P\)-code.
  • Not a distance: But it defines a loss — minimizing \(H(P,Q)\) over \(Q\) is equivalent to minimizing \(D_{\text{KL}}(P\|Q)\) (since \(H(P)\) is constant w.r.t. \(Q\)).
  • Connection to MLE: Minimizing cross-entropy \(H(\hat{P}_{\text{data}}, Q_\theta)\) over model parameters \(\theta\) is exactly maximum likelihood estimation (§12).
07
Section 07

KL Divergence — Relative Entropy

// Kullback-Leibler · non-symmetry · information gain · forward vs reverse

KL divergence is the extra information cost of using an approximate model \(Q\) instead of the true distribution \(P\). It is the single most important quantity in Bayesian statistics, variational inference, and information theory.

Definition
KL Div
KL Divergence (Relative Entropy)
\[D_{\text{KL}}(P \,\|\, Q) = \sum_x P(x)\log\frac{P(x)}{Q(x)} = \mathbb{E}_{x\sim P}\left[\log\frac{P(x)}{Q(x)}\right]\] \[= H(P,Q) - H(P) \geq 0\]
When Q(x) = 0 but P(x) > 0: D_KL = +∞ (our model assigns zero probability to something that can happen — catastrophic). When P(x) = 0: the term contributes 0 (by convention). This asymmetry is crucial.
Forward vs Reverse KL — The Mode-Seeking vs Mean-Seeking Distinction

Forward KL: D_KL(P‖Q)

Used in: Maximum Likelihood, cross-entropy training.

Behavior: Where \(P > 0\), forces \(Q > 0\). "Zero-avoiding" — penalizes \(Q\) for not covering all modes of \(P\). If \(P\) is bimodal, \(Q\) spreads to cover both modes.

"Mean-seeking": \(Q\) tends to cover the average of \(P\).

Reverse KL: D_KL(Q‖P)

Used in: Variational inference (ELBO), VAEs.

Behavior: Where \(Q > 0\), forces \(P > 0\). "Zero-forcing" — if \(P\) is bimodal, \(Q\) concentrates on one mode and ignores the other.

"Mode-seeking": \(Q\) commits to one mode of \(P\).

Why This Matters for VAEs
In the VAE ELBO, the KL term is \(D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \,\|\, p(\mathbf{z}))\) — reverse KL. This makes the approximate posterior \(q\) mode-seeking: it concentrates on the regions of the prior \(p\) that best explain the data, rather than covering all of it. This is why VAE posteriors are well-behaved Gaussians.
Properties of KL Divergence
  • Non-negative: \(D_{\text{KL}}(P\|Q) \geq 0\), with equality iff \(P = Q\) almost everywhere.
  • Asymmetric: \(D_{\text{KL}}(P\|Q) \neq D_{\text{KL}}(Q\|P)\) in general.
  • Not a metric: No triangle inequality, not symmetric. The Jensen-Shannon divergence \(JSD = \frac{1}{2}[D_{\text{KL}}(P\|M) + D_{\text{KL}}(Q\|M)]\) where \(M = \frac{P+Q}{2}\) is symmetric, bounded in [0,1], and its square root is a metric.
  • Invariant to bijective reparametrization: Unlike variance, KL divergence is preserved under invertible transformations.
08
Section 08

Jensen's Inequality & Why KL ≥ 0

// convex functions · Jensen · Gibbs inequality · information inequality
Jensen's Inequality
Jensen
Jensen's Inequality (for convex f)
\[f\!\left(\mathbb{E}[X]\right) \leq \mathbb{E}[f(X)] \quad \text{if } f \text{ is convex}\] \[f\!\left(\mathbb{E}[X]\right) \geq \mathbb{E}[f(X)] \quad \text{if } f \text{ is concave}\]
Proof sketch: a convex function lies above its tangent at any point. Since the tangent at E[X] satisfies f(x) ≥ f(E[X]) + f'(E[X])(x − E[X]), take expectations. The entropy function is concave (−p log p is concave), which is why H achieves its maximum at the uniform distribution.
Proof that KL Divergence ≥ 0 (Gibbs Inequality)
// Proof: D_KL(P‖Q) ≥ 0 using Jensen's inequality
1
\[D_{\text{KL}}(P\|Q) = \mathbb{E}_{x\sim P}\left[\log\frac{P(x)}{Q(x)}\right] = -\mathbb{E}_{x\sim P}\left[\log\frac{Q(x)}{P(x)}\right]\]
2
Apply Jensen's inequality with \(f(t) = -\log t\) (convex since \(f''(t) = 1/t^2 > 0\)): \[\mathbb{E}\left[-\log\frac{Q(x)}{P(x)}\right] \geq -\log\mathbb{E}\left[\frac{Q(x)}{P(x)}\right]\]
3
Evaluate the expectation: \[\mathbb{E}_{x\sim P}\left[\frac{Q(x)}{P(x)}\right] = \sum_x P(x)\frac{Q(x)}{P(x)} = \sum_x Q(x) = 1\]
4
Therefore: \(D_{\text{KL}}(P\|Q) \geq -\log(1) = 0\). Equality holds iff \(Q(x)/P(x)\) is constant — i.e., \(P = Q\). \(\blacksquare\)
The Log-Sum Inequality
Log-Sum
Log-Sum Inequality
\[\sum_{i=1}^n a_i \log\frac{a_i}{b_i} \geq \left(\sum_i a_i\right)\log\frac{\sum_i a_i}{\sum_i b_i}\]
A generalization of Jensen's. Used to prove subadditivity of entropy: H(X,Y) ≤ H(X) + H(Y), and the data processing inequality.
09
Section 09

The Maximum Entropy Principle

// maximum entropy distribution · Lagrangian derivation · exponential family · softmax

Given only partial knowledge about a distribution (e.g., its mean), which distribution should we choose? The answer: the one with maximum entropy consistent with the known constraints. We should be maximally uncertain about what we don't know.

Derivation via Lagrange Multipliers

Maximize \(H(P) = -\sum_x P(x)\log P(x)\) subject to \(\sum_x P(x) = 1\) (normalization) and \(\sum_x P(x)f_j(x) = \mu_j\) (known expected features).

// MaxEnt derivation — connects to exponential family and softmax
1
Lagrangian: \(\mathcal{L} = -\sum_x P(x)\log P(x) - \lambda_0\!\left(\sum_x P(x)-1\right) - \sum_j\lambda_j\!\left(\sum_x P(x)f_j(x)-\mu_j\right)\)
2
Stationarity w.r.t. \(P(x)\): \(\frac{\partial\mathcal{L}}{\partial P(x)} = -\log P(x) - 1 - \lambda_0 - \sum_j\lambda_j f_j(x) = 0\)
3
Solve: \(P(x) = \exp\!\left(-1 - \lambda_0 - \sum_j\lambda_j f_j(x)\right) = \frac{1}{Z}\exp\!\left(-\sum_j\lambda_j f_j(x)\right)\)
where Z = exp(1+λ₀) is the partition function, determined by the normalization constraint.
4
For classification with feature \(f_k(x) = \mathbf{1}[x=k]\) (indicator function per class): \[P(y=k) = \frac{e^{\lambda_k}}{\sum_{j=1}^K e^{\lambda_j}} \quad\leftarrow\quad \text{the Softmax!}\]
Softmax is the maximum entropy distribution subject to constraints on expected class membership — not an arbitrary choice, but a principled probabilistic inference.
★ Theorem — MaxEnt = Exponential Family

The maximum entropy distribution subject to \(n\) moment constraints \(\mathbb{E}[f_j(X)] = \mu_j\) is an exponential family distribution: \(P_{\boldsymbol{\lambda}}(x) = \frac{1}{Z(\boldsymbol{\lambda})}\exp(\boldsymbol{\lambda}^T\mathbf{f}(x))\). Every common distribution — Gaussian, Bernoulli, Poisson, Categorical — is a MaxEnt distribution under the appropriate constraints. This is why they appear everywhere.

MaxEnt Implies Common Distributions
Constraints Support MaxEnt Distribution
Normalization only Finite set \(\{1,\ldots,K\}\) Uniform: \(P(k)=1/K\)
Known mean \(\mu\) \([0,\infty)\) Exponential: \(\text{Exp}(1/\mu)\)
Known mean and variance \((-\infty,\infty)\) Gaussian: \(\mathcal{N}(\mu, \sigma^2)\)
Known mean, binary outcomes \(\{0,1\}\) Bernoulli: \(\text{Bern}(p)\)
Feature expectations per class \(\{1,\ldots,K\}\) Softmax / Categorical
10
Section 10

Fisher Information

// curvature of log-likelihood · Cramér-Rao bound · natural gradient

Fisher information measures how much information data carries about a model parameter \(\theta\). It is the variance of the score function — the gradient of the log-likelihood.

Definition and Geometric Meaning
Fisher
Fisher Information
\[\mathcal{I}(\theta) = \mathbb{E}_{x\sim p_\theta}\!\left[\left(\frac{\partial\log p(x;\theta)}{\partial\theta}\right)^2\right] = -\mathbb{E}_{x\sim p_\theta}\!\left[\frac{\partial^2\log p(x;\theta)}{\partial\theta^2}\right]\] \[\text{For vector } \boldsymbol{\theta}: \quad \mathcal{I}(\boldsymbol{\theta})_{jk} = \mathbb{E}\!\left[\frac{\partial\log p}{\partial\theta_j}\frac{\partial\log p}{\partial\theta_k}\right]\]
Geometrically: Fisher information is the curvature of the log-likelihood. High Fisher information → sharply peaked likelihood → parameter is precisely identifiable from data. Low Fisher information → flat likelihood → data is uninformative about θ.
Cramér-Rao Lower Bound
Cramér-Rao
Cramér-Rao Lower Bound (CRLB)
\[\text{Var}(\hat{\theta}) \geq \frac{1}{\mathcal{I}(\theta)}\]
No unbiased estimator can have variance smaller than the inverse Fisher information. Fisher information is the fundamental limit on statistical estimation precision. The MLE achieves this bound asymptotically (it is asymptotically efficient).
Natural Gradient Descent

Standard gradient descent uses the Euclidean metric in parameter space. The natural gradient uses the Fisher information metric — the geometry of the distribution space:

Nat Grad
Natural Gradient
\[\tilde{\nabla}_\theta \mathcal{L} = \mathcal{I}(\theta)^{-1}\nabla_\theta \mathcal{L}\]
The natural gradient is invariant to reparametrization. It moves in the direction of steepest ascent in distribution space (KL geometry) rather than parameter space. Used in TRPO, PPO, and K-FAC for neural networks. Avoids the instability of parameter-space gradient descent when parameters are ill-conditioned.
11
Section 11

Data Processing Inequality

// information can only be lost · Markov chains · no free lunch in processing

One of the most important theorems in information theory: processing data cannot create new information. It can only preserve or destroy the information already present.

★ Data Processing Inequality (DPI)

If \(X \to Y \to Z\) form a Markov chain (Z depends on X only through Y), then: \[I(X; Z) \leq I(X; Y)\] Any processing of Y to produce Z cannot increase the mutual information with X. Equivalently: \(I(X;Z) \leq I(X;Y) \leq H(X)\).

Implications for Machine Learning
  • Feature extraction: Any feature map \(\phi: \mathcal{X} \to \mathcal{Z}\) satisfies \(I(\phi(\mathbf{x}); Y) \leq I(\mathbf{x}; Y)\). A good feature extractor preserves label-relevant information; a bad one discards it. The DPI bounds how good any feature can be.
  • Information bottleneck: Neural networks can be viewed as successive Markov chains \(\mathbf{x} \to \mathbf{h}^{[1]} \to \cdots \to \mathbf{h}^{[L]} \to \hat{y}\). DPI implies \(I(\mathbf{h}^{[L]}; y) \leq \cdots \leq I(\mathbf{h}^{[1]}; y) \leq I(\mathbf{x}; y)\).
  • Compression: You can't recover information lost during compression. If an encoder throws away label-relevant bits, the decoder cannot recover them — no matter how powerful.
  • Sufficient statistics: A statistic \(T(\mathbf{x})\) is sufficient iff \(I(T(\mathbf{x}); \theta) = I(\mathbf{x}; \theta)\) — it preserves all information about the parameter.
12
Section 12

Why Neural Networks Use Cross-Entropy

// MLE = minimizing cross-entropy · empirical distribution · the direct proof

This is not a design choice — it is a theorem. Cross-entropy loss is the unique correct loss for probabilistic classification under the maximum likelihood principle.

The MLE Connection — Full Derivation

Given training data \(\{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^m\), define the empirical data distribution:

Empirical
\[\hat{P}_{\text{data}}(y \mid \mathbf{x}) = \frac{1}{m}\sum_{i=1}^m \mathbf{1}[(\mathbf{x}^{(i)}, y^{(i)}) = (\mathbf{x}, y)]\]
// Proof: MLE ≡ Cross-Entropy Minimization
1
The log-likelihood of the data under model \(p_\theta(y|\mathbf{x})\): \[\ell(\theta) = \sum_{i=1}^m\log p_\theta(y^{(i)} \mid \mathbf{x}^{(i)})\]
2
Rewrite as an expectation over the empirical distribution: \[\ell(\theta) = m\,\mathbb{E}_{(\mathbf{x},y)\sim\hat{P}}\!\left[\log p_\theta(y \mid \mathbf{x})\right]\]
3
Maximizing \(\ell(\theta)\) is equivalent to minimizing \(-\mathbb{E}_{\hat{P}}\!\left[\log p_\theta(y|\mathbf{x})\right]\) = cross-entropy \(H(\hat{P},p_\theta)\). Since \(H(\hat{P})\) is fixed: \[\arg\max_\theta\ell(\theta) = \arg\min_\theta H(\hat{P},p_\theta) = \arg\min_\theta D_{\text{KL}}(\hat{P}\,\|\,p_\theta)\]
4
In practice, with softmax output: \(p_\theta(y=k|\mathbf{x}) = \text{softmax}(\mathbf{z})_k\), so: \[-\mathbb{E}_{\hat{P}}[\log p_\theta] = -\frac{1}{m}\sum_{i=1}^m\sum_{k=1}^K y_k^{(i)}\log\hat{y}_k^{(i)} = \text{Categorical Cross-Entropy} \quad\blacksquare\]
The Core Insight
Training a neural network with cross-entropy loss is exactly equivalent to fitting the model distribution to the empirical data distribution by minimizing KL divergence. There is no heuristic here — it is the unique optimal loss function for probabilistic models under maximum likelihood. Any other loss (MSE, MAE) implicitly assumes a different noise model and is suboptimal for classification.
Why Not MSE for Classification?

MSE assumes the residuals are Gaussian-distributed (MLE under Gaussian noise). For classification outputs in \([0,1]\), the noise is Bernoulli. Using the wrong noise model gives the wrong gradient signal — particularly near the decision boundary where sigmoid saturates, giving vanishingly small gradients with MSE but informative gradients with cross-entropy.

13
Section 13

Information-Theoretic View of Learning

// generalization bounds · MDL · PAC learning · information bottleneck · ELBO
Minimum Description Length (MDL)

MDL principle: the best model is the one that provides the shortest description of the data. A model that fits the data in \(L_{\text{data}}\) bits but requires \(L_{\text{model}}\) bits to describe, has total cost \(L = L_{\text{model}} + L_{\text{data}|\text{model}}\).

MDL
\[L_{\text{data}|\text{model}} = -\log p_\theta(\mathbf{x}) = H(\hat{P}, p_\theta) \quad\text{(coding cost of data given model)}\] \[\text{MDL selects: }\hat{\theta} = \arg\min_\theta\; [-\log p_\theta(\text{data}) + \text{description length of }\theta]\]
MDL = MAP estimation in Bayesian terms: log P(data|θ) + log P(θ). The prior P(θ) encodes the model description length. More complex models have shorter data description but longer model description — MDL finds the optimal tradeoff.
The Information Bottleneck

Tishby et al. (2000) proposed viewing deep learning as finding a compressed representation \(T(\mathbf{x})\) that maximally preserves information about labels \(Y\) while minimally retaining information about inputs \(\mathbf{x}\):

IB
Information Bottleneck Objective
\[\min_{p(t|x)}\; I(X; T) - \beta\, I(T; Y)\]
β controls the tradeoff between compression (minimize I(X;T)) and predictive power (maximize I(T;Y)). For β → ∞: T retains all label information. For β → 0: T is maximally compressed. By DPI: I(T;Y) ≤ I(X;Y) — no representation can do better than the original data.
The ELBO — Variational Inference
ELBO
Evidence Lower Bound (VAE)
\[\log p(\mathbf{x}) \geq \underbrace{\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})]}_{\text{reconstruction: } -H(P_{\text{data}}, p_\theta)} - \underbrace{D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \,\|\, p(\mathbf{z}))}_{\text{regularization: keeps posterior close to prior}}\]
The ELBO is maximized when the KL term is zero (posterior = prior) and reconstruction is perfect. The tension between these two is the fundamental tradeoff in VAEs — and it is entirely an information-theoretic statement about the encoder-decoder channel.
PAC Learning and Generalization Bounds
PAC-Bayes
PAC-Bayes Generalization Bound
\[R[h] \leq \hat{R}[h] + \sqrt{\frac{D_{\text{KL}}(Q \,\|\, P) + \log(m/\delta)}{2m}}\]
With probability ≥ 1−δ over the training set. Q is the posterior over hypotheses, P is the prior. The generalization gap is bounded by the KL divergence between what we learned (Q) and what we assumed beforehand (P) — divided by sqrt(m). More data → tighter bound. KL(Q‖P) is a complexity measure: the more the data "surprises" our prior, the less we generalize.
14
Section 14

Information Theory in Modern ML

// contrastive learning · GANs · ICA · channel capacity · total correlation
Contrastive Learning — InfoNCE
InfoNCE
InfoNCE Loss (Oord et al. 2018)
\[\mathcal{L}_{\text{InfoNCE}} = -\mathbb{E}\left[\log\frac{e^{f(\mathbf{x},\mathbf{x}^+)/\tau}}{\sum_{j=0}^{N}e^{f(\mathbf{x},\mathbf{x}_j)/\tau}}\right]\]
f(x, x⁺) is a similarity score between anchor and positive pair. Minimizing InfoNCE maximizes a lower bound on I(X; X⁺) — the mutual information between two views of the same data. SimCLR, MoCo, CLIP all use this principle: learn representations that maximize MI between augmented views while contrasting against negatives.
GANs — JS Divergence

The original GAN (Goodfellow 2014) minimizes the Jensen-Shannon divergence between the real data distribution \(p_{\text{data}}\) and the generator distribution \(p_g\):

GAN
\[\min_G\max_D\; V(G,D) = \mathbb{E}_{\mathbf{x}\sim p_{\text{data}}}[\log D(\mathbf{x})] + \mathbb{E}_{\mathbf{z}\sim p_z}[\log(1-D(G(\mathbf{z})))]\] \[\equiv -2\log 2 + 2\cdot JSD(p_{\text{data}} \,\|\, p_g)\]
Wasserstein GAN replaces JSD with the Wasserstein distance (Earth Mover's distance), which provides meaningful gradients even when distributions have disjoint support — fixing the mode collapse problem in original GANs.
Disentanglement — Total Correlation
Total Corr
\[TC(\mathbf{z}) = D_{\text{KL}}\!\left(p(\mathbf{z}) \,\Big\|\, \prod_{j=1}^d p(z_j)\right) = \sum_{j=1}^d H(z_j) - H(\mathbf{z})\]
Total correlation measures the total dependence between all latent dimensions. β-VAE (Higgins 2017) adds β>1 weight to the KL term, forcing higher compression and lower TC, encouraging disentangled representations where each latent captures an independent factor of variation.
Label Smoothing — Entropy Regularization

Label smoothing replaces one-hot targets \(y_k \in \{0,1\}\) with soft targets \(y_k' = (1-\epsilon)y_k + \epsilon/K\). This is equivalent to minimizing cross-entropy with a target that has higher entropy, preventing the model from becoming overconfident. The network cannot drive its predictions to 0 or 1 without large negative cross-entropy from the smoothed labels.

Smooth
\[\mathcal{L}_{\text{smooth}} = (1-\epsilon)\,H(\mathbf{y},\hat{\mathbf{y}}) + \epsilon\,H(\mathbf{u},\hat{\mathbf{y}})\]
Where u is the uniform distribution (1/K per class). Label smoothing improves calibration, reduces overfitting, and has an information-theoretic interpretation as penalizing low-entropy (overconfident) predictions.
15
Section 15

The Complete Mental Model

// every quantity connected · one unified picture
SELF-INFORMATION I(x) = −log P(x) ENTROPY H(X) = E[I(X)] JOINT / COND H(X,Y) H(Y|X) MUTUAL INFO I(X;Y) = H(X)+H(Y)−H(X,Y) CROSS-ENTROPY H(P,Q) = −E_P[log Q] KL DIVERGENCE D_KL(P‖Q) ≥ 0 MAXENT / FISHER ExpFamily · CRLB H+KL ML APPLICATIONS CE Loss = MLE KL → VAE ELBO MI → InfoNCE MaxEnt → Softmax DPI → IB
Fig 3. Complete Information Theory mental map for ML — from self-information through entropy, MI, KL divergence, and their ML applications.

The unified story, in one coherent thread:

H
Entropy

The expected surprise. How many bits needed to encode outcomes. Maximum at uniform distribution. Zero when deterministic. The foundation of everything.

I(X;Y)
Mutual Information

Shared information between variables. Zero iff independent. The objective in contrastive learning, feature selection, and the information bottleneck.

H(P,Q)
Cross-Entropy

Cost of encoding P-events with Q-code. H(P,Q) = H(P) + KL(P‖Q). Minimizing over Q equals MLE. This IS the neural network loss function.

KL
KL Divergence

Extra bits from using Q instead of P. Always ≥ 0 (Jensen's). Not symmetric — forward and reverse KL have different mode-seeking behaviors.

MaxEnt
Maximum Entropy

Given constraints, be maximally uncertain. Implies exponential family distributions. Softmax IS the MaxEnt categorical distribution.

DPI
Data Processing

Processing cannot create information. I(X;Z) ≤ I(X;Y) for X→Y→Z. Every layer of a neural network can only preserve or lose information about the labels.

★ The Unifying Principle

Every major concept in machine learning can be read as an information-theoretic statement. Training a neural network minimizes \(D_{\text{KL}}(\hat{P}_{\text{data}} \,\|\, p_\theta)\) — the divergence between empirical reality and our model. Regularization minimizes \(D_{\text{KL}}(q_\theta \,\|\, p_{\text{prior}})\) — how far we've moved from our initial beliefs. Generalization is bounded by how much the data surprised our prior. Feature learning maximizes \(I(\text{representation}; \text{labels})\). Compression minimizes \(I(\text{representation}; \text{inputs})\). Every objective, every bound, every architectural choice traces back to a single question Shannon asked in 1948: how much do we not know?