Information Theory | Shahid Ul Islam

01

Section 01

What Is Information? — Self-Information

// surprise · log probability · bits vs nats · the fundamental unit

Information theory was invented by Claude Shannon in 1948 to answer a practical question: how many bits are needed to transmit a message over a noisy channel? But the answer revealed something profound about the nature of knowledge itself.

The Surprise Principle

Information content is measured by surprise. An event you were certain would happen carries no information. An impossible event carrying infinite information. The function must satisfy three axioms:

Monotone decreasing in probability: More probable events carry less information. If \(p_1 > p_2\), then \(I(p_1) < I(p_2)\).
Non-negative: \(I(p) \geq 0\) for all \(p \in [0,1]\). We gain, never lose, information.
Additive for independent events: \(I(p_1 \cdot p_2) = I(p_1) + I(p_2)\). Learning two independent facts gives the sum of their information contents.

These three axioms uniquely determine the form of the information function:

Self-Info

Self-Information (Surprisal)

\[I(x) = -\log_b P(X=x) = \log_b \frac{1}{P(X=x)}\]

Base b=2 → units of bits. Base b=e → nats. Base b=10 → hartleys. ML uses nats (natural log) for mathematical convenience. The three axioms above uniquely determine this logarithmic form — no other function works.

∞

P = 0

Impossible event. Infinite surprise if it occurs.

1 bit

P = ½

Fair coin flip. One binary question answered.

3.32 bits

P = 1/10

One of ten equally likely outcomes.

0 bits

P = 1

Certain event. No surprise, no information.

02

Section 02

Entropy — Measuring Average Uncertainty

// Shannon entropy · binary entropy · axiomatic derivation · properties

Self-information measures the surprise of a single outcome. Entropy generalizes this to an entire probability distribution — it is the expected self-information, or the average surprise of a random variable.

Shannon Entropy — Definition

Entropy

Shannon Entropy H(X)

\[H(X) = \mathbb{E}_{x \sim P}\left[I(x)\right] = -\sum_{x \in \mathcal{X}} P(x)\log P(x) = \mathbb{E}\left[-\log P(X)\right]\]

Convention: 0 log 0 = 0 (by L'Hôpital's rule: lim_{p→0} p log p = 0). For continuous distributions: H(X) = −∫p(x)log p(x)dx (differential entropy — can be negative).

Entropy answers: "If I sample from this distribution repeatedly, how many bits on average do I need to encode each outcome?" It is the minimum average code length achievable (Shannon's source coding theorem).

The Binary Entropy Function

For a Bernoulli(\(p\)) variable (coin with bias \(p\)):

Binary H

\[H(p) = -p\log_2 p - (1-p)\log_2(1-p)\]

H(0) = H(1) = 0 (certain outcome, no uncertainty). H(0.5) = 1 bit (maximum — fully uncertain). H is concave: mixing distributions increases entropy.

Fig 1. Binary entropy H(p). Maximum entropy (1 bit) at p=0.5 — maximum uncertainty. Zero entropy at p=0 and p=1 — complete certainty.

Properties of Entropy

Non-negative: \(H(X) \geq 0\), with equality iff \(X\) is deterministic (one outcome has probability 1).
Maximum at uniform: For \(|\mathcal{X}| = n\), \(H(X) \leq \log n\), with equality iff \(P(x) = 1/n\) for all \(x\). The uniform distribution is maximally uncertain.
Concavity: Mixing distributions always increases entropy.
\[H(\lambda P + (1-\lambda)Q) \geq \lambda H(P) + (1-\lambda)H(Q)\]
Invariant to relabeling: Renaming outcomes doesn't change entropy — it depends only on probabilities, not their identities.
Continuous and symmetric in the probabilities.

03

Section 03

Joint & Conditional Entropy

// two variables · what remains unknown · the information Venn diagram

Joint Entropy H(X, Y)

The entropy of two variables together — how many bits to describe both simultaneously:

Joint

Joint Entropy

\[H(X, Y) = -\sum_{x,y} P(x,y)\log P(x,y) = \mathbb{E}_{(x,y)\sim P}[-\log P(X,Y)]\]

If X and Y are independent: H(X,Y) = H(X) + H(Y). If Y is a deterministic function of X: H(X,Y) = H(X). In general: max(H(X),H(Y)) ≤ H(X,Y) ≤ H(X) + H(Y).

Conditional Entropy H(Y|X)

How much uncertainty remains about \(Y\) after observing \(X\)?

Conditional

Conditional Entropy

\[H(Y \mid X) = \sum_{x} P(x)\,H(Y \mid X=x) = -\sum_{x,y}P(x,y)\log P(y \mid x)\] \[= \mathbb{E}_{x\sim P_X}\left[H(Y \mid X=x)\right] = H(X,Y) - H(X)\]

H(Y|X) ≥ 0. H(Y|X) = 0 iff Y is fully determined by X. H(Y|X) = H(Y) iff X and Y are independent — knowing X tells you nothing about Y. Conditioning never increases entropy: H(Y|X) ≤ H(Y).

Fig 2. Information Venn diagram. H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y). The overlap I(X;Y) is mutual information — the shared information between X and Y.

04

Section 04

The Chain Rules of Information Theory

// entropy chain rule · mutual information chain rule · modularity

Chain Rule for Entropy

Chain Rule

Chain Rule — Entropy

\[H(X_1, X_2, \ldots, X_n) = \sum_{i=1}^n H(X_i \mid X_1, \ldots, X_{i-1})\] \[\text{Special case: } H(X,Y) = H(X) + H(Y \mid X) = H(Y) + H(X \mid Y)\]

Proof: expand log P(x₁,…,xₙ) = Σ log P(xᵢ|x₁,…,xᵢ₋₁) using the chain rule of probability. Take expectation. Each term is a conditional entropy.

Interpretation

The total uncertainty about \(n\) variables equals the uncertainty about the first, plus the remaining uncertainty about the second given the first, plus the remaining uncertainty about the third given the first two, and so on. Each piece of knowledge reduces entropy.

05

Section 05

Mutual Information

// shared information · independence test · feature selection · channel capacity

Mutual information \(I(X;Y)\) quantifies how much knowing \(X\) tells you about \(Y\) — and vice versa. It is the single most important quantity for understanding statistical dependence.

Definition and Equivalent Forms

MI

Mutual Information — Four Equivalent Definitions

\[I(X;Y) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X)\] \[\phantom{I(X;Y)} = H(X) + H(Y) - H(X,Y)\] \[\phantom{I(X;Y)} = \sum_{x,y}P(x,y)\log\frac{P(x,y)}{P(x)P(y)} = D_{\text{KL}}\!\left(P_{XY} \,\|\, P_X \otimes P_Y\right)\]

The last form is the KL divergence between the joint distribution and the product of marginals. I(X;Y) = 0 iff X and Y are independent (P(x,y) = P(x)P(y) everywhere). Always I(X;Y) ≥ 0.

Properties

Non-negative: \(I(X;Y) \geq 0\) — observing \(X\) never increases our uncertainty about \(Y\) on average.
Symmetric: \(I(X;Y) = I(Y;X)\) — information is shared equally both ways.
Bounded: \(I(X;Y) \leq \min(H(X), H(Y))\) — can't learn more about \(X\) from \(Y\) than \(X\) itself contains.
Zero iff independent: \(I(X;Y) = 0 \Leftrightarrow X \perp Y\) — MI is a general dependence measure, not just linear correlation.

MI in Machine Learning — Feature Selection

Mutual information is a principled criterion for selecting features: \(I(X_j; Y)\) measures how informative feature \(j\) is about the label \(Y\), regardless of the relationship form. Unlike correlation, it captures nonlinear dependence.

mRMR

Minimum Redundancy Maximum Relevance (mRMR)

\[\max_{X_j}\;\underbrace{I(X_j;\, Y)}_{\text{relevance}} - \frac{1}{k}\sum_{X_i \in S}\underbrace{I(X_j;\, X_i)}_{\text{redundancy with selected}}\]

Select features that are maximally informative about Y, while minimally redundant with already-selected features. Pure MI filter methods just rank by I(X_j; Y).

06

Section 06

Cross-Entropy

// comparing distributions · encoding cost · H(P,Q) ≥ H(P)

Cross-entropy measures the cost of encoding events drawn from distribution \(P\) using a code optimized for distribution \(Q\). When our model \(Q\) matches reality \(P\) exactly, the cost is minimized at \(H(P)\).

Definition

Cross-H

Cross-Entropy H(P, Q)

\[H(P, Q) = -\sum_x P(x)\log Q(x) = \mathbb{E}_{x\sim P}[-\log Q(x)]\]

This is the expected log-loss when the true distribution is P but we predict Q. The extra cost of using the wrong model is exactly KL divergence: H(P,Q) = H(P) + D_KL(P‖Q).

Decomp

Cross-Entropy = Entropy + KL Divergence

\[H(P, Q) = H(P) + D_{\text{KL}}(P \,\|\, Q)\]

Since D_KL ≥ 0: H(P,Q) ≥ H(P), with equality iff P = Q. Cross-entropy is minimized when Q perfectly matches P. This is the fundamental insight behind all ML loss functions.

Properties

Not symmetric: \(H(P,Q) \neq H(Q,P)\) in general — encoding \(P\)-drawn events with a \(Q\)-code is different from encoding \(Q\)-events with a \(P\)-code.
Not a distance: But it defines a loss — minimizing \(H(P,Q)\) over \(Q\) is equivalent to minimizing \(D_{\text{KL}}(P\|Q)\) (since \(H(P)\) is constant w.r.t. \(Q\)).
Connection to MLE: Minimizing cross-entropy \(H(\hat{P}_{\text{data}}, Q_\theta)\) over model parameters \(\theta\) is exactly maximum likelihood estimation (§12).

07

Section 07

KL Divergence — Relative Entropy

// Kullback-Leibler · non-symmetry · information gain · forward vs reverse

KL divergence is the extra information cost of using an approximate model \(Q\) instead of the true distribution \(P\). It is the single most important quantity in Bayesian statistics, variational inference, and information theory.

Definition

KL Div

KL Divergence (Relative Entropy)

\[D_{\text{KL}}(P \,\|\, Q) = \sum_x P(x)\log\frac{P(x)}{Q(x)} = \mathbb{E}_{x\sim P}\left[\log\frac{P(x)}{Q(x)}\right]\] \[= H(P,Q) - H(P) \geq 0\]

When Q(x) = 0 but P(x) > 0: D_KL = +∞ (our model assigns zero probability to something that can happen — catastrophic). When P(x) = 0: the term contributes 0 (by convention). This asymmetry is crucial.

Forward vs Reverse KL — The Mode-Seeking vs Mean-Seeking Distinction

Forward KL: D_KL(P‖Q)

Used in: Maximum Likelihood, cross-entropy training.

Behavior: Where \(P > 0\), forces \(Q > 0\). "Zero-avoiding" — penalizes \(Q\) for not covering all modes of \(P\). If \(P\) is bimodal, \(Q\) spreads to cover both modes.

"Mean-seeking": \(Q\) tends to cover the average of \(P\).

Reverse KL: D_KL(Q‖P)

Used in: Variational inference (ELBO), VAEs.

Behavior: Where \(Q > 0\), forces \(P > 0\). "Zero-forcing" — if \(P\) is bimodal, \(Q\) concentrates on one mode and ignores the other.

"Mode-seeking": \(Q\) commits to one mode of \(P\).

Why This Matters for VAEs

In the VAE ELBO, the KL term is \(D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \,\|\, p(\mathbf{z}))\) — reverse KL. This makes the approximate posterior \(q\) mode-seeking: it concentrates on the regions of the prior \(p\) that best explain the data, rather than covering all of it. This is why VAE posteriors are well-behaved Gaussians.

Properties of KL Divergence

Non-negative: \(D_{\text{KL}}(P\|Q) \geq 0\), with equality iff \(P = Q\) almost everywhere.
Asymmetric: \(D_{\text{KL}}(P\|Q) \neq D_{\text{KL}}(Q\|P)\) in general.
Not a metric: No triangle inequality, not symmetric. The Jensen-Shannon divergence \(JSD = \frac{1}{2}[D_{\text{KL}}(P\|M) + D_{\text{KL}}(Q\|M)]\) where \(M = \frac{P+Q}{2}\) is symmetric, bounded in [0,1], and its square root is a metric.
Invariant to bijective reparametrization: Unlike variance, KL divergence is preserved under invertible transformations.

08

Section 08

Jensen's Inequality & Why KL ≥ 0

// convex functions · Jensen · Gibbs inequality · information inequality

Jensen's Inequality

Jensen

Jensen's Inequality (for convex f)

\[f\!\left(\mathbb{E}[X]\right) \leq \mathbb{E}[f(X)] \quad \text{if } f \text{ is convex}\] \[f\!\left(\mathbb{E}[X]\right) \geq \mathbb{E}[f(X)] \quad \text{if } f \text{ is concave}\]

Proof sketch: a convex function lies above its tangent at any point. Since the tangent at E[X] satisfies f(x) ≥ f(E[X]) + f'(E[X])(x − E[X]), take expectations. The entropy function is concave (−p log p is concave), which is why H achieves its maximum at the uniform distribution.

Proof that KL Divergence ≥ 0 (Gibbs Inequality)

// Proof: D_KL(P‖Q) ≥ 0 using Jensen's inequality

1

\[D_{\text{KL}}(P\|Q) = \mathbb{E}_{x\sim P}\left[\log\frac{P(x)}{Q(x)}\right] = -\mathbb{E}_{x\sim P}\left[\log\frac{Q(x)}{P(x)}\right]\]

2

Apply Jensen's inequality with \(f(t) = -\log t\) (convex since \(f''(t) = 1/t^2 > 0\)): \[\mathbb{E}\left[-\log\frac{Q(x)}{P(x)}\right] \geq -\log\mathbb{E}\left[\frac{Q(x)}{P(x)}\right]\]

3

Evaluate the expectation: \[\mathbb{E}_{x\sim P}\left[\frac{Q(x)}{P(x)}\right] = \sum_x P(x)\frac{Q(x)}{P(x)} = \sum_x Q(x) = 1\]

4

Therefore: \(D_{\text{KL}}(P\|Q) \geq -\log(1) = 0\). Equality holds iff \(Q(x)/P(x)\) is constant — i.e., \(P = Q\). \(\blacksquare\)

The Log-Sum Inequality

Log-Sum

Log-Sum Inequality

\[\sum_{i=1}^n a_i \log\frac{a_i}{b_i} \geq \left(\sum_i a_i\right)\log\frac{\sum_i a_i}{\sum_i b_i}\]

A generalization of Jensen's. Used to prove subadditivity of entropy: H(X,Y) ≤ H(X) + H(Y), and the data processing inequality.

09

Section 09

The Maximum Entropy Principle

// maximum entropy distribution · Lagrangian derivation · exponential family · softmax

Given only partial knowledge about a distribution (e.g., its mean), which distribution should we choose? The answer: the one with maximum entropy consistent with the known constraints. We should be maximally uncertain about what we don't know.

Derivation via Lagrange Multipliers

Maximize \(H(P) = -\sum_x P(x)\log P(x)\) subject to \(\sum_x P(x) = 1\) (normalization) and \(\sum_x P(x)f_j(x) = \mu_j\) (known expected features).

// MaxEnt derivation — connects to exponential family and softmax

1

Lagrangian: \(\mathcal{L} = -\sum_x P(x)\log P(x) - \lambda_0\!\left(\sum_x P(x)-1\right) - \sum_j\lambda_j\!\left(\sum_x P(x)f_j(x)-\mu_j\right)\)

2

Stationarity w.r.t. \(P(x)\): \(\frac{\partial\mathcal{L}}{\partial P(x)} = -\log P(x) - 1 - \lambda_0 - \sum_j\lambda_j f_j(x) = 0\)

3

Solve: \(P(x) = \exp\!\left(-1 - \lambda_0 - \sum_j\lambda_j f_j(x)\right) = \frac{1}{Z}\exp\!\left(-\sum_j\lambda_j f_j(x)\right)\)

where Z = exp(1+λ₀) is the partition function, determined by the normalization constraint.

4

For classification with feature \(f_k(x) = \mathbf{1}[x=k]\) (indicator function per class): \[P(y=k) = \frac{e^{\lambda_k}}{\sum_{j=1}^K e^{\lambda_j}} \quad\leftarrow\quad \text{the Softmax!}\]

Softmax is the maximum entropy distribution subject to constraints on expected class membership — not an arbitrary choice, but a principled probabilistic inference.

★ Theorem — MaxEnt = Exponential Family

The maximum entropy distribution subject to \(n\) moment constraints \(\mathbb{E}[f_j(X)] = \mu_j\) is an exponential family distribution: \(P_{\boldsymbol{\lambda}}(x) = \frac{1}{Z(\boldsymbol{\lambda})}\exp(\boldsymbol{\lambda}^T\mathbf{f}(x))\). Every common distribution — Gaussian, Bernoulli, Poisson, Categorical — is a MaxEnt distribution under the appropriate constraints. This is why they appear everywhere.

MaxEnt Implies Common Distributions

Constraints	Support	MaxEnt Distribution
Normalization only	Finite set \(\{1,\ldots,K\}\)	Uniform: \(P(k)=1/K\)
Known mean \(\mu\)	\([0,\infty)\)	Exponential: \(\text{Exp}(1/\mu)\)
Known mean and variance	\((-\infty,\infty)\)	Gaussian: \(\mathcal{N}(\mu, \sigma^2)\)
Known mean, binary outcomes	\(\{0,1\}\)	Bernoulli: \(\text{Bern}(p)\)
Feature expectations per class	\(\{1,\ldots,K\}\)	Softmax / Categorical

10

Section 10

Fisher Information

// curvature of log-likelihood · Cramér-Rao bound · natural gradient

Fisher information measures how much information data carries about a model parameter \(\theta\). It is the variance of the score function — the gradient of the log-likelihood.

Definition and Geometric Meaning

Fisher

Fisher Information

\[\mathcal{I}(\theta) = \mathbb{E}_{x\sim p_\theta}\!\left[\left(\frac{\partial\log p(x;\theta)}{\partial\theta}\right)^2\right] = -\mathbb{E}_{x\sim p_\theta}\!\left[\frac{\partial^2\log p(x;\theta)}{\partial\theta^2}\right]\] \[\text{For vector } \boldsymbol{\theta}: \quad \mathcal{I}(\boldsymbol{\theta})_{jk} = \mathbb{E}\!\left[\frac{\partial\log p}{\partial\theta_j}\frac{\partial\log p}{\partial\theta_k}\right]\]

Geometrically: Fisher information is the curvature of the log-likelihood. High Fisher information → sharply peaked likelihood → parameter is precisely identifiable from data. Low Fisher information → flat likelihood → data is uninformative about θ.

Cramér-Rao Lower Bound

Cramér-Rao

Cramér-Rao Lower Bound (CRLB)

\[\text{Var}(\hat{\theta}) \geq \frac{1}{\mathcal{I}(\theta)}\]

No unbiased estimator can have variance smaller than the inverse Fisher information. Fisher information is the fundamental limit on statistical estimation precision. The MLE achieves this bound asymptotically (it is asymptotically efficient).

Natural Gradient Descent

Standard gradient descent uses the Euclidean metric in parameter space. The natural gradient uses the Fisher information metric — the geometry of the distribution space:

Nat Grad

Natural Gradient

\[\tilde{\nabla}_\theta \mathcal{L} = \mathcal{I}(\theta)^{-1}\nabla_\theta \mathcal{L}\]

The natural gradient is invariant to reparametrization. It moves in the direction of steepest ascent in distribution space (KL geometry) rather than parameter space. Used in TRPO, PPO, and K-FAC for neural networks. Avoids the instability of parameter-space gradient descent when parameters are ill-conditioned.

11

Section 11

Data Processing Inequality

// information can only be lost · Markov chains · no free lunch in processing

One of the most important theorems in information theory: processing data cannot create new information. It can only preserve or destroy the information already present.

★ Data Processing Inequality (DPI)

If \(X \to Y \to Z\) form a Markov chain (Z depends on X only through Y), then: \[I(X; Z) \leq I(X; Y)\] Any processing of Y to produce Z cannot increase the mutual information with X. Equivalently: \(I(X;Z) \leq I(X;Y) \leq H(X)\).

Implications for Machine Learning

Feature extraction: Any feature map \(\phi: \mathcal{X} \to \mathcal{Z}\) satisfies \(I(\phi(\mathbf{x}); Y) \leq I(\mathbf{x}; Y)\). A good feature extractor preserves label-relevant information; a bad one discards it. The DPI bounds how good any feature can be.
Information bottleneck: Neural networks can be viewed as successive Markov chains \(\mathbf{x} \to \mathbf{h}^{[1]} \to \cdots \to \mathbf{h}^{[L]} \to \hat{y}\). DPI implies \(I(\mathbf{h}^{[L]}; y) \leq \cdots \leq I(\mathbf{h}^{[1]}; y) \leq I(\mathbf{x}; y)\).
Compression: You can't recover information lost during compression. If an encoder throws away label-relevant bits, the decoder cannot recover them — no matter how powerful.
Sufficient statistics: A statistic \(T(\mathbf{x})\) is sufficient iff \(I(T(\mathbf{x}); \theta) = I(\mathbf{x}; \theta)\) — it preserves all information about the parameter.

12

Section 12

Why Neural Networks Use Cross-Entropy

// MLE = minimizing cross-entropy · empirical distribution · the direct proof

This is not a design choice — it is a theorem. Cross-entropy loss is the unique correct loss for probabilistic classification under the maximum likelihood principle.

The MLE Connection — Full Derivation

Given training data \(\{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^m\), define the empirical data distribution:

Empirical

\[\hat{P}_{\text{data}}(y \mid \mathbf{x}) = \frac{1}{m}\sum_{i=1}^m \mathbf{1}[(\mathbf{x}^{(i)}, y^{(i)}) = (\mathbf{x}, y)]\]

// Proof: MLE ≡ Cross-Entropy Minimization

1

The log-likelihood of the data under model \(p_\theta(y|\mathbf{x})\): \[\ell(\theta) = \sum_{i=1}^m\log p_\theta(y^{(i)} \mid \mathbf{x}^{(i)})\]

2

Rewrite as an expectation over the empirical distribution: \[\ell(\theta) = m\,\mathbb{E}_{(\mathbf{x},y)\sim\hat{P}}\!\left[\log p_\theta(y \mid \mathbf{x})\right]\]

3

Maximizing \(\ell(\theta)\) is equivalent to minimizing \(-\mathbb{E}_{\hat{P}}\!\left[\log p_\theta(y|\mathbf{x})\right]\) = cross-entropy \(H(\hat{P},p_\theta)\). Since \(H(\hat{P})\) is fixed: \[\arg\max_\theta\ell(\theta) = \arg\min_\theta H(\hat{P},p_\theta) = \arg\min_\theta D_{\text{KL}}(\hat{P}\,\|\,p_\theta)\]

4

In practice, with softmax output: \(p_\theta(y=k|\mathbf{x}) = \text{softmax}(\mathbf{z})_k\), so: \[-\mathbb{E}_{\hat{P}}[\log p_\theta] = -\frac{1}{m}\sum_{i=1}^m\sum_{k=1}^K y_k^{(i)}\log\hat{y}_k^{(i)} = \text{Categorical Cross-Entropy} \quad\blacksquare\]

The Core Insight

Training a neural network with cross-entropy loss is exactly equivalent to fitting the model distribution to the empirical data distribution by minimizing KL divergence. There is no heuristic here — it is the unique optimal loss function for probabilistic models under maximum likelihood. Any other loss (MSE, MAE) implicitly assumes a different noise model and is suboptimal for classification.

Why Not MSE for Classification?

MSE assumes the residuals are Gaussian-distributed (MLE under Gaussian noise). For classification outputs in \([0,1]\), the noise is Bernoulli. Using the wrong noise model gives the wrong gradient signal — particularly near the decision boundary where sigmoid saturates, giving vanishingly small gradients with MSE but informative gradients with cross-entropy.

13

Section 13

Information-Theoretic View of Learning

// generalization bounds · MDL · PAC learning · information bottleneck · ELBO

Minimum Description Length (MDL)

MDL principle: the best model is the one that provides the shortest description of the data. A model that fits the data in \(L_{\text{data}}\) bits but requires \(L_{\text{model}}\) bits to describe, has total cost \(L = L_{\text{model}} + L_{\text{data}|\text{model}}\).

MDL

\[L_{\text{data}|\text{model}} = -\log p_\theta(\mathbf{x}) = H(\hat{P}, p_\theta) \quad\text{(coding cost of data given model)}\] \[\text{MDL selects: }\hat{\theta} = \arg\min_\theta\; [-\log p_\theta(\text{data}) + \text{description length of }\theta]\]

MDL = MAP estimation in Bayesian terms: log P(data|θ) + log P(θ). The prior P(θ) encodes the model description length. More complex models have shorter data description but longer model description — MDL finds the optimal tradeoff.

The Information Bottleneck

Tishby et al. (2000) proposed viewing deep learning as finding a compressed representation \(T(\mathbf{x})\) that maximally preserves information about labels \(Y\) while minimally retaining information about inputs \(\mathbf{x}\):

IB

Information Bottleneck Objective

\[\min_{p(t|x)}\; I(X; T) - \beta\, I(T; Y)\]

β controls the tradeoff between compression (minimize I(X;T)) and predictive power (maximize I(T;Y)). For β → ∞: T retains all label information. For β → 0: T is maximally compressed. By DPI: I(T;Y) ≤ I(X;Y) — no representation can do better than the original data.

The ELBO — Variational Inference

ELBO

Evidence Lower Bound (VAE)

\[\log p(\mathbf{x}) \geq \underbrace{\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})]}_{\text{reconstruction: } -H(P_{\text{data}}, p_\theta)} - \underbrace{D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \,\|\, p(\mathbf{z}))}_{\text{regularization: keeps posterior close to prior}}\]

The ELBO is maximized when the KL term is zero (posterior = prior) and reconstruction is perfect. The tension between these two is the fundamental tradeoff in VAEs — and it is entirely an information-theoretic statement about the encoder-decoder channel.

PAC Learning and Generalization Bounds

PAC-Bayes

PAC-Bayes Generalization Bound

\[R[h] \leq \hat{R}[h] + \sqrt{\frac{D_{\text{KL}}(Q \,\|\, P) + \log(m/\delta)}{2m}}\]

With probability ≥ 1−δ over the training set. Q is the posterior over hypotheses, P is the prior. The generalization gap is bounded by the KL divergence between what we learned (Q) and what we assumed beforehand (P) — divided by sqrt(m). More data → tighter bound. KL(Q‖P) is a complexity measure: the more the data "surprises" our prior, the less we generalize.

14

Section 14

Information Theory in Modern ML

// contrastive learning · GANs · ICA · channel capacity · total correlation

Contrastive Learning — InfoNCE

InfoNCE

InfoNCE Loss (Oord et al. 2018)

\[\mathcal{L}_{\text{InfoNCE}} = -\mathbb{E}\left[\log\frac{e^{f(\mathbf{x},\mathbf{x}^+)/\tau}}{\sum_{j=0}^{N}e^{f(\mathbf{x},\mathbf{x}_j)/\tau}}\right]\]

f(x, x⁺) is a similarity score between anchor and positive pair. Minimizing InfoNCE maximizes a lower bound on I(X; X⁺) — the mutual information between two views of the same data. SimCLR, MoCo, CLIP all use this principle: learn representations that maximize MI between augmented views while contrasting against negatives.

GANs — JS Divergence

The original GAN (Goodfellow 2014) minimizes the Jensen-Shannon divergence between the real data distribution \(p_{\text{data}}\) and the generator distribution \(p_g\):

GAN

\[\min_G\max_D\; V(G,D) = \mathbb{E}_{\mathbf{x}\sim p_{\text{data}}}[\log D(\mathbf{x})] + \mathbb{E}_{\mathbf{z}\sim p_z}[\log(1-D(G(\mathbf{z})))]\] \[\equiv -2\log 2 + 2\cdot JSD(p_{\text{data}} \,\|\, p_g)\]

Wasserstein GAN replaces JSD with the Wasserstein distance (Earth Mover's distance), which provides meaningful gradients even when distributions have disjoint support — fixing the mode collapse problem in original GANs.

Disentanglement — Total Correlation

Total Corr

\[TC(\mathbf{z}) = D_{\text{KL}}\!\left(p(\mathbf{z}) \,\Big\|\, \prod_{j=1}^d p(z_j)\right) = \sum_{j=1}^d H(z_j) - H(\mathbf{z})\]

Total correlation measures the total dependence between all latent dimensions. β-VAE (Higgins 2017) adds β>1 weight to the KL term, forcing higher compression and lower TC, encouraging disentangled representations where each latent captures an independent factor of variation.

Label Smoothing — Entropy Regularization

Label smoothing replaces one-hot targets \(y_k \in \{0,1\}\) with soft targets \(y_k' = (1-\epsilon)y_k + \epsilon/K\). This is equivalent to minimizing cross-entropy with a target that has higher entropy, preventing the model from becoming overconfident. The network cannot drive its predictions to 0 or 1 without large negative cross-entropy from the smoothed labels.

Smooth

\[\mathcal{L}_{\text{smooth}} = (1-\epsilon)\,H(\mathbf{y},\hat{\mathbf{y}}) + \epsilon\,H(\mathbf{u},\hat{\mathbf{y}})\]

Where u is the uniform distribution (1/K per class). Label smoothing improves calibration, reduces overfitting, and has an information-theoretic interpretation as penalizing low-entropy (overconfident) predictions.

15

Section 15

The Complete Mental Model

// every quantity connected · one unified picture

Fig 3. Complete Information Theory mental map for ML — from self-information through entropy, MI, KL divergence, and their ML applications.

The unified story, in one coherent thread:

H

Entropy

The expected surprise. How many bits needed to encode outcomes. Maximum at uniform distribution. Zero when deterministic. The foundation of everything.

I(X;Y)

Mutual Information

Shared information between variables. Zero iff independent. The objective in contrastive learning, feature selection, and the information bottleneck.

H(P,Q)

Cross-Entropy

Cost of encoding P-events with Q-code. H(P,Q) = H(P) + KL(P‖Q). Minimizing over Q equals MLE. This IS the neural network loss function.

KL

KL Divergence

Extra bits from using Q instead of P. Always ≥ 0 (Jensen's). Not symmetric — forward and reverse KL have different mode-seeking behaviors.

MaxEnt

Maximum Entropy

Given constraints, be maximally uncertain. Implies exponential family distributions. Softmax IS the MaxEnt categorical distribution.

DPI

Data Processing

Processing cannot create information. I(X;Z) ≤ I(X;Y) for X→Y→Z. Every layer of a neural network can only preserve or lose information about the labels.

★ The Unifying Principle

Every major concept in machine learning can be read as an information-theoretic statement. Training a neural network minimizes \(D_{\text{KL}}(\hat{P}_{\text{data}} \,\|\, p_\theta)\) — the divergence between empirical reality and our model. Regularization minimizes \(D_{\text{KL}}(q_\theta \,\|\, p_{\text{prior}})\) — how far we've moved from our initial beliefs. Generalization is bounded by how much the data surprised our prior. Feature learning maximizes \(I(\text{representation}; \text{labels})\). Compression minimizes \(I(\text{representation}; \text{inputs})\). Every objective, every bound, every architectural choice traces back to a single question Shannon asked in 1948: how much do we not know?