Probablity Theory | Shahid Ul Islam

01

Foundations

Probability Spaces

// sample space · σ-algebra · probability measure · Kolmogorov axioms

A probability space is the mathematical bedrock on which everything in statistics and ML rests. It gives a rigorous answer to the question: what is a random experiment?

The Formal Definition

Probability Space (Ω, ℱ, ℙ)Definition

\[(\Omega,\,\mathcal{F},\,\mathbb{P}) \quad\text{where:}\] \[\Omega \text{ — sample space: the set of all possible outcomes}\] \[\mathcal{F} \subseteq 2^\Omega \text{ — σ-algebra: a collection of measurable events}\] \[\mathbb{P}:\mathcal{F}\to[0,1] \text{ — probability measure: assigns probabilities to events}\]

The σ-algebra ℱ must satisfy: (1) Ω ∈ ℱ, (2) if A ∈ ℱ then Aᶜ ∈ ℱ (closed under complement), (3) if A₁,A₂,... ∈ ℱ then ∪ᵢAᵢ ∈ ℱ (closed under countable unions). This ensures every set we want to assign a probability to is "measurable."

Kolmogorov's Three AxiomsAxioms

\[\textbf{(K1) Non-negativity:}\quad \mathbb{P}(A) \geq 0 \quad\forall A\in\mathcal{F}\] \[\textbf{(K2) Normalization:}\quad \mathbb{P}(\Omega) = 1\] \[\textbf{(K3) Countable Additivity:}\quad A_i \cap A_j = \emptyset \Rightarrow \mathbb{P}\!\left(\bigcup_{i=1}^\infty A_i\right) = \sum_{i=1}^\infty \mathbb{P}(A_i)\]

All probability rules you've ever used — complement rule P(Aᶜ) = 1−P(A), inclusion-exclusion, Boole's inequality — are theorems derived from these three axioms alone. They were proposed by Andrei Kolmogorov in 1933, unifying all previous ad hoc probability rules into a single coherent framework.

Key Consequences

Complement rule: \(\mathbb{P}(A^c) = 1 - \mathbb{P}(A)\). Proof: \(\Omega = A \cup A^c\) (disjoint), so \(1 = \mathbb{P}(A) + \mathbb{P}(A^c)\).
Monotonicity: If \(A \subseteq B\) then \(\mathbb{P}(A) \leq \mathbb{P}(B)\).
Inclusion-Exclusion: \(\mathbb{P}(A \cup B) = \mathbb{P}(A) + \mathbb{P}(B) - \mathbb{P}(A \cap B)\).
Union bound (Boole's inequality): \(\mathbb{P}(\bigcup_i A_i) \leq \sum_i \mathbb{P}(A_i)\) — used constantly in ML generalization bounds.

02

Variables

Random Variables

// measurable function · distribution · discrete vs continuous · induced measure

A random variable is not actually a variable in the usual sense — it is a function from the sample space to the real numbers. This formalism connects the abstract probability space to measurable quantities.

Formal Definition

Random Variable — DefinitionFormal

\[X: \Omega \to \mathbb{R} \quad \text{such that } \{X \leq x\} = \{\omega\in\Omega: X(\omega) \leq x\} \in \mathcal{F} \quad\forall x\in\mathbb{R}\]

The measurability condition ensures that P(X ≤ x) makes sense (the set we're measuring is in ℱ). In practice: for discrete probability spaces, every function is a random variable. For continuous spaces (e.g., Ω = ℝ), we need measurability to rule out pathological functions. Most functions you'll ever encounter are measurable.

Discrete Random Variables

Takes values in a countable set \(\{x_1, x_2, \ldots\}\). The distribution is fully described by the probability mass function (PMF).

Examples: number of heads in \(n\) flips, number of words in a sentence, number of training epochs until convergence.

Continuous Random Variables

Takes values in an uncountable set (e.g., \(\mathbb{R}\)). The distribution is described by a probability density function (PDF). \(P(X = x) = 0\) for any specific \(x\).

Examples: Gaussian noise, uniform random initialization, model output probabilities.

03

Distributions

PMFs, PDFs, and CDFs

// probability mass · density · cumulative · LOTUS · transformations

In Machine Learning, probability distributions form the backbone of our loss functions and generative models. When a neural network predicts a class, it outputs a Probability Mass Function (PMF) via a softmax layer. When a variational autoencoder learns the latent space of images, it models a continuous Probability Density Function (PDF).

Probability Mass Function (Discrete)

PMF — Discrete DistributionDiscrete

\[p_X(x) = \mathbb{P}(X = x) \geq 0, \quad \sum_{x\in\mathcal{X}} p_X(x) = 1\] \[\mathbb{P}(X \in A) = \sum_{x\in A}p_X(x)\]

The PMF must be non-negative and sum to 1. Every valid probability distribution must satisfy these two constraints — they follow directly from Kolmogorov's axioms.

Probability Density Function (Continuous)

PDF — Continuous DistributionContinuous

\[f_X(x) \geq 0, \quad \int_{-\infty}^\infty f_X(x)\,dx = 1\] \[\mathbb{P}(a \leq X \leq b) = \int_a^b f_X(x)\,dx\]

f_X(x) is NOT a probability — it's a density. P(X = x) = 0 for any specific x (measure-zero set). However, f_X(x) dx ≈ P(x ≤ X ≤ x+dx) for infinitesimally small dx. The density can exceed 1 (as long as the integral equals 1).

Cumulative Distribution Function (CDF)

CDF — Definition and PropertiesUniversal

\[F_X(x) = \mathbb{P}(X \leq x) = \begin{cases}\sum_{t \leq x}p_X(t) & \text{discrete} \\ \int_{-\infty}^x f_X(t)\,dt & \text{continuous}\end{cases}\]

CDF properties: (1) Non-decreasing: x₁ ≤ x₂ → F(x₁) ≤ F(x₂). (2) Right-continuous. (3) F(−∞) = 0, F(+∞) = 1. The CDF exists for every random variable (even when PMF/PDF don't). P(a < X ≤ b) = F(b) − F(a). For continuous RVs: f(x) = dF/dx.

04

Expectation

// Lebesgue integral · LOTUS · linearity · Jensen's inequality · moments

Machine Learning is fundamentally the science of optimizing expectations. The loss function we minimize during training (Empirical Risk Minimization) is simply a sample-based approximation of the expected loss over the true data distribution. Furthermore, when our models make predictions under uncertainty, they frequently output the expected value of the target.

Expectation — All CasesCore Definition

\[\mathbb{E}[X] = \begin{cases}\sum_{x\in\mathcal{X}} x\,p(x) & \text{discrete} \\ \int_{-\infty}^\infty x\,f(x)\,dx & \text{continuous}\end{cases}\] \[\mathbb{E}[g(X)] = \begin{cases}\sum_x g(x)\,p(x) & \text{discrete (LOTUS)} \\ \int g(x)\,f(x)\,dx & \text{continuous (LOTUS)}\end{cases}\]

LOTUS = Law of the Unconscious Statistician. Lets you compute E[g(X)] without first finding the distribution of g(X) — just integrate g(x) against the original density f(x). Essential for computing moments, MGFs, and loss function expectations.

Linearity of Expectation

Linearity — No Independence RequiredKey Property

\[\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]\] \[\mathbb{E}\!\left[\sum_{i=1}^n X_i\right] = \sum_{i=1}^n \mathbb{E}[X_i]\]

Linearity holds regardless of whether X and Y are independent. This is the single most powerful trick in probability — it lets you compute the expected value of complicated sums by breaking them into simple pieces. The expected number of heads in n flips = n · E[one flip] = n/2, with no need to compute the Binomial PMF.

Jensen's Inequality

Jensen's InequalityConvexity

\[f\text{ convex} \Rightarrow f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]\] \[f\text{ concave} \Rightarrow f(\mathbb{E}[X]) \geq \mathbb{E}[f(X)]\]

Applications: KL divergence ≥ 0 (proven via Jensen with f = −log). Variance ≥ 0 (Jensen with f(x) = x²). E[1/X] ≥ 1/E[X] for X > 0. E[X²] ≥ (E[X])² (hence Var(X) ≥ 0). In ML: the ELBO lower bound in VAEs uses Jensen. Softmax and the log-sum-exp trick use Jensen.

05

Spread

Variance & Covariance

// second moment · spread · correlation · covariance matrix · Cauchy-Schwarz

While expectation tells us our model's average prediction, variance measures its uncertainty and instability—central to understanding the famous bias-variance tradeoff. Covariance extends this to multiple dimensions, forming the foundation of feature correlation, Principal Component Analysis (PCA), and the geometry of multivariate data.

Variance and Standard DeviationSecond Moment

\[\text{Var}(X) = \mathbb{E}\!\left[(X-\mathbb{E}[X])^2\right] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2\] \[\sigma_X = \text{SD}(X) = \sqrt{\text{Var}(X)}\] \[\text{Var}(aX+b) = a^2\,\text{Var}(X)\]

The computational formula E[X²] − (E[X])² is usually easier than the definition. Variance is not linear (note the a² factor). Var(X+Y) = Var(X) + Var(Y) only when X and Y are uncorrelated (not necessarily independent).

Covariance and Correlation

Covariance — Definition and PropertiesJoint Spread

\[\text{Cov}(X,Y) = \mathbb{E}[(X-\mu_X)(Y-\mu_Y)] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]\] \[\rho_{XY} = \text{Corr}(X,Y) = \frac{\text{Cov}(X,Y)}{\sigma_X\sigma_Y} \in [-1,1]\] \[\text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y) + 2\,\text{Cov}(X,Y)\] \[\text{Var}\!\left(\sum_{i=1}^n X_i\right) = \sum_{i=1}^n\text{Var}(X_i) + 2\sum_{i<j}\text{Cov}(X_i,X_j)\]

|Corr(X,Y)| ≤ 1 by Cauchy-Schwarz: |Cov(X,Y)|² ≤ Var(X)·Var(Y). Cov(X,Y) = 0 means X and Y are uncorrelated, NOT independent (independence → uncorrelated, but not vice versa). The covariance matrix Σᵢⱼ = Cov(Xᵢ,Xⱼ) is the central object in multivariate statistics and PCA.

The Covariance Matrix

Covariance Matrix — Definition & PropertiesMultivariate

\[\boldsymbol{\Sigma} = \text{Cov}(\mathbf{X}) = \mathbb{E}[(\mathbf{X}-\boldsymbol{\mu})(\mathbf{X}-\boldsymbol{\mu})^T] \in \mathbb{R}^{d\times d}\] \[\boldsymbol{\Sigma}_{ij} = \text{Cov}(X_i, X_j), \quad \boldsymbol{\Sigma}_{ii} = \text{Var}(X_i)\]

Σ is always symmetric (Σᵢⱼ = Σⱼᵢ) and positive semi-definite (v^T Σ v ≥ 0 for all v). PSD follows from Var(v^T X) = v^T Σ v ≥ 0. The covariance matrix encodes the shape, scale, and correlation structure of a multivariate distribution. PCA diagonalizes it.

06

Conditioning

Conditional Probability

// definition · updating on evidence · chain rule · independence

Conditional probability is the mathematical formulation of supervised learning. When we train a discriminative model, we are not trying to learn the distribution of everything; we are specifically trying to model the conditional probability \(P(Y|X)\)—the probability of a target \(Y\) given our observed features \(X\).

Conditional ProbabilityDefinition

\[\mathbb{P}(A \mid B) = \frac{\mathbb{P}(A \cap B)}{\mathbb{P}(B)}, \quad \mathbb{P}(B) > 0\] \[\text{(Product rule)} \quad \mathbb{P}(A \cap B) = \mathbb{P}(A \mid B)\,\mathbb{P}(B) = \mathbb{P}(B \mid A)\,\mathbb{P}(A)\] \[\text{(Chain rule)} \quad \mathbb{P}(A_1 \cap \cdots \cap A_n) = \prod_{k=1}^n \mathbb{P}(A_k \mid A_1,\ldots,A_{k-1})\]

Conditioning restricts the sample space to B and renormalizes. P(A|B) is the probability of A given that B has occurred. The product rule is Bayes' theorem waiting to happen. The chain rule factorizes any joint probability into a product of conditionals — used in autoregressive language models like GPT: P(w₁,...,wₙ) = P(w₁)P(w₂|w₁)P(w₃|w₁,w₂)···

Conditional Expectation

Conditional Expectation E[Y|X]Key Object

\[\mathbb{E}[Y \mid X = x] = \int y\,f_{Y|X}(y|x)\,dy\] \[\text{Tower Property (Law of Total Expectation):}\quad \mathbb{E}[Y] = \mathbb{E}[\mathbb{E}[Y \mid X]]\]

E[Y|X] is itself a random variable — a function of X. The tower property says: to compute E[Y], first compute E[Y|X] for each value of X, then average over X. Used in ANOVA, Bayesian updating, and the derivation of the Kalman filter.

07

Total Probability

Law of Total Probability

// partition of sample space · marginalizing · the bridge between prior and evidence

Law of Total ProbabilityMarginalization

\[\text{Let }\{B_1, B_2, \ldots, B_n\}\text{ partition }\Omega\text{ (mutually exclusive and exhaustive).}\] \[\mathbb{P}(A) = \sum_{i=1}^n \mathbb{P}(A \mid B_i)\,\mathbb{P}(B_i)\] \[\text{Continuous version:} \quad p(x) = \int p(x \mid z)\,p(z)\,dz\]

The continuous version is the marginalisation integral — the denominator in Bayes' theorem. Computing this integral is often the hardest part of Bayesian inference. Variational inference (ELBO) and MCMC exist precisely to avoid computing this integral directly. It is also the "evidence" P(D) = ∫P(D|θ)P(θ)dθ in Bayesian ML.

∑

Why This Is Fundamental to ML

The law of total probability is the mathematical operation of marginalization — averaging over unobserved variables. In probabilistic ML, latent variables \(z\) represent hidden structure. The marginal \(p(x) = \int p(x|z)p(z)\,dz\) is the likelihood you actually want to maximize. The entire field of latent variable models (VAEs, GMMs, HMMs) centers on computing or approximating this integral.

08

Bayes

Bayes' Rule — Derivation

// from product rule · diagnostic reasoning · prior · likelihood · posterior

Bayes' Rule is the engine of Bayesian ML. It provides a principled mathematical framework for how an AI agent should update its beliefs (its model parameters or weights) as it observes new training data. Instead of finding a single "best" parameter set, Bayesian inference finds an entire distribution of plausible parameters.

Bayes' Rule — Full FormCore Theorem

\[\mathbb{P}(B \mid A) = \frac{\mathbb{P}(A \mid B)\,\mathbb{P}(B)}{\mathbb{P}(A)} = \frac{\mathbb{P}(A \mid B)\,\mathbb{P}(B)}{\sum_j \mathbb{P}(A \mid B_j)\,\mathbb{P}(B_j)}\] \[\underbrace{p(\theta \mid \mathcal{D})}_{\text{posterior}} = \frac{\overbrace{p(\mathcal{D}\mid\theta)}^{\text{likelihood}}\,\overbrace{p(\theta)}^{\text{prior}}}{\underbrace{p(\mathcal{D})}_{\text{evidence}}} \propto p(\mathcal{D}\mid\theta)\,p(\theta)\]

Derivation: product rule applied two ways: P(A∩B) = P(A|B)P(B) = P(B|A)P(A). Divide both sides by P(A). Bayes' rule is not a deep theorem — it is one line of algebra from the definition of conditional probability. Its profundity comes from its interpretation as belief updating.

09

Joint

Joint Distributions

// joint PMF/PDF · marginals · conditional distributions · multivariate Gaussian

A joint distribution models the entire universe of our data variables simultaneously, \(P(X_1, X_2, \ldots, Y)\). While discriminative models focus solely on the conditional, generative AI—such as Variational Autoencoders, GANs, or Diffusion Models—attempts the much harder task of learning this complete joint distribution, allowing us to sample entirely new, synthetic data points.

Joint, Marginal, and ConditionalMultivariate

\[\text{Joint:}\quad p(x,y) \text{ — the full distribution over both variables}\] \[\text{Marginal:}\quad p(x) = \int p(x,y)\,dy \quad\text{(marginalize out } y\text{)}\] \[\text{Conditional:}\quad p(y \mid x) = \frac{p(x,y)}{p(x)}\] \[\text{Factorization:}\quad p(x,y) = p(y \mid x)\,p(x) = p(x \mid y)\,p(y)\]

These three forms are related by simple algebra. The Bayesian network literature generalizes this: any joint distribution over n variables can be factored as a product of conditionals along a DAG: p(x₁,...,xₙ) = ∏ᵢ p(xᵢ|parents(xᵢ)). This is the foundation of probabilistic graphical models.

Multivariate Gaussian

Multivariate Gaussian DistributionMVN

\[\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\] \[f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\boldsymbol{\Sigma}|^{1/2}}\exp\!\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T\boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right)\]

The MVN is the multivariate maximum-entropy distribution subject to fixed mean and covariance (by MaxEnt principle). Its contours are ellipses aligned with eigenvectors of Σ. The exponent (x−μ)^T Σ^{-1} (x−μ) is the squared Mahalanobis distance. Conditionals of a MVN are Gaussian; marginals are Gaussian — closure under all linear operations makes it the workhorse of probabilistic ML.

10

Independence

// pairwise vs mutual · conditional independence · independence in ML

Independence is the most common (and often violated) assumption in data science. The fundamental "i.i.d." assumption—that training examples are independent and identically distributed—allows us to multiply probabilities and sum log-likelihoods. Furthermore, conditional independence is what makes complex graphical models (like Naive Bayes or Hidden Markov Models) computationally tractable.

Independence — Formal DefinitionDefinition

\[X \perp Y \iff p(x,y) = p(x)\,p(y) \quad\forall x,y\] \[\iff p(x \mid y) = p(x) \quad\forall x,y \text{ s.t. } p(y) > 0\] \[\text{Mutual independence: } p(x_1,\ldots,x_n) = \prod_{i=1}^n p(x_i)\] \[\text{Conditional independence: } X \perp Y \mid Z \iff p(x,y|z) = p(x|z)\,p(y|z)\]

Mutual independence is strictly stronger than pairwise independence. Pairwise independence: every pair is independent. Mutual independence: every subset is jointly independent. Counterexample: X, Y ~ Bernoulli(0.5) iid, Z = X ⊕ Y (XOR). X⊥Y, X⊥Z, Y⊥Z but not X⊥Y⊥Z (knowing X and Z determines Y). Conditional independence is the core of Bayesian networks and naive Bayes classifiers.

Independence vs Uncorrelated

Independent ⟹ Uncorrelated

If \(X \perp Y\) then \(\text{Cov}(X,Y) = 0\). Proof: \(\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]\) (by independence), so \(\text{Cov} = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] = 0\).

Uncorrelated ⟹̸ Independent

Counterexample: \(X \sim \mathcal{N}(0,1)\), \(Y = X^2\). Then \(\text{Cov}(X,Y) = \mathbb{E}[X^3] - \mathbb{E}[X]\mathbb{E}[X^2] = 0\), but \(Y = X^2\) — completely determined by \(X\). Independence requires checking all moments, not just the second moment.

11

Distributions

Common Distributions

// Bernoulli · Gaussian · Binomial · Poisson · Beta · Dirichlet · Exponential · Gamma

These distributions are our algorithmic building blocks. The choice of a target distribution directly dictates the loss function we use: assuming Gaussian noise leads to Mean Squared Error (MSE), assuming a Bernoulli distribution leads to Binary Cross-Entropy, and a Categorical distribution gives us Softmax Cross-Entropy.

Bernoulli

p(x) = pˣ(1−p)^{1−x}, x∈{0,1}

Single binary trial with success probability p. E[X]=p, Var(X)=p(1−p). Basis of logistic regression output.

Binomial

p(k) = C(n,k)pᵏ(1−p)^{n−k}

Number of successes in n independent Bernoulli(p) trials. E[X]=np, Var=np(1−p). CLT: Bin(n,p) ≈ N(np, np(1−p)) for large n.

Gaussian / Normal

f(x) = (1/√(2πσ²))exp(−(x−μ)²/2σ²)

MaxEnt distribution for known mean and variance. E[X]=μ, Var=σ². Basis of regression likelihood, weight initialization, noise models. Central to CLT.

Poisson

p(k) = λᵏe^{−λ}/k!, k=0,1,2,...

Count of rare events in a fixed interval. E[X]=Var(X)=λ. Approximates Binomial(n,p) when n large, p small, λ=np. Used in NLP token count models.

Beta

f(x) ∝ x^{α−1}(1−x)^{β−1}, x∈[0,1]

Distribution over probabilities. Conjugate prior for Bernoulli/Binomial. E[X]=α/(α+β). Uniform=Beta(1,1). MaxEnt on [0,1] with fixed E[log X] and E[log(1−X)].

Dirichlet

f(p) ∝ ∏ pₖ^{αₖ−1}, Σpₖ=1

Distribution over probability simplices. Conjugate prior for Categorical/Multinomial. Generalizes Beta. E[pₖ] = αₖ/Σαⱼ. Foundation of LDA topic models.

Exponential

f(x) = λe^{−λx}, x≥0

Time until first event in Poisson process. E[X]=1/λ, Var=1/λ². Memoryless: P(X>s+t|X>s)=P(X>t). MaxEnt on [0,∞) with fixed mean.

Gamma

f(x) ∝ x^{α−1}e^{−x/β}, x>0

Sum of α independent Exp(1/β) RVs. E[X]=αβ, Var=αβ². Conjugate prior for Gaussian precision and Poisson rate. Generalizes exponential and chi-squared.

The Exponential Family — Unifying Structure

Exponential Family DistributionGeneral Form

\[p(x;\eta) = h(x)\exp\!\left(\eta^T T(x) - A(\eta)\right)\]

η = natural parameters, T(x) = sufficient statistics, A(η) = log partition function (ensures normalisation). All common distributions are exponential family. Properties: MLE of η has closed form (set mean of T(x) equal to E[T(x)]). Conjugate priors always exist. A(η) is convex, and ∇A(η) = E[T(X)], ∇²A(η) = Cov(T(X)). MaxEnt under moment constraints always gives exponential family.

12

Convergence

Law of Large Numbers

// weak LLN · strong LLN · convergence in probability vs almost surely · iid requirement

The Law of Large Numbers formalizes the intuition that averages stabilize with more data. It is the mathematical foundation of Monte Carlo methods, empirical risk minimization, and every statistical estimator.

Weak Law of Large Numbers

Weak LLN (Khinchin)Convergence in Probability

\[\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i \xrightarrow{p} \mu = \mathbb{E}[X] \quad\text{as } n\to\infty\] \[\text{i.e., } \forall\epsilon > 0:\quad \mathbb{P}(|\bar{X}_n - \mu| > \epsilon) \to 0\]

Requires X₁, X₂, ... to be iid with finite mean. Proof via Chebyshev's inequality: P(|X̄_n − μ| > ε) ≤ Var(X̄_n)/ε² = σ²/(nε²) → 0. The Weak LLN justifies empirical risk minimization: the empirical average of losses converges to the expected loss as n → ∞.

Strong Law of Large Numbers

Strong LLN (Kolmogorov)Almost Sure Convergence

\[\bar{X}_n \xrightarrow{a.s.} \mu \quad\text{i.e.}\quad \mathbb{P}\!\left(\lim_{n\to\infty}\bar{X}_n = \mu\right) = 1\]

Stronger than Weak LLN: almost every path of sample averages converges to μ. Weak LLN: for any fixed ε, the probability of being outside ε-ball goes to 0. Strong LLN: with probability 1, you'll eventually stay inside every ε-ball forever. Requires finite E[|X|] (not just finite mean). The difference matters for ergodic theory and stochastic processes.

n

LLN in Machine Learning

The LLN justifies training on mini-batches: the average gradient over a batch converges to the true gradient as batch size grows. It justifies Monte Carlo integration: \(\hat{I} = \frac{1}{n}\sum_i f(x_i) \to \mathbb{E}[f(X)]\). It justifies empirical risk minimization: minimizing \(\frac{1}{n}\sum \ell(f(x_i),y_i)\) approximates minimizing the true risk \(\mathbb{E}[\ell(f(X),Y)]\). Everything in statistical learning theory uses the LLN as its foundation.

13

CLT

Central Limit Theorem

// normal limit · rate √n · Berry-Esseen · CLT in ML · characteristic functions

The Central Limit Theorem (CLT) is the reason the Gaussian distribution appears everywhere in nature and algorithms. In Deep Learning, it justifies why aggregated gradients in a mini-batch behave predictably, why weight initialization schemes assume normality, and why ensemble methods like Random Forests successfully reduce model variance.

★

Central Limit Theorem (Lindeberg-Lévy)

Let \(X_1, X_2, \ldots\) be i.i.d. with \(\mathbb{E}[X_i] = \mu\) and \(\text{Var}(X_i) = \sigma^2 < \infty\). Then: \[\sqrt{n}\!\left(\bar{X}_n - \mu\right) \xrightarrow{d} \mathcal{N}(0, \sigma^2)\] Equivalently: \(\frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \xrightarrow{d} \mathcal{N}(0, 1)\) as \(n \to \infty\).

Proof Sketch via Characteristic Functions

// CLT proof sketch using characteristic functions (moment generating functions)

1

Let \(Z_i = (X_i - \mu)/\sigma\). The standardized sum is \(S_n = \frac{1}{\sqrt{n}}\sum_{i=1}^n Z_i\). Characteristic function: \(\varphi_{Z_i}(t) = \mathbb{E}[e^{itZ_i}]\).

2

By Taylor expansion near \(t=0\): \(\varphi_{Z_i}(t) = 1 + it\mathbb{E}[Z_i] - \frac{t^2}{2}\mathbb{E}[Z_i^2] + O(t^3) = 1 - \frac{t^2}{2} + O(t^3)\).

Since E[Z_i] = 0 (standardized) and E[Z_i²] = 1 (unit variance).

3

Since \(Z_i\) are independent: \(\varphi_{S_n}(t) = \left[\varphi_{Z_i}(t/\sqrt{n})\right]^n = \left[1 - \frac{t^2}{2n} + O(n^{-3/2})\right]^n\).

4

Take \(n \to \infty\): \(\left(1 - \frac{t^2}{2n}\right)^n \to e^{-t^2/2}\). This is the characteristic function of \(\mathcal{N}(0,1)\). By Lévy's continuity theorem, \(S_n \xrightarrow{d} \mathcal{N}(0,1)\). ∎

The Berry-Esseen Theorem — Rate of Convergence

Berry-Esseen BoundQuantitative CLT

\[\sup_x \left|\mathbb{P}\!\left(\frac{\bar{X}_n-\mu}{\sigma/\sqrt{n}} \leq x\right) - \Phi(x)\right| \leq \frac{C\rho}{\sigma^3\sqrt{n}}\] where \(\rho = \mathbb{E}[|X-\mu|^3]\) (third absolute moment), \(C \leq 0.4748\).

The CLT convergence rate is O(1/√n). For n=100 and σ²=1, ρ=1: error ≤ 0.047. This quantitative bound justifies using the normal approximation in practice — and tells you exactly how large n must be for the approximation to be valid.

CLT in Machine Learning

Confidence intervals for model metrics: The difference between two models' accuracies over \(n\) test examples is approximately \(\mathcal{N}(0, \sigma^2/n)\) — enabling statistical significance tests.
Mini-batch gradient estimates: The mini-batch gradient is an average of \(B\) per-sample gradients. By CLT, its distribution is approximately Gaussian around the true gradient, with variance \(\sigma^2/B\).
Justification for Gaussian likelihood: Many real-world measurements are sums of many small independent noise sources. By CLT, such sums are approximately Gaussian — justifying the Gaussian noise assumption in regression.
Ensemble methods: The average prediction of many weak classifiers converges to a distribution with lower variance (rate \(1/\sqrt{n}\)) around the mean prediction.

14

Bounds

Concentration Inequalities

// Markov · Chebyshev · Hoeffding · sub-Gaussian · generalization bounds

Concentration inequalities bound the probability that a random variable deviates far from its mean. They are the mathematical core of statistical learning theory — generalization bounds, PAC learning, and bandit algorithms all rely on them.

Markov's Inequality — The Weakest Bound

Markov's InequalityNon-Negative RV

\[\text{For } X \geq 0:\quad \mathbb{P}(X \geq t) \leq \frac{\mathbb{E}[X]}{t}\]

Proof: E[X] = E[X·1_{X≥t}] + E[X·1_{X<t}] ≥ E[X·1_{X≥t}] ≥ t·P(X≥t). Requires only X ≥ 0 and finite mean. Basis for all other concentration inequalities (Chebyshev applies Markov to (X−μ)²). Tight: Bernoulli(μ/t) achieves equality.

Chebyshev's Inequality

Chebyshev's InequalityTwo-Sided

\[\mathbb{P}(|X - \mu| \geq t) \leq \frac{\sigma^2}{t^2} = \frac{\text{Var}(X)}{t^2}\]

Apply Markov to Y = (X−μ)². P(|X−μ| ≥ t) = P((X−μ)² ≥ t²) ≤ E[(X−μ)²]/t² = σ²/t². Requires only finite variance. The "k-sigma rule": P(|X−μ| ≥ kσ) ≤ 1/k². So at least 75% of data within 2σ, 89% within 3σ (for ANY distribution — compare to Gaussian's 95% and 99.7%). Chebyshev gives distribution-free bounds.

Hoeffding's Inequality

Hoeffding's InequalityBounded RVs

\[\text{For } a_i \leq X_i \leq b_i \text{ independent, } \bar{X} = \frac{1}{n}\sum X_i:\] \[\mathbb{P}\!\left(\bar{X} - \mathbb{E}[\bar{X}] \geq \epsilon\right) \leq \exp\!\left(-\frac{2n^2\epsilon^2}{\sum_{i=1}^n(b_i-a_i)^2}\right)\] \[\text{For i.i.d. } a \leq X_i \leq b:\quad \mathbb{P}(|\bar{X} - \mu| \geq \epsilon) \leq 2\exp\!\left(-\frac{2n\epsilon^2}{(b-a)^2}\right)\]

Exponentially tighter than Chebyshev for bounded random variables. Requires no assumption about the variance — only that X_i ∈ [a_i, b_i]. The probability decays exponentially in n and ε². Setting the bound to δ: with probability ≥ 1−δ, |X̄ − μ| ≤ (b−a)√(log(2/δ)/(2n)). This is a PAC bound — it says how many samples you need for accuracy ε with confidence 1−δ.

Sub-Gaussian Random Variables

Sub-Gaussian DistributionHeavy Tails

\[X \text{ is sub-Gaussian with parameter } \sigma \text{ if: } \mathbb{E}[e^{tX}] \leq e^{\sigma^2 t^2/2} \;\;\forall t\in\mathbb{R}\] \[\Rightarrow \mathbb{P}(X \geq \epsilon) \leq e^{-\epsilon^2/(2\sigma^2)} \quad\text{(Gaussian-like tail)}\]

Gaussian(0,σ²) is sub-Gaussian with parameter σ. Any bounded [−B,B] RV is sub-Gaussian with σ = B. Sub-Gaussian is the right class for concentration: tails decay at least as fast as Gaussian. Heavy-tailed distributions (Cauchy, Pareto) are NOT sub-Gaussian. Most loss functions in ML are bounded → sub-Gaussian → Hoeffding applies.

Inequality	Assumption	Bound	Decay Rate	Use Case
Markov	X ≥ 0, finite E[X]	E[X]/t	O(1/t)	Weakest bound; proves others
Chebyshev	Finite variance	σ²/t²	O(1/t²)	Distribution-free; LLN proof
Hoeffding	Bounded, independent	2exp(−2nε²/range²)	Exponential in n	PAC learning, bandit algorithms
Sub-Gaussian	Sub-Gaussian tail	exp(−ε²/2σ²)	Exponential in ε²	Generalization bounds
McDiarmid	Bounded differences	2exp(−2ε²/Σcᵢ²)	Exponential in n	Functions of independent RVs

15

ML View

Probabilistic View of Machine Learning

// MLE · MAP · generative vs discriminative · probabilistic prediction

Every ML algorithm has a probabilistic interpretation. Recognizing this reveals what assumptions each method implicitly makes, how to improve it, and how to quantify uncertainty in its predictions.

Maximum Likelihood Estimation

MLE — The Probabilistic Foundation of TrainingEstimation

\[\hat{\theta}_{\text{MLE}} = \arg\max_\theta\,\prod_{i=1}^n p(x_i;\theta) = \arg\max_\theta\,\sum_{i=1}^n\log p(x_i;\theta)\]

The switch to log-likelihood is valid because log is monotone. The log converts the product to a sum — numerically stable and analytically convenient. MLE minimizes KL(P̂_data ‖ P_θ) — it finds the model closest to the empirical data distribution in KL divergence.

Every Loss Function is a Log-Likelihood

Loss Function	Noise Model	Probabilistic Interpretation
MSE: \(\frac{1}{n}\sum(y_i-\hat{y}_i)^2\)	\(y\|\mathbf{x},\theta \sim \mathcal{N}(\hat{y},\sigma^2)\)	MLE under Gaussian noise
Binary Cross-Entropy	\(y\|\mathbf{x},\theta \sim \text{Bern}(\sigma(\hat{y}))\)	MLE under Bernoulli model
Categorical Cross-Entropy	\(y\|\mathbf{x},\theta \sim \text{Cat}(\text{softmax}(\hat{\mathbf{y}}))\)	MLE under Categorical model
MAE: \(\frac{1}{n}\sum\|y_i-\hat{y}_i\|\)	\(y\|\mathbf{x},\theta \sim \text{Laplace}(\hat{y}, b)\)	MLE under Laplace noise (robust)
L2 Regularization \(+\lambda\\|\theta\\|^2\)	\(\theta \sim \mathcal{N}(0, 1/(2\lambda))\)	MAP with Gaussian prior
L1 Regularization \(+\lambda\\|\theta\\|_1\)	\(\theta_j \sim \text{Laplace}(0, 1/\lambda)\)	MAP with Laplace prior (sparsity)

Generative vs Discriminative Models

Generative Models

Model the joint distribution \(p(\mathbf{x}, y) = p(\mathbf{x}|y)\,p(y)\). Can generate new samples, handle missing data, and do density estimation. Predict via Bayes: \(p(y|\mathbf{x}) = p(\mathbf{x}|y)p(y)/p(\mathbf{x})\).

Examples: Naive Bayes, Gaussian Mixture Models, Hidden Markov Models, VAEs, GANs, Normalizing Flows.

Discriminative Models

Model the conditional distribution \(p(y|\mathbf{x})\) directly. Don't model \(p(\mathbf{x})\) — focus all capacity on the decision boundary. Usually more accurate when the task is classification/regression.

Examples: Logistic Regression, SVMs, Neural Networks, CNNs, Transformers, Random Forests.

16

Mental Model

The Complete Mental Model

// everything connected · probability as a language · the full picture

Fig 1. Complete probability theory mental map — from Kolmogorov's axioms through random variables, conditioning, Bayes' rule, limit theorems, and out to every ML application.

The complete story as one coherent thread:

A probability space \((\Omega, \mathcal{F}, \mathbb{P})\) formalizes randomness: outcomes, measurable events, and a measure satisfying Kolmogorov's three axioms. Every probability rule you've ever used is a theorem derived from these axioms.
Random variables are measurable functions \(X: \Omega \to \mathbb{R}\). They carry distributions — described by PMFs (discrete) or PDFs (continuous) — and their CDFs always exist and completely characterize the distribution.
Expectation is the Lebesgue integral of \(X\) against \(\mathbb{P}\). It is linear (always), monotone, and satisfies LOTUS (no need to find the distribution of \(g(X)\) to compute \(\mathbb{E}[g(X)]\)). Jensen's inequality connects convexity to expectation inequalities.
Variance and covariance measure spread and co-movement. The covariance matrix is symmetric PSD. PCA diagonalizes it. Every loss landscape in ML has a Hessian which is a covariance matrix in disguise.
Conditional probability and Bayes' rule follow from the product rule. The chain rule of probability factorizes any joint distribution into conditionals — exactly what autoregressive LMs (GPT) compute when generating text token by token.
The Law of Large Numbers guarantees empirical averages converge to expectations. This justifies SGD, mini-batch training, empirical risk minimization, and Monte Carlo integration — the entire computational machinery of ML.
The Central Limit Theorem tells us that sums of independent random variables are approximately Gaussian, regardless of the original distribution. This justifies Gaussian likelihood assumptions and enables statistical inference on model metrics.
Concentration inequalities — Markov, Chebyshev, Hoeffding, sub-Gaussian — give finite-sample bounds. They are the core of PAC learning theory: "with \(n\) samples and probability \(1-\delta\), the generalization gap is at most \(O(\sqrt{\log(1/\delta)/n})\)."
Every ML model is a probabilistic model: the loss function is the negative log-likelihood, regularization is a log-prior, training is MLE or MAP, and the full Bayesian treatment gives a posterior over parameters and an honest predictive distribution.

★

Probability is the Language of Uncertainty

Every algorithm in machine learning is a statement about probability: what distribution generated the data, what distribution we place over parameters, how distributions transform under operations. MSE loss assumes Gaussian noise. Cross-entropy assumes Bernoulli/Categorical output. L2 regularization assumes Gaussian prior. Dropout is approximate Bayesian inference. Batch normalization manipulates the distribution of activations. Attention is a soft probability distribution over positions. Understanding probability theory doesn't add a mathematical layer on top of ML — it reveals that probability is the substance of which every ML algorithm is made.

from first principles Probability Theory for ML