The rigorous mathematical foundation beneath every
algorithm you've ever run. Measure theory to CLT,
concentration inequalities to the probabilistic soul of ML.
Probability SpacesRandom VariablesPMF · PDF · CDFExpectation · VarianceBayes' RuleJoint DistributionsCommon DistributionsLLN & CLTConcentration Ineq.Prob. View of ML
// sample space · σ-algebra · probability measure · Kolmogorov axioms
A probability space is the mathematical bedrock on which everything in statistics and ML rests. It gives a rigorous answer to the question: what is a random experiment?
The Formal Definition
Probability Space (Ω, ℱ, ℙ)Definition
\[(\Omega,\,\mathcal{F},\,\mathbb{P}) \quad\text{where:}\]
\[\Omega \text{ — sample space: the set of all possible outcomes}\]
\[\mathcal{F} \subseteq 2^\Omega \text{ — σ-algebra: a collection of measurable events}\]
\[\mathbb{P}:\mathcal{F}\to[0,1] \text{ — probability measure: assigns probabilities to events}\]
The σ-algebra ℱ must satisfy: (1) Ω ∈ ℱ, (2) if A ∈ ℱ then Aᶜ ∈ ℱ (closed under complement), (3) if A₁,A₂,... ∈ ℱ then ∪ᵢAᵢ ∈ ℱ (closed under countable unions). This ensures every set we want to assign a probability to is "measurable."
All probability rules you've ever used — complement rule P(Aᶜ) = 1−P(A), inclusion-exclusion, Boole's inequality — are theorems derived from these three axioms alone. They were proposed by Andrei Kolmogorov in 1933, unifying all previous ad hoc probability rules into a single coherent framework.
Key Consequences
Complement rule: \(\mathbb{P}(A^c) = 1 - \mathbb{P}(A)\). Proof: \(\Omega = A \cup A^c\) (disjoint), so \(1 = \mathbb{P}(A) + \mathbb{P}(A^c)\).
Monotonicity: If \(A \subseteq B\) then \(\mathbb{P}(A) \leq \mathbb{P}(B)\).
Union bound (Boole's inequality): \(\mathbb{P}(\bigcup_i A_i) \leq \sum_i \mathbb{P}(A_i)\) — used constantly in ML generalization bounds.
02
Variables
Random Variables
// measurable function · distribution · discrete vs continuous · induced measure
A random variable is not actually a variable in the usual sense — it is a function from the sample space to the real numbers. This formalism connects the abstract probability space to measurable quantities.
The measurability condition ensures that P(X ≤ x) makes sense (the set we're measuring is in ℱ). In practice: for discrete probability spaces, every function is a random variable. For continuous spaces (e.g., Ω = ℝ), we need measurability to rule out pathological functions. Most functions you'll ever encounter are measurable.
Discrete Random Variables
Takes values in a countable set \(\{x_1, x_2, \ldots\}\). The distribution is fully described by the probability mass function (PMF).
Examples: number of heads in \(n\) flips, number of words in a sentence, number of training epochs until convergence.
Continuous Random Variables
Takes values in an uncountable set (e.g., \(\mathbb{R}\)). The distribution is described by a probability density function (PDF). \(P(X = x) = 0\) for any specific \(x\).
Examples: Gaussian noise, uniform random initialization, model output probabilities.
03
Distributions
PMFs, PDFs, and CDFs
// probability mass · density · cumulative · LOTUS · transformations
In Machine Learning, probability distributions form the backbone of our loss functions and generative models. When a neural network predicts a class, it outputs a Probability Mass Function (PMF) via a softmax layer. When a variational autoencoder learns the latent space of images, it models a continuous Probability Density Function (PDF).
The PMF must be non-negative and sum to 1. Every valid probability distribution must satisfy these two constraints — they follow directly from Kolmogorov's axioms.
Probability Density Function (Continuous)
PDF — Continuous DistributionContinuous
\[f_X(x) \geq 0, \quad \int_{-\infty}^\infty f_X(x)\,dx = 1\]
\[\mathbb{P}(a \leq X \leq b) = \int_a^b f_X(x)\,dx\]
f_X(x) is NOT a probability — it's a density. P(X = x) = 0 for any specific x (measure-zero set). However, f_X(x) dx ≈ P(x ≤ X ≤ x+dx) for infinitesimally small dx. The density can exceed 1 (as long as the integral equals 1).
Machine Learning is fundamentally the science of optimizing expectations. The loss function we minimize during training (Empirical Risk Minimization) is simply a sample-based approximation of the expected loss over the true data distribution. Furthermore, when our models make predictions under uncertainty, they frequently output the expected value of the target.
LOTUS = Law of the Unconscious Statistician. Lets you compute E[g(X)] without first finding the distribution of g(X) — just integrate g(x) against the original density f(x). Essential for computing moments, MGFs, and loss function expectations.
Linearity holds regardless of whether X and Y are independent. This is the single most powerful trick in probability — it lets you compute the expected value of complicated sums by breaking them into simple pieces. The expected number of heads in n flips = n · E[one flip] = n/2, with no need to compute the Binomial PMF.
Applications: KL divergence ≥ 0 (proven via Jensen with f = −log). Variance ≥ 0 (Jensen with f(x) = x²). E[1/X] ≥ 1/E[X] for X > 0. E[X²] ≥ (E[X])² (hence Var(X) ≥ 0). In ML: the ELBO lower bound in VAEs uses Jensen. Softmax and the log-sum-exp trick use Jensen.
05
Spread
Variance & Covariance
// second moment · spread · correlation · covariance matrix · Cauchy-Schwarz
While expectation tells us our model's average prediction, variance measures its uncertainty and instability—central to understanding the famous bias-variance tradeoff. Covariance extends this to multiple dimensions, forming the foundation of feature correlation, Principal Component Analysis (PCA), and the geometry of multivariate data.
The computational formula E[X²] − (E[X])² is usually easier than the definition. Variance is not linear (note the a² factor). Var(X+Y) = Var(X) + Var(Y) only when X and Y are uncorrelated (not necessarily independent).
Covariance and Correlation
Covariance — Definition and PropertiesJoint Spread
|Corr(X,Y)| ≤ 1 by Cauchy-Schwarz: |Cov(X,Y)|² ≤ Var(X)·Var(Y). Cov(X,Y) = 0 means X and Y are uncorrelated, NOT independent (independence → uncorrelated, but not vice versa). The covariance matrix Σᵢⱼ = Cov(Xᵢ,Xⱼ) is the central object in multivariate statistics and PCA.
Σ is always symmetric (Σᵢⱼ = Σⱼᵢ) and positive semi-definite (v^T Σ v ≥ 0 for all v). PSD follows from Var(v^T X) = v^T Σ v ≥ 0. The covariance matrix encodes the shape, scale, and correlation structure of a multivariate distribution. PCA diagonalizes it.
Conditional probability is the mathematical formulation of supervised learning. When we train a discriminative model, we are not trying to learn the distribution of everything; we are specifically trying to model the conditional probability \(P(Y|X)\)—the probability of a target \(Y\) given our observed features \(X\).
Conditioning restricts the sample space to B and renormalizes. P(A|B) is the probability of A given that B has occurred. The product rule is Bayes' theorem waiting to happen. The chain rule factorizes any joint probability into a product of conditionals — used in autoregressive language models like GPT: P(w₁,...,wₙ) = P(w₁)P(w₂|w₁)P(w₃|w₁,w₂)···
Conditional Expectation
Conditional Expectation E[Y|X]Key Object
\[\mathbb{E}[Y \mid X = x] = \int y\,f_{Y|X}(y|x)\,dy\]
\[\text{Tower Property (Law of Total Expectation):}\quad \mathbb{E}[Y] = \mathbb{E}[\mathbb{E}[Y \mid X]]\]
E[Y|X] is itself a random variable — a function of X. The tower property says: to compute E[Y], first compute E[Y|X] for each value of X, then average over X. Used in ANOVA, Bayesian updating, and the derivation of the Kalman filter.
07
Total Probability
Law of Total Probability
// partition of sample space · marginalizing · the bridge between prior and evidence
The continuous version is the marginalisation integral — the denominator in Bayes' theorem. Computing this integral is often the hardest part of Bayesian inference. Variational inference (ELBO) and MCMC exist precisely to avoid computing this integral directly. It is also the "evidence" P(D) = ∫P(D|θ)P(θ)dθ in Bayesian ML.
∑
Why This Is Fundamental to ML
The law of total probability is the mathematical operation of marginalization — averaging over unobserved variables. In probabilistic ML, latent variables \(z\) represent hidden structure. The marginal \(p(x) = \int p(x|z)p(z)\,dz\) is the likelihood you actually want to maximize. The entire field of latent variable models (VAEs, GMMs, HMMs) centers on computing or approximating this integral.
Bayes' Rule is the engine of Bayesian ML. It provides a principled mathematical framework for how an AI agent should update its beliefs (its model parameters or weights) as it observes new training data. Instead of finding a single "best" parameter set, Bayesian inference finds an entire distribution of plausible parameters.
Derivation: product rule applied two ways: P(A∩B) = P(A|B)P(B) = P(B|A)P(A). Divide both sides by P(A). Bayes' rule is not a deep theorem — it is one line of algebra from the definition of conditional probability. Its profundity comes from its interpretation as belief updating.
A joint distribution models the entire universe of our data variables simultaneously, \(P(X_1, X_2, \ldots, Y)\). While discriminative models focus solely on the conditional, generative AI—such as Variational Autoencoders, GANs, or Diffusion Models—attempts the much harder task of learning this complete joint distribution, allowing us to sample entirely new, synthetic data points.
Joint, Marginal, and ConditionalMultivariate
\[\text{Joint:}\quad p(x,y) \text{ — the full distribution over both variables}\]
\[\text{Marginal:}\quad p(x) = \int p(x,y)\,dy \quad\text{(marginalize out } y\text{)}\]
\[\text{Conditional:}\quad p(y \mid x) = \frac{p(x,y)}{p(x)}\]
\[\text{Factorization:}\quad p(x,y) = p(y \mid x)\,p(x) = p(x \mid y)\,p(y)\]
These three forms are related by simple algebra. The Bayesian network literature generalizes this: any joint distribution over n variables can be factored as a product of conditionals along a DAG: p(x₁,...,xₙ) = ∏ᵢ p(xᵢ|parents(xᵢ)). This is the foundation of probabilistic graphical models.
The MVN is the multivariate maximum-entropy distribution subject to fixed mean and covariance (by MaxEnt principle). Its contours are ellipses aligned with eigenvectors of Σ. The exponent (x−μ)^T Σ^{-1} (x−μ) is the squared Mahalanobis distance. Conditionals of a MVN are Gaussian; marginals are Gaussian — closure under all linear operations makes it the workhorse of probabilistic ML.
10
Independence
Independence
// pairwise vs mutual · conditional independence · independence in ML
Independence is the most common (and often violated) assumption in data science. The fundamental "i.i.d." assumption—that training examples are independent and identically distributed—allows us to multiply probabilities and sum log-likelihoods. Furthermore, conditional independence is what makes complex graphical models (like Naive Bayes or Hidden Markov Models) computationally tractable.
Mutual independence is strictly stronger than pairwise independence. Pairwise independence: every pair is independent. Mutual independence: every subset is jointly independent. Counterexample: X, Y ~ Bernoulli(0.5) iid, Z = X ⊕ Y (XOR). X⊥Y, X⊥Z, Y⊥Z but not X⊥Y⊥Z (knowing X and Z determines Y). Conditional independence is the core of Bayesian networks and naive Bayes classifiers.
Independence vs Uncorrelated
Independent ⟹ Uncorrelated
If \(X \perp Y\) then \(\text{Cov}(X,Y) = 0\). Proof: \(\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]\) (by independence), so \(\text{Cov} = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] = 0\).
Uncorrelated ⟹̸ Independent
Counterexample: \(X \sim \mathcal{N}(0,1)\), \(Y = X^2\). Then \(\text{Cov}(X,Y) = \mathbb{E}[X^3] - \mathbb{E}[X]\mathbb{E}[X^2] = 0\), but \(Y = X^2\) — completely determined by \(X\). Independence requires checking all moments, not just the second moment.
These distributions are our algorithmic building blocks. The choice of a target distribution directly dictates the loss function we use: assuming Gaussian noise leads to Mean Squared Error (MSE), assuming a Bernoulli distribution leads to Binary Cross-Entropy, and a Categorical distribution gives us Softmax Cross-Entropy.
Bernoulli
p(x) = pˣ(1−p)^{1−x}, x∈{0,1}
Single binary trial with success probability p. E[X]=p, Var(X)=p(1−p). Basis of logistic regression output.
Binomial
p(k) = C(n,k)pᵏ(1−p)^{n−k}
Number of successes in n independent Bernoulli(p) trials. E[X]=np, Var=np(1−p). CLT: Bin(n,p) ≈ N(np, np(1−p)) for large n.
Gaussian / Normal
f(x) = (1/√(2πσ²))exp(−(x−μ)²/2σ²)
MaxEnt distribution for known mean and variance. E[X]=μ, Var=σ². Basis of regression likelihood, weight initialization, noise models. Central to CLT.
Poisson
p(k) = λᵏe^{−λ}/k!, k=0,1,2,...
Count of rare events in a fixed interval. E[X]=Var(X)=λ. Approximates Binomial(n,p) when n large, p small, λ=np. Used in NLP token count models.
Beta
f(x) ∝ x^{α−1}(1−x)^{β−1}, x∈[0,1]
Distribution over probabilities. Conjugate prior for Bernoulli/Binomial. E[X]=α/(α+β). Uniform=Beta(1,1). MaxEnt on [0,1] with fixed E[log X] and E[log(1−X)].
Dirichlet
f(p) ∝ ∏ pₖ^{αₖ−1}, Σpₖ=1
Distribution over probability simplices. Conjugate prior for Categorical/Multinomial. Generalizes Beta. E[pₖ] = αₖ/Σαⱼ. Foundation of LDA topic models.
Exponential
f(x) = λe^{−λx}, x≥0
Time until first event in Poisson process. E[X]=1/λ, Var=1/λ². Memoryless: P(X>s+t|X>s)=P(X>t). MaxEnt on [0,∞) with fixed mean.
Gamma
f(x) ∝ x^{α−1}e^{−x/β}, x>0
Sum of α independent Exp(1/β) RVs. E[X]=αβ, Var=αβ². Conjugate prior for Gaussian precision and Poisson rate. Generalizes exponential and chi-squared.
η = natural parameters, T(x) = sufficient statistics, A(η) = log partition function (ensures normalisation). All common distributions are exponential family. Properties: MLE of η has closed form (set mean of T(x) equal to E[T(x)]). Conjugate priors always exist. A(η) is convex, and ∇A(η) = E[T(X)], ∇²A(η) = Cov(T(X)). MaxEnt under moment constraints always gives exponential family.
12
Convergence
Law of Large Numbers
// weak LLN · strong LLN · convergence in probability vs almost surely · iid requirement
The Law of Large Numbers formalizes the intuition that averages stabilize with more data. It is the mathematical foundation of Monte Carlo methods, empirical risk minimization, and every statistical estimator.
Requires X₁, X₂, ... to be iid with finite mean. Proof via Chebyshev's inequality: P(|X̄_n − μ| > ε) ≤ Var(X̄_n)/ε² = σ²/(nε²) → 0. The Weak LLN justifies empirical risk minimization: the empirical average of losses converges to the expected loss as n → ∞.
Stronger than Weak LLN: almost every path of sample averages converges to μ. Weak LLN: for any fixed ε, the probability of being outside ε-ball goes to 0. Strong LLN: with probability 1, you'll eventually stay inside every ε-ball forever. Requires finite E[|X|] (not just finite mean). The difference matters for ergodic theory and stochastic processes.
n
LLN in Machine Learning
The LLN justifies training on mini-batches: the average gradient over a batch converges to the true gradient as batch size grows. It justifies Monte Carlo integration: \(\hat{I} = \frac{1}{n}\sum_i f(x_i) \to \mathbb{E}[f(X)]\). It justifies empirical risk minimization: minimizing \(\frac{1}{n}\sum \ell(f(x_i),y_i)\) approximates minimizing the true risk \(\mathbb{E}[\ell(f(X),Y)]\). Everything in statistical learning theory uses the LLN as its foundation.
13
CLT
Central Limit Theorem
// normal limit · rate √n · Berry-Esseen · CLT in ML · characteristic functions
The Central Limit Theorem (CLT) is the reason the Gaussian distribution appears everywhere in nature and algorithms. In Deep Learning, it justifies why aggregated gradients in a mini-batch behave predictably, why weight initialization schemes assume normality, and why ensemble methods like Random Forests successfully reduce model variance.
★
Central Limit Theorem (Lindeberg-Lévy)
Let \(X_1, X_2, \ldots\) be i.i.d. with \(\mathbb{E}[X_i] = \mu\) and \(\text{Var}(X_i) = \sigma^2 < \infty\). Then:
\[\sqrt{n}\!\left(\bar{X}_n - \mu\right) \xrightarrow{d} \mathcal{N}(0, \sigma^2)\]
Equivalently: \(\frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \xrightarrow{d} \mathcal{N}(0, 1)\) as \(n \to \infty\).
Proof Sketch via Characteristic Functions
// CLT proof sketch using characteristic functions (moment generating functions)
1
Let \(Z_i = (X_i - \mu)/\sigma\). The standardized sum is \(S_n = \frac{1}{\sqrt{n}}\sum_{i=1}^n Z_i\). Characteristic function: \(\varphi_{Z_i}(t) = \mathbb{E}[e^{itZ_i}]\).
2
By Taylor expansion near \(t=0\): \(\varphi_{Z_i}(t) = 1 + it\mathbb{E}[Z_i] - \frac{t^2}{2}\mathbb{E}[Z_i^2] + O(t^3) = 1 - \frac{t^2}{2} + O(t^3)\).
Since E[Z_i] = 0 (standardized) and E[Z_i²] = 1 (unit variance).
3
Since \(Z_i\) are independent: \(\varphi_{S_n}(t) = \left[\varphi_{Z_i}(t/\sqrt{n})\right]^n = \left[1 - \frac{t^2}{2n} + O(n^{-3/2})\right]^n\).
4
Take \(n \to \infty\): \(\left(1 - \frac{t^2}{2n}\right)^n \to e^{-t^2/2}\). This is the characteristic function of \(\mathcal{N}(0,1)\). By Lévy's continuity theorem, \(S_n \xrightarrow{d} \mathcal{N}(0,1)\). ∎
The CLT convergence rate is O(1/√n). For n=100 and σ²=1, ρ=1: error ≤ 0.047. This quantitative bound justifies using the normal approximation in practice — and tells you exactly how large n must be for the approximation to be valid.
CLT in Machine Learning
Confidence intervals for model metrics: The difference between two models' accuracies over \(n\) test examples is approximately \(\mathcal{N}(0, \sigma^2/n)\) — enabling statistical significance tests.
Mini-batch gradient estimates: The mini-batch gradient is an average of \(B\) per-sample gradients. By CLT, its distribution is approximately Gaussian around the true gradient, with variance \(\sigma^2/B\).
Justification for Gaussian likelihood: Many real-world measurements are sums of many small independent noise sources. By CLT, such sums are approximately Gaussian — justifying the Gaussian noise assumption in regression.
Ensemble methods: The average prediction of many weak classifiers converges to a distribution with lower variance (rate \(1/\sqrt{n}\)) around the mean prediction.
Concentration inequalities bound the probability that a random variable deviates far from its mean. They are the mathematical core of statistical learning theory — generalization bounds, PAC learning, and bandit algorithms all rely on them.
Markov's Inequality — The Weakest Bound
Markov's InequalityNon-Negative RV
\[\text{For } X \geq 0:\quad \mathbb{P}(X \geq t) \leq \frac{\mathbb{E}[X]}{t}\]
Proof: E[X] = E[X·1_{X≥t}] + E[X·1_{X<t}] ≥ E[X·1_{X≥t}] ≥ t·P(X≥t). Requires only X ≥ 0 and finite mean. Basis for all other concentration inequalities (Chebyshev applies Markov to (X−μ)²). Tight: Bernoulli(μ/t) achieves equality.
Apply Markov to Y = (X−μ)². P(|X−μ| ≥ t) = P((X−μ)² ≥ t²) ≤ E[(X−μ)²]/t² = σ²/t². Requires only finite variance. The "k-sigma rule": P(|X−μ| ≥ kσ) ≤ 1/k². So at least 75% of data within 2σ, 89% within 3σ (for ANY distribution — compare to Gaussian's 95% and 99.7%). Chebyshev gives distribution-free bounds.
Exponentially tighter than Chebyshev for bounded random variables. Requires no assumption about the variance — only that X_i ∈ [a_i, b_i]. The probability decays exponentially in n and ε². Setting the bound to δ: with probability ≥ 1−δ, |X̄ − μ| ≤ (b−a)√(log(2/δ)/(2n)). This is a PAC bound — it says how many samples you need for accuracy ε with confidence 1−δ.
Gaussian(0,σ²) is sub-Gaussian with parameter σ. Any bounded [−B,B] RV is sub-Gaussian with σ = B. Sub-Gaussian is the right class for concentration: tails decay at least as fast as Gaussian. Heavy-tailed distributions (Cauchy, Pareto) are NOT sub-Gaussian. Most loss functions in ML are bounded → sub-Gaussian → Hoeffding applies.
Every ML algorithm has a probabilistic interpretation. Recognizing this reveals what assumptions each method implicitly makes, how to improve it, and how to quantify uncertainty in its predictions.
Maximum Likelihood Estimation
MLE — The Probabilistic Foundation of TrainingEstimation
The switch to log-likelihood is valid because log is monotone. The log converts the product to a sum — numerically stable and analytically convenient. MLE minimizes KL(P̂_data ‖ P_θ) — it finds the model closest to the empirical data distribution in KL divergence.
Model the joint distribution \(p(\mathbf{x}, y) = p(\mathbf{x}|y)\,p(y)\). Can generate new samples, handle missing data, and do density estimation. Predict via Bayes: \(p(y|\mathbf{x}) = p(\mathbf{x}|y)p(y)/p(\mathbf{x})\).
Model the conditional distribution \(p(y|\mathbf{x})\) directly. Don't model \(p(\mathbf{x})\) — focus all capacity on the decision boundary. Usually more accurate when the task is classification/regression.
Examples: Logistic Regression, SVMs, Neural Networks, CNNs, Transformers, Random Forests.
16
Mental Model
The Complete Mental Model
// everything connected · probability as a language · the full picture
Fig 1. Complete probability theory mental map — from Kolmogorov's axioms through random variables, conditioning, Bayes' rule, limit theorems, and out to every ML application.
The complete story as one coherent thread:
A probability space \((\Omega, \mathcal{F}, \mathbb{P})\) formalizes randomness: outcomes, measurable events, and a measure satisfying Kolmogorov's three axioms. Every probability rule you've ever used is a theorem derived from these axioms.
Random variables are measurable functions \(X: \Omega \to \mathbb{R}\). They carry distributions — described by PMFs (discrete) or PDFs (continuous) — and their CDFs always exist and completely characterize the distribution.
Expectation is the Lebesgue integral of \(X\) against \(\mathbb{P}\). It is linear (always), monotone, and satisfies LOTUS (no need to find the distribution of \(g(X)\) to compute \(\mathbb{E}[g(X)]\)). Jensen's inequality connects convexity to expectation inequalities.
Variance and covariance measure spread and co-movement. The covariance matrix is symmetric PSD. PCA diagonalizes it. Every loss landscape in ML has a Hessian which is a covariance matrix in disguise.
Conditional probability and Bayes' rule follow from the product rule. The chain rule of probability factorizes any joint distribution into conditionals — exactly what autoregressive LMs (GPT) compute when generating text token by token.
The Law of Large Numbers guarantees empirical averages converge to expectations. This justifies SGD, mini-batch training, empirical risk minimization, and Monte Carlo integration — the entire computational machinery of ML.
The Central Limit Theorem tells us that sums of independent random variables are approximately Gaussian, regardless of the original distribution. This justifies Gaussian likelihood assumptions and enables statistical inference on model metrics.
Concentration inequalities — Markov, Chebyshev, Hoeffding, sub-Gaussian — give finite-sample bounds. They are the core of PAC learning theory: "with \(n\) samples and probability \(1-\delta\), the generalization gap is at most \(O(\sqrt{\log(1/\delta)/n})\)."
Every ML model is a probabilistic model: the loss function is the negative log-likelihood, regularization is a log-prior, training is MLE or MAP, and the full Bayesian treatment gives a posterior over parameters and an honest predictive distribution.
★
Probability is the Language of Uncertainty
Every algorithm in machine learning is a statement about probability: what distribution generated the data, what distribution we place over parameters, how distributions transform under operations. MSE loss assumes Gaussian noise. Cross-entropy assumes Bernoulli/Categorical output. L2 regularization assumes Gaussian prior. Dropout is approximate Bayesian inference. Batch normalization manipulates the distribution of activations. Attention is a soft probability distribution over positions. Understanding probability theory doesn't add a mathematical layer on top of ML — it reveals that probability is the substance of which every ML algorithm is made.