From the philosophical divide between frequentist and Bayesian thinking,
through Bayes' theorem, priors, posteriors, conjugate families,
and into every modern probabilistic ML technique.
// two schools · probability as frequency vs degree of belief · what parameters mean
The deepest divide in statistics is not methodological — it is philosophical. It concerns what probability is, what parameters are, and what it means to make an inference.
The Two Philosophies
Frequentist View
Probability is the long-run frequency of an event in infinitely repeated identical experiments. \(P(A) = \lim_{n\to\infty}\frac{\text{occurrences of }A}{n}\).
Parameters are fixed, unknown constants. They have a true value; they are not random. It is meaningless to write \(P(\theta)\).
Inference: estimate the fixed parameter from data. Report confidence intervals — \(1-\alpha\)% of such intervals will contain the true value over repeated experiments.
Example: "The coin has a true fixed bias \(p\). After 100 flips, my estimate is \(\hat{p} = 0.52\)."
Bayesian View
Probability is a degree of belief — a subjective state of knowledge about an event. Any uncertain quantity can have a probability, whether or not it can be repeated.
Parameters are uncertain quantities. They have a probability distribution \(P(\theta)\) reflecting our state of knowledge. They are random variables.
Inference: update beliefs using Bayes' theorem. The posterior \(P(\theta|X)\) is our updated belief after seeing data.
Example: "Before flipping, I believe the bias is near 0.5. After 100 flips, I update: posterior has mean 0.52, concentrated near 0.5."
Plug in \(\hat{\theta}\): \(p(y_*|x_*, \hat{\theta})\)
Integrate over uncertainty: \(\int p(y_*|x_*,\theta)p(\theta|\text{data})\,d\theta\)
Model comparison?
Hypothesis tests, AIC/BIC (approximate)
Bayes factors: exact ratio of model evidences
B
The Bayesian Advantage in ML
Most ML methods are implicitly Bayesian: L2 regularization is a Gaussian prior; L1 regularization is a Laplace prior; early stopping is an implicit prior; dropout is approximate variational inference. Understanding the Bayesian framework reveals what these methods are really doing and gives a principled basis for designing new ones.
02
Probability
Probability Foundations
// conditional probability · product rule · sum rule · marginalization
Bayesian inference is built on three rules of probability. Everything else is algebra.
These three rules are all you need. The rest of Bayesian inference is applying them to increasingly complex models. Marginalization (the sum rule) is often the computationally intractable step that drives approximate inference methods.
p(θ|D) — posterior. p(D|θ) — likelihood. p(θ) — prior. p(D) — evidence/marginal likelihood. The denominator is the "normalizing constant" that ensures the posterior integrates to 1. It's often written as the proportionality: p(θ|D) ∝ p(D|θ)·p(θ).
∝
The ∝ Shortcut
Since \(p(\mathcal{D})\) doesn't depend on \(\theta\), we almost always write \(p(\theta|\mathcal{D}) \propto p(\mathcal{D}|\theta)\,p(\theta)\). We only need to compute \(p(\mathcal{D})\) if we need a proper normalized distribution or want to compare models. For finding the posterior mode (MAP) or its shape, the proportionality is sufficient.
03
Core Theorem
Bayes' Theorem — Derivation & Interpretation
// from product rule · the four terms · belief update · prior × likelihood = posterior
Derivation from First Principles
// Deriving Bayes' theorem from the product rule alone
1
Product rule applied to \(P(\theta, \mathcal{D})\):
\[P(\theta, \mathcal{D}) = P(\mathcal{D}|\theta)\,P(\theta)\]
2
Product rule applied the other way:
\[P(\theta, \mathcal{D}) = P(\theta|\mathcal{D})\,P(\mathcal{D})\]
3
Equate (both equal the joint):
\[P(\theta|\mathcal{D})\,P(\mathcal{D}) = P(\mathcal{D}|\theta)\,P(\theta)\]
4
Divide both sides by \(P(\mathcal{D}) > 0\):
\[\boxed{P(\theta|\mathcal{D}) = \frac{P(\mathcal{D}|\theta)\,P(\theta)}{P(\mathcal{D})}}\]
That's it. Two applications of the product rule. The entire Bayesian framework follows from this. ∎
The Belief Update Visual
Prior
P(θ)
What we believed before seeing any data. Can be informative (domain knowledge) or diffuse (ignorance).
×
Likelihood
P(D|θ)
How well each θ explains the observed data. Points in parameter space where data is likely.
÷
Evidence
P(D)
The average likelihood under the prior. Ensures the posterior is a proper distribution.
=
Posterior
P(θ|D)
Updated beliefs after seeing data. Combines prior knowledge with evidence from data.
Fig 1. Bayesian update. Broad prior (uncertain) × peaked likelihood (data informative) = narrower posterior (updated belief). The posterior is a compromise: pulled toward the likelihood, regularized by the prior.
The prior \(p(\theta)\) encodes everything we believe about \(\theta\) before seeing data. Choosing a prior is both the most criticized and most valuable part of Bayesian inference.
Types of Priors
Informative Prior
θ ~ N(μ₀, σ₀²) with small σ₀
Encodes genuine domain knowledge. A medical expert's belief about drug efficacy. Strong priors require justification but can dramatically improve inference with small data.
Weakly Informative
θ ~ N(0, 10²) or Cauchy(0,2.5)
Encodes soft constraints: "θ is probably in a reasonable range." Recommended default in modern Bayesian practice (Gelman et al.). Prevents wild extrapolations without being dogmatic.
Diffuse / Uninformative
θ ~ Uniform(−∞, ∞)
Attempts to express "no information." Often improper (doesn't integrate to 1). Can cause issues: uniform prior on θ ≠ uniform prior on θ². The concept of "no information" is not well-defined.
Jeffreys Prior
p(θ) ∝ √det(I(θ))
Invariant under reparametrization — the "objective" prior. Based on the Fisher information matrix \(\mathcal{I}(\theta)\). Gives the same inference regardless of how you parametrize the model.
Hierarchical Prior
θᵢ ~ p(θ|φ), φ ~ p(φ)
Prior parameters (hyperparameters) are themselves given priors. Allows the data to determine how strong the regularization should be. Foundation of Bayesian hierarchical models.
Empirical Bayes
φ̂ = argmax P(D|φ)
Estimate hyperparameters from the data itself. Not fully Bayesian (uses data twice) but computationally tractable. Used in Gaussian process ML and empirical Bayes methods.
Every regularized ML model is implicitly a MAP Bayesian model. The regularization strength λ equals 1/σ² (Gaussian) or 1/b (Laplace). The Bayesian framework reveals the probabilistic assumptions implicit in regularization choices.
05
Likelihood
The Likelihood Function
// P(data|θ) · as a function of θ · likelihood principle · sufficient statistics
The likelihood \(p(\mathcal{D}|\theta)\) is the same mathematical expression as the data distribution — but with a crucial conceptual inversion: we fix \(\mathcal{D}\) (observed) and vary \(\theta\). It is not a probability distribution over \(\theta\).
The i.i.d. assumption converts the joint probability of all data into a product of individual probabilities. Log-likelihood converts this to a sum — numerically stable and analytically convenient. MLE maximizes ℓ(θ). Bayesian inference multiplies the likelihood by the prior.
The Likelihood Principle
L
The Likelihood Principle
All evidence about \(\theta\) from data \(\mathcal{D}\) is contained in the likelihood function \(\mathcal{L}(\theta) = p(\mathcal{D}|\theta)\). Two experiments with proportional likelihoods provide the same evidence about \(\theta\), even if the experimental designs were different. Bayesian inference automatically satisfies this principle; frequentist methods generally do not.
// the complete answer · what it contains · point estimates vs full posterior
The posterior \(p(\theta|\mathcal{D})\) is the Bayesian answer to every question. It is not a single number — it is a complete probability distribution over parameters, encoding all we know after observing data.
The 95% Bayesian credible interval [a,b] directly means "given the observed data, θ lies in [a,b] with probability 95%." This is what most people intuitively want from a confidence interval — but frequentist CIs don't actually mean this.
∫
The Intractability Problem
The denominator \(p(\mathcal{D}) = \int p(\mathcal{D}|\theta)p(\theta)\,d\theta\) is a high-dimensional integral — analytically intractable for most models. This is the central computational challenge in Bayesian inference. Solutions: conjugate priors (exact, §08), Laplace approximation (Gaussian approximation), variational inference (§13), or MCMC sampling (§14).
07
Estimation
MAP vs MLE — The Complete Comparison
// maximum a posteriori · regularization equivalence · when MAP = MLE · asymptotic behavior
MAP = MLE + log-prior. When the prior is uniform, MAP = MLE. MAP uses the mode of the posterior — a single point estimate that ignores the rest of the distribution. Full Bayesian inference keeps the entire posterior rather than collapsing it to a point.
Asymptotic Equivalence
As \(n \to \infty\), both MLE and MAP converge to the true parameter (under regularity conditions). The prior is "washed out" by the likelihood:
For large enough data, the posterior concentrates around the MLE regardless of the prior (as long as the prior assigns positive probability to a neighbourhood of θ*). The influence of the prior scales as O(1) while the likelihood scales as O(n). Prior choice matters most with small data.
Property
MLE
MAP
Full Bayes
Output
Point estimate \(\hat{\theta}\)
Point estimate \(\hat{\theta}_{\text{MAP}}\)
Full distribution \(p(\theta|\mathcal{D})\)
Prior used?
No (uniform implicit)
Yes (as regularizer)
Yes (fully)
Uncertainty quantified?
Via asymptotic SE
Partially (Laplace approx)
Fully
Regularization?
No
Yes (= log-prior)
Natural (averaging)
Overfitting risk?
High (large models)
Lower
Lowest (averaging)
Computational cost
Low
Low
High
Typical use
Deep learning, large data
Regularized ML models
Probabilistic ML, small data
08
Conjugacy
Conjugate Priors — Full Treatment
// closed-form posteriors · exponential family · Beta-Binomial · Gaussian-Gaussian · full derivation
A prior \(p(\theta)\) is conjugate to a likelihood \(p(\mathcal{D}|\theta)\) if the posterior \(p(\theta|\mathcal{D})\) has the same distributional form as the prior. Conjugacy gives exact, closed-form posteriors — no approximation needed.
Beta-Binomial: The Canonical Example
Model: \(n\) coin flips with \(k\) heads, unknown bias \(p\in[0,1]\).
The posterior is a Beta with updated counts. The prior parameters α, β act as "pseudo-counts" of prior observations. This is exact — no approximation. Posterior mean = (α+k)/(α+β+n).
β
Interpreting the Beta Parameters
The Beta posterior \(\text{Beta}(\alpha+k, \beta+n-k)\) has a beautiful interpretation: \(\alpha+k\) is the total number of heads (prior pseudo-counts + observed heads); \(\beta+n-k\) is the total number of tails. The posterior mean \(\hat{p} = (\alpha+k)/(\alpha+\beta+n)\) interpolates between the prior mean \(\alpha/(\alpha+\beta)\) and the MLE \(k/n\), weighted by how many observations each is based on.
Gaussian-Gaussian: Known Variance
Observe \(x_1,\ldots,x_n \overset{iid}{\sim} \mathcal{N}(\mu, \sigma^2)\) with known \(\sigma^2\), unknown \(\mu\).
The posterior precision (1/τₙ²) = prior precision + data precision. The posterior mean is a precision-weighted average of prior mean μ₀ and sample mean x̄. As n→∞: μₙ → x̄ (data dominates). As τ₀²→∞ (diffuse prior): μₙ → x̄ and τₙ² → σ²/n (same as frequentist).
The Exponential Family and Conjugacy
Exponential Family — Universal Conjugate StructureGeneral Theory
Every exponential family distribution has a conjugate prior — and the posterior update is simply adding the sufficient statistics Σᵢ T(xᵢ) to the prior hyperparameter χ, and incrementing ν by n. This is why conjugacy and the exponential family are inseparably linked.
// online learning · today's posterior is tomorrow's prior · order invariance
Bayesian inference is naturally sequential. Today's posterior becomes tomorrow's prior when new data arrives. This is the Bayesian version of online learning.
Sequential Bayesian UpdateOnline Learning
\[p(\theta|\mathcal{D}_{1:t}) \propto p(x_t|\theta)\cdot p(\theta|\mathcal{D}_{1:t-1})\]
\[\text{i.e.} \quad \underbrace{p(\theta|\mathcal{D}_{1:t})}_{\text{new posterior}} \propto \underbrace{p(x_t|\theta)}_{\text{new data}} \times \underbrace{p(\theta|\mathcal{D}_{1:t-1})}_{\text{previous posterior as new prior}}\]
The result is identical whether you process data all at once (batch) or one point at a time (sequential). Bayesian inference is order-invariant: p(θ|x₁,x₂) = p(θ|x₂,x₁) for i.i.d. data. This is the mathematical basis for online/streaming Bayesian learning.
→
Beta-Binomial Sequential Update
Start: \(\text{Beta}(1,1)\) (uniform prior). Observe H, T, H, H, T. Updates: \(\text{Beta}(2,1) \to \text{Beta}(2,2) \to \text{Beta}(3,2) \to \text{Beta}(4,2) \to \text{Beta}(4,3)\). Final posterior: \(\text{Beta}(4,3)\), same as if we'd processed all 5 flips at once with \(k=3, n=5\). The order doesn't matter; only the counts do.
10
Prediction
Predictive Distributions
// posterior predictive · marginalizing over parameters · uncertainty propagation
A key Bayesian advantage: instead of predicting with a single "best" parameter, we average over all parameter values weighted by their posterior probability. This naturally propagates parameter uncertainty into predictions.
The posterior predictive marginalizes over parameter uncertainty. It is always "wider" (more uncertain) than the prediction at any fixed θ. Compared to plugging in θ̂: plug-in gives p(x_new|θ̂), which can be overconfident, especially for small data or complex models.
This is how the model thinks data looks before seeing any observations. Useful for prior predictive checking: simulate data from the prior predictive and compare to domain knowledge. If the prior predictive generates physically impossible values (e.g., negative ages, probabilities > 1), your prior is wrong.
Beta-Binomial Predictive — Closed Form
Beta-Binomial Predictive DistributionExact
\[p(x_{\text{new}} = k | n_{\text{new}}, \mathcal{D}) = \binom{n_{\text{new}}}{k}\frac{B(\alpha+k,\beta+n_{\text{new}}-k)}{B(\alpha,\beta)}\]
where B(·,·) is the Beta function and (α,β) are the posterior parameters after updating on D. This is the Beta-Binomial distribution. Its variance is larger than a pure Binomial with p̂ because it accounts for uncertainty in p itself — it is "over-dispersed."
11
Regression
Bayesian Linear Regression
// Gaussian prior on weights · posterior over functions · predictive intervals · evidence
Bayesian linear regression replaces a single weight vector with a distribution over weight vectors. The result is a distribution over functions — and a predictive distribution that honestly quantifies uncertainty.
Note: μ_w = (X^T X + (σ²/τ²) I)^{-1} X^T y — this is exactly the Ridge regression solution with λ = σ²/τ²! Bayesian linear regression with Gaussian prior is Ridge regression in disguise. The posterior mean is the MAP; the full posterior quantifies the uncertainty.
The predictive variance has two components: σ² (irreducible observation noise) and x_*^T Σ_w x_* (parameter uncertainty). The second term grows where the training data is sparse, giving wider uncertainty estimates in extrapolation regions. This is the key advantage over simply predicting with the MLE ŵ.
The marginal likelihood \(p(\mathcal{D}|M)\) — also called the evidence — measures how well a model predicts the data on average over all parameter values. It automatically penalizes overly complex models without a separate regularization procedure.
BF > 10: strong evidence for M₁. BF > 100: decisive. BF = 1: no evidence either way. Unlike AIC/BIC, Bayes factors are exact (given the model and prior) and don't require asymptotic approximations. The evidence automatically embodies Occam's razor: simpler models that fit almost as well have higher evidence.
Ω
Automatic Occam's Razor
A complex model spreads its probability over a large hypothesis space — most of which the data doesn't look like. A simple model concentrates its probability on a smaller region. If the data falls in the simple model's region, the simple model wins on evidence. This is the Bayesian version of Occam's razor: given equal fit, prefer the simpler model. It falls out automatically from the mathematics, with no need to specify a penalty term.
13
Approx. Inference
Variational Inference & the ELBO
// intractable posterior · approximation family · KL minimization · mean-field VI
When the posterior is intractable, variational inference (VI) approximates it with a tractable distribution \(q_\phi(\theta)\) from a chosen family, by minimizing KL divergence.
Since KL ≥ 0: ELBO ≤ log p(D). The ELBO is a lower bound on the log evidence. Maximizing the ELBO simultaneously (1) makes q fit the posterior well, and (2) gives a lower bound on log p(D) useful for model comparison. ELBO = E_q[log likelihood] − KL(q ‖ prior). Used in VAEs, topic models, and scalable Bayesian deep learning.
Mean-Field Variational Inference
The most common VI approximation: assume the posterior factorizes fully:
\(q(\boldsymbol{\theta}) = \prod_j q_j(\theta_j)\). This ignores all posterior correlations but enables closed-form updates for exponential family models:
Coordinate Ascent VI (CAVI) updates each factor q_j in turn, holding others fixed. Guaranteed to converge to a local ELBO maximum. For conjugate exponential family models, the optimal q_j* is in the same family as the prior — giving closed-form updates.
MCMC constructs a Markov chain whose stationary distribution is the target posterior \(p(\theta|\mathcal{D})\). By running the chain long enough, we obtain samples that (approximately) represent the posterior — without computing the intractable normalizing constant.
Metropolis-Hastings Algorithm
Metropolis-Hastings Sampler
# Target: p(θ|D) ∝ p(D|θ)·p(θ). Proposal: q(θ'|θ_curr) θ_curr = initialize() # start somewhere in parameter space
samples = []
for t in range(N_samples + N_burnin): # Propose a new state θ_prop = q_sample(θ_curr) # e.g., θ_curr + N(0, σ²)
# Compute acceptance ratio log_ratio = (log_p(θ_prop) + log_q(θ_curr|θ_prop)
- log_p(θ_curr) - log_q(θ_prop|θ_curr)) α = min(1, exp(log_ratio)) # acceptance probability
# Accept or reject if uniform(0,1) < α: θ_curr = θ_prop# accept: move to proposed # else: stay at θ_curr (reject)
if t >= N_burnin:
samples.append(θ_curr)
# Use samples to estimate E[f(θ)|D] ≈ mean([f(s) for s in samples])
Hamiltonian Monte Carlo (HMC)
HMC exploits gradient information to make large, accepted proposals. It introduces auxiliary "momentum" variables and simulates Hamiltonian dynamics to traverse the posterior efficiently — avoiding the random walk behavior of MH.
θ is the position (parameter), ρ is the momentum (auxiliary). The joint distribution p(θ,ρ) ∝ exp(−H(θ,ρ)) has p(θ|D) as its marginal. Simulating Hamiltonian dynamics (leapfrog integrator) proposes new states that preserve H — giving very high acceptance rates and long-distance moves. The NUTS (No-U-Turn Sampler) automatically tunes the step size and trajectory length. Used in Stan and PyMC.
15
Deep Learning
Bayesian Neural Networks
// weight distributions · uncertainty · MC Dropout · deep ensembles · Laplace approx
Bayesian Neural Networks (BNNs) place priors over the weights \(\mathbf{W}\) of a neural network and compute (or approximate) the posterior \(p(\mathbf{W}|\mathcal{D})\). This gives uncertainty-aware predictions — crucial for safety-critical applications like medical diagnosis and autonomous driving.
This integral is intractable for neural networks — too many weights, non-conjugate. Approximations: (1) Laplace approximation: Gaussian centered at MAP estimate. (2) Mean-field VI (Bayes by Backprop). (3) MC Dropout: use dropout at test time as approximate posterior sampling. (4) Deep Ensembles: train K networks with different random seeds.
MC Dropout — Practical BNNs
Gal & Ghahramani (2016) showed that a neural network with dropout trained by minimizing cross-entropy is mathematically equivalent to approximate Bayesian inference in a deep Gaussian process. The approximation: apply dropout at test time and compute \(T\) stochastic forward passes:
Epistemic uncertainty (model uncertainty) is high where training data is sparse. Aleatoric uncertainty (data noise) is irreducible. MC Dropout separates these: the mean of the T predictions is the estimate; the variance captures epistemic uncertainty. Your XAI chest X-ray work could directly use MC Dropout to produce uncertainty maps alongside GradCAM.
16
Mental Model
The Complete Mental Model
// everything connected · one framework · the probabilistic view of ML
Fig 2. Complete Bayesian ML mental map — from Bayes' theorem through inference methods to every ML connection.
The complete unified thread:
Probability is belief. Parameters are uncertain quantities with distributions, not fixed unknowns. Every uncertain quantity — parameters, predictions, model structure — gets a probability.
Bayes' theorem is the update rule: \(p(\theta|\mathcal{D}) \propto p(\mathcal{D}|\theta)\,p(\theta)\). Prior belief times likelihood equals posterior belief (up to normalization). This is the only learning rule consistent with probability theory.
The prior encodes knowledge. Gaussian prior = L2 regularization. Laplace prior = L1 regularization. Horseshoe prior = sparsity-inducing. The Bayesian framework reveals what regularization really is: a prior on the parameters.
Conjugate priors give exact closed-form posteriors. Beta-Binomial. Gaussian-Gaussian. Dirichlet-Categorical. These arise from the exponential family structure and make sequential updating trivially simple.
MAP estimation is the mode of the posterior — equivalent to regularized maximum likelihood. Full Bayesian inference keeps the entire posterior, propagating uncertainty into every downstream quantity.
Predictive distributions integrate out parameter uncertainty: \(p(y_*|\mathcal{D}) = \int p(y_*|\theta)p(\theta|\mathcal{D})\,d\theta\). This gives honest uncertainty — wider where data is sparse.
The evidence \(p(\mathcal{D}|M)\) measures how well a model predicts data on average. Maximizing it for model selection automatically implements Occam's razor without a penalty term.
When exact inference fails: Variational inference (minimize KL, maximize ELBO), MCMC (Metropolis-Hastings, HMC/NUTS), Laplace approximation, or MC Dropout. Every deep learning trick has a Bayesian interpretation.
The Probabilistic View of Machine Learning
Every ML model is implicitly a probabilistic model — of the data-generating process, of the noise, of the parameters. Ridge regression is MAP inference with a Gaussian prior. L1 regularization uses a Laplace prior. Dropout is approximate variational inference. Deep ensembles approximate the posterior predictive. The VAE ELBO is the Bayesian evidence lower bound. Understanding the Bayesian framework doesn't add a separate layer of complexity to ML — it reveals the probabilistic assumptions already embedded in every algorithm you've used, and gives you the tools to reason about them explicitly, improve them, and quantify the uncertainty they produce.