Bayesian ML | Shahid Ul Islam

01

Philosophy

Frequentist vs Bayesian Thinking

// two schools · probability as frequency vs degree of belief · what parameters mean

The deepest divide in statistics is not methodological — it is philosophical. It concerns what probability is, what parameters are, and what it means to make an inference.

The Two Philosophies

Frequentist View

Probability is the long-run frequency of an event in infinitely repeated identical experiments. \(P(A) = \lim_{n\to\infty}\frac{\text{occurrences of }A}{n}\).

Parameters are fixed, unknown constants. They have a true value; they are not random. It is meaningless to write \(P(\theta)\).

Inference: estimate the fixed parameter from data. Report confidence intervals — \(1-\alpha\)% of such intervals will contain the true value over repeated experiments.

Example: "The coin has a true fixed bias \(p\). After 100 flips, my estimate is \(\hat{p} = 0.52\)."

Bayesian View

Probability is a degree of belief — a subjective state of knowledge about an event. Any uncertain quantity can have a probability, whether or not it can be repeated.

Parameters are uncertain quantities. They have a probability distribution \(P(\theta)\) reflecting our state of knowledge. They are random variables.

Inference: update beliefs using Bayes' theorem. The posterior \(P(\theta|X)\) is our updated belief after seeing data.

Example: "Before flipping, I believe the bias is near 0.5. After 100 flips, I update: posterior has mean 0.52, concentrated near 0.5."

Concrete Consequences

Question	Frequentist Answer	Bayesian Answer
"What is the probability that θ > 0.5?"	Undefined. θ is fixed, not random.	\(P(\theta > 0.5 \| \text{data}) = \int_{0.5}^1 p(\theta\|\text{data})\,d\theta\) — directly answerable.
How to incorporate domain knowledge?	Through the choice of estimator (ad hoc).	Through the prior \(P(\theta)\) — principled.
Small sample inference?	Asymptotic theory breaks down.	Prior regularizes; posteriors are always valid.
Prediction for new \(x_*\)?	Plug in \(\hat{\theta}\): \(p(y_\|x_, \hat{\theta})\)	Integrate over uncertainty: \(\int p(y_\|x_,\theta)p(\theta\|\text{data})\,d\theta\)
Model comparison?	Hypothesis tests, AIC/BIC (approximate)	Bayes factors: exact ratio of model evidences

B

The Bayesian Advantage in ML

Most ML methods are implicitly Bayesian: L2 regularization is a Gaussian prior; L1 regularization is a Laplace prior; early stopping is an implicit prior; dropout is approximate variational inference. Understanding the Bayesian framework reveals what these methods are really doing and gives a principled basis for designing new ones.

02

Probability

Probability Foundations

// conditional probability · product rule · sum rule · marginalization

Bayesian inference is built on three rules of probability. Everything else is algebra.

The Three Fundamental Rules

The Sum Rule and Product RuleFoundation

\[\text{Product Rule:}\quad P(A,B) = P(A|B)\,P(B) = P(B|A)\,P(A)\] \[\text{Sum Rule:}\quad P(A) = \sum_B P(A,B) = \sum_B P(A|B)\,P(B)\quad\text{(marginalization)}\] \[\text{Bayes' Theorem:}\quad P(A|B) = \frac{P(B|A)\,P(A)}{P(B)} \quad\text{(product rule applied twice)}\]

These three rules are all you need. The rest of Bayesian inference is applying them to increasingly complex models. Marginalization (the sum rule) is often the computationally intractable step that drives approximate inference methods.

Continuous Distributions

Continuous Bayes' TheoremKey Form

\[p(\theta | \mathcal{D}) = \frac{p(\mathcal{D}|\theta)\,p(\theta)}{p(\mathcal{D})} = \frac{p(\mathcal{D}|\theta)\,p(\theta)}{\int p(\mathcal{D}|\theta')\,p(\theta')\,d\theta'}\]

p(θ|D) — posterior. p(D|θ) — likelihood. p(θ) — prior. p(D) — evidence/marginal likelihood. The denominator is the "normalizing constant" that ensures the posterior integrates to 1. It's often written as the proportionality: p(θ|D) ∝ p(D|θ)·p(θ).

∝

The ∝ Shortcut

Since \(p(\mathcal{D})\) doesn't depend on \(\theta\), we almost always write \(p(\theta|\mathcal{D}) \propto p(\mathcal{D}|\theta)\,p(\theta)\). We only need to compute \(p(\mathcal{D})\) if we need a proper normalized distribution or want to compare models. For finding the posterior mode (MAP) or its shape, the proportionality is sufficient.

03

Core Theorem

Bayes' Theorem — Derivation & Interpretation

// from product rule · the four terms · belief update · prior × likelihood = posterior

Derivation from First Principles

// Deriving Bayes' theorem from the product rule alone

1

Product rule applied to \(P(\theta, \mathcal{D})\): \[P(\theta, \mathcal{D}) = P(\mathcal{D}|\theta)\,P(\theta)\]

2

Product rule applied the other way: \[P(\theta, \mathcal{D}) = P(\theta|\mathcal{D})\,P(\mathcal{D})\]

3

Equate (both equal the joint): \[P(\theta|\mathcal{D})\,P(\mathcal{D}) = P(\mathcal{D}|\theta)\,P(\theta)\]

4

Divide both sides by \(P(\mathcal{D}) > 0\): \[\boxed{P(\theta|\mathcal{D}) = \frac{P(\mathcal{D}|\theta)\,P(\theta)}{P(\mathcal{D})}}\]

That's it. Two applications of the product rule. The entire Bayesian framework follows from this. ∎

The Belief Update Visual

Prior

P(θ)

What we believed before seeing any data. Can be informative (domain knowledge) or diffuse (ignorance).

×

Likelihood

P(D|θ)

How well each θ explains the observed data. Points in parameter space where data is likely.

÷

Evidence

P(D)

The average likelihood under the prior. Ensures the posterior is a proper distribution.

=

Posterior

P(θ|D)

Updated beliefs after seeing data. Combines prior knowledge with evidence from data.

Fig 1. Bayesian update. Broad prior (uncertain) × peaked likelihood (data informative) = narrower posterior (updated belief). The posterior is a compromise: pulled toward the likelihood, regularized by the prior.

04

Prior

The Prior Distribution

// informative vs diffuse · improper priors · Jeffreys prior · prior elicitation

The prior \(p(\theta)\) encodes everything we believe about \(\theta\) before seeing data. Choosing a prior is both the most criticized and most valuable part of Bayesian inference.

Types of Priors

Informative Prior

θ ~ N(μ₀, σ₀²) with small σ₀

Encodes genuine domain knowledge. A medical expert's belief about drug efficacy. Strong priors require justification but can dramatically improve inference with small data.

Weakly Informative

θ ~ N(0, 10²) or Cauchy(0,2.5)

Encodes soft constraints: "θ is probably in a reasonable range." Recommended default in modern Bayesian practice (Gelman et al.). Prevents wild extrapolations without being dogmatic.

Diffuse / Uninformative

θ ~ Uniform(−∞, ∞)

Attempts to express "no information." Often improper (doesn't integrate to 1). Can cause issues: uniform prior on θ ≠ uniform prior on θ². The concept of "no information" is not well-defined.

Jeffreys Prior

p(θ) ∝ √det(I(θ))

Invariant under reparametrization — the "objective" prior. Based on the Fisher information matrix \(\mathcal{I}(\theta)\). Gives the same inference regardless of how you parametrize the model.

Hierarchical Prior

θᵢ ~ p(θ|φ), φ ~ p(φ)

Prior parameters (hyperparameters) are themselves given priors. Allows the data to determine how strong the regularization should be. Foundation of Bayesian hierarchical models.

Empirical Bayes

φ̂ = argmax P(D|φ)

Estimate hyperparameters from the data itself. Not fully Bayesian (uses data twice) but computationally tractable. Used in Gaussian process ML and empirical Bayes methods.

Priors as Regularization

Prior ↔ Regularizer CorrespondenceML Connection

\[\text{MAP objective:} \quad \log p(\theta|\mathcal{D}) = \underbrace{\log p(\mathcal{D}|\theta)}_{\text{log-likelihood}} + \underbrace{\log p(\theta)}_{\text{log-prior}} + \text{const}\] \[\theta \sim \mathcal{N}(0, \sigma^2\mathbf{I}) \quad\Rightarrow\quad \log p(\theta) = -\frac{1}{2\sigma^2}\|\theta\|^2 + \text{const} \quad\Leftrightarrow\quad \text{L2 regularization}\] \[\theta_j \sim \text{Laplace}(0, b) \quad\Rightarrow\quad \log p(\theta) = -\frac{1}{b}\|\theta\|_1 + \text{const} \quad\Leftrightarrow\quad \text{L1 regularization}\]

Every regularized ML model is implicitly a MAP Bayesian model. The regularization strength λ equals 1/σ² (Gaussian) or 1/b (Laplace). The Bayesian framework reveals the probabilistic assumptions implicit in regularization choices.

05

Likelihood

The Likelihood Function

// P(data|θ) · as a function of θ · likelihood principle · sufficient statistics

The likelihood \(p(\mathcal{D}|\theta)\) is the same mathematical expression as the data distribution — but with a crucial conceptual inversion: we fix \(\mathcal{D}\) (observed) and vary \(\theta\). It is not a probability distribution over \(\theta\).

Likelihood for i.i.d. DataStandard Form

\[\mathcal{L}(\theta) = p(\mathcal{D}|\theta) = \prod_{i=1}^{n}p(x_i|\theta)\] \[\ell(\theta) = \log\mathcal{L}(\theta) = \sum_{i=1}^{n}\log p(x_i|\theta)\]

The i.i.d. assumption converts the joint probability of all data into a product of individual probabilities. Log-likelihood converts this to a sum — numerically stable and analytically convenient. MLE maximizes ℓ(θ). Bayesian inference multiplies the likelihood by the prior.

The Likelihood Principle

L

The Likelihood Principle

All evidence about \(\theta\) from data \(\mathcal{D}\) is contained in the likelihood function \(\mathcal{L}(\theta) = p(\mathcal{D}|\theta)\). Two experiments with proportional likelihoods provide the same evidence about \(\theta\), even if the experimental designs were different. Bayesian inference automatically satisfies this principle; frequentist methods generally do not.

Common Likelihood Functions in ML

Data Type	Model	Likelihood \(p(y\|\theta, x)\)	MLE Estimate
Continuous target	\(y = f_\theta(x) + \epsilon, \epsilon\sim\mathcal{N}(0,\sigma^2)\)	\(\mathcal{N}(y; f_\theta(x), \sigma^2)\)	Minimize MSE
Binary classification	Logistic: \(p = \sigma(f_\theta(x))\)	\(\text{Bern}(y; \sigma(f_\theta(x)))\)	Minimize BCE
Multi-class	Softmax: \(\mathbf{p} = \text{softmax}(f_\theta(x))\)	\(\text{Cat}(y; \text{softmax}(f_\theta))\)	Minimize CCE
Count data	Poisson regression	\(\text{Poi}(y; \exp(f_\theta(x)))\)	Minimize neg log-Poisson
Coin flips	Bernoulli(\(p\))	\(p^k(1-p)^{n-k}\)	\(\hat{p} = k/n\)

06

Posterior

The Posterior Distribution

// the complete answer · what it contains · point estimates vs full posterior

The posterior \(p(\theta|\mathcal{D})\) is the Bayesian answer to every question. It is not a single number — it is a complete probability distribution over parameters, encoding all we know after observing data.

The Posterior Contains EverythingComplete Answer

\[p(\theta|\mathcal{D}) = \frac{p(\mathcal{D}|\theta)\,p(\theta)}{p(\mathcal{D})}, \quad p(\mathcal{D}) = \int p(\mathcal{D}|\theta)\,p(\theta)\,d\theta\] \[\text{Posterior mean: } \mathbb{E}[\theta|\mathcal{D}] = \int\theta\,p(\theta|\mathcal{D})\,d\theta\] \[\text{Posterior mode (MAP): } \hat{\theta}_{\text{MAP}} = \arg\max_\theta p(\theta|\mathcal{D})\] \[\text{Posterior variance: } \text{Var}[\theta|\mathcal{D}] = \mathbb{E}[\theta^2|\mathcal{D}] - (\mathbb{E}[\theta|\mathcal{D}])^2\] \[\text{Credible interval: } P(\theta \in [a,b]|\mathcal{D}) = \int_a^b p(\theta|\mathcal{D})\,d\theta = 1-\alpha\]

The 95% Bayesian credible interval [a,b] directly means "given the observed data, θ lies in [a,b] with probability 95%." This is what most people intuitively want from a confidence interval — but frequentist CIs don't actually mean this.

∫

The Intractability Problem

The denominator \(p(\mathcal{D}) = \int p(\mathcal{D}|\theta)p(\theta)\,d\theta\) is a high-dimensional integral — analytically intractable for most models. This is the central computational challenge in Bayesian inference. Solutions: conjugate priors (exact, §08), Laplace approximation (Gaussian approximation), variational inference (§13), or MCMC sampling (§14).

07

Estimation

MAP vs MLE — The Complete Comparison

// maximum a posteriori · regularization equivalence · when MAP = MLE · asymptotic behavior

MLE vs MAP — Side by SideComparison

\[\hat{\theta}_{\text{MLE}} = \arg\max_\theta\,p(\mathcal{D}|\theta) = \arg\max_\theta\,\ell(\theta)\] \[\hat{\theta}_{\text{MAP}} = \arg\max_\theta\,p(\theta|\mathcal{D}) = \arg\max_\theta\,[\ell(\theta) + \log p(\theta)]\]

MAP = MLE + log-prior. When the prior is uniform, MAP = MLE. MAP uses the mode of the posterior — a single point estimate that ignores the rest of the distribution. Full Bayesian inference keeps the entire posterior rather than collapsing it to a point.

Asymptotic Equivalence

As \(n \to \infty\), both MLE and MAP converge to the true parameter (under regularity conditions). The prior is "washed out" by the likelihood:

Asymptotic BehaviorLarge n

\[p(\theta|\mathcal{D}) \xrightarrow{n\to\infty} \mathcal{N}(\hat{\theta}_{\text{MLE}},\, \mathcal{I}_n(\hat{\theta})^{-1}) \quad\text{(Bernstein-von Mises theorem)}\]

For large enough data, the posterior concentrates around the MLE regardless of the prior (as long as the prior assigns positive probability to a neighbourhood of θ*). The influence of the prior scales as O(1) while the likelihood scales as O(n). Prior choice matters most with small data.

Property	MLE	MAP	Full Bayes
Output	Point estimate \(\hat{\theta}\)	Point estimate \(\hat{\theta}_{\text{MAP}}\)	Full distribution \(p(\theta\|\mathcal{D})\)
Prior used?	No (uniform implicit)	Yes (as regularizer)	Yes (fully)
Uncertainty quantified?	Via asymptotic SE	Partially (Laplace approx)	Fully
Regularization?	No	Yes (= log-prior)	Natural (averaging)
Overfitting risk?	High (large models)	Lower	Lowest (averaging)
Computational cost	Low	Low	High
Typical use	Deep learning, large data	Regularized ML models	Probabilistic ML, small data

08

Conjugacy

Conjugate Priors — Full Treatment

// closed-form posteriors · exponential family · Beta-Binomial · Gaussian-Gaussian · full derivation

A prior \(p(\theta)\) is conjugate to a likelihood \(p(\mathcal{D}|\theta)\) if the posterior \(p(\theta|\mathcal{D})\) has the same distributional form as the prior. Conjugacy gives exact, closed-form posteriors — no approximation needed.

Beta-Binomial: The Canonical Example

Model: \(n\) coin flips with \(k\) heads, unknown bias \(p\in[0,1]\).

// Full Beta-Binomial conjugate derivation

Prior

\[p \sim \text{Beta}(\alpha, \beta) \propto p^{\alpha-1}(1-p)^{\beta-1}\]

α = prior number of "successes", β = prior number of "failures". α=β=1 is the uniform prior. α=β=10 is a strong prior centered at 0.5.

Like

\[p(\mathcal{D}|p) = \binom{n}{k}p^k(1-p)^{n-k} \propto p^k(1-p)^{n-k}\]

Post

\[p(p|\mathcal{D}) \propto p^{\alpha-1}(1-p)^{\beta-1} \cdot p^k(1-p)^{n-k} = p^{(\alpha+k)-1}(1-p)^{(\beta+n-k)-1}\]

Result

\[\boxed{p|\mathcal{D} \sim \text{Beta}(\alpha+k,\; \beta+n-k)}\]

The posterior is a Beta with updated counts. The prior parameters α, β act as "pseudo-counts" of prior observations. This is exact — no approximation. Posterior mean = (α+k)/(α+β+n).

β

Interpreting the Beta Parameters

The Beta posterior \(\text{Beta}(\alpha+k, \beta+n-k)\) has a beautiful interpretation: \(\alpha+k\) is the total number of heads (prior pseudo-counts + observed heads); \(\beta+n-k\) is the total number of tails. The posterior mean \(\hat{p} = (\alpha+k)/(\alpha+\beta+n)\) interpolates between the prior mean \(\alpha/(\alpha+\beta)\) and the MLE \(k/n\), weighted by how many observations each is based on.

Gaussian-Gaussian: Known Variance

Observe \(x_1,\ldots,x_n \overset{iid}{\sim} \mathcal{N}(\mu, \sigma^2)\) with known \(\sigma^2\), unknown \(\mu\).

Gaussian Conjugate Prior — PosteriorKey Result

\[\text{Prior: } \mu \sim \mathcal{N}(\mu_0, \tau_0^2)\] \[\text{Posterior: } \mu|\mathbf{x} \sim \mathcal{N}(\mu_n, \tau_n^2)\] \[\tau_n^2 = \left(\frac{1}{\tau_0^2} + \frac{n}{\sigma^2}\right)^{-1} \qquad \mu_n = \tau_n^2\left(\frac{\mu_0}{\tau_0^2} + \frac{n\bar{x}}{\sigma^2}\right)\]

The posterior precision (1/τₙ²) = prior precision + data precision. The posterior mean is a precision-weighted average of prior mean μ₀ and sample mean x̄. As n→∞: μₙ → x̄ (data dominates). As τ₀²→∞ (diffuse prior): μₙ → x̄ and τₙ² → σ²/n (same as frequentist).

The Exponential Family and Conjugacy

Exponential Family — Universal Conjugate StructureGeneral Theory

\[p(x|\eta) = h(x)\exp(\eta^T T(x) - A(\eta)) \quad\text{(exponential family)}\] \[\text{Conjugate prior: } p(\eta|\chi,\nu) \propto \exp(\eta^T\chi - \nu A(\eta))\] \[\text{Posterior: } p(\eta|\mathcal{D}) = p(\eta|\chi + \textstyle\sum_i T(x_i),\; \nu+n)\]

Every exponential family distribution has a conjugate prior — and the posterior update is simply adding the sufficient statistics Σᵢ T(xᵢ) to the prior hyperparameter χ, and incrementing ν by n. This is why conjugacy and the exponential family are inseparably linked.

Complete Conjugate Prior Table

Likelihood	Parameter	Conjugate Prior	Posterior Update
Binomial\((n,p)\)	\(p\)	Beta\((\alpha,\beta)\)	Beta\((\alpha+k,\beta+n-k)\)
Poisson\((\lambda)\)	\(\lambda\)	Gamma\((\alpha,\beta)\)	Gamma\((\alpha+\sum x_i,\beta+n)\)
Normal\((\mu,\sigma^2)\) known \(\sigma^2\)	\(\mu\)	\(\mathcal{N}(\mu_0,\tau_0^2)\)	\(\mathcal{N}(\mu_n,\tau_n^2)\) as above
Normal\((\mu,\sigma^2)\) known \(\mu\)	\(\sigma^2\)	Inverse-Gamma\((\alpha,\beta)\)	IG\((\alpha+n/2, \beta+\frac{1}{2}\sum(x_i-\mu)^2)\)
Categorical\((\mathbf{p})\)	\(\mathbf{p}\)	Dirichlet\((\boldsymbol{\alpha})\)	Dirichlet\((\boldsymbol{\alpha}+\mathbf{n})\)
Multinomial\((n,\mathbf{p})\)	\(\mathbf{p}\)	Dirichlet\((\boldsymbol{\alpha})\)	Dirichlet\((\boldsymbol{\alpha}+\mathbf{n})\)
Normal (both unknown)	\((\mu,\sigma^2)\)	Normal-Inverse-Gamma	NIG with updated hyperparams

09

Sequential

Bayesian Updating — Sequential Inference

// online learning · today's posterior is tomorrow's prior · order invariance

Bayesian inference is naturally sequential. Today's posterior becomes tomorrow's prior when new data arrives. This is the Bayesian version of online learning.

Sequential Bayesian UpdateOnline Learning

\[p(\theta|\mathcal{D}_{1:t}) \propto p(x_t|\theta)\cdot p(\theta|\mathcal{D}_{1:t-1})\] \[\text{i.e.} \quad \underbrace{p(\theta|\mathcal{D}_{1:t})}_{\text{new posterior}} \propto \underbrace{p(x_t|\theta)}_{\text{new data}} \times \underbrace{p(\theta|\mathcal{D}_{1:t-1})}_{\text{previous posterior as new prior}}\]

The result is identical whether you process data all at once (batch) or one point at a time (sequential). Bayesian inference is order-invariant: p(θ|x₁,x₂) = p(θ|x₂,x₁) for i.i.d. data. This is the mathematical basis for online/streaming Bayesian learning.

→

Beta-Binomial Sequential Update

Start: \(\text{Beta}(1,1)\) (uniform prior). Observe H, T, H, H, T. Updates: \(\text{Beta}(2,1) \to \text{Beta}(2,2) \to \text{Beta}(3,2) \to \text{Beta}(4,2) \to \text{Beta}(4,3)\). Final posterior: \(\text{Beta}(4,3)\), same as if we'd processed all 5 flips at once with \(k=3, n=5\). The order doesn't matter; only the counts do.

10

Prediction

Predictive Distributions

// posterior predictive · marginalizing over parameters · uncertainty propagation

A key Bayesian advantage: instead of predicting with a single "best" parameter, we average over all parameter values weighted by their posterior probability. This naturally propagates parameter uncertainty into predictions.

Posterior Predictive DistributionFull Prediction

\[p(x_{\text{new}}|\mathcal{D}) = \int p(x_{\text{new}}|\theta)\,p(\theta|\mathcal{D})\,d\theta = \mathbb{E}_{\theta|\mathcal{D}}\!\left[p(x_{\text{new}}|\theta)\right]\]

The posterior predictive marginalizes over parameter uncertainty. It is always "wider" (more uncertain) than the prediction at any fixed θ. Compared to plugging in θ̂: plug-in gives p(x_new|θ̂), which can be overconfident, especially for small data or complex models.

Prior Predictive Distribution

Prior PredictiveBefore Data

\[p(x_{\text{new}}) = \int p(x_{\text{new}}|\theta)\,p(\theta)\,d\theta\]

This is how the model thinks data looks before seeing any observations. Useful for prior predictive checking: simulate data from the prior predictive and compare to domain knowledge. If the prior predictive generates physically impossible values (e.g., negative ages, probabilities > 1), your prior is wrong.

Beta-Binomial Predictive — Closed Form

Beta-Binomial Predictive DistributionExact

\[p(x_{\text{new}} = k | n_{\text{new}}, \mathcal{D}) = \binom{n_{\text{new}}}{k}\frac{B(\alpha+k,\beta+n_{\text{new}}-k)}{B(\alpha,\beta)}\]

where B(·,·) is the Beta function and (α,β) are the posterior parameters after updating on D. This is the Beta-Binomial distribution. Its variance is larger than a pure Binomial with p̂ because it accounts for uncertainty in p itself — it is "over-dispersed."

11

Regression

Bayesian Linear Regression

// Gaussian prior on weights · posterior over functions · predictive intervals · evidence

Bayesian linear regression replaces a single weight vector with a distribution over weight vectors. The result is a distribution over functions — and a predictive distribution that honestly quantifies uncertainty.

Bayesian Linear Regression — SetupModel

\[\mathbf{y} = \mathbf{X}\mathbf{w} + \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\sigma^2\mathbf{I})\] \[\text{Prior: }\mathbf{w}\sim\mathcal{N}(\mathbf{0}, \tau^2\mathbf{I})\] \[\text{Posterior: }\mathbf{w}|\mathbf{y},\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}_w, \boldsymbol{\Sigma}_w)\] \[\boldsymbol{\Sigma}_w = \left(\frac{1}{\sigma^2}\mathbf{X}^T\mathbf{X} + \frac{1}{\tau^2}\mathbf{I}\right)^{-1} \qquad \boldsymbol{\mu}_w = \frac{1}{\sigma^2}\boldsymbol{\Sigma}_w\mathbf{X}^T\mathbf{y}\]

Note: μ_w = (X^T X + (σ²/τ²) I)^{-1} X^T y — this is exactly the Ridge regression solution with λ = σ²/τ²! Bayesian linear regression with Gaussian prior is Ridge regression in disguise. The posterior mean is the MAP; the full posterior quantifies the uncertainty.

Predictive Distribution for New Points

Bayesian Predictive DistributionUncertainty-Aware

\[p(y_*|\mathbf{x}_*, \mathbf{X}, \mathbf{y}) = \mathcal{N}(y_*;\; \boldsymbol{\mu}_w^T\mathbf{x}_*,\; \sigma^2 + \mathbf{x}_*^T\boldsymbol{\Sigma}_w\mathbf{x}_*)\]

The predictive variance has two components: σ² (irreducible observation noise) and x_*^T Σ_w x_* (parameter uncertainty). The second term grows where the training data is sparse, giving wider uncertainty estimates in extrapolation regions. This is the key advantage over simply predicting with the MLE ŵ.

12

Evidence

The Evidence & Bayesian Model Selection

// marginal likelihood · Bayes factors · Occam's razor · automatic complexity penalty

The marginal likelihood \(p(\mathcal{D}|M)\) — also called the evidence — measures how well a model predicts the data on average over all parameter values. It automatically penalizes overly complex models without a separate regularization procedure.

Model Evidence and Bayes FactorsModel Selection

\[p(\mathcal{D}|M_k) = \int p(\mathcal{D}|\theta, M_k)\,p(\theta|M_k)\,d\theta\] \[\text{Posterior over models: } P(M_k|\mathcal{D}) = \frac{p(\mathcal{D}|M_k)\,P(M_k)}{\sum_j p(\mathcal{D}|M_j)\,P(M_j)}\] \[\text{Bayes Factor: } BF_{12} = \frac{p(\mathcal{D}|M_1)}{p(\mathcal{D}|M_2)}\]

BF > 10: strong evidence for M₁. BF > 100: decisive. BF = 1: no evidence either way. Unlike AIC/BIC, Bayes factors are exact (given the model and prior) and don't require asymptotic approximations. The evidence automatically embodies Occam's razor: simpler models that fit almost as well have higher evidence.

Ω

Automatic Occam's Razor

A complex model spreads its probability over a large hypothesis space — most of which the data doesn't look like. A simple model concentrates its probability on a smaller region. If the data falls in the simple model's region, the simple model wins on evidence. This is the Bayesian version of Occam's razor: given equal fit, prefer the simpler model. It falls out automatically from the mathematics, with no need to specify a penalty term.

13

Approx. Inference

Variational Inference & the ELBO

// intractable posterior · approximation family · KL minimization · mean-field VI

When the posterior is intractable, variational inference (VI) approximates it with a tractable distribution \(q_\phi(\theta)\) from a chosen family, by minimizing KL divergence.

Variational Inference — ELBOVI Objective

\[\text{Goal: find } q_\phi(\theta) \approx p(\theta|\mathcal{D})\] \[\text{Minimize: } D_{\text{KL}}(q_\phi(\theta)\,\|\,p(\theta|\mathcal{D}))\] \[\log p(\mathcal{D}) = \underbrace{\mathbb{E}_q[\log p(\mathcal{D},\theta) - \log q_\phi(\theta)]}_{\text{ELBO}(\phi)} + D_{\text{KL}}(q_\phi\,\|\,p(\theta|\mathcal{D}))\] \[\therefore\quad \text{Maximizing ELBO} \equiv \text{Minimizing KL}(q_\phi \,\|\, p_{\text{posterior}})\]

Since KL ≥ 0: ELBO ≤ log p(D). The ELBO is a lower bound on the log evidence. Maximizing the ELBO simultaneously (1) makes q fit the posterior well, and (2) gives a lower bound on log p(D) useful for model comparison. ELBO = E_q[log likelihood] − KL(q ‖ prior). Used in VAEs, topic models, and scalable Bayesian deep learning.

Mean-Field Variational Inference

The most common VI approximation: assume the posterior factorizes fully: \(q(\boldsymbol{\theta}) = \prod_j q_j(\theta_j)\). This ignores all posterior correlations but enables closed-form updates for exponential family models:

Mean-Field Update EquationsCAVI

\[\log q_j^*(\theta_j) = \mathbb{E}_{-j}\!\left[\log p(\boldsymbol{\theta}, \mathcal{D})\right] + \text{const}\]

Coordinate Ascent VI (CAVI) updates each factor q_j in turn, holding others fixed. Guaranteed to converge to a local ELBO maximum. For conjugate exponential family models, the optimal q_j* is in the same family as the prior — giving closed-form updates.

14

Sampling

MCMC — Markov Chain Monte Carlo

// sampling from posteriors · Metropolis-Hastings · HMC · No-U-Turn Sampler

MCMC constructs a Markov chain whose stationary distribution is the target posterior \(p(\theta|\mathcal{D})\). By running the chain long enough, we obtain samples that (approximately) represent the posterior — without computing the intractable normalizing constant.

Metropolis-Hastings Algorithm

Metropolis-Hastings Sampler

# Target: p(θ|D) ∝ p(D|θ)·p(θ). Proposal: q(θ'|θ_curr)
θ_curr = initialize()   # start somewhere in parameter space
samples = []

for t in range(N_samples + N_burnin):
  # Propose a new state
  θ_prop = q_sample(θ_curr)   # e.g., θ_curr + N(0, σ²)

  # Compute acceptance ratio
  log_ratio = (log_p(θ_prop) + log_q(θ_curr|θ_prop)
             - log_p(θ_curr) - log_q(θ_prop|θ_curr))
  α = min(1, exp(log_ratio))   # acceptance probability

  # Accept or reject
  if uniform(0,1) < α:
    θ_curr = θ_prop        # accept: move to proposed
  # else: stay at θ_curr (reject)

  if t >= N_burnin:
    samples.append(θ_curr)

# Use samples to estimate E[f(θ)|D] ≈ mean([f(s) for s in samples])

Hamiltonian Monte Carlo (HMC)

HMC exploits gradient information to make large, accepted proposals. It introduces auxiliary "momentum" variables and simulates Hamiltonian dynamics to traverse the posterior efficiently — avoiding the random walk behavior of MH.

HMC — HamiltonianGradient-Based MCMC

\[H(\theta, \rho) = \underbrace{-\log p(\theta|\mathcal{D})}_{\text{potential energy } U(\theta)} + \underbrace{\frac{1}{2}\rho^T M^{-1}\rho}_{\text{kinetic energy}}\]

θ is the position (parameter), ρ is the momentum (auxiliary). The joint distribution p(θ,ρ) ∝ exp(−H(θ,ρ)) has p(θ|D) as its marginal. Simulating Hamiltonian dynamics (leapfrog integrator) proposes new states that preserve H — giving very high acceptance rates and long-distance moves. The NUTS (No-U-Turn Sampler) automatically tunes the step size and trajectory length. Used in Stan and PyMC.

15

Deep Learning

Bayesian Neural Networks

// weight distributions · uncertainty · MC Dropout · deep ensembles · Laplace approx

Bayesian Neural Networks (BNNs) place priors over the weights \(\mathbf{W}\) of a neural network and compute (or approximate) the posterior \(p(\mathbf{W}|\mathcal{D})\). This gives uncertainty-aware predictions — crucial for safety-critical applications like medical diagnosis and autonomous driving.

Bayesian Neural Network PredictionFull Uncertainty

\[p(y_*|\mathbf{x}_*, \mathcal{D}) = \int p(y_*|\mathbf{x}_*, \mathbf{W})\,p(\mathbf{W}|\mathcal{D})\,d\mathbf{W}\]

This integral is intractable for neural networks — too many weights, non-conjugate. Approximations: (1) Laplace approximation: Gaussian centered at MAP estimate. (2) Mean-field VI (Bayes by Backprop). (3) MC Dropout: use dropout at test time as approximate posterior sampling. (4) Deep Ensembles: train K networks with different random seeds.

MC Dropout — Practical BNNs

Gal & Ghahramani (2016) showed that a neural network with dropout trained by minimizing cross-entropy is mathematically equivalent to approximate Bayesian inference in a deep Gaussian process. The approximation: apply dropout at test time and compute \(T\) stochastic forward passes:

MC Dropout PredictivePractical

\[\hat{y}_* \approx \frac{1}{T}\sum_{t=1}^{T} f_{\hat{\mathbf{W}}_t}(\mathbf{x}_*), \quad \hat{\mathbf{W}}_t \sim q_{\text{dropout}}\] \[\text{Epistemic uncertainty} \approx \text{Var}\!\left[\{f_{\hat{\mathbf{W}}_t}(\mathbf{x}_*)\}_{t=1}^T\right]\]

Epistemic uncertainty (model uncertainty) is high where training data is sparse. Aleatoric uncertainty (data noise) is irreducible. MC Dropout separates these: the mean of the T predictions is the estimate; the variance captures epistemic uncertainty. Your XAI chest X-ray work could directly use MC Dropout to produce uncertainty maps alongside GradCAM.

16

Mental Model

The Complete Mental Model

// everything connected · one framework · the probabilistic view of ML

Fig 2. Complete Bayesian ML mental map — from Bayes' theorem through inference methods to every ML connection.

The complete unified thread:

Probability is belief. Parameters are uncertain quantities with distributions, not fixed unknowns. Every uncertain quantity — parameters, predictions, model structure — gets a probability.
Bayes' theorem is the update rule: \(p(\theta|\mathcal{D}) \propto p(\mathcal{D}|\theta)\,p(\theta)\). Prior belief times likelihood equals posterior belief (up to normalization). This is the only learning rule consistent with probability theory.
The prior encodes knowledge. Gaussian prior = L2 regularization. Laplace prior = L1 regularization. Horseshoe prior = sparsity-inducing. The Bayesian framework reveals what regularization really is: a prior on the parameters.
Conjugate priors give exact closed-form posteriors. Beta-Binomial. Gaussian-Gaussian. Dirichlet-Categorical. These arise from the exponential family structure and make sequential updating trivially simple.
MAP estimation is the mode of the posterior — equivalent to regularized maximum likelihood. Full Bayesian inference keeps the entire posterior, propagating uncertainty into every downstream quantity.
Predictive distributions integrate out parameter uncertainty: \(p(y_*|\mathcal{D}) = \int p(y_*|\theta)p(\theta|\mathcal{D})\,d\theta\). This gives honest uncertainty — wider where data is sparse.
The evidence \(p(\mathcal{D}|M)\) measures how well a model predicts data on average. Maximizing it for model selection automatically implements Occam's razor without a penalty term.
When exact inference fails: Variational inference (minimize KL, maximize ELBO), MCMC (Metropolis-Hastings, HMC/NUTS), Laplace approximation, or MC Dropout. Every deep learning trick has a Bayesian interpretation.

The Probabilistic View of Machine Learning

Every ML model is implicitly a probabilistic model — of the data-generating process, of the noise, of the parameters. Ridge regression is MAP inference with a Gaussian prior. L1 regularization uses a Laplace prior. Dropout is approximate variational inference. Deep ensembles approximate the posterior predictive. The VAE ELBO is the Bayesian evidence lower bound. Understanding the Bayesian framework doesn't add a separate layer of complexity to ML — it reveals the probabilistic assumptions already embedded in every algorithm you've used, and gives you the tools to reason about them explicitly, improve them, and quantify the uncertainty they produce.