// Machine Learning — Mathematical Foundations

Logistic
Regression

A complete theoretical and mathematical treatment — from first principles to gradient derivation, MLE, regularization, and multiclass extensions.

Sigmoid Function Maximum Likelihood Cross-Entropy Gradient Descent Decision Boundary Regularization Softmax / Multiclass
// Table of Contents
  1. 01Why Not Linear Regression? — The Motivation
  2. 02Probability, Odds, and the Logit Function
  3. 03The Sigmoid (Logistic) Function — Deep Dive
  4. 04The Hypothesis and Decision Boundary
  5. 05Why Not MSE? — The Loss Function Problem
  6. 06Maximum Likelihood Estimation — Full Derivation
  7. 07Binary Cross-Entropy Loss
  8. 08Gradient Computation — The Full Chain Rule
  9. 09Gradient Descent Update Rules
  10. 10Regularization — L1 and L2
  11. 11Multiclass: One-vs-Rest and Softmax
  12. 12Geometric Interpretation
  13. 13Assumptions of Logistic Regression
  14. 14Logistic Regression as a Neural Network
  15. 15Evaluation Metrics
  16. 16The Full Mental Model — Everything Together
§01

Why Not Linear Regression?

You already know linear regression models a continuous target \( y \in \mathbb{R} \). For classification, the target is categorical — binary for now: \( y \in \{0, 1\} \). The question is: can we repurpose linear regression?

Let's try. Fit a line \( \hat{y} = \mathbf{w}^T \mathbf{x} + b \) and threshold at 0.5. Three fundamental problems arise:

⚠ Problem 1 — Unbounded Output

Linear regression outputs \( \hat{y} \in (-\infty, +\infty) \). We need a probability \( P \in [0, 1] \). A predicted value of 3.7 or −0.2 is mathematically meaningless as a probability.

⚠ Problem 2 — Sensitivity to Outliers Shifts the Boundary

Adding an extreme outlier far from the boundary can drag the regression line and drastically shift where it crosses the threshold — even though the boundary should be unaffected. The linear model is trying to minimize distance to all points, not find a boundary.

⚠ Problem 3 — Violated Assumptions, Wrong Loss

Linear regression assumes residuals are Gaussian-distributed around a continuous mean. Binary outputs violate this entirely. MSE loss is also not a convex function when composed with a sigmoid — but log-loss is. We'll formalize this in §05.

The solution: we still use a linear combination \( z = \mathbf{w}^T\mathbf{x} + b \) as the raw score, but we squash it through a function that maps \( \mathbb{R} \to (0,1) \). That function is the sigmoid. But to understand why the sigmoid specifically, we must first understand odds and the logit.

§02

Probability, Odds, and the Logit

This is the conceptual heart of logistic regression. Most textbooks jump straight to the sigmoid formula — we'll derive why it has to be the sigmoid from probabilistic first principles.

From Probability to Odds

Let \( p = P(y=1 \mid \mathbf{x}) \) — the probability the sample belongs to class 1.

The odds of an event is the ratio of the probability it happens to the probability it doesn't:

Definition \[ \text{Odds} = \frac{p}{1-p} \]

When \( p = 0.5 \): odds = 1 ("even odds"). When \( p = 0.75 \): odds = 3 ("3 to 1"). As \( p \to 1 \): odds \( \to \infty \). As \( p \to 0 \): odds \( \to 0 \). So odds live in \( (0, +\infty) \).

From Odds to Log-Odds (Logit)

The odds are still asymmetric (0 to ∞). Take the natural log:

Logit Function \[ \text{logit}(p) = \log\left(\frac{p}{1-p}\right) \]

Now the range is \( (-\infty, +\infty) \)! The logit is symmetric around 0: \( \text{logit}(0.5) = 0 \), and logit is an increasing function of \( p \).

The Central Assumption of Logistic Regression

Logistic regression assumes the log-odds are linear in the features:

Core Assumption \[ \log\left(\frac{p}{1-p}\right) = \mathbf{w}^T\mathbf{x} + b = z \]

This is elegant: the log-odds are unbounded (matching the range of a linear function), and the linear model now lives on the correct scale. Now we just invert the logit to get back to probability \( p \).

Inverting the Logit → The Sigmoid

// Algebraic derivation of the sigmoid from the logit assumption
1
Start from the logit equation: \(\displaystyle \log\frac{p}{1-p} = z\)
2
Exponentiate both sides: \(\displaystyle \frac{p}{1-p} = e^z\)
3
Solve for \(p\): \(\displaystyle p = e^z(1-p) = e^z - pe^z\) \(\displaystyle \Rightarrow p(1 + e^z) = e^z\)
4
\(\displaystyle p = \frac{e^z}{1+e^z} = \frac{1}{1+e^{-z}}\)
Divide numerator and denominator by \(e^z\)
The Sigmoid \[ \sigma(z) = \frac{1}{1+e^{-z}} \]
✓ Key Insight

The sigmoid isn't an arbitrary choice. It is the exact inverse of the logit function. Choosing a sigmoid output is mathematically equivalent to assuming the log-odds are linear in the features.

§03

The Sigmoid — Deep Dive

z σ(z) 1 0.5 0 z=0 -3 +3 σ(z) = 1/(1+e⁻ᶻ) Asymptote y=1 Asymptote y=0
Fig 1. The sigmoid function — maps ℝ → (0,1) with inflection at z=0 where σ(0) = 0.5

Properties You Must Know

  • Range: \( \sigma(z) \in (0,1) \) — strictly, never hits 0 or 1. This models probability.
  • Symmetry: \( \sigma(-z) = 1 - \sigma(z) \) — symmetric about the point \((0, 0.5)\).
  • Inflection point: At \(z = 0\), \(\sigma(0) = 0.5\), and the curve bends from concave-up to concave-down.
  • Saturation: For large \(|z|\), the gradient approaches 0 — this causes the vanishing gradient problem in deep networks.
  • Monotonically increasing: Always increasing, so \(\sigma'(z) > 0\) for all \(z\).

The Derivative of the Sigmoid (Critical for Backprop)

// Deriving σ'(z) — used in every gradient computation
1
\(\sigma(z) = (1 + e^{-z})^{-1}\)
2
\(\sigma'(z) = -(1+e^{-z})^{-2} \cdot (-e^{-z}) = \dfrac{e^{-z}}{(1+e^{-z})^2}\)
3
Factor: \(\displaystyle = \frac{1}{1+e^{-z}} \cdot \frac{e^{-z}}{1+e^{-z}} = \frac{1}{1+e^{-z}} \cdot \left(1 - \frac{1}{1+e^{-z}}\right)\)
4
\(\boxed{\sigma'(z) = \sigma(z)(1-\sigma(z))}\)
This elegant result means: the derivative is entirely expressible in terms of the output itself — very efficient to compute.
💡 Why This Matters

Because \(\sigma'(z) = \sigma(z)(1-\sigma(z))\), during backpropagation you never recompute the sigmoid — you reuse the forward pass output. Also note: maximum gradient is at \(z=0\) where \(\sigma' = 0.25\), meaning the gradient is already weak — explaining why sigmoid networks saturate.

§04

The Hypothesis and Decision Boundary

Full Hypothesis

For a feature vector \(\mathbf{x} \in \mathbb{R}^n\), with weight vector \(\mathbf{w} \in \mathbb{R}^n\) and bias \(b\):

Logistic Regression Hypothesis \[ \hat{y} = h_{\mathbf{w},b}(\mathbf{x}) = \sigma(\mathbf{w}^T\mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w}^T\mathbf{x}+b)}} \] \[ \hat{y} \approx P(y=1 \mid \mathbf{x}; \mathbf{w}, b) \]

Note \(\hat{y}\) is interpreted as the probability that sample \(\mathbf{x}\) belongs to class 1. The probability for class 0 is simply \(1 - \hat{y}\).

The Decision Boundary

We classify as class 1 when \(\hat{y} \geq 0.5\), i.e., when \(\sigma(z) \geq 0.5\). Since \(\sigma\) is monotone and \(\sigma(0) = 0.5\):

Decision Rule \[ \hat{y} \geq 0.5 \iff z \geq 0 \iff \mathbf{w}^T\mathbf{x} + b \geq 0 \]

The decision boundary is the set of points where \(\mathbf{w}^T\mathbf{x} + b = 0\). In 2D, this is a line. In \(n\) dimensions, it's a hyperplane.

💡 Geometric Insight

The weight vector \(\mathbf{w}\) is the normal to the decision hyperplane. The dot product \(\mathbf{w}^T\mathbf{x}\) is the signed projection of \(\mathbf{x}\) onto \(\mathbf{w}\) — points on one side have positive projection, points on the other side have negative. This is pure linear algebra from your vectors background.

Since the boundary is linear in \(\mathbf{x}\), logistic regression is a linear classifier. It can only separate linearly separable classes — unless you engineer polynomial features (just like polynomial regression).

Class 0 w·x + b < 0 Class 1 w·x + b > 0 w (normal) w·x + b = 0
Fig 2. Linear decision boundary — the hyperplane w·x + b = 0 separates the feature space. w is the normal vector.
§05

Why Not MSE? The Loss Function Problem

You know from linear regression that we minimize MSE. Why not do the same here with the sigmoid output?

MSE with Sigmoid \[ J(\mathbf{w}) = \frac{1}{m}\sum_{i=1}^{m}\left(y^{(i)} - \sigma(\mathbf{w}^T\mathbf{x}^{(i)}+b)\right)^2 \]

The problem: this loss is non-convex when composed with the sigmoid. It has many local minima, making gradient descent unreliable. Here's why:

MSE + SIGMOID (non-convex) Multiple local minima → bad CROSS-ENTROPY (convex) One global minimum → good
Fig 3. MSE with sigmoid is non-convex (multiple local minima). Cross-entropy is strictly convex (one global minimum).
✓ Formal Statement

For logistic regression, the cross-entropy loss is a strictly convex function of the weights \(\mathbf{w}\). This guarantees that gradient descent converges to the unique global minimum (assuming sufficient data and appropriate learning rate). MSE does not have this guarantee.

We derive the correct loss from Maximum Likelihood Estimation — which also gives it a deep statistical justification.

§06

Maximum Likelihood Estimation — Full Derivation

MLE is the principled way to derive the loss. The idea: find parameters \(\mathbf{w}, b\) that make the observed training data most probable.

The Probabilistic Model

Each label \(y^{(i)} \in \{0,1\}\) is a Bernoulli random variable. Our model predicts:

\[ P(y=1 \mid \mathbf{x}) = \hat{y} = \sigma(z), \qquad P(y=0 \mid \mathbf{x}) = 1 - \hat{y} \]

These two cases can be written compactly in a single equation using the Bernoulli PMF:

Bernoulli PMF \[ P(y \mid \mathbf{x}; \mathbf{w}) = \hat{y}^{\,y} \cdot (1-\hat{y})^{1-y} \]

Check: when \(y=1\), this gives \(\hat{y}^1 \cdot (1-\hat{y})^0 = \hat{y}\). When \(y=0\), this gives \(\hat{y}^0 \cdot (1-\hat{y})^1 = 1-\hat{y}\). ✓

The Likelihood Function

Assuming i.i.d. (independent, identically distributed) samples, the likelihood of the entire dataset is the product over all \(m\) examples:

Likelihood \[ \mathcal{L}(\mathbf{w},b) = \prod_{i=1}^{m} P(y^{(i)} \mid \mathbf{x}^{(i)}; \mathbf{w},b) = \prod_{i=1}^{m} \hat{y}_i^{\,y^{(i)}} (1-\hat{y}_i)^{1-y^{(i)}} \]

Log-Likelihood (Turning Products into Sums)

Maximizing \(\mathcal{L}\) is equivalent to maximizing \(\log\mathcal{L}\) (since log is monotone). Taking the log converts products to sums — numerically more stable and mathematically easier to differentiate:

Log-Likelihood \[ \ell(\mathbf{w},b) = \log\mathcal{L} = \sum_{i=1}^{m} \left[ y^{(i)}\log\hat{y}_i + (1-y^{(i)})\log(1-\hat{y}_i) \right] \]

From Maximization to Minimization

We want to maximize the log-likelihood, but gradient descent minimizes. We negate and normalize:

Loss = Negative Mean Log-Likelihood \[ J(\mathbf{w},b) = -\frac{1}{m}\ell(\mathbf{w},b) = -\frac{1}{m}\sum_{i=1}^{m} \left[ y^{(i)}\log\hat{y}_i + (1-y^{(i)})\log(1-\hat{y}_i) \right] \]
✓ Connection to Information Theory

This is precisely the Binary Cross-Entropy — the average cross-entropy between the true label distribution and the predicted distribution. MLE and cross-entropy minimization are the same thing for Bernoulli outcomes.

§07

Binary Cross-Entropy Loss — Intuition

The loss for a single sample is:

Single-Sample Loss \[ \mathcal{L}(\hat{y}, y) = -\left[y\log(\hat{y}) + (1-y)\log(1-\hat{y})\right] \]

Let's understand what this penalizes:

True y Predicted ŷ Loss Interpretation
1 0.99 \(-\log(0.99) \approx 0.01\) Correct, confident → tiny loss
1 0.50 \(-\log(0.5) \approx 0.69\) Correct, uncertain → moderate loss
1 0.01 \(-\log(0.01) \approx 4.6\) Wrong, confident → huge loss
0 0.01 \(-\log(0.99) \approx 0.01\) Correct, confident → tiny loss
0 0.99 \(-\log(0.01) \approx 4.6\) Wrong, confident → huge loss
💡 Asymmetric Penalty

Cross-entropy punishes confident wrong predictions extremely heavily — the loss approaches infinity as \(\hat{y} \to 0\) when \(y=1\). This is exactly the right behavior: being very wrong with confidence is the worst possible outcome in a probabilistic model.

The two cases of the loss function also have a clean form. When the sigmoid output \(\hat{y} = \sigma(z)\):

Loss in terms of z (useful for derivation) \[ \text{if } y=1: \quad \mathcal{L} = -\log\sigma(z) = \log(1+e^{-z}) \] \[ \text{if } y=0: \quad \mathcal{L} = -\log(1-\sigma(z)) = \log(1+e^{z}) \]
§08

Gradient Computation — The Full Chain Rule

This is the most mathematically rich section. We derive \(\frac{\partial J}{\partial w_j}\) from scratch using the chain rule. You've done partial derivatives, so let's go deep.

The Computational Graph

w, x, b z = w·x+b ŷ = σ(z) J(ŷ, y) linear sigmoid cross-ent
Fig 4. Computational graph of logistic regression — backprop flows right to left.

Step 1 — Derivative of Loss w.r.t. ŷ

∂J/∂ŷ for one sample \[ J = -\left[y\log\hat{y} + (1-y)\log(1-\hat{y})\right] \] \[ \frac{\partial J}{\partial \hat{y}} = -\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}} \]

Step 2 — Derivative of ŷ w.r.t. z (Sigmoid Derivative)

∂ŷ/∂z \[ \frac{\partial \hat{y}}{\partial z} = \sigma'(z) = \hat{y}(1-\hat{y}) \]

Step 3 — Chain Rule: ∂J/∂z

// Combining steps 1 and 2 via chain rule
1
\(\displaystyle \frac{\partial J}{\partial z} = \frac{\partial J}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z}\)
2
\(\displaystyle = \left(-\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}\right) \cdot \hat{y}(1-\hat{y})\)
3
\(\displaystyle = -y(1-\hat{y}) + (1-y)\hat{y}\)
Distribute: \(-\frac{y}{\hat{y}} \cdot \hat{y}(1-\hat{y}) = -y(1-\hat{y})\) and \(\frac{1-y}{1-\hat{y}} \cdot \hat{y}(1-\hat{y}) = (1-y)\hat{y}\)
4
\(\displaystyle = -y + y\hat{y} + \hat{y} - y\hat{y} = \hat{y} - y\)
5
\(\displaystyle \boxed{\frac{\partial J}{\partial z} = \hat{y} - y}\)
The sigmoid and log cancel beautifully — the gradient at the linear layer is simply the prediction error!
✓ The Beautiful Cancellation

The derivative of the log-loss with respect to the pre-activation \(z\) is simply \(\hat{y} - y\) — the residual (prediction minus truth). This is the same form as in linear regression! The sigmoid's derivative and the log's derivative cancel each other perfectly. This is not a coincidence — it's a consequence of the sigmoid being the canonical link function for Bernoulli outcomes.

Step 4 — Gradient w.r.t. Weights and Bias

Since \(z = \mathbf{w}^T\mathbf{x} + b\):

Final Gradients \[ \frac{\partial J}{\partial w_j} = \frac{\partial J}{\partial z} \cdot \frac{\partial z}{\partial w_j} = (\hat{y} - y) \cdot x_j \] \[ \frac{\partial J}{\partial b} = \frac{\partial J}{\partial z} \cdot \frac{\partial z}{\partial b} = (\hat{y} - y) \cdot 1 = \hat{y} - y \]

For the full dataset with \(m\) samples (averaging over all):

Vectorized Gradients (m samples) \[ \frac{\partial J}{\partial \mathbf{w}} = \frac{1}{m} \mathbf{X}^T (\hat{\mathbf{y}} - \mathbf{y}) \] \[ \frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^{m}(\hat{y}^{(i)} - y^{(i)}) \]

where \(\mathbf{X} \in \mathbb{R}^{m \times n}\), \(\hat{\mathbf{y}} \in \mathbb{R}^m\), \(\mathbf{y} \in \mathbb{R}^m\)

§09

Gradient Descent Update Rules

Parameter Updates \[ \mathbf{w} \leftarrow \mathbf{w} - \alpha \cdot \frac{\partial J}{\partial \mathbf{w}} = \mathbf{w} - \frac{\alpha}{m} \mathbf{X}^T(\hat{\mathbf{y}} - \mathbf{y}) \] \[ b \leftarrow b - \alpha \cdot \frac{\partial J}{\partial b} = b - \frac{\alpha}{m}\sum_{i=1}^{m}(\hat{y}^{(i)} - y^{(i)}) \]
// Batch Gradient Descent for Logistic Regression
Initialize: w = zeros(n), b = 0

for epoch in range(num_epochs):
  # Forward pass
  z = X @ w + b           # (m,)
  y_hat = sigmoid(z)         # (m,) predictions

  # Loss
  J = -mean(y*log(y_hat) + (1-y)*log(1-y_hat))

  # Backward pass (gradients)
  error = y_hat - y          # (m,) — the residuals
  dw = (X.T @ error) / m     # (n,)
  db = mean(error)           # scalar

  # Update
  w -= α * dw
  b -= α * db

The three variants (from your gradient descent background):

Variant Batch Size Update Frequency Noise / Stability
Batch GD All \(m\) Once per epoch Stable, but slow for large data
Stochastic GD 1 sample \(m\) times per epoch Very noisy, can escape local minima
Mini-Batch GD 32–512 \(m/\text{batch}\) times Balance of speed and stability ← standard
§10

Regularization — L1 and L2

From your bias-variance tradeoff knowledge: when a model overfits, we add a penalty on the weights to constrain their magnitude.

L2 Regularization (Ridge / Weight Decay)

L2 Regularized Loss \[ J_{\text{L2}}(\mathbf{w}) = -\frac{1}{m}\sum_{i=1}^{m}\left[y^{(i)}\log\hat{y}_i + (1-y^{(i)})\log(1-\hat{y}_i)\right] + \frac{\lambda}{2m}\|\mathbf{w}\|_2^2 \]

The \(\frac{\lambda}{2m}\) factor is conventional (the 2 cancels the square's derivative neatly). The new gradient for weights:

L2 Gradient \[ \frac{\partial J_{\text{L2}}}{\partial \mathbf{w}} = \frac{1}{m}\mathbf{X}^T(\hat{\mathbf{y}} - \mathbf{y}) + \frac{\lambda}{m}\mathbf{w} \]

The update rule becomes: \(\mathbf{w} \leftarrow \mathbf{w}(1 - \frac{\alpha\lambda}{m}) - \frac{\alpha}{m}\mathbf{X}^T(\hat{\mathbf{y}}-\mathbf{y})\) — the factor \((1 - \frac{\alpha\lambda}{m}) < 1\) shrinks the weights every step, hence the name "weight decay."

L2 penalizes large weights but never forces them exactly to zero. It promotes small, spread-out weights. Probabilistically, it corresponds to a Gaussian prior on the weights (MAP estimation with \(\mathcal{N}(0, 1/\lambda)\)).

L1 Regularization (Lasso)

L1 Regularized Loss \[ J_{\text{L1}}(\mathbf{w}) = -\frac{1}{m}\sum_{i=1}^{m}\left[\ldots\right] + \frac{\lambda}{m}\|\mathbf{w}\|_1 = -\frac{1}{m}\sum\left[\ldots\right] + \frac{\lambda}{m}\sum_{j}|w_j| \]
L1 Gradient \[ \frac{\partial J_{\text{L1}}}{\partial w_j} = \frac{1}{m}[\mathbf{X}^T(\hat{\mathbf{y}}-\mathbf{y})]_j + \frac{\lambda}{m}\text{sign}(w_j) \]

L1 can push weights exactly to zero, performing implicit feature selection. This is because the L1 norm has corners at the axes in weight space — gradient descent tends to land exactly at zero. Probabilistically, corresponds to a Laplace prior (Laplacian distribution) on the weights.

Property L2 (Ridge) L1 (Lasso)
Effect on weights Shrinks, never zero Can drive exactly to zero
Feature selection No Yes (sparse solution)
Differentiable? Yes everywhere No at \(w_j = 0\)
Prior equivalent Gaussian prior Laplace prior
Geometry L2 ball (circle) L1 ball (diamond)
§11

Multiclass: One-vs-Rest and Softmax

Logistic regression extends to \(K > 2\) classes in two ways.

Strategy 1: One-vs-Rest (OvR)

Train \(K\) binary logistic regressors. Classifier \(k\) learns: "is this class \(k\) vs. all other classes?" At test time, pick the class with the highest sigmoid score.

Drawback: The K sigmoid outputs don't sum to 1 — they don't form a proper probability distribution over classes.

Strategy 2: Softmax Regression (Multinomial Logistic Regression)

The natural generalization. For \(K\) classes, maintain weight vectors \(\mathbf{w}_1, \ldots, \mathbf{w}_K\). Compute \(K\) linear scores:

\[ z_k = \mathbf{w}_k^T \mathbf{x} + b_k, \quad k = 1, \ldots, K \]

Apply the softmax function to convert to probabilities:

Softmax Function \[ P(y=k \mid \mathbf{x}) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}} \]

Properties of softmax: all outputs in (0,1), and they sum to 1. When \(K=2\), softmax reduces exactly to the sigmoid — binary logistic regression is a special case.

Categorical Cross-Entropy Loss \[ J = -\frac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{K} \mathbf{1}[y^{(i)}=k] \log P(y^{(i)}=k \mid \mathbf{x}^{(i)}) \]

Where \(\mathbf{1}[y^{(i)}=k]\) is 1 if sample \(i\) belongs to class \(k\), else 0. This is equivalent to: for each sample, take the negative log of the predicted probability assigned to the true class.

💡 Connection to Your DL Work

The final layer of any classification neural network is exactly softmax regression. When you built your chest X-ray CNN, the last fully-connected layer + softmax was doing softmax regression — the rest of the network was learning a feature representation. Logistic regression = neural network with no hidden layers.

§12

Geometric Interpretation

Let's consolidate the geometry — leveraging your vector and linear algebra background.

§13

Assumptions of Logistic Regression

Unlike linear regression with its stringent Gauss-Markov assumptions, logistic regression has fewer but still important assumptions:

Assumption What it Means What Happens if Violated
Binary (or categorical) outcome \(y \in \{0,1\}\) Model is inappropriate; use other methods
Log-odds are linear in features \(\text{logit}(p) = \mathbf{w}^T\mathbf{x}+b\) Underfitting; use feature engineering or nonlinear models
No multicollinearity Features are not highly correlated Unstable, large, noisy coefficients
Large sample size MLE is asymptotically consistent High variance estimates with small data; use regularization
No perfect separation No feature perfectly predicts \(y\) MLE diverges (\(\mathbf{w} \to \infty\)); regularization is required
Independence of observations i.i.d. samples Underestimated standard errors; use clustered SE
⚠ Perfect Separation (Hauck-Donner Effect)

For perfectly separable classes (e.g., all samples with \(x_1 > 5\) are class 1), the MLE pushes \(w_1 \to \infty\) — the sigmoid becomes a step function and the gradient approaches zero before convergence. Adding L2 regularization bounds the weights and fixes this numerically.

§14

Logistic Regression as a Neural Network

This connection is essential and directly bridges your ML and DL knowledge.

x₁ x₂ xₙ σ(z) = ŷ w₁ w₂ wₙ ŷ INPUT LAYER OUTPUT LAYER z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b
Fig 5. Logistic regression = a neural network with no hidden layers — input layer directly connects to a single sigmoid unit.

Formally: logistic regression is a single-layer neural network with:

Adding hidden layers = a proper neural network. The same gradient derivation generalizes via backpropagation — the chain rule we worked through in §08 applies repeatedly through each layer.

§15

Evaluation Metrics

Confusion Matrix and Core Metrics

Metric Formula Measures
Accuracy \(\frac{TP+TN}{TP+TN+FP+FN}\) Overall correctness (misleading for imbalanced data)
Precision \(\frac{TP}{TP+FP}\) "Of predicted positives, how many were actually positive?"
Recall (Sensitivity) \(\frac{TP}{TP+FN}\) "Of actual positives, how many were caught?"
F1 Score \(\frac{2 \cdot P \cdot R}{P+R}\) Harmonic mean of Precision and Recall
Specificity \(\frac{TN}{TN+FP}\) True negative rate (important in medical diagnosis)

AUC-ROC

The Receiver Operating Characteristic plots TPR (Recall) vs FPR at varying thresholds. The AUC (Area Under Curve) measures the probability that the model ranks a random positive sample higher than a random negative one. AUC = 0.5 is random; AUC = 1.0 is perfect.

💡 Threshold Selection

The default threshold of 0.5 is not always optimal. In your chest X-ray work, a false negative (missed disease) is far worse than a false positive — so you'd lower the threshold (e.g., 0.3) to increase recall at the cost of precision. The ROC curve helps select the optimal operating threshold for your specific cost structure.

Log-Loss (as a metric)

\[ \text{Log-Loss} = -\frac{1}{m}\sum_{i=1}^{m}\left[y^{(i)}\log\hat{y}^{(i)} + (1-y^{(i)})\log(1-\hat{y}^{(i)})\right] \]

Log-loss evaluates the calibration of predicted probabilities — not just whether the rank order is right, but whether the probability values are accurate. A perfectly calibrated model has log-loss approaching the true entropy of the data.

§16

The Full Mental Model — Everything Together

LOGISTIC REGRESSION — COMPLETE FLOW Features x ∈ ℝⁿ Linear Score z = wᵀx + b Sigmoid ŷ = σ(z) Cross-Entropy J(ŷ, y) ← BACKPROPAGATION (gradients) KEY GRADIENT RESULTS ∂J/∂ŷ = -y/ŷ + (1-y)/(1-ŷ) ∂ŷ/∂z = σ(z)(1-σ(z)) = ŷ(1-ŷ) ∂J/∂z = ŷ - y ← log & sigmoid cancel! ∂J/∂w = (1/m) Xᵀ(ŷ - y) ∂J/∂b = (1/m) Σ(ŷᵢ - yᵢ) w ← w - α·∂J/∂w b ← b - α·∂J/∂b PARAMETER UPDATE Converges to global minimum (convex loss) · Linear decision boundary · Probabilistic output
Fig 6. Complete computational flow of logistic regression — forward pass, cross-entropy loss, gradient backpropagation, and parameter update.

Here is the complete picture as a single unified narrative:

  1. Assume log-odds are linear in features → sigmoid hypothesis falls out naturally.
  2. Model each label as Bernoulli with \(p = \sigma(\mathbf{w}^T\mathbf{x}+b)\).
  3. Maximize the likelihood of the training data → derive cross-entropy loss via MLE.
  4. Compute gradients via chain rule: the beautiful result \(\partial J/\partial z = \hat{y}-y\).
  5. Update parameters via gradient descent: \(\mathbf{w} \leftarrow \mathbf{w} - \frac{\alpha}{m}\mathbf{X}^T(\hat{\mathbf{y}}-\mathbf{y})\).
  6. Regularize with L1 or L2 to prevent overfitting and handle near-separation.
  7. Extend to multiclass via softmax regression.
  8. Evaluate with AUC-ROC, F1, log-loss — not just accuracy.
  9. Recognize it as a 0-hidden-layer neural network — foundation of all deep learning classifiers.