Why Not Linear Regression?
You already know linear regression models a continuous target \( y \in \mathbb{R} \). For classification, the target is categorical — binary for now: \( y \in \{0, 1\} \). The question is: can we repurpose linear regression?
Let's try. Fit a line \( \hat{y} = \mathbf{w}^T \mathbf{x} + b \) and threshold at 0.5. Three fundamental problems arise:
Linear regression outputs \( \hat{y} \in (-\infty, +\infty) \). We need a probability \( P \in [0, 1] \). A predicted value of 3.7 or −0.2 is mathematically meaningless as a probability.
Adding an extreme outlier far from the boundary can drag the regression line and drastically shift where it crosses the threshold — even though the boundary should be unaffected. The linear model is trying to minimize distance to all points, not find a boundary.
Linear regression assumes residuals are Gaussian-distributed around a continuous mean. Binary outputs violate this entirely. MSE loss is also not a convex function when composed with a sigmoid — but log-loss is. We'll formalize this in §05.
The solution: we still use a linear combination \( z = \mathbf{w}^T\mathbf{x} + b \) as the raw score, but we squash it through a function that maps \( \mathbb{R} \to (0,1) \). That function is the sigmoid. But to understand why the sigmoid specifically, we must first understand odds and the logit.
Probability, Odds, and the Logit
This is the conceptual heart of logistic regression. Most textbooks jump straight to the sigmoid formula — we'll derive why it has to be the sigmoid from probabilistic first principles.
From Probability to Odds
Let \( p = P(y=1 \mid \mathbf{x}) \) — the probability the sample belongs to class 1.
The odds of an event is the ratio of the probability it happens to the probability it doesn't:
When \( p = 0.5 \): odds = 1 ("even odds"). When \( p = 0.75 \): odds = 3 ("3 to 1"). As \( p \to 1 \): odds \( \to \infty \). As \( p \to 0 \): odds \( \to 0 \). So odds live in \( (0, +\infty) \).
From Odds to Log-Odds (Logit)
The odds are still asymmetric (0 to ∞). Take the natural log:
Now the range is \( (-\infty, +\infty) \)! The logit is symmetric around 0: \( \text{logit}(0.5) = 0 \), and logit is an increasing function of \( p \).
The Central Assumption of Logistic Regression
Logistic regression assumes the log-odds are linear in the features:
This is elegant: the log-odds are unbounded (matching the range of a linear function), and the linear model now lives on the correct scale. Now we just invert the logit to get back to probability \( p \).
Inverting the Logit → The Sigmoid
The sigmoid isn't an arbitrary choice. It is the exact inverse of the logit function. Choosing a sigmoid output is mathematically equivalent to assuming the log-odds are linear in the features.
The Sigmoid — Deep Dive
Properties You Must Know
- Range: \( \sigma(z) \in (0,1) \) — strictly, never hits 0 or 1. This models probability.
- Symmetry: \( \sigma(-z) = 1 - \sigma(z) \) — symmetric about the point \((0, 0.5)\).
- Inflection point: At \(z = 0\), \(\sigma(0) = 0.5\), and the curve bends from concave-up to concave-down.
- Saturation: For large \(|z|\), the gradient approaches 0 — this causes the vanishing gradient problem in deep networks.
- Monotonically increasing: Always increasing, so \(\sigma'(z) > 0\) for all \(z\).
The Derivative of the Sigmoid (Critical for Backprop)
Because \(\sigma'(z) = \sigma(z)(1-\sigma(z))\), during backpropagation you never recompute the sigmoid — you reuse the forward pass output. Also note: maximum gradient is at \(z=0\) where \(\sigma' = 0.25\), meaning the gradient is already weak — explaining why sigmoid networks saturate.
The Hypothesis and Decision Boundary
Full Hypothesis
For a feature vector \(\mathbf{x} \in \mathbb{R}^n\), with weight vector \(\mathbf{w} \in \mathbb{R}^n\) and bias \(b\):
Note \(\hat{y}\) is interpreted as the probability that sample \(\mathbf{x}\) belongs to class 1. The probability for class 0 is simply \(1 - \hat{y}\).
The Decision Boundary
We classify as class 1 when \(\hat{y} \geq 0.5\), i.e., when \(\sigma(z) \geq 0.5\). Since \(\sigma\) is monotone and \(\sigma(0) = 0.5\):
The decision boundary is the set of points where \(\mathbf{w}^T\mathbf{x} + b = 0\). In 2D, this is a line. In \(n\) dimensions, it's a hyperplane.
The weight vector \(\mathbf{w}\) is the normal to the decision hyperplane. The dot product \(\mathbf{w}^T\mathbf{x}\) is the signed projection of \(\mathbf{x}\) onto \(\mathbf{w}\) — points on one side have positive projection, points on the other side have negative. This is pure linear algebra from your vectors background.
Since the boundary is linear in \(\mathbf{x}\), logistic regression is a linear classifier. It can only separate linearly separable classes — unless you engineer polynomial features (just like polynomial regression).
Why Not MSE? The Loss Function Problem
You know from linear regression that we minimize MSE. Why not do the same here with the sigmoid output?
The problem: this loss is non-convex when composed with the sigmoid. It has many local minima, making gradient descent unreliable. Here's why:
For logistic regression, the cross-entropy loss is a strictly convex function of the weights \(\mathbf{w}\). This guarantees that gradient descent converges to the unique global minimum (assuming sufficient data and appropriate learning rate). MSE does not have this guarantee.
We derive the correct loss from Maximum Likelihood Estimation — which also gives it a deep statistical justification.
Maximum Likelihood Estimation — Full Derivation
MLE is the principled way to derive the loss. The idea: find parameters \(\mathbf{w}, b\) that make the observed training data most probable.
The Probabilistic Model
Each label \(y^{(i)} \in \{0,1\}\) is a Bernoulli random variable. Our model predicts:
These two cases can be written compactly in a single equation using the Bernoulli PMF:
Check: when \(y=1\), this gives \(\hat{y}^1 \cdot (1-\hat{y})^0 = \hat{y}\). When \(y=0\), this gives \(\hat{y}^0 \cdot (1-\hat{y})^1 = 1-\hat{y}\). ✓
The Likelihood Function
Assuming i.i.d. (independent, identically distributed) samples, the likelihood of the entire dataset is the product over all \(m\) examples:
Log-Likelihood (Turning Products into Sums)
Maximizing \(\mathcal{L}\) is equivalent to maximizing \(\log\mathcal{L}\) (since log is monotone). Taking the log converts products to sums — numerically more stable and mathematically easier to differentiate:
From Maximization to Minimization
We want to maximize the log-likelihood, but gradient descent minimizes. We negate and normalize:
This is precisely the Binary Cross-Entropy — the average cross-entropy between the true label distribution and the predicted distribution. MLE and cross-entropy minimization are the same thing for Bernoulli outcomes.
Binary Cross-Entropy Loss — Intuition
The loss for a single sample is:
Let's understand what this penalizes:
| True y | Predicted ŷ | Loss | Interpretation |
|---|---|---|---|
| 1 | 0.99 | \(-\log(0.99) \approx 0.01\) | Correct, confident → tiny loss |
| 1 | 0.50 | \(-\log(0.5) \approx 0.69\) | Correct, uncertain → moderate loss |
| 1 | 0.01 | \(-\log(0.01) \approx 4.6\) | Wrong, confident → huge loss |
| 0 | 0.01 | \(-\log(0.99) \approx 0.01\) | Correct, confident → tiny loss |
| 0 | 0.99 | \(-\log(0.01) \approx 4.6\) | Wrong, confident → huge loss |
Cross-entropy punishes confident wrong predictions extremely heavily — the loss approaches infinity as \(\hat{y} \to 0\) when \(y=1\). This is exactly the right behavior: being very wrong with confidence is the worst possible outcome in a probabilistic model.
The two cases of the loss function also have a clean form. When the sigmoid output \(\hat{y} = \sigma(z)\):
Gradient Computation — The Full Chain Rule
This is the most mathematically rich section. We derive \(\frac{\partial J}{\partial w_j}\) from scratch using the chain rule. You've done partial derivatives, so let's go deep.
The Computational Graph
Step 1 — Derivative of Loss w.r.t. ŷ
Step 2 — Derivative of ŷ w.r.t. z (Sigmoid Derivative)
Step 3 — Chain Rule: ∂J/∂z
The derivative of the log-loss with respect to the pre-activation \(z\) is simply \(\hat{y} - y\) — the residual (prediction minus truth). This is the same form as in linear regression! The sigmoid's derivative and the log's derivative cancel each other perfectly. This is not a coincidence — it's a consequence of the sigmoid being the canonical link function for Bernoulli outcomes.
Step 4 — Gradient w.r.t. Weights and Bias
Since \(z = \mathbf{w}^T\mathbf{x} + b\):
For the full dataset with \(m\) samples (averaging over all):
where \(\mathbf{X} \in \mathbb{R}^{m \times n}\), \(\hat{\mathbf{y}} \in \mathbb{R}^m\), \(\mathbf{y} \in \mathbb{R}^m\)
Gradient Descent Update Rules
Initialize: w = zeros(n), b = 0
for epoch in range(num_epochs):
# Forward pass
z = X @ w + b # (m,)
y_hat = sigmoid(z) # (m,) predictions
# Loss
J = -mean(y*log(y_hat) + (1-y)*log(1-y_hat))
# Backward pass (gradients)
error = y_hat - y # (m,) — the residuals
dw = (X.T @ error) / m # (n,)
db = mean(error) # scalar
# Update
w -= α * dw
b -= α * db
The three variants (from your gradient descent background):
| Variant | Batch Size | Update Frequency | Noise / Stability |
|---|---|---|---|
| Batch GD | All \(m\) | Once per epoch | Stable, but slow for large data |
| Stochastic GD | 1 sample | \(m\) times per epoch | Very noisy, can escape local minima |
| Mini-Batch GD | 32–512 | \(m/\text{batch}\) times | Balance of speed and stability ← standard |
Regularization — L1 and L2
From your bias-variance tradeoff knowledge: when a model overfits, we add a penalty on the weights to constrain their magnitude.
L2 Regularization (Ridge / Weight Decay)
The \(\frac{\lambda}{2m}\) factor is conventional (the 2 cancels the square's derivative neatly). The new gradient for weights:
The update rule becomes: \(\mathbf{w} \leftarrow \mathbf{w}(1 - \frac{\alpha\lambda}{m}) - \frac{\alpha}{m}\mathbf{X}^T(\hat{\mathbf{y}}-\mathbf{y})\) — the factor \((1 - \frac{\alpha\lambda}{m}) < 1\) shrinks the weights every step, hence the name "weight decay."
L2 penalizes large weights but never forces them exactly to zero. It promotes small, spread-out weights. Probabilistically, it corresponds to a Gaussian prior on the weights (MAP estimation with \(\mathcal{N}(0, 1/\lambda)\)).
L1 Regularization (Lasso)
L1 can push weights exactly to zero, performing implicit feature selection. This is because the L1 norm has corners at the axes in weight space — gradient descent tends to land exactly at zero. Probabilistically, corresponds to a Laplace prior (Laplacian distribution) on the weights.
| Property | L2 (Ridge) | L1 (Lasso) |
|---|---|---|
| Effect on weights | Shrinks, never zero | Can drive exactly to zero |
| Feature selection | No | Yes (sparse solution) |
| Differentiable? | Yes everywhere | No at \(w_j = 0\) |
| Prior equivalent | Gaussian prior | Laplace prior |
| Geometry | L2 ball (circle) | L1 ball (diamond) |
Multiclass: One-vs-Rest and Softmax
Logistic regression extends to \(K > 2\) classes in two ways.
Strategy 1: One-vs-Rest (OvR)
Train \(K\) binary logistic regressors. Classifier \(k\) learns: "is this class \(k\) vs. all other classes?" At test time, pick the class with the highest sigmoid score.
Drawback: The K sigmoid outputs don't sum to 1 — they don't form a proper probability distribution over classes.
Strategy 2: Softmax Regression (Multinomial Logistic Regression)
The natural generalization. For \(K\) classes, maintain weight vectors \(\mathbf{w}_1, \ldots, \mathbf{w}_K\). Compute \(K\) linear scores:
Apply the softmax function to convert to probabilities:
Properties of softmax: all outputs in (0,1), and they sum to 1. When \(K=2\), softmax reduces exactly to the sigmoid — binary logistic regression is a special case.
Where \(\mathbf{1}[y^{(i)}=k]\) is 1 if sample \(i\) belongs to class \(k\), else 0. This is equivalent to: for each sample, take the negative log of the predicted probability assigned to the true class.
The final layer of any classification neural network is exactly softmax regression. When you built your chest X-ray CNN, the last fully-connected layer + softmax was doing softmax regression — the rest of the network was learning a feature representation. Logistic regression = neural network with no hidden layers.
Geometric Interpretation
Let's consolidate the geometry — leveraging your vector and linear algebra background.
- The weight vector \(\mathbf{w}\): Points in the direction of maximum increase of the log-odds. Orthogonal to the decision boundary.
- The bias \(b\): Translates the boundary away from the origin. Without bias, the boundary must pass through the origin.
- Distance from a point to the boundary: \(d = \frac{|\mathbf{w}^T\mathbf{x}+b|}{\|\mathbf{w}\|}\) — the signed distance scaled by the norm of \(\mathbf{w}\). The further a point is from the boundary, the more extreme the sigmoid output (closer to 0 or 1).
- Confidence: \(|\mathbf{w}^T\mathbf{x}+b|\) measures how "confidently" a point is classified. A value near 0 means uncertain (ŷ ≈ 0.5); large magnitude means confident.
- Effect of \(\|\mathbf{w}\|\): Scaling up \(\mathbf{w}\) without changing direction sharpens the sigmoid — the model becomes more "peaked." At \(\|\mathbf{w}\| \to \infty\), the sigmoid approaches a hard step function.
Assumptions of Logistic Regression
Unlike linear regression with its stringent Gauss-Markov assumptions, logistic regression has fewer but still important assumptions:
| Assumption | What it Means | What Happens if Violated |
|---|---|---|
| Binary (or categorical) outcome | \(y \in \{0,1\}\) | Model is inappropriate; use other methods |
| Log-odds are linear in features | \(\text{logit}(p) = \mathbf{w}^T\mathbf{x}+b\) | Underfitting; use feature engineering or nonlinear models |
| No multicollinearity | Features are not highly correlated | Unstable, large, noisy coefficients |
| Large sample size | MLE is asymptotically consistent | High variance estimates with small data; use regularization |
| No perfect separation | No feature perfectly predicts \(y\) | MLE diverges (\(\mathbf{w} \to \infty\)); regularization is required |
| Independence of observations | i.i.d. samples | Underestimated standard errors; use clustered SE |
For perfectly separable classes (e.g., all samples with \(x_1 > 5\) are class 1), the MLE pushes \(w_1 \to \infty\) — the sigmoid becomes a step function and the gradient approaches zero before convergence. Adding L2 regularization bounds the weights and fixes this numerically.
Logistic Regression as a Neural Network
This connection is essential and directly bridges your ML and DL knowledge.
Formally: logistic regression is a single-layer neural network with:
- No hidden layers
- One output neuron (binary) or \(K\) output neurons (multiclass softmax)
- Sigmoid / softmax activation
- Cross-entropy loss
- Gradient descent training (backpropagation with only one "layer" of weights)
Adding hidden layers = a proper neural network. The same gradient derivation generalizes via backpropagation — the chain rule we worked through in §08 applies repeatedly through each layer.
Evaluation Metrics
Confusion Matrix and Core Metrics
| Metric | Formula | Measures |
|---|---|---|
| Accuracy | \(\frac{TP+TN}{TP+TN+FP+FN}\) | Overall correctness (misleading for imbalanced data) |
| Precision | \(\frac{TP}{TP+FP}\) | "Of predicted positives, how many were actually positive?" |
| Recall (Sensitivity) | \(\frac{TP}{TP+FN}\) | "Of actual positives, how many were caught?" |
| F1 Score | \(\frac{2 \cdot P \cdot R}{P+R}\) | Harmonic mean of Precision and Recall |
| Specificity | \(\frac{TN}{TN+FP}\) | True negative rate (important in medical diagnosis) |
AUC-ROC
The Receiver Operating Characteristic plots TPR (Recall) vs FPR at varying thresholds. The AUC (Area Under Curve) measures the probability that the model ranks a random positive sample higher than a random negative one. AUC = 0.5 is random; AUC = 1.0 is perfect.
The default threshold of 0.5 is not always optimal. In your chest X-ray work, a false negative (missed disease) is far worse than a false positive — so you'd lower the threshold (e.g., 0.3) to increase recall at the cost of precision. The ROC curve helps select the optimal operating threshold for your specific cost structure.
Log-Loss (as a metric)
Log-loss evaluates the calibration of predicted probabilities — not just whether the rank order is right, but whether the probability values are accurate. A perfectly calibrated model has log-loss approaching the true entropy of the data.
The Full Mental Model — Everything Together
Here is the complete picture as a single unified narrative:
- Assume log-odds are linear in features → sigmoid hypothesis falls out naturally.
- Model each label as Bernoulli with \(p = \sigma(\mathbf{w}^T\mathbf{x}+b)\).
- Maximize the likelihood of the training data → derive cross-entropy loss via MLE.
- Compute gradients via chain rule: the beautiful result \(\partial J/\partial z = \hat{y}-y\).
- Update parameters via gradient descent: \(\mathbf{w} \leftarrow \mathbf{w} - \frac{\alpha}{m}\mathbf{X}^T(\hat{\mathbf{y}}-\mathbf{y})\).
- Regularize with L1 or L2 to prevent overfitting and handle near-separation.
- Extend to multiclass via softmax regression.
- Evaluate with AUC-ROC, F1, log-loss — not just accuracy.
- Recognize it as a 0-hidden-layer neural network — foundation of all deep learning classifiers.