Logistic Regression | Shahid Ul Islam

§01

Why Not Linear Regression?

You already know linear regression models a continuous target \( y \in \mathbb{R} \). For classification, the target is categorical — binary for now: \( y \in \{0, 1\} \). The question is: can we repurpose linear regression?

Let's try. Fit a line \( \hat{y} = \mathbf{w}^T \mathbf{x} + b \) and threshold at 0.5. Three fundamental problems arise:

⚠ Problem 1 — Unbounded Output

Linear regression outputs \( \hat{y} \in (-\infty, +\infty) \). We need a probability \( P \in [0, 1] \). A predicted value of 3.7 or −0.2 is mathematically meaningless as a probability.

⚠ Problem 2 — Sensitivity to Outliers Shifts the Boundary

Adding an extreme outlier far from the boundary can drag the regression line and drastically shift where it crosses the threshold — even though the boundary should be unaffected. The linear model is trying to minimize distance to all points, not find a boundary.

⚠ Problem 3 — Violated Assumptions, Wrong Loss

Linear regression assumes residuals are Gaussian-distributed around a continuous mean. Binary outputs violate this entirely. MSE loss is also not a convex function when composed with a sigmoid — but log-loss is. We'll formalize this in §05.

The solution: we still use a linear combination \( z = \mathbf{w}^T\mathbf{x} + b \) as the raw score, but we squash it through a function that maps \( \mathbb{R} \to (0,1) \). That function is the sigmoid. But to understand why the sigmoid specifically, we must first understand odds and the logit.

§02

Probability, Odds, and the Logit

This is the conceptual heart of logistic regression. Most textbooks jump straight to the sigmoid formula — we'll derive why it has to be the sigmoid from probabilistic first principles.

From Probability to Odds

Let \( p = P(y=1 \mid \mathbf{x}) \) — the probability the sample belongs to class 1.

The odds of an event is the ratio of the probability it happens to the probability it doesn't:

Definition \[ \text{Odds} = \frac{p}{1-p} \]

When \( p = 0.5 \): odds = 1 ("even odds"). When \( p = 0.75 \): odds = 3 ("3 to 1"). As \( p \to 1 \): odds \( \to \infty \). As \( p \to 0 \): odds \( \to 0 \). So odds live in \( (0, +\infty) \).

From Odds to Log-Odds (Logit)

The odds are still asymmetric (0 to ∞). Take the natural log:

Logit Function \[ \text{logit}(p) = \log\left(\frac{p}{1-p}\right) \]

Now the range is \( (-\infty, +\infty) \)! The logit is symmetric around 0: \( \text{logit}(0.5) = 0 \), and logit is an increasing function of \( p \).

The Central Assumption of Logistic Regression

Logistic regression assumes the log-odds are linear in the features:

Core Assumption \[ \log\left(\frac{p}{1-p}\right) = \mathbf{w}^T\mathbf{x} + b = z \]

This is elegant: the log-odds are unbounded (matching the range of a linear function), and the linear model now lives on the correct scale. Now we just invert the logit to get back to probability \( p \).

Inverting the Logit → The Sigmoid

// Algebraic derivation of the sigmoid from the logit assumption

1

Start from the logit equation: \(\displaystyle \log\frac{p}{1-p} = z\)

2

Exponentiate both sides: \(\displaystyle \frac{p}{1-p} = e^z\)

3

Solve for \(p\): \(\displaystyle p = e^z(1-p) = e^z - pe^z\) \(\displaystyle \Rightarrow p(1 + e^z) = e^z\)

4

\(\displaystyle p = \frac{e^z}{1+e^z} = \frac{1}{1+e^{-z}}\)

Divide numerator and denominator by \(e^z\)

The Sigmoid \[ \sigma(z) = \frac{1}{1+e^{-z}} \]

✓ Key Insight

The sigmoid isn't an arbitrary choice. It is the exact inverse of the logit function. Choosing a sigmoid output is mathematically equivalent to assuming the log-odds are linear in the features.

§03

The Sigmoid — Deep Dive

Fig 1. The sigmoid function — maps ℝ → (0,1) with inflection at z=0 where σ(0) = 0.5

Properties You Must Know

Range: \( \sigma(z) \in (0,1) \) — strictly, never hits 0 or 1. This models probability.
Symmetry: \( \sigma(-z) = 1 - \sigma(z) \) — symmetric about the point \((0, 0.5)\).
Inflection point: At \(z = 0\), \(\sigma(0) = 0.5\), and the curve bends from concave-up to concave-down.
Saturation: For large \(|z|\), the gradient approaches 0 — this causes the vanishing gradient problem in deep networks.
Monotonically increasing: Always increasing, so \(\sigma'(z) > 0\) for all \(z\).

The Derivative of the Sigmoid (Critical for Backprop)

// Deriving σ'(z) — used in every gradient computation

1

\(\sigma(z) = (1 + e^{-z})^{-1}\)

2

\(\sigma'(z) = -(1+e^{-z})^{-2} \cdot (-e^{-z}) = \dfrac{e^{-z}}{(1+e^{-z})^2}\)

3

Factor: \(\displaystyle = \frac{1}{1+e^{-z}} \cdot \frac{e^{-z}}{1+e^{-z}} = \frac{1}{1+e^{-z}} \cdot \left(1 - \frac{1}{1+e^{-z}}\right)\)

4

\(\boxed{\sigma'(z) = \sigma(z)(1-\sigma(z))}\)

This elegant result means: the derivative is entirely expressible in terms of the output itself — very efficient to compute.

💡 Why This Matters

Because \(\sigma'(z) = \sigma(z)(1-\sigma(z))\), during backpropagation you never recompute the sigmoid — you reuse the forward pass output. Also note: maximum gradient is at \(z=0\) where \(\sigma' = 0.25\), meaning the gradient is already weak — explaining why sigmoid networks saturate.

§04

The Hypothesis and Decision Boundary

Full Hypothesis

For a feature vector \(\mathbf{x} \in \mathbb{R}^n\), with weight vector \(\mathbf{w} \in \mathbb{R}^n\) and bias \(b\):

Logistic Regression Hypothesis \[ \hat{y} = h_{\mathbf{w},b}(\mathbf{x}) = \sigma(\mathbf{w}^T\mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w}^T\mathbf{x}+b)}} \] \[ \hat{y} \approx P(y=1 \mid \mathbf{x}; \mathbf{w}, b) \]

Note \(\hat{y}\) is interpreted as the probability that sample \(\mathbf{x}\) belongs to class 1. The probability for class 0 is simply \(1 - \hat{y}\).

The Decision Boundary

We classify as class 1 when \(\hat{y} \geq 0.5\), i.e., when \(\sigma(z) \geq 0.5\). Since \(\sigma\) is monotone and \(\sigma(0) = 0.5\):

Decision Rule \[ \hat{y} \geq 0.5 \iff z \geq 0 \iff \mathbf{w}^T\mathbf{x} + b \geq 0 \]

The decision boundary is the set of points where \(\mathbf{w}^T\mathbf{x} + b = 0\). In 2D, this is a line. In \(n\) dimensions, it's a hyperplane.

💡 Geometric Insight

The weight vector \(\mathbf{w}\) is the normal to the decision hyperplane. The dot product \(\mathbf{w}^T\mathbf{x}\) is the signed projection of \(\mathbf{x}\) onto \(\mathbf{w}\) — points on one side have positive projection, points on the other side have negative. This is pure linear algebra from your vectors background.

Since the boundary is linear in \(\mathbf{x}\), logistic regression is a linear classifier. It can only separate linearly separable classes — unless you engineer polynomial features (just like polynomial regression).

Fig 2. Linear decision boundary — the hyperplane w·x + b = 0 separates the feature space. w is the normal vector.

§05

Why Not MSE? The Loss Function Problem

You know from linear regression that we minimize MSE. Why not do the same here with the sigmoid output?

MSE with Sigmoid \[ J(\mathbf{w}) = \frac{1}{m}\sum_{i=1}^{m}\left(y^{(i)} - \sigma(\mathbf{w}^T\mathbf{x}^{(i)}+b)\right)^2 \]

The problem: this loss is non-convex when composed with the sigmoid. It has many local minima, making gradient descent unreliable. Here's why:

Fig 3. MSE with sigmoid is non-convex (multiple local minima). Cross-entropy is strictly convex (one global minimum).

✓ Formal Statement

For logistic regression, the cross-entropy loss is a strictly convex function of the weights \(\mathbf{w}\). This guarantees that gradient descent converges to the unique global minimum (assuming sufficient data and appropriate learning rate). MSE does not have this guarantee.

We derive the correct loss from Maximum Likelihood Estimation — which also gives it a deep statistical justification.

§06

Maximum Likelihood Estimation — Full Derivation

MLE is the principled way to derive the loss. The idea: find parameters \(\mathbf{w}, b\) that make the observed training data most probable.

The Probabilistic Model

Each label \(y^{(i)} \in \{0,1\}\) is a Bernoulli random variable. Our model predicts:

P(y=1 \mid \mathbf{x}) = \hat{y} = \sigma(z), \qquad P(y=0 \mid \mathbf{x}) = 1 - \hat{y}

These two cases can be written compactly in a single equation using the Bernoulli PMF:

Bernoulli PMF \[ P(y \mid \mathbf{x}; \mathbf{w}) = \hat{y}^{\,y} \cdot (1-\hat{y})^{1-y} \]

Check: when \(y=1\), this gives \(\hat{y}^1 \cdot (1-\hat{y})^0 = \hat{y}\). When \(y=0\), this gives \(\hat{y}^0 \cdot (1-\hat{y})^1 = 1-\hat{y}\). ✓

The Likelihood Function

Assuming i.i.d. (independent, identically distributed) samples, the likelihood of the entire dataset is the product over all \(m\) examples:

Likelihood \[ \mathcal{L}(\mathbf{w},b) = \prod_{i=1}^{m} P(y^{(i)} \mid \mathbf{x}^{(i)}; \mathbf{w},b) = \prod_{i=1}^{m} \hat{y}_i^{\,y^{(i)}} (1-\hat{y}_i)^{1-y^{(i)}} \]

Log-Likelihood (Turning Products into Sums)

Maximizing \(\mathcal{L}\) is equivalent to maximizing \(\log\mathcal{L}\) (since log is monotone). Taking the log converts products to sums — numerically more stable and mathematically easier to differentiate:

Log-Likelihood \[ \ell(\mathbf{w},b) = \log\mathcal{L} = \sum_{i=1}^{m} \left[ y^{(i)}\log\hat{y}_i + (1-y^{(i)})\log(1-\hat{y}_i) \right] \]

From Maximization to Minimization

We want to maximize the log-likelihood, but gradient descent minimizes. We negate and normalize:

Loss = Negative Mean Log-Likelihood \[ J(\mathbf{w},b) = -\frac{1}{m}\ell(\mathbf{w},b) = -\frac{1}{m}\sum_{i=1}^{m} \left[ y^{(i)}\log\hat{y}_i + (1-y^{(i)})\log(1-\hat{y}_i) \right] \]

✓ Connection to Information Theory

This is precisely the Binary Cross-Entropy — the average cross-entropy between the true label distribution and the predicted distribution. MLE and cross-entropy minimization are the same thing for Bernoulli outcomes.

§07

Binary Cross-Entropy Loss — Intuition

The loss for a single sample is:

Single-Sample Loss \[ \mathcal{L}(\hat{y}, y) = -\left[y\log(\hat{y}) + (1-y)\log(1-\hat{y})\right] \]

Let's understand what this penalizes:

True y	Predicted ŷ	Loss	Interpretation
1	0.99	\(-\log(0.99) \approx 0.01\)	Correct, confident → tiny loss
1	0.50	\(-\log(0.5) \approx 0.69\)	Correct, uncertain → moderate loss
1	0.01	\(-\log(0.01) \approx 4.6\)	Wrong, confident → huge loss
0	0.01	\(-\log(0.99) \approx 0.01\)	Correct, confident → tiny loss
0	0.99	\(-\log(0.01) \approx 4.6\)	Wrong, confident → huge loss

💡 Asymmetric Penalty

Cross-entropy punishes confident wrong predictions extremely heavily — the loss approaches infinity as \(\hat{y} \to 0\) when \(y=1\). This is exactly the right behavior: being very wrong with confidence is the worst possible outcome in a probabilistic model.

The two cases of the loss function also have a clean form. When the sigmoid output \(\hat{y} = \sigma(z)\):

Loss in terms of z (useful for derivation) \[ \text{if } y=1: \quad \mathcal{L} = -\log\sigma(z) = \log(1+e^{-z}) \] \[ \text{if } y=0: \quad \mathcal{L} = -\log(1-\sigma(z)) = \log(1+e^{z}) \]

§08

Gradient Computation — The Full Chain Rule

This is the most mathematically rich section. We derive \(\frac{\partial J}{\partial w_j}\) from scratch using the chain rule. You've done partial derivatives, so let's go deep.

The Computational Graph

Fig 4. Computational graph of logistic regression — backprop flows right to left.

Step 1 — Derivative of Loss w.r.t. ŷ

∂J/∂ŷ for one sample \[ J = -\left[y\log\hat{y} + (1-y)\log(1-\hat{y})\right] \] \[ \frac{\partial J}{\partial \hat{y}} = -\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}} \]

Step 2 — Derivative of ŷ w.r.t. z (Sigmoid Derivative)

∂ŷ/∂z \[ \frac{\partial \hat{y}}{\partial z} = \sigma'(z) = \hat{y}(1-\hat{y}) \]

Step 3 — Chain Rule: ∂J/∂z

// Combining steps 1 and 2 via chain rule

1

\(\displaystyle \frac{\partial J}{\partial z} = \frac{\partial J}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z}\)

2

\(\displaystyle = \left(-\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}\right) \cdot \hat{y}(1-\hat{y})\)

3

\(\displaystyle = -y(1-\hat{y}) + (1-y)\hat{y}\)

Distribute: \(-\frac{y}{\hat{y}} \cdot \hat{y}(1-\hat{y}) = -y(1-\hat{y})\) and \(\frac{1-y}{1-\hat{y}} \cdot \hat{y}(1-\hat{y}) = (1-y)\hat{y}\)

4

\(\displaystyle = -y + y\hat{y} + \hat{y} - y\hat{y} = \hat{y} - y\)

5

\(\displaystyle \boxed{\frac{\partial J}{\partial z} = \hat{y} - y}\)

The sigmoid and log cancel beautifully — the gradient at the linear layer is simply the prediction error!

✓ The Beautiful Cancellation

The derivative of the log-loss with respect to the pre-activation \(z\) is simply \(\hat{y} - y\) — the residual (prediction minus truth). This is the same form as in linear regression! The sigmoid's derivative and the log's derivative cancel each other perfectly. This is not a coincidence — it's a consequence of the sigmoid being the canonical link function for Bernoulli outcomes.

Step 4 — Gradient w.r.t. Weights and Bias

Since \(z = \mathbf{w}^T\mathbf{x} + b\):

Final Gradients \[ \frac{\partial J}{\partial w_j} = \frac{\partial J}{\partial z} \cdot \frac{\partial z}{\partial w_j} = (\hat{y} - y) \cdot x_j \] \[ \frac{\partial J}{\partial b} = \frac{\partial J}{\partial z} \cdot \frac{\partial z}{\partial b} = (\hat{y} - y) \cdot 1 = \hat{y} - y \]

For the full dataset with \(m\) samples (averaging over all):

Vectorized Gradients (m samples) \[ \frac{\partial J}{\partial \mathbf{w}} = \frac{1}{m} \mathbf{X}^T (\hat{\mathbf{y}} - \mathbf{y}) \] \[ \frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^{m}(\hat{y}^{(i)} - y^{(i)}) \] where \(\mathbf{X} \in \mathbb{R}^{m \times n}\), \(\hat{\mathbf{y}} \in \mathbb{R}^m\), \(\mathbf{y} \in \mathbb{R}^m\)

§09

Gradient Descent Update Rules

Parameter Updates \[ \mathbf{w} \leftarrow \mathbf{w} - \alpha \cdot \frac{\partial J}{\partial \mathbf{w}} = \mathbf{w} - \frac{\alpha}{m} \mathbf{X}^T(\hat{\mathbf{y}} - \mathbf{y}) \] \[ b \leftarrow b - \alpha \cdot \frac{\partial J}{\partial b} = b - \frac{\alpha}{m}\sum_{i=1}^{m}(\hat{y}^{(i)} - y^{(i)}) \]

// Batch Gradient Descent for Logistic Regression
Initialize: w = zeros(n), b = 0

for epoch in range(num_epochs):
  # Forward pass
  z = X @ w + b           # (m,)
  y_hat = sigmoid(z)         # (m,) predictions

  # Loss
  J = -mean(y*log(y_hat) + (1-y)*log(1-y_hat))

  # Backward pass (gradients)
  error = y_hat - y          # (m,) — the residuals
  dw = (X.T @ error) / m     # (n,)
  db = mean(error)           # scalar

  # Update
  w -= α * dw
  b -= α * db

The three variants (from your gradient descent background):

Variant	Batch Size	Update Frequency	Noise / Stability
Batch GD	All \(m\)	Once per epoch	Stable, but slow for large data
Stochastic GD	1 sample	\(m\) times per epoch	Very noisy, can escape local minima
Mini-Batch GD	32–512	\(m/\text{batch}\) times	Balance of speed and stability ← standard

§10

Regularization — L1 and L2

From your bias-variance tradeoff knowledge: when a model overfits, we add a penalty on the weights to constrain their magnitude.

L2 Regularization (Ridge / Weight Decay)

L2 Regularized Loss \[ J_{\text{L2}}(\mathbf{w}) = -\frac{1}{m}\sum_{i=1}^{m}\left[y^{(i)}\log\hat{y}_i + (1-y^{(i)})\log(1-\hat{y}_i)\right] + \frac{\lambda}{2m}\|\mathbf{w}\|_2^2 \]

The \(\frac{\lambda}{2m}\) factor is conventional (the 2 cancels the square's derivative neatly). The new gradient for weights:

L2 Gradient \[ \frac{\partial J_{\text{L2}}}{\partial \mathbf{w}} = \frac{1}{m}\mathbf{X}^T(\hat{\mathbf{y}} - \mathbf{y}) + \frac{\lambda}{m}\mathbf{w} \]

The update rule becomes: \(\mathbf{w} \leftarrow \mathbf{w}(1 - \frac{\alpha\lambda}{m}) - \frac{\alpha}{m}\mathbf{X}^T(\hat{\mathbf{y}}-\mathbf{y})\) — the factor \((1 - \frac{\alpha\lambda}{m}) < 1\) shrinks the weights every step, hence the name "weight decay."

L2 penalizes large weights but never forces them exactly to zero. It promotes small, spread-out weights. Probabilistically, it corresponds to a Gaussian prior on the weights (MAP estimation with \(\mathcal{N}(0, 1/\lambda)\)).

L1 Regularization (Lasso)

L1 Regularized Loss \[ J_{\text{L1}}(\mathbf{w}) = -\frac{1}{m}\sum_{i=1}^{m}\left[\ldots\right] + \frac{\lambda}{m}\|\mathbf{w}\|_1 = -\frac{1}{m}\sum\left[\ldots\right] + \frac{\lambda}{m}\sum_{j}|w_j| \]

L1 Gradient \[ \frac{\partial J_{\text{L1}}}{\partial w_j} = \frac{1}{m}[\mathbf{X}^T(\hat{\mathbf{y}}-\mathbf{y})]_j + \frac{\lambda}{m}\text{sign}(w_j) \]

L1 can push weights exactly to zero, performing implicit feature selection. This is because the L1 norm has corners at the axes in weight space — gradient descent tends to land exactly at zero. Probabilistically, corresponds to a Laplace prior (Laplacian distribution) on the weights.

Property	L2 (Ridge)	L1 (Lasso)
Effect on weights	Shrinks, never zero	Can drive exactly to zero
Feature selection	No	Yes (sparse solution)
Differentiable?	Yes everywhere	No at \(w_j = 0\)
Prior equivalent	Gaussian prior	Laplace prior
Geometry	L2 ball (circle)	L1 ball (diamond)

§11

Multiclass: One-vs-Rest and Softmax

Logistic regression extends to \(K > 2\) classes in two ways.

Strategy 1: One-vs-Rest (OvR)

Train \(K\) binary logistic regressors. Classifier \(k\) learns: "is this class \(k\) vs. all other classes?" At test time, pick the class with the highest sigmoid score.

Drawback: The K sigmoid outputs don't sum to 1 — they don't form a proper probability distribution over classes.

Strategy 2: Softmax Regression (Multinomial Logistic Regression)

The natural generalization. For \(K\) classes, maintain weight vectors \(\mathbf{w}_1, \ldots, \mathbf{w}_K\). Compute \(K\) linear scores:

z_k = \mathbf{w}_k^T \mathbf{x} + b_k, \quad k = 1, \ldots, K

Apply the softmax function to convert to probabilities:

Softmax Function \[ P(y=k \mid \mathbf{x}) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}} \]

Properties of softmax: all outputs in (0,1), and they sum to 1. When \(K=2\), softmax reduces exactly to the sigmoid — binary logistic regression is a special case.

Categorical Cross-Entropy Loss \[ J = -\frac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{K} \mathbf{1}[y^{(i)}=k] \log P(y^{(i)}=k \mid \mathbf{x}^{(i)}) \]

Where \(\mathbf{1}[y^{(i)}=k]\) is 1 if sample \(i\) belongs to class \(k\), else 0. This is equivalent to: for each sample, take the negative log of the predicted probability assigned to the true class.

💡 Connection to Your DL Work

The final layer of any classification neural network is exactly softmax regression. When you built your chest X-ray CNN, the last fully-connected layer + softmax was doing softmax regression — the rest of the network was learning a feature representation. Logistic regression = neural network with no hidden layers.

§12

Geometric Interpretation

Let's consolidate the geometry — leveraging your vector and linear algebra background.

The weight vector \(\mathbf{w}\): Points in the direction of maximum increase of the log-odds. Orthogonal to the decision boundary.
The bias \(b\): Translates the boundary away from the origin. Without bias, the boundary must pass through the origin.
Distance from a point to the boundary: \(d = \frac{|\mathbf{w}^T\mathbf{x}+b|}{\|\mathbf{w}\|}\) — the signed distance scaled by the norm of \(\mathbf{w}\). The further a point is from the boundary, the more extreme the sigmoid output (closer to 0 or 1).
Confidence: \(|\mathbf{w}^T\mathbf{x}+b|\) measures how "confidently" a point is classified. A value near 0 means uncertain (ŷ ≈ 0.5); large magnitude means confident.
Effect of \(\|\mathbf{w}\|\): Scaling up \(\mathbf{w}\) without changing direction sharpens the sigmoid — the model becomes more "peaked." At \(\|\mathbf{w}\| \to \infty\), the sigmoid approaches a hard step function.

§13

Assumptions of Logistic Regression

Unlike linear regression with its stringent Gauss-Markov assumptions, logistic regression has fewer but still important assumptions:

Assumption	What it Means	What Happens if Violated
Binary (or categorical) outcome	\(y \in \{0,1\}\)	Model is inappropriate; use other methods
Log-odds are linear in features	\(\text{logit}(p) = \mathbf{w}^T\mathbf{x}+b\)	Underfitting; use feature engineering or nonlinear models
No multicollinearity	Features are not highly correlated	Unstable, large, noisy coefficients
Large sample size	MLE is asymptotically consistent	High variance estimates with small data; use regularization
No perfect separation	No feature perfectly predicts \(y\)	MLE diverges (\(\mathbf{w} \to \infty\)); regularization is required
Independence of observations	i.i.d. samples	Underestimated standard errors; use clustered SE

⚠ Perfect Separation (Hauck-Donner Effect)

For perfectly separable classes (e.g., all samples with \(x_1 > 5\) are class 1), the MLE pushes \(w_1 \to \infty\) — the sigmoid becomes a step function and the gradient approaches zero before convergence. Adding L2 regularization bounds the weights and fixes this numerically.

§14

Logistic Regression as a Neural Network

This connection is essential and directly bridges your ML and DL knowledge.

Fig 5. Logistic regression = a neural network with no hidden layers — input layer directly connects to a single sigmoid unit.

Formally: logistic regression is a single-layer neural network with:

No hidden layers
One output neuron (binary) or \(K\) output neurons (multiclass softmax)
Sigmoid / softmax activation
Cross-entropy loss
Gradient descent training (backpropagation with only one "layer" of weights)

Adding hidden layers = a proper neural network. The same gradient derivation generalizes via backpropagation — the chain rule we worked through in §08 applies repeatedly through each layer.

§15

Evaluation Metrics

Confusion Matrix and Core Metrics

Metric	Formula	Measures
Accuracy	\(\frac{TP+TN}{TP+TN+FP+FN}\)	Overall correctness (misleading for imbalanced data)
Precision	\(\frac{TP}{TP+FP}\)	"Of predicted positives, how many were actually positive?"
Recall (Sensitivity)	\(\frac{TP}{TP+FN}\)	"Of actual positives, how many were caught?"
F1 Score	\(\frac{2 \cdot P \cdot R}{P+R}\)	Harmonic mean of Precision and Recall
Specificity	\(\frac{TN}{TN+FP}\)	True negative rate (important in medical diagnosis)

AUC-ROC

The Receiver Operating Characteristic plots TPR (Recall) vs FPR at varying thresholds. The AUC (Area Under Curve) measures the probability that the model ranks a random positive sample higher than a random negative one. AUC = 0.5 is random; AUC = 1.0 is perfect.

💡 Threshold Selection

The default threshold of 0.5 is not always optimal. In your chest X-ray work, a false negative (missed disease) is far worse than a false positive — so you'd lower the threshold (e.g., 0.3) to increase recall at the cost of precision. The ROC curve helps select the optimal operating threshold for your specific cost structure.

Log-Loss (as a metric)

\text{Log-Loss} = -\frac{1}{m}\sum_{i=1}^{m}\left[y^{(i)}\log\hat{y}^{(i)} + (1-y^{(i)})\log(1-\hat{y}^{(i)})\right]

Log-loss evaluates the calibration of predicted probabilities — not just whether the rank order is right, but whether the probability values are accurate. A perfectly calibrated model has log-loss approaching the true entropy of the data.

§16

The Full Mental Model — Everything Together

Fig 6. Complete computational flow of logistic regression — forward pass, cross-entropy loss, gradient backpropagation, and parameter update.

Here is the complete picture as a single unified narrative:

Assume log-odds are linear in features → sigmoid hypothesis falls out naturally.
Model each label as Bernoulli with \(p = \sigma(\mathbf{w}^T\mathbf{x}+b)\).
Maximize the likelihood of the training data → derive cross-entropy loss via MLE.
Compute gradients via chain rule: the beautiful result \(\partial J/\partial z = \hat{y}-y\).
Update parameters via gradient descent: \(\mathbf{w} \leftarrow \mathbf{w} - \frac{\alpha}{m}\mathbf{X}^T(\hat{\mathbf{y}}-\mathbf{y})\).
Regularize with L1 or L2 to prevent overfitting and handle near-separation.
Extend to multiclass via softmax regression.
Evaluate with AUC-ROC, F1, log-loss — not just accuracy.
Recognize it as a 0-hidden-layer neural network — foundation of all deep learning classifiers.

LogisticRegression

Why Not Linear Regression?

Probability, Odds, and the Logit

From Probability to Odds

From Odds to Log-Odds (Logit)

The Central Assumption of Logistic Regression

Inverting the Logit → The Sigmoid

The Sigmoid — Deep Dive

Properties You Must Know

The Derivative of the Sigmoid (Critical for Backprop)

The Hypothesis and Decision Boundary

Full Hypothesis

The Decision Boundary

Why Not MSE? The Loss Function Problem

Maximum Likelihood Estimation — Full Derivation

The Probabilistic Model

The Likelihood Function

Log-Likelihood (Turning Products into Sums)

From Maximization to Minimization

Binary Cross-Entropy Loss — Intuition

Gradient Computation — The Full Chain Rule

The Computational Graph

Step 1 — Derivative of Loss w.r.t. ŷ

Step 2 — Derivative of ŷ w.r.t. z (Sigmoid Derivative)

Step 3 — Chain Rule: ∂J/∂z

Step 4 — Gradient w.r.t. Weights and Bias

Gradient Descent Update Rules

Regularization — L1 and L2

L2 Regularization (Ridge / Weight Decay)

L1 Regularization (Lasso)

Multiclass: One-vs-Rest and Softmax

Strategy 1: One-vs-Rest (OvR)

Strategy 2: Softmax Regression (Multinomial Logistic Regression)

Geometric Interpretation

Assumptions of Logistic Regression

Logistic Regression as a Neural Network

Evaluation Metrics

Confusion Matrix and Core Metrics

AUC-ROC

Log-Loss (as a metric)

The Full Mental Model — Everything Together

Logistic
Regression