Mathematics · Optimization Theory · The Bridge

the geometry of the optimum Convex Optimization

⊳ Builds directly on Gradient Descent (first-order local methods) and Lagrange Multipliers (constrained stationarity) — convexity is the property that makes both of these methods provably correct, not just locally optimal but globally optimal.
Convex Sets & Functions Jensen's Inequality 1st & 2nd Order Conditions Strong Convexity Lipschitz Continuity Duality & KKT Subgradients Proximal Methods Interior Point Convergence Rates
x y λx+(1−λ)y supporting hyperplane C C
Convex Sets
∇²f≥0
2nd Order
KKT
Optimality
O(1/k)
Convergence
prox
Operators
01
Geometry

Convex Sets

// definition · examples · operations preserving convexity · why shape matters

Convexity is fundamentally a statement about shape. A convex set has no "dents" — if you pick any two points in the set, the entire straight line segment between them stays inside the set. This simple geometric property has enormous consequences for optimization.

Definition
Convex SetDefinition
\[C \subseteq \mathbb{R}^n \text{ is convex} \iff \forall \mathbf{x},\mathbf{y}\in C,\;\forall\lambda\in[0,1]:\quad \lambda\mathbf{x} + (1-\lambda)\mathbf{y} \in C\]
The point λx + (1−λ)y is a convex combination of x and y — it traces the line segment from y (λ=0) to x (λ=1) as λ varies. The definition says: this entire segment must lie in C. Equivalently, C contains every convex combination of its points.
Canonical Examples
SetDefinitionConvex?
Hyperplane\(\{x: \mathbf{a}^T\mathbf{x}=b\}\)Yes — affine
Halfspace\(\{x: \mathbf{a}^T\mathbf{x}\leq b\}\)Yes
Ball\(\{x: \|x-x_0\|\leq r\}\)Yes
Polyhedron\(\{x: A\mathbf{x}\leq b\}\) (intersection of halfspaces)Yes
Positive semidefinite cone\(\{X\in\mathbb{S}^n: X\succeq 0\}\)Yes
Annulus\(\{x: 1\leq\|x\|\leq2\}\)No — has a hole
Union of two disjoint balls\(B_1 \cup B_2\)No — line segment exits
Operations That Preserve Convexity
  • Intersection: If \(C_1, C_2\) are convex, so is \(C_1 \cap C_2\). This is why a polyhedron (intersection of halfspaces) is always convex — and why feasible regions in linear programs (intersections of linear constraints) are convex.
  • Affine images: If \(C\) is convex and \(f(\mathbf{x}) = A\mathbf{x}+\mathbf{b}\), then \(f(C) = \{f(\mathbf{x}):\mathbf{x}\in C\}\) is convex. Scaling, rotation, translation, and projection all preserve convexity.
  • Minkowski sum: \(C_1 + C_2 = \{x_1+x_2: x_1\in C_1, x_2\in C_2\}\) is convex if \(C_1, C_2\) are convex.
  • Cartesian product: \(C_1 \times C_2\) is convex if \(C_1, C_2\) are convex — used to combine independent constraint sets.
Note

Union of convex sets is generally not convex. This is why feasible regions defined by OR conditions (e.g., "x ≤ 0 OR x ≥ 1") are typically non-convex, and why such constraints make optimization problems combinatorially hard (mixed-integer programming).

Convex Hull and Extreme Points
Convex HullConstruction
\[\text{conv}(S) = \left\{\sum_{i=1}^k\lambda_i\mathbf{x}_i : \mathbf{x}_i\in S,\;\lambda_i\geq0,\;\sum_i\lambda_i=1\right\}\]
The convex hull of S is the smallest convex set containing S — the set of all convex combinations of points in S. Geometrically: stretch a rubber band around the points. Extreme points (vertices) of a polyhedron are points that cannot be written as a convex combination of other points in the set — these are the candidate optimal points for linear programs.
02
Functions

Convex Functions

// epigraph · definition · examples · convexity-preserving operations

A function is convex if its graph "curves upward" — every chord connecting two points on the graph lies above (or on) the graph. This single property guarantees that any local minimum is a global minimum.

Definition via Chords
Convex Function — DefinitionCore Definition
\[f: \mathbb{R}^n \to \mathbb{R} \text{ is convex if } \text{dom}(f) \text{ is convex and}\] \[f(\lambda\mathbf{x}+(1-\lambda)\mathbf{y}) \leq \lambda f(\mathbf{x}) + (1-\lambda)f(\mathbf{y}) \quad \forall \mathbf{x},\mathbf{y}\in\text{dom}(f),\;\lambda\in[0,1]\] \[\text{Strictly convex if strict inequality holds for } \mathbf{x}\neq\mathbf{y},\;\lambda\in(0,1)\] \[\text{Concave: } -f \text{ is convex} \quad\text{(reverse inequality)}\]
The value of f at the convex combination is at most the convex combination of the values — the function value at any point on the chord is below the chord itself. Strict convexity rules out flat segments; it guarantees a unique minimizer if one exists.
The Epigraph — Functions as Sets
EpigraphGeometric View
\[\text{epi}(f) = \{(\mathbf{x},t)\in\mathbb{R}^{n+1} : t\geq f(\mathbf{x})\}\] \[f \text{ is convex} \iff \text{epi}(f) \text{ is a convex set}\]
The epigraph is the region "above" the function's graph. This identity translates every fact about convex sets into a fact about convex functions, and vice versa — it's the bridge between §01 and §02. A function is convex exactly when the region above its graph is a convex set (no dents from above).
CONVEX f — epigraph shaded f(x) f(y) chord lies above graph NON-CONVEX — counterexample f(z) > chord! graph pokes above the chord
Fig 1. Left: a convex function — every chord lies above (or on) the graph; the epigraph (shaded) is a convex set. Right: a non-convex function — at point z, the function exceeds the chord, violating the definition.
Convexity-Preserving Operations
OperationRuleWhy It Matters
Non-negative weighted sum\(f = \sum w_if_i,\;w_i\geq0,\;f_i\) convexSum of losses + regularizers stays convex
Composition with affine map\(g(\mathbf{x}) = f(A\mathbf{x}+b)\) convex if \(f\) convexLinear models preserve convexity of the loss
Pointwise maximum\(f(\mathbf{x}) = \max_i f_i(\mathbf{x})\)Hinge loss, max-margin formulations
Partial minimization\(g(\mathbf{x}) = \min_\mathbf{y} f(\mathbf{x},\mathbf{y})\), \(f\) jointly convexUsed in proximal operators (§11)
Composition \(h(g(\mathbf{x}))\)\(h\) convex non-decreasing, \(g\) convex (or \(h\) convex non-increasing, \(g\) concave)\(e^{f(\mathbf{x})}\) convex if \(f\) convex
03
Inequality

Jensen's Inequality

// generalization to expectations · proof · applications across ML

Jensen's inequality generalizes the definition of convexity from two-point convex combinations to arbitrary probability distributions. It is one of the most widely-used inequalities across all of mathematics and machine learning.

Jensen's InequalityGeneralization
\[f \text{ convex} \;\Rightarrow\; f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]\] \[f \text{ concave} \;\Rightarrow\; f(\mathbb{E}[X]) \geq \mathbb{E}[f(X)]\]
The two-point definition is the special case where X takes value x with probability λ and y with probability 1−λ: f(λx+(1−λ)y) ≤ λf(x)+(1−λ)f(y) becomes f(E[X]) ≤ E[f(X)]. Jensen's inequality extends this to any random variable and any convex function.
Proof via Supporting Hyperplane
// Proof of Jensen's inequality using the supporting hyperplane (tangent line)
1
Since \(f\) is convex and differentiable, at the point \(\mu = \mathbb{E}[X]\), the tangent line is a global underestimator: \[f(x) \geq f(\mu) + \nabla f(\mu)^T(x-\mu) \quad \forall x\]
This is the first-order condition for convexity (§04) — the tangent line at any point lies entirely below the graph.
2
Substitute \(x = X\) (random variable) and take expectations of both sides: \[\mathbb{E}[f(X)] \geq \mathbb{E}[f(\mu) + \nabla f(\mu)^T(X-\mu)] = f(\mu) + \nabla f(\mu)^T(\mathbb{E}[X]-\mu)\]
3
Since \(\mu = \mathbb{E}[X]\): \(\mathbb{E}[X] - \mu = 0\), so the second term vanishes: \[\mathbb{E}[f(X)] \geq f(\mu) = f(\mathbb{E}[X]) \quad\blacksquare\]
Applications Throughout ML
  • KL divergence ≥ 0: Applying Jensen with \(f = -\log\) (convex) to \(\mathbb{E}_{p}[q(x)/p(x)]\) proves \(D_{\text{KL}}(p\|q) \geq 0\) — the Gibbs inequality.
  • ELBO (Evidence Lower Bound): \(\log p(x) = \log\mathbb{E}_q[p(x,z)/q(z)] \geq \mathbb{E}_q[\log(p(x,z)/q(z))]\) — Jensen applied to \(\log\) (concave) gives the variational lower bound used in VAEs.
  • Variance non-negativity: \(\mathbb{E}[X^2] \geq (\mathbb{E}[X])^2\) follows from Jensen with \(f(x)=x^2\) (convex) — hence \(\text{Var}(X) = \mathbb{E}[X^2]-(\mathbb{E}[X])^2 \geq 0\).
  • AM-GM inequality: Applying Jensen to \(f = -\log\) and uniform weights gives the arithmetic mean ≥ geometric mean inequality — fundamental to many ML proofs.
04
Calculus

First-Order Conditions

// global underestimator · gradient inequality · optimality for unconstrained problems

If \(f\) is differentiable, convexity has an equivalent characterization in terms of the gradient: the tangent hyperplane at any point lies entirely below (or on) the graph of \(f\). This is the property used in the proof of Jensen's inequality above.

First-Order Convexity ConditionEquivalent Definition
\[f \text{ (differentiable) is convex} \iff f(\mathbf{y}) \geq f(\mathbf{x}) + \nabla f(\mathbf{x})^T(\mathbf{y}-\mathbf{x}) \quad \forall \mathbf{x},\mathbf{y}\in\text{dom}(f)\]
Geometric meaning: the first-order Taylor approximation at any point x is a global underestimator of f. No matter where you linearize, the linear approximation never overshoots the true function value anywhere else.
Proof of Equivalence with the Chord Definition
// Showing chord definition ⟹ gradient inequality
1
From the chord definition with \(\mathbf{z} = \mathbf{x}+t(\mathbf{y}-\mathbf{x})\) for \(t\in(0,1)\): \[f(\mathbf{x}+t(\mathbf{y}-\mathbf{x})) \leq (1-t)f(\mathbf{x}) + tf(\mathbf{y})\]
2
Rearrange: \[\frac{f(\mathbf{x}+t(\mathbf{y}-\mathbf{x})) - f(\mathbf{x})}{t} \leq f(\mathbf{y}) - f(\mathbf{x})\]
3
Take \(t \to 0^+\): the left side becomes the directional derivative \(\nabla f(\mathbf{x})^T(\mathbf{y}-\mathbf{x})\): \[\nabla f(\mathbf{x})^T(\mathbf{y}-\mathbf{x}) \leq f(\mathbf{y}) - f(\mathbf{x})\] \[\therefore\quad f(\mathbf{y}) \geq f(\mathbf{x}) + \nabla f(\mathbf{x})^T(\mathbf{y}-\mathbf{x}) \quad\blacksquare\]
The First-Order Optimality Condition
First-Order Optimality — Unconstrained Convex Problems

For a convex differentiable \(f\), \(\mathbf{x}^*\) is a global minimizer if and only if \(\nabla f(\mathbf{x}^*) = \mathbf{0}\). Proof: by the gradient inequality, \(f(\mathbf{y}) \geq f(\mathbf{x}^*) + \nabla f(\mathbf{x}^*)^T(\mathbf{y}-\mathbf{x}^*) = f(\mathbf{x}^*)\) for all \(\mathbf{y}\), since the gradient term vanishes. This is the foundation of gradient descent: \(\nabla f = \mathbf{0}\) is both necessary AND sufficient for convex problems — unlike non-convex problems where it only identifies stationary points (which may be saddle points or local minima).

For Constrained Problems
First-Order Optimality — Constrained CaseGeneralization
\[\mathbf{x}^* = \arg\min_{\mathbf{x}\in C} f(\mathbf{x}) \iff \nabla f(\mathbf{x}^*)^T(\mathbf{y}-\mathbf{x}^*) \geq 0 \quad \forall \mathbf{y}\in C\]
At the optimum, the gradient cannot point into the feasible set — every feasible direction (y − x*) makes a non-negative angle with the gradient. If x* is in the interior of C, this reduces to ∇f(x*)=0. If x* is on the boundary, the gradient may be non-zero but must point "outward" — this is the geometric origin of KKT conditions (§09).
05
Curvature

Second-Order Conditions

// Hessian · positive semidefiniteness · curvature interpretation · examples

The second-order condition characterizes convexity through curvature: a twice-differentiable function is convex if and only if it curves "upward" in every direction, at every point — formalized via the Hessian matrix.

Second-Order Convexity ConditionHessian Test
\[f \text{ (twice differentiable) is convex} \iff \nabla^2 f(\mathbf{x}) \succeq 0 \quad \forall \mathbf{x}\in\text{dom}(f)\] \[\text{i.e.}\quad \mathbf{v}^T\nabla^2f(\mathbf{x})\mathbf{v} \geq 0 \quad \forall \mathbf{v}\in\mathbb{R}^n\]
∇²f(x) ⪰ 0 means the Hessian is positive semidefinite — all its eigenvalues are ≥ 0. Strictly convex (sufficient, not necessary): ∇²f(x) ≻ 0 (positive definite, eigenvalues > 0). The Hessian eigenvalues measure curvature in each eigenvector direction — convexity requires non-negative curvature everywhere.
Verifying Convexity for Common ML Losses
// Hessian computations for standard loss functions
MSE
\(f(\mathbf{w}) = \|\mathbf{Xw}-\mathbf{y}\|^2\). Gradient: \(\nabla f = 2\mathbf{X}^T(\mathbf{Xw}-\mathbf{y})\). Hessian: \[\nabla^2 f(\mathbf{w}) = 2\mathbf{X}^T\mathbf{X}\]
X^T X is always PSD: v^T(X^TX)v = ‖Xv‖² ≥ 0 for all v. Therefore MSE is always convex — regardless of the data. Strictly convex iff X has full column rank (no collinear features).
Logistic Regression
\(f(\mathbf{w}) = -\sum_i[y_i\log\sigma(\mathbf{w}^T\mathbf{x}_i) + (1-y_i)\log(1-\sigma(\mathbf{w}^T\mathbf{x}_i))]\). Hessian: \[\nabla^2 f(\mathbf{w}) = \sum_i \sigma_i(1-\sigma_i)\mathbf{x}_i\mathbf{x}_i^T = \mathbf{X}^T\mathbf{D}\mathbf{X}, \quad D_{ii}=\sigma_i(1-\sigma_i)>0\]
D has positive diagonal entries (σ(1-σ) > 0 always), so X^T D X is PSD by the same argument as MSE. Logistic regression's cross-entropy loss is always convex — this is precisely why it has no local minima other than the global one.
Neural Network (counter-example)
\(f(\mathbf{w}) = \frac{1}{2}(\sigma(w_2\sigma(w_1x))-y)^2\) for a 2-layer network. The composition of nonlinear activations with weight products makes the Hessian indefinite (mixed positive and negative eigenvalues) almost everywhere.
This is THE fundamental reason deep learning is non-convex: products of weights w₁w₂ create saddle points and multiple local minima. Convexity theory still informs us (Lipschitz smoothness, local convergence rates) but global optimality guarantees are lost.
Curvature Interpretation
Insight

The Hessian eigenvalues directly determine the shape of level sets near a minimum: large eigenvalue → steep curvature → narrow valley in that direction. Small eigenvalue → shallow curvature → wide, flat valley. The condition number \(\kappa = \lambda_{\max}/\lambda_{\min}\) of the Hessian determines how elongated the level sets are — and directly controls the convergence rate of gradient descent (covered in the Gradient Descent masterclass).

06
Strength

Strong Convexity

// quantified curvature · unique minimizer · linear convergence rates

Plain convexity guarantees a global minimum exists (if the problem is bounded) but says nothing about how fast we can find it, or whether it's unique. Strong convexity adds a quantitative lower bound on curvature — and this single number controls convergence rates.

Strong Convexity — DefinitionQuantified Curvature
\[f \text{ is } \mu\text{-strongly convex} \iff f(\mathbf{x}) - \frac{\mu}{2}\|\mathbf{x}\|^2 \text{ is convex}, \quad \mu > 0\] \[\text{Equivalent gradient inequality:}\quad f(\mathbf{y}) \geq f(\mathbf{x}) + \nabla f(\mathbf{x})^T(\mathbf{y}-\mathbf{x}) + \frac{\mu}{2}\|\mathbf{y}-\mathbf{x}\|^2\] \[\text{Equivalent Hessian condition:}\quad \nabla^2 f(\mathbf{x}) \succeq \mu \mathbf{I} \quad \forall \mathbf{x}\]
Strong convexity means: the function lies above a quadratic lower bound everywhere (not just a linear one as in plain convexity). μ measures the minimum curvature — every eigenvalue of the Hessian is at least μ. Plain convexity is the limiting case μ=0.
Consequences of Strong Convexity
  • Unique minimizer: Strong convexity implies strict convexity, which implies (if a minimizer exists) it is unique. A strongly convex function cannot have flat regions or multiple minima.
  • Quadratic growth: \(f(\mathbf{x}) \geq f(\mathbf{x}^*) + \frac{\mu}{2}\|\mathbf{x}-\mathbf{x}^*\|^2\) — the suboptimality gap grows at least quadratically with distance from the optimum. This directly bounds how far from optimal you are given the function value.
  • Linear convergence: Gradient descent on a \(\mu\)-strongly convex, \(L\)-smooth function converges linearly: \(f(\mathbf{x}_k) - f(\mathbf{x}^*) \leq (1-\mu/L)^k(f(\mathbf{x}_0)-f(\mathbf{x}^*))\). Without strong convexity, only sublinear rates \(O(1/k)\) are guaranteed.
  • Polyak-Łojasiewicz (PL) condition: A weaker condition than strong convexity that still guarantees linear convergence: \(\frac{1}{2}\|\nabla f(\mathbf{x})\|^2 \geq \mu(f(\mathbf{x})-f^*)\). Many non-convex deep learning loss landscapes satisfy PL locally — explaining why SGD converges quickly despite non-convexity.
Practical Example: Ridge Regression
Ridge Regression — Guaranteed Strong ConvexityPractical
\[f(\mathbf{w}) = \|\mathbf{Xw}-\mathbf{y}\|^2 + \lambda\|\mathbf{w}\|^2\] \[\nabla^2 f(\mathbf{w}) = 2\mathbf{X}^T\mathbf{X} + 2\lambda\mathbf{I} \succeq 2\lambda\mathbf{I} \quad\Rightarrow\quad \mu = 2\lambda\]
Even if X^TX is singular (e.g., d > n, more features than samples), adding λ‖w‖² guarantees μ = 2λ > 0 strong convexity. This is one of the deepest reasons L2 regularization helps: it converts a merely convex (possibly flat or unbounded) problem into a strongly convex one with a unique solution and linear convergence guarantees.
07
Smoothness

Lipschitz Continuity

// bounded rate of change · L-smoothness · the descent lemma · step size selection

Lipschitz continuity bounds how fast a function (or its gradient) can change. Combined with strong convexity, it forms the pair of constants (\(\mu, L\)) that completely characterizes the convergence behavior of first-order optimization methods.

Lipschitz Continuity of the Function
Lipschitz Continuous FunctionBounded Change
\[|f(\mathbf{x}) - f(\mathbf{y})| \leq G\|\mathbf{x}-\mathbf{y}\| \quad \forall \mathbf{x},\mathbf{y}\]
G bounds the maximum rate of change of f. Equivalent to: ‖∇f(x)‖ ≤ G everywhere (if differentiable). Used in subgradient methods (§10) where the convergence rate depends on G.
L-Smoothness (Lipschitz Gradient)
L-Smooth FunctionLipschitz Gradient
\[\|\nabla f(\mathbf{x}) - \nabla f(\mathbf{y})\| \leq L\|\mathbf{x}-\mathbf{y}\| \quad \forall \mathbf{x},\mathbf{y}\] \[\text{Equivalent (if twice differentiable): }\quad \nabla^2 f(\mathbf{x}) \preceq L\mathbf{I} \quad \forall \mathbf{x}\]
L bounds the maximum curvature — the largest eigenvalue of the Hessian is at most L. Combined with §06 (μ ≤ ∇²f ≤ L), the function's Hessian eigenvalues all lie in [μ, L]. The ratio κ = L/μ is the condition number that governs gradient descent's convergence rate.
The Descent Lemma
// Deriving the descent lemma — the key tool for step size selection
1
By the fundamental theorem of calculus and L-smoothness, for any \(\mathbf{x},\mathbf{y}\): \[f(\mathbf{y}) = f(\mathbf{x}) + \int_0^1\nabla f(\mathbf{x}+t(\mathbf{y}-\mathbf{x}))^T(\mathbf{y}-\mathbf{x})\,dt\]
2
Add and subtract \(\nabla f(\mathbf{x})^T(\mathbf{y}-\mathbf{x})\), then bound the difference using L-smoothness and Cauchy-Schwarz: \[f(\mathbf{y}) \leq f(\mathbf{x}) + \nabla f(\mathbf{x})^T(\mathbf{y}-\mathbf{x}) + \frac{L}{2}\|\mathbf{y}-\mathbf{x}\|^2\]
3
Apply to gradient descent: set \(\mathbf{y}=\mathbf{x}-\alpha\nabla f(\mathbf{x})\): \[f(\mathbf{x}-\alpha\nabla f(\mathbf{x})) \leq f(\mathbf{x}) - \alpha\left(1-\frac{L\alpha}{2}\right)\|\nabla f(\mathbf{x})\|^2\]
For the right-hand side to guarantee decrease, we need (1 − Lα/2) > 0, i.e. α < 2/L. The optimal fixed step size is α = 1/L — this directly proves why the famous "step size ≤ 1/L" rule for gradient descent guarantees monotonic decrease. ∎
The Descent Lemma — Foundation of Step Size Selection

\[f(\mathbf{x}-\tfrac{1}{L}\nabla f(\mathbf{x})) \leq f(\mathbf{x}) - \frac{1}{2L}\|\nabla f(\mathbf{x})\|^2\] Every step of gradient descent with \(\alpha = 1/L\) decreases the function value by at least \(\frac{1}{2L}\|\nabla f\|^2\). This is the inequality that drives all convergence rate proofs for gradient descent — \(O(1/k)\) for convex, linear for strongly convex \((\mu>0)\).

08
Duality

Duality Theory

// the Lagrangian dual · weak duality · strong duality · Slater's condition · geometric interpretation

Every constrained optimization problem (the "primal") has an associated "dual" problem. The dual provides a lower bound on the primal's optimal value — and under convexity, this bound becomes tight, transforming the original problem into an equivalent, often easier one.

The Lagrangian and Dual Function
Primal Problem and LagrangianSetup
\[\text{Primal: } \quad p^* = \min_\mathbf{x}\; f(\mathbf{x}) \quad \text{s.t.}\quad g_i(\mathbf{x})\leq0\;(i=1..m),\;\; h_j(\mathbf{x})=0\;(j=1..p)\] \[\mathcal{L}(\mathbf{x},\boldsymbol{\lambda},\boldsymbol{\nu}) = f(\mathbf{x}) + \sum_{i=1}^m\lambda_ig_i(\mathbf{x}) + \sum_{j=1}^p\nu_jh_j(\mathbf{x}), \quad \lambda_i\geq0\] \[\text{Dual function: }\quad d(\boldsymbol{\lambda},\boldsymbol{\nu}) = \inf_\mathbf{x}\mathcal{L}(\mathbf{x},\boldsymbol{\lambda},\boldsymbol{\nu})\] \[\text{Dual problem: }\quad d^* = \max_{\boldsymbol{\lambda}\geq0,\boldsymbol{\nu}}\; d(\boldsymbol{\lambda},\boldsymbol{\nu})\]
The dual function d(λ,ν) is always concave (it's an infimum of affine functions of (λ,ν)) — even if the primal is non-convex! This means the dual problem is always a concave maximization, often much easier to solve.
Weak Duality — Always True
Weak DualityUniversal
\[d^* \leq p^* \quad\text{(always — even for non-convex problems)}\]
Proof sketch: for feasible x and any λ≥0, ν: L(x,λ,ν) = f(x) + Σλᵢgᵢ(x) + Σνⱼhⱼ(x) ≤ f(x) since gᵢ(x)≤0, λᵢ≥0, hⱼ(x)=0. So d(λ,ν) = inf_x L(x,λ,ν) ≤ L(x*,λ,ν) ≤ f(x*) = p*. The gap p* − d* ≥ 0 is called the duality gap.
Strong Duality — Under Convexity + Slater
Strong Duality (Slater's Condition)Convex Problems
\[\text{If } f,g_i \text{ convex}, h_j \text{ affine, and } \exists\,\mathbf{x} \text{ strictly feasible } (g_i(\mathbf{x})<0\;\forall i):\] \[d^* = p^* \quad\text{(zero duality gap)}\]
Slater's condition: there exists a strictly feasible point (inequality constraints satisfied with strict inequality). This is a constraint qualification — it rules out degenerate geometries where the constraint boundary "kisses" the objective in a pathological way. Under Slater's condition, solving the dual problem is EQUIVALENT to solving the primal — and the dual is often much easier (e.g., SVM dual is a quadratic program in fewer variables).
Geometric Interpretation
g(x) — constraint value f(x) — objective g(x)=0 feasible (g≤0) 𝒢 = {(g(x),f(x)): x} p* (primal optimum) slope = −λ* d* (dual optimum)
Fig 2. Geometric duality: the set 𝒢 of achievable (constraint, objective) pairs. p* is the minimum f over the feasible half g≤0. The dual finds the best supporting hyperplane (slope −λ*) — under strong duality, this hyperplane touches exactly at p*, so d*=p*.
09
Optimality

KKT Conditions — Revisited

// from Lagrange multipliers to KKT · complementary slackness · sufficiency under convexity

The Lagrange Multipliers masterclass introduced KKT as necessary conditions for constrained optimality. Here we close the loop: under convexity, KKT conditions become both necessary and sufficient — solving the KKT system solves the original problem completely.

The KKT Conditions — Complete StatementOptimality System
\[\text{(1) Primal feasibility:}\quad g_i(\mathbf{x}^*)\leq0,\quad h_j(\mathbf{x}^*)=0\] \[\text{(2) Dual feasibility:}\quad \lambda_i^*\geq0\] \[\text{(3) Complementary slackness:}\quad \lambda_i^*g_i(\mathbf{x}^*)=0 \quad\forall i\] \[\text{(4) Stationarity:}\quad \nabla f(\mathbf{x}^*) + \sum_i\lambda_i^*\nabla g_i(\mathbf{x}^*) + \sum_j\nu_j^*\nabla h_j(\mathbf{x}^*) = \mathbf{0}\]
Complementary slackness is the bridge to economic/geometric interpretation: either a constraint is "active" (gᵢ(x*)=0, the boundary binds, λᵢ* can be positive) OR it's "inactive" (gᵢ(x*)<0, doesn't bind, λᵢ*=0 necessarily). You can never have both a strictly-satisfied constraint AND a positive multiplier.
KKT as Necessary AND Sufficient — The Convex Case
KKT Sufficiency Theorem

If \(f, g_i\) are convex and \(h_j\) affine, then any point \((\mathbf{x}^*,\boldsymbol{\lambda}^*,\boldsymbol{\nu}^*)\) satisfying the KKT conditions is a global optimum of the primal AND dual — with zero duality gap. This is the precise statement of "the bridge": for general (non-convex) problems, KKT gives candidate stationary points that might be optima — checking second-order conditions and comparing values is still needed. For convex problems, KKT is the complete answer — find a KKT point and you're done, no further checking required.

Worked Example — SVM Dual Derivation
// SVM: from primal QP to dual via KKT — the canonical example
Primal
\[\min_{\mathbf{w},b}\;\frac{1}{2}\|\mathbf{w}\|^2 \quad\text{s.t.}\quad y_i(\mathbf{w}^T\mathbf{x}_i+b)\geq1\;\;\forall i\]
Lagrangian
\[\mathcal{L}(\mathbf{w},b,\boldsymbol{\alpha}) = \frac{1}{2}\|\mathbf{w}\|^2 - \sum_i\alpha_i[y_i(\mathbf{w}^T\mathbf{x}_i+b)-1], \quad \alpha_i\geq0\]
Stationarity
\[\nabla_\mathbf{w}\mathcal{L}=\mathbf{w}-\sum_i\alpha_iy_i\mathbf{x}_i=0 \Rightarrow \mathbf{w}=\sum_i\alpha_iy_i\mathbf{x}_i\] \[\nabla_b\mathcal{L}=-\sum_i\alpha_iy_i=0 \Rightarrow \sum_i\alpha_iy_i=0\]
Dual
Substitute back into \(\mathcal{L}\): \[\max_{\boldsymbol{\alpha}\geq0}\; \sum_i\alpha_i - \frac{1}{2}\sum_{i,j}\alpha_i\alpha_jy_iy_j\mathbf{x}_i^T\mathbf{x}_j \quad\text{s.t.}\quad \sum_i\alpha_iy_i=0\]
Complementary slackness: αᵢ[yᵢ(w^Tx_i+b)−1]=0 — αᵢ>0 only for points exactly on the margin (support vectors!). This is why most αᵢ are zero — the solution is "sparse" in the dual variables, and the kernel trick (x_i^T x_j → k(x_i,x_j)) is only possible because the dual depends on the data only through inner products.
10
Non-Smooth

Subgradients

// generalizing the gradient · subgradient set · non-differentiable optimization · subgradient descent

Many important functions in ML — \(\ell_1\) norm, hinge loss, ReLU at zero — are convex but not differentiable everywhere. Subgradients generalize the gradient to handle these "kinks," preserving the gradient inequality from §04.

Subgradient — DefinitionGeneralized Derivative
\[\mathbf{g} \text{ is a subgradient of } f \text{ at } \mathbf{x} \iff f(\mathbf{y}) \geq f(\mathbf{x}) + \mathbf{g}^T(\mathbf{y}-\mathbf{x}) \quad \forall\mathbf{y}\] \[\partial f(\mathbf{x}) = \{\mathbf{g} : \mathbf{g} \text{ is a subgradient of } f \text{ at } \mathbf{x}\} \quad\text{(subdifferential — a SET)}\]
At points where f is differentiable: ∂f(x) = {∇f(x)} — a single element. At "kinks" (non-differentiable points), ∂f(x) is a whole set of valid slopes — any line through (x,f(x)) that stays below the graph everywhere. Optimality condition generalizes elegantly: x* is a global min iff 0 ∈ ∂f(x*).
Subdifferentials of Common Functions
FunctionSubdifferentialNotes
\(f(x)=|x|\)\(\partial f(0) = [-1,1]\)At x=0: any slope between −1 and 1 stays below |x|
\(f(x)=\max(0,x)\) (ReLU)\(\partial f(0)=[0,1]\)Convention: PyTorch uses 0 (subgradient choice)
\(f(\mathbf{x})=\|\mathbf{x}\|_1\)\(\partial f(\mathbf{x})_i = \text{sign}(x_i)\) if \(x_i\neq0\), else \([-1,1]\)Used in LASSO optimality conditions
\(f(\mathbf{x})=\max(f_1(\mathbf{x}),f_2(\mathbf{x}))\)\(\partial f = \text{conv}(\partial f_1 \cup \partial f_2)\) at tie pointsHinge loss, max-margin formulations
Subgradient Descent
Subgradient MethodUpdate Rule
\[\mathbf{x}_{k+1} = \mathbf{x}_k - \alpha_k\mathbf{g}_k, \quad \mathbf{g}_k \in \partial f(\mathbf{x}_k)\] \[\text{Convergence: } O(1/\sqrt{k}) \text{ with } \alpha_k = O(1/\sqrt{k}) \quad\text{(much slower than smooth case!)}\]
Unlike gradient descent on smooth functions (O(1/k) or linear), subgradient descent is NOT a descent method — f(x_{k+1}) can be larger than f(x_k) at any single step. Only the BEST iterate over the trajectory is guaranteed to converge, and the rate O(1/√k) is fundamentally slower. This motivates proximal methods (§11) which handle non-smoothness more gracefully.
11
Splitting

Proximal Methods

// proximal operator · proximal gradient · ISTA/FISTA · LASSO soft-thresholding

Many ML problems have an objective \(f(\mathbf{x}) = g(\mathbf{x}) + h(\mathbf{x})\) where \(g\) is smooth (e.g., squared loss) but \(h\) is non-smooth (e.g., \(\ell_1\) penalty for sparsity). Proximal methods exploit this structure to handle the non-smooth part exactly while gradient-stepping on the smooth part — achieving the same convergence rates as gradient descent on purely smooth problems.

The Proximal OperatorCore Definition
\[\text{prox}_{\alpha h}(\mathbf{v}) = \arg\min_{\mathbf{x}}\left\{h(\mathbf{x}) + \frac{1}{2\alpha}\|\mathbf{x}-\mathbf{v}\|^2\right\}\]
The proximal operator finds a point close to \(\mathbf{v}\) that also has small \(h\). It's a generalized projection: if \(h\) is the indicator function of a convex set \(C\) (0 inside, ∞ outside), \(\text{prox}_h(\mathbf{v})\) is the orthogonal projection of \(\mathbf{v}\) onto \(C\). For \(h=0\), \(\text{prox}(\mathbf{v})=\mathbf{v}\) (identity).
Proximal Gradient Method (ISTA)

To understand proximal gradient, think in terms of majorization-minimization (MM): we construct a quadratic upper bound (majorizer) of \(f\) at \(\mathbf{x}_k\) and minimize that instead.

// Majorization-Minimization for Proximal Gradient
1
Since \(g\) is \(L\)-smooth, the descent lemma gives: \[g(\mathbf{y}) \leq g(\mathbf{x}_k) + \nabla g(\mathbf{x}_k)^T(\mathbf{y}-\mathbf{x}_k) + \frac{L}{2}\|\mathbf{y}-\mathbf{x}_k\|^2\]
This is a quadratic upper bound on \(g\).
2
Adding \(h(\mathbf{y})\) to both sides gives a majorizer of \(f = g + h\): \[f(\mathbf{y}) \leq \underbrace{g(\mathbf{x}_k) + \nabla g(\mathbf{x}_k)^T(\mathbf{y}-\mathbf{x}_k) + \frac{L}{2}\|\mathbf{y}-\mathbf{x}_k\|^2 + h(\mathbf{y})}_{Q(\mathbf{y}, \mathbf{x}_k)}\]
3
Dropping the constant term \(g(\mathbf{x}_k)\) and completing the square: \[Q(\mathbf{y}, \mathbf{x}_k) \propto h(\mathbf{y}) + \frac{L}{2}\left\|\mathbf{y} - \left(\mathbf{x}_k - \frac{1}{L}\nabla g(\mathbf{x}_k)\right)\right\|^2\]
4
Minimizing \(Q\) is exactly evaluating the proximal operator with \(\alpha=1/L\): \[\mathbf{x}_{k+1} = \text{prox}_{(1/L) h}\left(\mathbf{x}_k - \frac{1}{L}\nabla g(\mathbf{x}_k)\right)\]
Proximal Gradient Algorithm (ISTA)Splitting Method
\[\mathbf{x}_{k+1} = \text{prox}_{\alpha h}\!\left(\mathbf{x}_k - \alpha\nabla g(\mathbf{x}_k)\right)\]
Two steps: (1) gradient step on the smooth part \(g\), (2) proximal step on the non-smooth part \(h\). Crucially, if \(g\) is \(L\)-smooth, ISTA converges at the same rate as gradient descent on smooth functions: \(O(1/k)\). For strongly convex \(g\), ISTA converges linearly.
ISTA

ISTA = Iterative Shrinkage-Thresholding Algorithm. The "shrinkage" refers to the soft-thresholding step for \(\ell_1\) regularization.

Accelerated Proximal Gradient (FISTA)

Just as Nesterov acceleration improves gradient descent from \(O(1/k)\) to \(O(1/k^2)\), we can accelerate ISTA to get FISTA — the optimal first-order method for composite convex optimization.

FISTA — Accelerated Proximal GradientOptimal Rate
\[ \begin{cases} \mathbf{y}_k = \mathbf{x}_k + \frac{t_{k-1} - 1}{t_k}(\mathbf{x}_k - \mathbf{x}_{k-1}) \\ \mathbf{x}_{k+1} = \text{prox}_{\alpha h}\!\left(\mathbf{y}_k - \alpha\nabla g(\mathbf{y}_k)\right) \\ t_{k+1} = \frac{1 + \sqrt{1 + 4t_k^2}}{2} \end{cases} \]
FISTA uses a momentum-like extrapolation step with the sequence \(t_k\). It achieves the optimal \(O(1/k^2)\) convergence rate — a factor of \(k\) faster than ISTA — with no increase in per-iteration cost.
Soft-Thresholding — Deriving the \(\ell_1\) Proximal Operator

Let's prove that the proximal operator for \(h(\mathbf{x}) = \lambda\|\mathbf{x}\|_1\) is indeed the soft-thresholding operator. Since \(\ell_1\) is separable, we can solve for each coordinate independently.

// Deriving Soft-Thresholding
1
The 1D proximal problem for coordinate \(i\): \[\min_{z} \left\{\lambda |z| + \frac{1}{2\alpha}(z - v_i)^2\right\}\]
2
Case 1: \(z > 0\). Then \(|z| = z\), and the derivative is \(\lambda + \frac{1}{\alpha}(z - v_i) = 0\). Solving: \[z = v_i - \alpha\lambda\]
Valid only if \(v_i - \alpha\lambda > 0 \implies v_i > \alpha\lambda\).
3
Case 2: \(z < 0\). Then \(|z| = -z\), and the derivative is \(-\lambda + \frac{1}{\alpha}(z - v_i) = 0\). Solving: \[z = v_i + \alpha\lambda\]
Valid only if \(v_i + \alpha\lambda < 0 \implies v_i < -\alpha\lambda\).
4
Case 3: \(z = 0\). The subgradient condition for optimality is \(0 \in \partial\left(\lambda|z| + \frac{1}{2\alpha}(z-v_i)^2\right)\). At \(z=0\): \[0 \in [-\alpha\lambda, \alpha\lambda] - v_i \implies v_i \in [-\alpha\lambda, \alpha\lambda]\]
Soft-Thresholding OperatorL1 Prox
\[\text{prox}_{\lambda \|\cdot\|_1}(\mathbf{v})_i = S_\lambda(v_i) = \text{sign}(v_i)\max(0, |v_i|-\lambda)\]
This is the most famous proximal operator. It pulls each coordinate toward zero by \(\lambda\), setting it to zero if its magnitude is less than \(\lambda\). This is the exact reason \(\ell_1\) regularization induces sparsity — the proximal step literally zeros out small components.
Concrete Example: LASSO via ISTA/FISTA

The LASSO problem is: \[\min_{\mathbf{w}} \frac{1}{2}\|\mathbf{Xw} - \mathbf{y}\|^2 + \lambda\|\mathbf{w}\|_1\] Here, \(g(\mathbf{w}) = \frac{1}{2}\|\mathbf{Xw} - \mathbf{y}\|^2\) (smooth, \(L = \sigma_{\text{max}}(\mathbf{X}^T\mathbf{X})\)) and \(h(\mathbf{w}) = \lambda\|\mathbf{w}\|_1\) (non-smooth).

PSEUDOCODE
# ISTA for LASSO
function ISTA_LASSO(X, y, λ, max_iter):
  L = spectral_norm(X@X.T) # Lipschitz constant of ∇g
  α = 1 / L
  w = zeros(p) # p = number of features
  for k in 1 to max_iter:
    gradient = X.T @ (X @ w - y)
    w = soft_threshold(w - α * gradient, α * λ)
  return w

function soft_threshold(v, λ):
  return sign(v) * max(0, abs(v) - λ)
12
Coordinate

Coordinate Descent

// alternating optimization · Gauss-Seidel updates · efficiency for big data

Coordinate descent (CD) simplifies high-dimensional optimization by updating one variable (or block) at a time while keeping others fixed. While not always faster than gradient methods, it is exceptionally efficient for problems with specific structural separability — especially when the 1D subproblems have closed-form solutions.

The Update Rule — Gauss-Seidel vs Randomized
Coordinate UpdateAlgorithm
\[x_i^{(k+1)} = \arg\min_{z} f(x_1^{(k+1)}, \dots, x_{i-1}^{(k+1)}, z, x_{i+1}^{(k)}, \dots, x_n^{(k)})\]
Gauss-Seidel CD: Cycle through coordinates in a fixed order, using the most recent values of previously updated coordinates.
Randomized CD: Select coordinates uniformly at random (or with priority) — often faster in practice and easier to analyze theoretically.
Concrete Example: Coordinate Descent for LASSO

Let's derive the coordinate update for LASSO, which is the workhorse of the `glmnet` package. The LASSO problem is: \[\min_{\mathbf{w}} \frac{1}{2}\|\mathbf{y} - \mathbf{Xw}\|^2 + \lambda\|\mathbf{w}\|_1\] Let \(x_j\) be the j-th column of \(\mathbf{X}\), and let \(r = \mathbf{y} - \sum_{k \neq j} \mathbf{X}_k w_k\) be the residual excluding the j-th coordinate.

// LASSO Coordinate Update Derivation
1
The objective as a function of \(w_j\) alone is: \[\min_{w_j} \frac{1}{2}\|r - x_j w_j\|^2 + \lambda|w_j|\]
2
Let \(s_j = x_j^T r\) (the partial residual) and \(d_j = x_j^T x_j\) (the squared norm of the j-th feature). Then: \[\min_{w_j} \frac{1}{2}d_j w_j^2 - s_j w_j + \lambda|w_j|\]
3
This is exactly the 1D proximal problem for \(\ell_1\)! The solution is the soft-thresholding operator: \[w_j^{(k+1)} = S_{\lambda/d_j}\left(\frac{s_j}{d_j}\right) = \text{sign}\left(\frac{s_j}{d_j}\right)\max\left(0, \left|\frac{s_j}{d_j}\right| - \frac{\lambda}{d_j}\right)\]
PSEUDOCODE
# Coordinate Descent for LASSO (Gauss-Seidel)
function LASSO_CD(X, y, λ, max_iter):
  n, p = X.shape
  w = zeros(p)
  d = sum(X**2, axis=0) # precompute squared norms
  for iter in 1 to max_iter:
    for j in 1 to p:
      # partial residual without j-th feature
      r = y - X @ w + X[:, j] * w[j]
      s_j = X[:, j].T @ r
      # soft-thresholding update
      w[j] = soft_threshold(s_j / d[j], λ / d[j])
  return w
Sparsity

Coordinate descent for LASSO naturally maintains sparsity: once a coordinate is set to zero, it often stays zero for many iterations. This makes it extremely fast for high-dimensional problems where most features are irrelevant.

Why and When it Works
  • Separability: Works best when the objective is "smooth + separable": \(f(\mathbf{x}) = g(\mathbf{x}) + \sum h_i(x_i)\).
  • Non-smooth catch: CD can get stuck at "corners" for general non-smooth functions. It requires the non-smooth part to be coordinate-wise separable (like \(\ell_1\)).
  • Big Data: For very high-dimensional sparse problems (like text classification with millions of features), CD is often the state-of-the-art solver (used in `glmnet` and `liblinear`).
  • Line search free: Unlike gradient descent, coordinate descent often requires no step-size tuning — the 1D optimizations are exact.
13
Barriers

Interior Point Methods

// logarithmic barrier · central path · Newton's method for constrained problems

Interior point methods (IPMs) solve constrained problems by approximating them with a sequence of unconstrained problems. They "smooth out" the constraints using a barrier function that goes to infinity as you approach the boundary — ensuring we always stay strictly feasible.

The Log-Barrier and Central Path
Logarithmic Barrier FunctionBarrier
\[\phi(\mathbf{x}) = -\sum_{i=1}^m \log(-g_i(\mathbf{x}))\] \[\min_\mathbf{x}\; t f(\mathbf{x}) + \phi(\mathbf{x})\]
As \(g_i(\mathbf{x}) \to 0^-\), \(-\log(-g_i(\mathbf{x})) \to \infty\). The parameter \(t > 0\) controls the barrier strength: small \(t\) emphasizes the barrier (staying away from boundaries), large \(t\) emphasizes the original objective.

For each \(t\), let \(\mathbf{x}^*(t)\) denote the minimizer of the barrier problem. The set \(\{\mathbf{x}^*(t) \mid t > 0\}\) is called the central path. As \(t \to \infty\), \(\mathbf{x}^*(t) \to \mathbf{x}^*\) — the true solution of the original constrained problem.

The Interior Point Algorithm

The standard path-following IPM alternates between two steps:

  1. Newton step: Solve the current barrier problem with a Newton step.
  2. Update t: Increase \(t\) by a factor (e.g., \(t \leftarrow 1.5t\)).

PSEUDOCODE
# Primal Interior Point Method (sketch)
function IPM(f, g, x0):
  t = 1
  x = x0 # must be strictly feasible: g_i(x) < 0
  for k in 1 to max_iter:
    # 1. Newton step on t*f + phi
    grad = t * ∇f(x) + ∇φ(x)
    hess = t * ∇²f(x) + ∇²φ(x)
    dx = solve(hess, -grad)
    # 2. Line search to stay feasible
    x = x + alpha * dx
    # 3. Increase barrier parameter
    t = t * 1.2
  return x
Why use IPMs?
  • Polynomial Complexity: The path-following IPM is theoretically polynomial-time (O(m³√log(1/ε))) for linear programming — a landmark result in optimization.
  • High Precision: Unlike first-order methods (gradient descent), IPMs use second-order info (Newton steps) and can reach machine precision in very few iterations (usually 10-50).
  • Small-Medium Scale: Because they require computing and inverting the Hessian, they are best for problems with a few thousand variables, not millions.
  • Reliability: IPMs are the engine behind commercial solvers like Gurobi and Mosek for LPs, QPs, and SOCPs.
14
Guarantees

Convergence Guarantees

// complexity classes · O(1/k) vs O(1/k²) vs linear · optimal rates

Convergence analysis tells us how many iterations \(k\) are needed to reach an \(\epsilon\)-optimal solution: \(f(\mathbf{x}_k) - f(\mathbf{x}^*) \leq \epsilon\). This hierarchy of rates is the "speed limit" of optimization — and convexity is what lets us prove these guarantees.

Convergence Rate Hierarchy
AssumptionMethodRateIterations for \(\epsilon\)
Convex, G-LipschitzSubgradient\(O(1/\sqrt{k})\)\(O(1/\epsilon^2)\)
Convex, L-smoothGradient Descent\(O(1/k)\)\(O(1/\epsilon)\)
Convex, L-smoothAccelerated (Nesterov)\(O(1/k^2)\)\(O(1/\sqrt{\epsilon})\)
Strongly Convex, L-smoothGradient Descent\(O(e^{-k/\kappa})\)\(O(\kappa \log(1/\epsilon))\)
Convergence Proof Sketch — Smooth Convex Case

Let's see why gradient descent converges at \(O(1/k)\) for L-smooth convex functions. The key tool is the descent lemma from Section 7.

// Gradient Descent Convergence (Smooth Convex)
1
From the descent lemma with \(\alpha = 1/L\): \[f(\mathbf{x}_{k+1}) \leq f(\mathbf{x}_k) - \frac{1}{2L}\|\nabla f(\mathbf{x}_k)\|^2\]
2
From convexity (gradient inequality): \[f(\mathbf{x}^*) \geq f(\mathbf{x}_k) + \nabla f(\mathbf{x}_k)^T(\mathbf{x}^* - \mathbf{x}_k)\]
3
Let \(r_k = \|\mathbf{x}_k - \mathbf{x}^*\|\). With some algebra (see textbooks), we can show: \[f(\mathbf{x}_k) - f(\mathbf{x}^*) \leq \frac{2L r_0^2}{k}\]
Key Concepts
  • Sublinear Convergence: Rates like \(O(1/k)\) or \(O(1/k^2)\) — the error decreases, but the number of iterations needed for each new decimal of precision grows.
  • Linear Convergence: Error decreases geometrically (\(O(e^{-k/\kappa})\)) — each iteration roughly halves the error (for good condition numbers). This is much faster!
  • Condition Number (\(\kappa = L/\mu\)): The ratio of smoothness to strong convexity. Large \(\kappa\) means slow linear convergence (ill-conditioned problems, like narrow valleys).
  • Optimal Rates: Nesterov's accelerated gradient achieves the best possible rate for first-order methods on smooth convex problems — you can't do better than \(O(1/k^2)\) without using second-order information.
Linear

"Linear convergence" (counter-intuitively) means the error decreases geometrically, like \(0.9^k\). On a log-scale plot, this looks like a straight line. It is much faster than "sublinear" rates like \(1/k\). Strong convexity is the magic ingredient that turns sublinear convergence into linear convergence.

15
ML Context

Optimization in ML

// empirical risk minimization · regularizers · why convexity is the baseline

In Machine Learning, optimization is the engine of learning. The central problem is Empirical Risk Minimization (ERM): \[\min_\mathbf{w}\; \underbrace{\frac{1}{n}\sum_{i=1}^n L(y_i, f(\mathbf{w}, \mathbf{x}_i))}_{\text{Empirical Risk}} + \underbrace{\lambda \Omega(\mathbf{w})}_{\text{Regularizer}}\]

Convexity in Classical ML Models

All your favorite classical ML models are convex — and that's why they're reliable!

Linear Regression

Least squares is a quadratic program with a unique global minimum (if \(X^T X\) is positive definite). Even with L2 regularization (Ridge), it's still convex and strongly convex.

Logistic Regression

The log-loss is convex (negative log-likelihood for a Bernoulli model). The gradient is Lipschitz continuous, so gradient descent converges at \(O(1/k)\).

SVM (Hinge Loss)

The hinge loss is convex, and the dual problem is a QP. The KKT conditions tell us that only support vectors matter — a beautiful result of convex duality.

LASSO

L1 regularization makes the objective non-differentiable but still convex. Proximal methods or coordinate descent work perfectly here, and the solution is sparse!

Why Convexity Matters in ML
  • No Local Minima: Every local minimum is a global minimum — no need to worry about getting stuck!
  • Worst-Case Guarantees: We know exactly how fast our algorithms will converge.
  • Interpretability: Duality gives insight into constraints and support vectors (SVMs, LASSO).
  • Hyperparameter Robustness: Convex solvers are less sensitive to step sizes and initializations (though still important!).
Convex vs Non-Convex in ML

The Convex Era (Classical ML)

Models like Linear Regression, SVMs, Logistic Regression, and Lasso are convex. We have global guarantees. We can trust our solvers to find the absolute best parameters. This is why classical ML is so reliable!

The Non-Convex Era (Deep Learning)

Neural networks are non-convex (compositions of ReLUs/sigmoids, weight products). We lose global guarantees, but convexity theory still provides the local analysis (smoothness, gradients, Lipschitz constants) that makes SGD/Adam work. Many techniques (batch norm, warmup) are inspired by convex optimization!

16
Model

Complete Mental Model

// synthesis · the big picture · when to use what
CONVEX OPTIMIZATION — COMPLETE TAXONOMY CONVEXITY Sets · Functions · Optimality Unconstrained min f(x) Equality Cons. Ax = b Inequality Cons. g_i(x) ≤ 0 Composite f + non-smooth h First-Order GD · Nesterov Subgradient · Proximal ISTA / FISTA Second-Order Newton's Method Quasi-Newton (L-BFGS) Dual & Interior KKT · Dual QP Barrier · Central Path Coordinate Descent Gauss-Seidel Updates Efficient for Sparse Data CONVERGENCE HIERARCHY Linear: O(e^-k) — Strongly Convex Accelerated: O(1/k²) — Nesterov / FISTA Sublinear: O(1/k) — Smooth Convex GD Slowest: O(1/√k) — Subgradient Descent MACHINE LEARNING CONTEXT Linear & Logistic Regression · SVM · LASSO Foundation for local analysis in Deep Learning
Fig 4. Complete taxonomy of convex optimization — from theoretical foundations to problem types, specialized solvers, convergence guarantees, and their deployment in classical and modern machine learning.

The complete unified narrative:

  1. Start with Convex Sets: A set is convex if every line segment between two points in the set stays in the set. This is the foundation — without convex sets, we can't define convex functions.
  2. Convex Functions: A function is convex if its epigraph is a convex set (or equivalently, if the chord lies above the graph). This guarantees that any local minimum is a global minimum!
  3. First-Order Condition: For differentiable functions, convexity is equivalent to the gradient inequality: the tangent line is a global underestimator.
  4. Second-Order Condition: For twice differentiable functions, convexity is equivalent to the Hessian being positive semidefinite (PSD) everywhere — all curvature is non-negative.
  5. Jensen's Inequality: Convexity extends to expectations: \(f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]\). This is the engine behind AM-GM, ELBO, and many ML proofs.
  6. Strong Convexity & Smoothness: These two conditions control convergence speed. Strong convexity (\(\mu I \leq \nabla^2 f\)) gives a quadratic lower bound. Smoothness (\(\nabla^2 f \leq L I\)) gives a quadratic upper bound. Their ratio is the condition number \(\kappa = L/\mu\).
  7. Constraints & Duality: For constrained problems, we form the Lagrangian, derive the KKT conditions, and use duality to find a dual problem that's often easier to solve (like SVM dual).
  8. Non-Smooth Problems: Use subgradients (generalized gradients) or better, proximal methods — split into smooth + non-smooth parts and use proximal gradient (ISTA/FISTA).
  9. Choose Your Solver: If small: IPM/Newton. If smooth: Nesterov. If composite: ISTA/FISTA. If high-dimensional sparse: coordinate descent.
  10. ML is Convex (Mostly): Classical ML (linear regression, SVM, logistic, LASSO) is all convex. Deep learning is non-convex, but we still use convex principles (SGD, momentum, smoothness) locally!
The Single Most Important Takeaway

Convex optimization is not just a set of algorithms — it is a mathematical guarantee. When a problem is convex, you know exactly what you're getting: a global minimizer, convergence rates, and no surprises. Even when you leave convexity (deep learning), convexity theory is still the best guide you have.