Convex sets and functions, Jensen's inequality, first- and second-order
conditions, strong convexity, duality, KKT, subgradients, proximal
and interior-point methods — the rigorous bridge between gradient
descent and Lagrange multipliers.
⊳ Builds directly on Gradient Descent (first-order local methods) and Lagrange Multipliers (constrained stationarity) — convexity is the property that makes both of these methods provably correct, not just locally optimal but globally optimal.
Convexity is fundamentally a statement about shape. A convex set has no "dents" — if you pick any two points in the set, the entire straight line segment between them stays inside the set. This simple geometric property has enormous consequences for optimization.
The point λx + (1−λ)y is a convex combination of x and y — it traces the line segment from y (λ=0) to x (λ=1) as λ varies. The definition says: this entire segment must lie in C. Equivalently, C contains every convex combination of its points.
Canonical Examples
Set
Definition
Convex?
Hyperplane
\(\{x: \mathbf{a}^T\mathbf{x}=b\}\)
Yes — affine
Halfspace
\(\{x: \mathbf{a}^T\mathbf{x}\leq b\}\)
Yes
Ball
\(\{x: \|x-x_0\|\leq r\}\)
Yes
Polyhedron
\(\{x: A\mathbf{x}\leq b\}\) (intersection of halfspaces)
Yes
Positive semidefinite cone
\(\{X\in\mathbb{S}^n: X\succeq 0\}\)
Yes
Annulus
\(\{x: 1\leq\|x\|\leq2\}\)
No — has a hole
Union of two disjoint balls
\(B_1 \cup B_2\)
No — line segment exits
Operations That Preserve Convexity
Intersection: If \(C_1, C_2\) are convex, so is \(C_1 \cap C_2\). This is why a polyhedron (intersection of halfspaces) is always convex — and why feasible regions in linear programs (intersections of linear constraints) are convex.
Affine images: If \(C\) is convex and \(f(\mathbf{x}) = A\mathbf{x}+\mathbf{b}\), then \(f(C) = \{f(\mathbf{x}):\mathbf{x}\in C\}\) is convex. Scaling, rotation, translation, and projection all preserve convexity.
Minkowski sum: \(C_1 + C_2 = \{x_1+x_2: x_1\in C_1, x_2\in C_2\}\) is convex if \(C_1, C_2\) are convex.
Cartesian product: \(C_1 \times C_2\) is convex if \(C_1, C_2\) are convex — used to combine independent constraint sets.
Note
Union of convex sets is generally not convex. This is why feasible regions defined by OR conditions (e.g., "x ≤ 0 OR x ≥ 1") are typically non-convex, and why such constraints make optimization problems combinatorially hard (mixed-integer programming).
The convex hull of S is the smallest convex set containing S — the set of all convex combinations of points in S. Geometrically: stretch a rubber band around the points. Extreme points (vertices) of a polyhedron are points that cannot be written as a convex combination of other points in the set — these are the candidate optimal points for linear programs.
A function is convex if its graph "curves upward" — every chord connecting two points on the graph lies above (or on) the graph. This single property guarantees that any local minimum is a global minimum.
Definition via Chords
Convex Function — DefinitionCore Definition
\[f: \mathbb{R}^n \to \mathbb{R} \text{ is convex if } \text{dom}(f) \text{ is convex and}\]
\[f(\lambda\mathbf{x}+(1-\lambda)\mathbf{y}) \leq \lambda f(\mathbf{x}) + (1-\lambda)f(\mathbf{y}) \quad \forall \mathbf{x},\mathbf{y}\in\text{dom}(f),\;\lambda\in[0,1]\]
\[\text{Strictly convex if strict inequality holds for } \mathbf{x}\neq\mathbf{y},\;\lambda\in(0,1)\]
\[\text{Concave: } -f \text{ is convex} \quad\text{(reverse inequality)}\]
The value of f at the convex combination is at most the convex combination of the values — the function value at any point on the chord is below the chord itself. Strict convexity rules out flat segments; it guarantees a unique minimizer if one exists.
The Epigraph — Functions as Sets
EpigraphGeometric View
\[\text{epi}(f) = \{(\mathbf{x},t)\in\mathbb{R}^{n+1} : t\geq f(\mathbf{x})\}\]
\[f \text{ is convex} \iff \text{epi}(f) \text{ is a convex set}\]
The epigraph is the region "above" the function's graph. This identity translates every fact about convex sets into a fact about convex functions, and vice versa — it's the bridge between §01 and §02. A function is convex exactly when the region above its graph is a convex set (no dents from above).
Fig 1. Left: a convex function — every chord lies above (or on) the graph; the epigraph (shaded) is a convex set. Right: a non-convex function — at point z, the function exceeds the chord, violating the definition.
Convexity-Preserving Operations
Operation
Rule
Why It Matters
Non-negative weighted sum
\(f = \sum w_if_i,\;w_i\geq0,\;f_i\) convex
Sum of losses + regularizers stays convex
Composition with affine map
\(g(\mathbf{x}) = f(A\mathbf{x}+b)\) convex if \(f\) convex
// generalization to expectations · proof · applications across ML
Jensen's inequality generalizes the definition of convexity from two-point convex combinations to arbitrary probability distributions. It is one of the most widely-used inequalities across all of mathematics and machine learning.
The two-point definition is the special case where X takes value x with probability λ and y with probability 1−λ: f(λx+(1−λ)y) ≤ λf(x)+(1−λ)f(y) becomes f(E[X]) ≤ E[f(X)]. Jensen's inequality extends this to any random variable and any convex function.
Proof via Supporting Hyperplane
// Proof of Jensen's inequality using the supporting hyperplane (tangent line)
1
Since \(f\) is convex and differentiable, at the point \(\mu = \mathbb{E}[X]\), the tangent line is a global underestimator:
\[f(x) \geq f(\mu) + \nabla f(\mu)^T(x-\mu) \quad \forall x\]
This is the first-order condition for convexity (§04) — the tangent line at any point lies entirely below the graph.
2
Substitute \(x = X\) (random variable) and take expectations of both sides:
\[\mathbb{E}[f(X)] \geq \mathbb{E}[f(\mu) + \nabla f(\mu)^T(X-\mu)] = f(\mu) + \nabla f(\mu)^T(\mathbb{E}[X]-\mu)\]
3
Since \(\mu = \mathbb{E}[X]\): \(\mathbb{E}[X] - \mu = 0\), so the second term vanishes:
\[\mathbb{E}[f(X)] \geq f(\mu) = f(\mathbb{E}[X]) \quad\blacksquare\]
Applications Throughout ML
KL divergence ≥ 0: Applying Jensen with \(f = -\log\) (convex) to \(\mathbb{E}_{p}[q(x)/p(x)]\) proves \(D_{\text{KL}}(p\|q) \geq 0\) — the Gibbs inequality.
ELBO (Evidence Lower Bound): \(\log p(x) = \log\mathbb{E}_q[p(x,z)/q(z)] \geq \mathbb{E}_q[\log(p(x,z)/q(z))]\) — Jensen applied to \(\log\) (concave) gives the variational lower bound used in VAEs.
Variance non-negativity: \(\mathbb{E}[X^2] \geq (\mathbb{E}[X])^2\) follows from Jensen with \(f(x)=x^2\) (convex) — hence \(\text{Var}(X) = \mathbb{E}[X^2]-(\mathbb{E}[X])^2 \geq 0\).
AM-GM inequality: Applying Jensen to \(f = -\log\) and uniform weights gives the arithmetic mean ≥ geometric mean inequality — fundamental to many ML proofs.
04
Calculus
First-Order Conditions
// global underestimator · gradient inequality · optimality for unconstrained problems
If \(f\) is differentiable, convexity has an equivalent characterization in terms of the gradient: the tangent hyperplane at any point lies entirely below (or on) the graph of \(f\). This is the property used in the proof of Jensen's inequality above.
Geometric meaning: the first-order Taylor approximation at any point x is a global underestimator of f. No matter where you linearize, the linear approximation never overshoots the true function value anywhere else.
Proof of Equivalence with the Chord Definition
// Showing chord definition ⟹ gradient inequality
1
From the chord definition with \(\mathbf{z} = \mathbf{x}+t(\mathbf{y}-\mathbf{x})\) for \(t\in(0,1)\):
\[f(\mathbf{x}+t(\mathbf{y}-\mathbf{x})) \leq (1-t)f(\mathbf{x}) + tf(\mathbf{y})\]
For a convex differentiable \(f\), \(\mathbf{x}^*\) is a global minimizer if and only if \(\nabla f(\mathbf{x}^*) = \mathbf{0}\). Proof: by the gradient inequality, \(f(\mathbf{y}) \geq f(\mathbf{x}^*) + \nabla f(\mathbf{x}^*)^T(\mathbf{y}-\mathbf{x}^*) = f(\mathbf{x}^*)\) for all \(\mathbf{y}\), since the gradient term vanishes. This is the foundation of gradient descent: \(\nabla f = \mathbf{0}\) is both necessary AND sufficient for convex problems — unlike non-convex problems where it only identifies stationary points (which may be saddle points or local minima).
At the optimum, the gradient cannot point into the feasible set — every feasible direction (y − x*) makes a non-negative angle with the gradient. If x* is in the interior of C, this reduces to ∇f(x*)=0. If x* is on the boundary, the gradient may be non-zero but must point "outward" — this is the geometric origin of KKT conditions (§09).
The second-order condition characterizes convexity through curvature: a twice-differentiable function is convex if and only if it curves "upward" in every direction, at every point — formalized via the Hessian matrix.
∇²f(x) ⪰ 0 means the Hessian is positive semidefinite — all its eigenvalues are ≥ 0. Strictly convex (sufficient, not necessary): ∇²f(x) ≻ 0 (positive definite, eigenvalues > 0). The Hessian eigenvalues measure curvature in each eigenvector direction — convexity requires non-negative curvature everywhere.
Verifying Convexity for Common ML Losses
// Hessian computations for standard loss functions
X^T X is always PSD: v^T(X^TX)v = ‖Xv‖² ≥ 0 for all v. Therefore MSE is always convex — regardless of the data. Strictly convex iff X has full column rank (no collinear features).
D has positive diagonal entries (σ(1-σ) > 0 always), so X^T D X is PSD by the same argument as MSE. Logistic regression's cross-entropy loss is always convex — this is precisely why it has no local minima other than the global one.
Neural Network (counter-example)
\(f(\mathbf{w}) = \frac{1}{2}(\sigma(w_2\sigma(w_1x))-y)^2\) for a 2-layer network. The composition of nonlinear activations with weight products makes the Hessian indefinite (mixed positive and negative eigenvalues) almost everywhere.
This is THE fundamental reason deep learning is non-convex: products of weights w₁w₂ create saddle points and multiple local minima. Convexity theory still informs us (Lipschitz smoothness, local convergence rates) but global optimality guarantees are lost.
Curvature Interpretation
Insight
The Hessian eigenvalues directly determine the shape of level sets near a minimum: large eigenvalue → steep curvature → narrow valley in that direction. Small eigenvalue → shallow curvature → wide, flat valley. The condition number \(\kappa = \lambda_{\max}/\lambda_{\min}\) of the Hessian determines how elongated the level sets are — and directly controls the convergence rate of gradient descent (covered in the Gradient Descent masterclass).
06
Strength
Strong Convexity
// quantified curvature · unique minimizer · linear convergence rates
Plain convexity guarantees a global minimum exists (if the problem is bounded) but says nothing about how fast we can find it, or whether it's unique. Strong convexity adds a quantitative lower bound on curvature — and this single number controls convergence rates.
Strong convexity means: the function lies above a quadratic lower bound everywhere (not just a linear one as in plain convexity). μ measures the minimum curvature — every eigenvalue of the Hessian is at least μ. Plain convexity is the limiting case μ=0.
Consequences of Strong Convexity
Unique minimizer: Strong convexity implies strict convexity, which implies (if a minimizer exists) it is unique. A strongly convex function cannot have flat regions or multiple minima.
Quadratic growth: \(f(\mathbf{x}) \geq f(\mathbf{x}^*) + \frac{\mu}{2}\|\mathbf{x}-\mathbf{x}^*\|^2\) — the suboptimality gap grows at least quadratically with distance from the optimum. This directly bounds how far from optimal you are given the function value.
Linear convergence: Gradient descent on a \(\mu\)-strongly convex, \(L\)-smooth function converges linearly: \(f(\mathbf{x}_k) - f(\mathbf{x}^*) \leq (1-\mu/L)^k(f(\mathbf{x}_0)-f(\mathbf{x}^*))\). Without strong convexity, only sublinear rates \(O(1/k)\) are guaranteed.
Polyak-Łojasiewicz (PL) condition: A weaker condition than strong convexity that still guarantees linear convergence: \(\frac{1}{2}\|\nabla f(\mathbf{x})\|^2 \geq \mu(f(\mathbf{x})-f^*)\). Many non-convex deep learning loss landscapes satisfy PL locally — explaining why SGD converges quickly despite non-convexity.
Even if X^TX is singular (e.g., d > n, more features than samples), adding λ‖w‖² guarantees μ = 2λ > 0 strong convexity. This is one of the deepest reasons L2 regularization helps: it converts a merely convex (possibly flat or unbounded) problem into a strongly convex one with a unique solution and linear convergence guarantees.
07
Smoothness
Lipschitz Continuity
// bounded rate of change · L-smoothness · the descent lemma · step size selection
Lipschitz continuity bounds how fast a function (or its gradient) can change. Combined with strong convexity, it forms the pair of constants (\(\mu, L\)) that completely characterizes the convergence behavior of first-order optimization methods.
G bounds the maximum rate of change of f. Equivalent to: ‖∇f(x)‖ ≤ G everywhere (if differentiable). Used in subgradient methods (§10) where the convergence rate depends on G.
L bounds the maximum curvature — the largest eigenvalue of the Hessian is at most L. Combined with §06 (μ ≤ ∇²f ≤ L), the function's Hessian eigenvalues all lie in [μ, L]. The ratio κ = L/μ is the condition number that governs gradient descent's convergence rate.
The Descent Lemma
// Deriving the descent lemma — the key tool for step size selection
1
By the fundamental theorem of calculus and L-smoothness, for any \(\mathbf{x},\mathbf{y}\):
\[f(\mathbf{y}) = f(\mathbf{x}) + \int_0^1\nabla f(\mathbf{x}+t(\mathbf{y}-\mathbf{x}))^T(\mathbf{y}-\mathbf{x})\,dt\]
2
Add and subtract \(\nabla f(\mathbf{x})^T(\mathbf{y}-\mathbf{x})\), then bound the difference using L-smoothness and Cauchy-Schwarz:
\[f(\mathbf{y}) \leq f(\mathbf{x}) + \nabla f(\mathbf{x})^T(\mathbf{y}-\mathbf{x}) + \frac{L}{2}\|\mathbf{y}-\mathbf{x}\|^2\]
3
Apply to gradient descent: set \(\mathbf{y}=\mathbf{x}-\alpha\nabla f(\mathbf{x})\):
\[f(\mathbf{x}-\alpha\nabla f(\mathbf{x})) \leq f(\mathbf{x}) - \alpha\left(1-\frac{L\alpha}{2}\right)\|\nabla f(\mathbf{x})\|^2\]
For the right-hand side to guarantee decrease, we need (1 − Lα/2) > 0, i.e. α < 2/L. The optimal fixed step size is α = 1/L — this directly proves why the famous "step size ≤ 1/L" rule for gradient descent guarantees monotonic decrease. ∎
The Descent Lemma — Foundation of Step Size Selection
\[f(\mathbf{x}-\tfrac{1}{L}\nabla f(\mathbf{x})) \leq f(\mathbf{x}) - \frac{1}{2L}\|\nabla f(\mathbf{x})\|^2\]
Every step of gradient descent with \(\alpha = 1/L\) decreases the function value by at least \(\frac{1}{2L}\|\nabla f\|^2\). This is the inequality that drives all convergence rate proofs for gradient descent — \(O(1/k)\) for convex, linear for strongly convex \((\mu>0)\).
Every constrained optimization problem (the "primal") has an associated "dual" problem. The dual provides a lower bound on the primal's optimal value — and under convexity, this bound becomes tight, transforming the original problem into an equivalent, often easier one.
The dual function d(λ,ν) is always concave (it's an infimum of affine functions of (λ,ν)) — even if the primal is non-convex! This means the dual problem is always a concave maximization, often much easier to solve.
Weak Duality — Always True
Weak DualityUniversal
\[d^* \leq p^* \quad\text{(always — even for non-convex problems)}\]
Proof sketch: for feasible x and any λ≥0, ν: L(x,λ,ν) = f(x) + Σλᵢgᵢ(x) + Σνⱼhⱼ(x) ≤ f(x) since gᵢ(x)≤0, λᵢ≥0, hⱼ(x)=0. So d(λ,ν) = inf_x L(x,λ,ν) ≤ L(x*,λ,ν) ≤ f(x*) = p*. The gap p* − d* ≥ 0 is called the duality gap.
Slater's condition: there exists a strictly feasible point (inequality constraints satisfied with strict inequality). This is a constraint qualification — it rules out degenerate geometries where the constraint boundary "kisses" the objective in a pathological way. Under Slater's condition, solving the dual problem is EQUIVALENT to solving the primal — and the dual is often much easier (e.g., SVM dual is a quadratic program in fewer variables).
Geometric Interpretation
Fig 2. Geometric duality: the set 𝒢 of achievable (constraint, objective) pairs. p* is the minimum f over the feasible half g≤0. The dual finds the best supporting hyperplane (slope −λ*) — under strong duality, this hyperplane touches exactly at p*, so d*=p*.
09
Optimality
KKT Conditions — Revisited
// from Lagrange multipliers to KKT · complementary slackness · sufficiency under convexity
The Lagrange Multipliers masterclass introduced KKT as necessary conditions for constrained optimality. Here we close the loop: under convexity, KKT conditions become both necessary and sufficient — solving the KKT system solves the original problem completely.
The KKT Conditions — Complete StatementOptimality System
Complementary slackness is the bridge to economic/geometric interpretation: either a constraint is "active" (gᵢ(x*)=0, the boundary binds, λᵢ* can be positive) OR it's "inactive" (gᵢ(x*)<0, doesn't bind, λᵢ*=0 necessarily). You can never have both a strictly-satisfied constraint AND a positive multiplier.
KKT as Necessary AND Sufficient — The Convex Case
KKT Sufficiency Theorem
If \(f, g_i\) are convex and \(h_j\) affine, then any point \((\mathbf{x}^*,\boldsymbol{\lambda}^*,\boldsymbol{\nu}^*)\) satisfying the KKT conditions is a global optimum of the primal AND dual — with zero duality gap. This is the precise statement of "the bridge": for general (non-convex) problems, KKT gives candidate stationary points that might be optima — checking second-order conditions and comparing values is still needed. For convex problems, KKT is the complete answer — find a KKT point and you're done, no further checking required.
Worked Example — SVM Dual Derivation
// SVM: from primal QP to dual via KKT — the canonical example
Substitute back into \(\mathcal{L}\):
\[\max_{\boldsymbol{\alpha}\geq0}\; \sum_i\alpha_i - \frac{1}{2}\sum_{i,j}\alpha_i\alpha_jy_iy_j\mathbf{x}_i^T\mathbf{x}_j \quad\text{s.t.}\quad \sum_i\alpha_iy_i=0\]
Complementary slackness: αᵢ[yᵢ(w^Tx_i+b)−1]=0 — αᵢ>0 only for points exactly on the margin (support vectors!). This is why most αᵢ are zero — the solution is "sparse" in the dual variables, and the kernel trick (x_i^T x_j → k(x_i,x_j)) is only possible because the dual depends on the data only through inner products.
10
Non-Smooth
Subgradients
// generalizing the gradient · subgradient set · non-differentiable optimization · subgradient descent
Many important functions in ML — \(\ell_1\) norm, hinge loss, ReLU at zero — are convex but not differentiable everywhere. Subgradients generalize the gradient to handle these "kinks," preserving the gradient inequality from §04.
Subgradient — DefinitionGeneralized Derivative
\[\mathbf{g} \text{ is a subgradient of } f \text{ at } \mathbf{x} \iff f(\mathbf{y}) \geq f(\mathbf{x}) + \mathbf{g}^T(\mathbf{y}-\mathbf{x}) \quad \forall\mathbf{y}\]
\[\partial f(\mathbf{x}) = \{\mathbf{g} : \mathbf{g} \text{ is a subgradient of } f \text{ at } \mathbf{x}\} \quad\text{(subdifferential — a SET)}\]
At points where f is differentiable: ∂f(x) = {∇f(x)} — a single element. At "kinks" (non-differentiable points), ∂f(x) is a whole set of valid slopes — any line through (x,f(x)) that stays below the graph everywhere. Optimality condition generalizes elegantly: x* is a global min iff 0 ∈ ∂f(x*).
Subdifferentials of Common Functions
Function
Subdifferential
Notes
\(f(x)=|x|\)
\(\partial f(0) = [-1,1]\)
At x=0: any slope between −1 and 1 stays below |x|
\(f(x)=\max(0,x)\) (ReLU)
\(\partial f(0)=[0,1]\)
Convention: PyTorch uses 0 (subgradient choice)
\(f(\mathbf{x})=\|\mathbf{x}\|_1\)
\(\partial f(\mathbf{x})_i = \text{sign}(x_i)\) if \(x_i\neq0\), else \([-1,1]\)
Unlike gradient descent on smooth functions (O(1/k) or linear), subgradient descent is NOT a descent method — f(x_{k+1}) can be larger than f(x_k) at any single step. Only the BEST iterate over the trajectory is guaranteed to converge, and the rate O(1/√k) is fundamentally slower. This motivates proximal methods (§11) which handle non-smoothness more gracefully.
Many ML problems have an objective \(f(\mathbf{x}) = g(\mathbf{x}) + h(\mathbf{x})\) where \(g\) is smooth (e.g., squared loss) but \(h\) is non-smooth (e.g., \(\ell_1\) penalty for sparsity). Proximal methods exploit this structure to handle the non-smooth part exactly while gradient-stepping on the smooth part — achieving the same convergence rates as gradient descent on purely smooth problems.
The proximal operator finds a point close to \(\mathbf{v}\) that also has small \(h\). It's a generalized projection: if \(h\) is the indicator function of a convex set \(C\) (0 inside, ∞ outside), \(\text{prox}_h(\mathbf{v})\) is the orthogonal projection of \(\mathbf{v}\) onto \(C\). For \(h=0\), \(\text{prox}(\mathbf{v})=\mathbf{v}\) (identity).
Proximal Gradient Method (ISTA)
To understand proximal gradient, think in terms of majorization-minimization (MM): we construct a quadratic upper bound (majorizer) of \(f\) at \(\mathbf{x}_k\) and minimize that instead.
// Majorization-Minimization for Proximal Gradient
1
Since \(g\) is \(L\)-smooth, the descent lemma gives:
\[g(\mathbf{y}) \leq g(\mathbf{x}_k) + \nabla g(\mathbf{x}_k)^T(\mathbf{y}-\mathbf{x}_k) + \frac{L}{2}\|\mathbf{y}-\mathbf{x}_k\|^2\]
This is a quadratic upper bound on \(g\).
2
Adding \(h(\mathbf{y})\) to both sides gives a majorizer of \(f = g + h\):
\[f(\mathbf{y}) \leq \underbrace{g(\mathbf{x}_k) + \nabla g(\mathbf{x}_k)^T(\mathbf{y}-\mathbf{x}_k) + \frac{L}{2}\|\mathbf{y}-\mathbf{x}_k\|^2 + h(\mathbf{y})}_{Q(\mathbf{y}, \mathbf{x}_k)}\]
3
Dropping the constant term \(g(\mathbf{x}_k)\) and completing the square:
\[Q(\mathbf{y}, \mathbf{x}_k) \propto h(\mathbf{y}) + \frac{L}{2}\left\|\mathbf{y} - \left(\mathbf{x}_k - \frac{1}{L}\nabla g(\mathbf{x}_k)\right)\right\|^2\]
4
Minimizing \(Q\) is exactly evaluating the proximal operator with \(\alpha=1/L\):
\[\mathbf{x}_{k+1} = \text{prox}_{(1/L) h}\left(\mathbf{x}_k - \frac{1}{L}\nabla g(\mathbf{x}_k)\right)\]
Two steps: (1) gradient step on the smooth part \(g\), (2) proximal step on the non-smooth part \(h\). Crucially, if \(g\) is \(L\)-smooth, ISTA converges at the same rate as gradient descent on smooth functions: \(O(1/k)\). For strongly convex \(g\), ISTA converges linearly.
ISTA
ISTA = Iterative Shrinkage-Thresholding Algorithm. The "shrinkage" refers to the soft-thresholding step for \(\ell_1\) regularization.
Accelerated Proximal Gradient (FISTA)
Just as Nesterov acceleration improves gradient descent from \(O(1/k)\) to \(O(1/k^2)\), we can accelerate ISTA to get FISTA — the optimal first-order method for composite convex optimization.
FISTA uses a momentum-like extrapolation step with the sequence \(t_k\). It achieves the optimal \(O(1/k^2)\) convergence rate — a factor of \(k\) faster than ISTA — with no increase in per-iteration cost.
Soft-Thresholding — Deriving the \(\ell_1\) Proximal Operator
Let's prove that the proximal operator for \(h(\mathbf{x}) = \lambda\|\mathbf{x}\|_1\) is indeed the soft-thresholding operator. Since \(\ell_1\) is separable, we can solve for each coordinate independently.
// Deriving Soft-Thresholding
1
The 1D proximal problem for coordinate \(i\):
\[\min_{z} \left\{\lambda |z| + \frac{1}{2\alpha}(z - v_i)^2\right\}\]
2
Case 1: \(z > 0\). Then \(|z| = z\), and the derivative is \(\lambda + \frac{1}{\alpha}(z - v_i) = 0\). Solving:
\[z = v_i - \alpha\lambda\]
Valid only if \(v_i - \alpha\lambda > 0 \implies v_i > \alpha\lambda\).
3
Case 2: \(z < 0\). Then \(|z| = -z\), and the derivative is \(-\lambda + \frac{1}{\alpha}(z - v_i) = 0\). Solving:
\[z = v_i + \alpha\lambda\]
Valid only if \(v_i + \alpha\lambda < 0 \implies v_i < -\alpha\lambda\).
4
Case 3: \(z = 0\). The subgradient condition for optimality is \(0 \in \partial\left(\lambda|z| + \frac{1}{2\alpha}(z-v_i)^2\right)\). At \(z=0\):
\[0 \in [-\alpha\lambda, \alpha\lambda] - v_i \implies v_i \in [-\alpha\lambda, \alpha\lambda]\]
This is the most famous proximal operator. It pulls each coordinate toward zero by \(\lambda\), setting it to zero if its magnitude is less than \(\lambda\). This is the exact reason \(\ell_1\) regularization induces sparsity — the proximal step literally zeros out small components.
Concrete Example: LASSO via ISTA/FISTA
The LASSO problem is:
\[\min_{\mathbf{w}} \frac{1}{2}\|\mathbf{Xw} - \mathbf{y}\|^2 + \lambda\|\mathbf{w}\|_1\]
Here, \(g(\mathbf{w}) = \frac{1}{2}\|\mathbf{Xw} - \mathbf{y}\|^2\) (smooth, \(L = \sigma_{\text{max}}(\mathbf{X}^T\mathbf{X})\)) and \(h(\mathbf{w}) = \lambda\|\mathbf{w}\|_1\) (non-smooth).
PSEUDOCODE
# ISTA for LASSO functionISTA_LASSO(X, y, λ, max_iter): L = spectral_norm(X@X.T) # Lipschitz constant of ∇g α = 1 / L w = zeros(p) # p = number of features forkin1tomax_iter: gradient = X.T@ (X@w-y) w = soft_threshold(w-α*gradient, α*λ) returnw
// alternating optimization · Gauss-Seidel updates · efficiency for big data
Coordinate descent (CD) simplifies high-dimensional optimization by updating one variable (or block) at a time while keeping others fixed. While not always faster than gradient methods, it is exceptionally efficient for problems with specific structural separability — especially when the 1D subproblems have closed-form solutions.
Gauss-Seidel CD: Cycle through coordinates in a fixed order, using the most recent values of previously updated coordinates. Randomized CD: Select coordinates uniformly at random (or with priority) — often faster in practice and easier to analyze theoretically.
Concrete Example: Coordinate Descent for LASSO
Let's derive the coordinate update for LASSO, which is the workhorse of the `glmnet` package. The LASSO problem is:
\[\min_{\mathbf{w}} \frac{1}{2}\|\mathbf{y} - \mathbf{Xw}\|^2 + \lambda\|\mathbf{w}\|_1\]
Let \(x_j\) be the j-th column of \(\mathbf{X}\), and let \(r = \mathbf{y} - \sum_{k \neq j} \mathbf{X}_k w_k\) be the residual excluding the j-th coordinate.
// LASSO Coordinate Update Derivation
1
The objective as a function of \(w_j\) alone is:
\[\min_{w_j} \frac{1}{2}\|r - x_j w_j\|^2 + \lambda|w_j|\]
2
Let \(s_j = x_j^T r\) (the partial residual) and \(d_j = x_j^T x_j\) (the squared norm of the j-th feature). Then:
\[\min_{w_j} \frac{1}{2}d_j w_j^2 - s_j w_j + \lambda|w_j|\]
3
This is exactly the 1D proximal problem for \(\ell_1\)! The solution is the soft-thresholding operator:
\[w_j^{(k+1)} = S_{\lambda/d_j}\left(\frac{s_j}{d_j}\right) = \text{sign}\left(\frac{s_j}{d_j}\right)\max\left(0, \left|\frac{s_j}{d_j}\right| - \frac{\lambda}{d_j}\right)\]
PSEUDOCODE
# Coordinate Descent for LASSO (Gauss-Seidel) functionLASSO_CD(X, y, λ, max_iter): n, p = X.shape w = zeros(p) d = sum(X**2, axis=0) # precompute squared norms foriterin1tomax_iter: forjin1top: # partial residual without j-th feature r = y-X@w+X[:, j] *w[j] s_j = X[:, j].T@r # soft-thresholding update w[j] = soft_threshold(s_j / d[j], λ / d[j]) returnw
Sparsity
Coordinate descent for LASSO naturally maintains sparsity: once a coordinate is set to zero, it often stays zero for many iterations. This makes it extremely fast for high-dimensional problems where most features are irrelevant.
Why and When it Works
Separability: Works best when the objective is "smooth + separable": \(f(\mathbf{x}) = g(\mathbf{x}) + \sum h_i(x_i)\).
Non-smooth catch: CD can get stuck at "corners" for general non-smooth functions. It requires the non-smooth part to be coordinate-wise separable (like \(\ell_1\)).
Big Data: For very high-dimensional sparse problems (like text classification with millions of features), CD is often the state-of-the-art solver (used in `glmnet` and `liblinear`).
Line search free: Unlike gradient descent, coordinate descent often requires no step-size tuning — the 1D optimizations are exact.
13
Barriers
Interior Point Methods
// logarithmic barrier · central path · Newton's method for constrained problems
Interior point methods (IPMs) solve constrained problems by approximating them with a sequence of unconstrained problems. They "smooth out" the constraints using a barrier function that goes to infinity as you approach the boundary — ensuring we always stay strictly feasible.
The Log-Barrier and Central Path
Logarithmic Barrier FunctionBarrier
\[\phi(\mathbf{x}) = -\sum_{i=1}^m \log(-g_i(\mathbf{x}))\]
\[\min_\mathbf{x}\; t f(\mathbf{x}) + \phi(\mathbf{x})\]
As \(g_i(\mathbf{x}) \to 0^-\), \(-\log(-g_i(\mathbf{x})) \to \infty\). The parameter \(t > 0\) controls the barrier strength: small \(t\) emphasizes the barrier (staying away from boundaries), large \(t\) emphasizes the original objective.
For each \(t\), let \(\mathbf{x}^*(t)\) denote the minimizer of the barrier problem. The set \(\{\mathbf{x}^*(t) \mid t > 0\}\) is called the central path. As \(t \to \infty\), \(\mathbf{x}^*(t) \to \mathbf{x}^*\) — the true solution of the original constrained problem.
The Interior Point Algorithm
The standard path-following IPM alternates between two steps:
Newton step: Solve the current barrier problem with a Newton step.
Update t: Increase \(t\) by a factor (e.g., \(t \leftarrow 1.5t\)).
PSEUDOCODE
# Primal Interior Point Method (sketch) functionIPM(f, g, x0): t = 1 x = x0# must be strictly feasible: g_i(x) < 0 forkin1tomax_iter: # 1. Newton step on t*f + phi grad = t * ∇f(x) + ∇φ(x) hess = t * ∇²f(x) + ∇²φ(x) dx = solve(hess, -grad) # 2. Line search to stay feasible x = x + alpha * dx # 3. Increase barrier parameter t = t * 1.2 returnx
Why use IPMs?
Polynomial Complexity: The path-following IPM is theoretically polynomial-time (O(m³√log(1/ε))) for linear programming — a landmark result in optimization.
High Precision: Unlike first-order methods (gradient descent), IPMs use second-order info (Newton steps) and can reach machine precision in very few iterations (usually 10-50).
Small-Medium Scale: Because they require computing and inverting the Hessian, they are best for problems with a few thousand variables, not millions.
Reliability: IPMs are the engine behind commercial solvers like Gurobi and Mosek for LPs, QPs, and SOCPs.
14
Guarantees
Convergence Guarantees
// complexity classes · O(1/k) vs O(1/k²) vs linear · optimal rates
Convergence analysis tells us how many iterations \(k\) are needed to reach an \(\epsilon\)-optimal solution: \(f(\mathbf{x}_k) - f(\mathbf{x}^*) \leq \epsilon\). This hierarchy of rates is the "speed limit" of optimization — and convexity is what lets us prove these guarantees.
Convergence Rate Hierarchy
Assumption
Method
Rate
Iterations for \(\epsilon\)
Convex, G-Lipschitz
Subgradient
\(O(1/\sqrt{k})\)
\(O(1/\epsilon^2)\)
Convex, L-smooth
Gradient Descent
\(O(1/k)\)
\(O(1/\epsilon)\)
Convex, L-smooth
Accelerated (Nesterov)
\(O(1/k^2)\)
\(O(1/\sqrt{\epsilon})\)
Strongly Convex, L-smooth
Gradient Descent
\(O(e^{-k/\kappa})\)
\(O(\kappa \log(1/\epsilon))\)
Convergence Proof Sketch — Smooth Convex Case
Let's see why gradient descent converges at \(O(1/k)\) for L-smooth convex functions. The key tool is the descent lemma from Section 7.
// Gradient Descent Convergence (Smooth Convex)
1
From the descent lemma with \(\alpha = 1/L\):
\[f(\mathbf{x}_{k+1}) \leq f(\mathbf{x}_k) - \frac{1}{2L}\|\nabla f(\mathbf{x}_k)\|^2\]
Let \(r_k = \|\mathbf{x}_k - \mathbf{x}^*\|\). With some algebra (see textbooks), we can show:
\[f(\mathbf{x}_k) - f(\mathbf{x}^*) \leq \frac{2L r_0^2}{k}\]
Key Concepts
Sublinear Convergence: Rates like \(O(1/k)\) or \(O(1/k^2)\) — the error decreases, but the number of iterations needed for each new decimal of precision grows.
Linear Convergence: Error decreases geometrically (\(O(e^{-k/\kappa})\)) — each iteration roughly halves the error (for good condition numbers). This is much faster!
Condition Number (\(\kappa = L/\mu\)): The ratio of smoothness to strong convexity. Large \(\kappa\) means slow linear convergence (ill-conditioned problems, like narrow valleys).
Optimal Rates: Nesterov's accelerated gradient achieves the best possible rate for first-order methods on smooth convex problems — you can't do better than \(O(1/k^2)\) without using second-order information.
Linear
"Linear convergence" (counter-intuitively) means the error decreases geometrically, like \(0.9^k\). On a log-scale plot, this looks like a straight line. It is much faster than "sublinear" rates like \(1/k\). Strong convexity is the magic ingredient that turns sublinear convergence into linear convergence.
15
ML Context
Optimization in ML
// empirical risk minimization · regularizers · why convexity is the baseline
In Machine Learning, optimization is the engine of learning. The central problem is Empirical Risk Minimization (ERM):
\[\min_\mathbf{w}\; \underbrace{\frac{1}{n}\sum_{i=1}^n L(y_i, f(\mathbf{w}, \mathbf{x}_i))}_{\text{Empirical Risk}} + \underbrace{\lambda \Omega(\mathbf{w})}_{\text{Regularizer}}\]
Convexity in Classical ML Models
All your favorite classical ML models are convex — and that's why they're reliable!
Linear Regression
Least squares is a quadratic program with a unique global minimum (if \(X^T X\) is positive definite). Even with L2 regularization (Ridge), it's still convex and strongly convex.
Logistic Regression
The log-loss is convex (negative log-likelihood for a Bernoulli model). The gradient is Lipschitz continuous, so gradient descent converges at \(O(1/k)\).
SVM (Hinge Loss)
The hinge loss is convex, and the dual problem is a QP. The KKT conditions tell us that only support vectors matter — a beautiful result of convex duality.
LASSO
L1 regularization makes the objective non-differentiable but still convex. Proximal methods or coordinate descent work perfectly here, and the solution is sparse!
Why Convexity Matters in ML
No Local Minima: Every local minimum is a global minimum — no need to worry about getting stuck!
Worst-Case Guarantees: We know exactly how fast our algorithms will converge.
Interpretability: Duality gives insight into constraints and support vectors (SVMs, LASSO).
Hyperparameter Robustness: Convex solvers are less sensitive to step sizes and initializations (though still important!).
Convex vs Non-Convex in ML
The Convex Era (Classical ML)
Models like Linear Regression, SVMs, Logistic Regression, and Lasso are convex. We have global guarantees. We can trust our solvers to find the absolute best parameters. This is why classical ML is so reliable!
The Non-Convex Era (Deep Learning)
Neural networks are non-convex (compositions of ReLUs/sigmoids, weight products). We lose global guarantees, but convexity theory still provides the local analysis (smoothness, gradients, Lipschitz constants) that makes SGD/Adam work. Many techniques (batch norm, warmup) are inspired by convex optimization!
16
Model
Complete Mental Model
// synthesis · the big picture · when to use what
Fig 4. Complete taxonomy of convex optimization — from theoretical foundations to problem types, specialized solvers, convergence guarantees, and their deployment in classical and modern machine learning.
The complete unified narrative:
Start with Convex Sets: A set is convex if every line segment between two points in the set stays in the set. This is the foundation — without convex sets, we can't define convex functions.
Convex Functions: A function is convex if its epigraph is a convex set (or equivalently, if the chord lies above the graph). This guarantees that any local minimum is a global minimum!
First-Order Condition: For differentiable functions, convexity is equivalent to the gradient inequality: the tangent line is a global underestimator.
Second-Order Condition: For twice differentiable functions, convexity is equivalent to the Hessian being positive semidefinite (PSD) everywhere — all curvature is non-negative.
Jensen's Inequality: Convexity extends to expectations: \(f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]\). This is the engine behind AM-GM, ELBO, and many ML proofs.
Strong Convexity & Smoothness: These two conditions control convergence speed. Strong convexity (\(\mu I \leq \nabla^2 f\)) gives a quadratic lower bound. Smoothness (\(\nabla^2 f \leq L I\)) gives a quadratic upper bound. Their ratio is the condition number \(\kappa = L/\mu\).
Constraints & Duality: For constrained problems, we form the Lagrangian, derive the KKT conditions, and use duality to find a dual problem that's often easier to solve (like SVM dual).
Non-Smooth Problems: Use subgradients (generalized gradients) or better, proximal methods — split into smooth + non-smooth parts and use proximal gradient (ISTA/FISTA).
Choose Your Solver: If small: IPM/Newton. If smooth: Nesterov. If composite: ISTA/FISTA. If high-dimensional sparse: coordinate descent.
ML is Convex (Mostly): Classical ML (linear regression, SVM, logistic, LASSO) is all convex. Deep learning is non-convex, but we still use convex principles (SGD, momentum, smoothness) locally!
The Single Most Important Takeaway
Convex optimization is not just a set of algorithms — it is a mathematical guarantee. When a problem is convex, you know exactly what you're getting: a global minimizer, convergence rates, and no surprises. Even when you leave convexity (deep learning), convexity theory is still the best guide you have.