// Mathematics — Constrained Optimization

Lagrange
Multipliers

From equality constraints to the full KKT system, Lagrangian duality, the dual problem, and every connection to SVM, Ridge, and the physics of optimal control. Theory that unlocks half of modern ML.

Constrained Optimization Gradient Parallelism KKT Conditions Complementary Slackness Lagrangian Duality Dual Problem SVM Derivation Ridge as Constrained Opt Strong vs Weak Duality
// Table of Contents
  1. 01Constrained Optimization — The Setup
  2. 02Geometric Intuition — Why Gradients Must Be Parallel
  3. 03The Lagrangian — Formal Definition
  4. 04Solving Equality Constraints — Full Methodology
  5. 05Multiple Equality Constraints
  6. 06Inequality Constraints — Extending the Framework
  7. 07KKT Conditions — Complete Treatment
  8. 08Complementary Slackness — Deep Dive
  9. 09Lagrangian Duality and the Dual Problem
  10. 10Strong Duality and Constraint Qualification
  11. 11Application — Hard-Margin SVM Derivation
  12. 12Application — Soft-Margin SVM and Dual
  13. 13Application — Ridge as Constrained Optimization
  14. 14Application — Maximum Entropy and Softmax
  15. 15Numerical Methods — ALM and Penalty Methods
  16. 16The Full Mental Model
§01

Constrained Optimization — The Setup

In unconstrained optimization, we minimize \(f(\mathbf{x})\) over all \(\mathbf{x} \in \mathbb{R}^n\). The solution is wherever \(\nabla f = \mathbf{0}\). Constrained optimization forces the solution to lie on a specific set — a surface, curve, or region — defined by constraints. This is far more realistic for real-world problems.

General Constrained Problem \[ \min_{\mathbf{x} \in \mathbb{R}^n} \; f(\mathbf{x}) \quad \text{subject to} \quad \begin{cases} g_i(\mathbf{x}) = 0, & i = 1, \ldots, m \quad \text{(equality constraints)} \\ h_j(\mathbf{x}) \leq 0, & j = 1, \ldots, p \quad \text{(inequality constraints)} \end{cases} \]

The set of all \(\mathbf{x}\) satisfying all constraints is the feasible set \(\mathcal{F}\). We want the minimum of \(f\) restricted to \(\mathcal{F}\).

Why Can't We Just Ignore Constraints?

The unconstrained minimum \(\mathbf{x}^* = \arg\min f\) might not be in \(\mathcal{F}\). Three scenarios:

  • Unconstrained min is feasible: \(\mathbf{x}^* \in \mathcal{F}\) → constraints are inactive, same answer. Lagrange multipliers will give \(\lambda^* = 0\).
  • Unconstrained min is infeasible: The optimal feasible point lies on the constraint boundary — we are "pushed" there by the constraint. This is the interesting case.
  • No feasible point exists: Problem is infeasible → no solution.

Examples from Your Background

Problem Objective \(f\) Constraint
SVM (hard margin) Minimize \(\frac{1}{2}\|\mathbf{w}\|^2\) \(y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1\) for all \(i\)
Ridge regression Minimize \(\|\mathbf{X}\boldsymbol{\theta}-\mathbf{y}\|^2\) \(\|\boldsymbol{\theta}\|^2 \leq t\)
Lasso regression Minimize \(\|\mathbf{X}\boldsymbol{\theta}-\mathbf{y}\|^2\) \(\|\boldsymbol{\theta}\|_1 \leq t\)
Maximum entropy Maximize \(-\sum_k p_k \log p_k\) \(\sum_k p_k = 1\), \(p_k \geq 0\)
Portfolio optimization Minimize portfolio variance Budget sums to 1, return \(\geq r_{\min}\)
✓ The Core Idea

Lagrange multipliers convert a constrained optimization problem into an unconstrained one by absorbing the constraints into the objective via multiplier terms. The price: we now optimize over both the original variables and the multipliers. But the resulting system — the stationarity conditions — can be solved systematically.

§02

Geometric Intuition — Why Gradients Must Be Parallel

Before any formulas: why does the method work? The entire theory rests on a single geometric insight.

The Setup

Consider minimizing \(f(\mathbf{x})\) subject to \(g(\mathbf{x}) = 0\). The constraint defines a level surface (a curve in 2D, a surface in 3D). We must move along this surface and find where \(f\) is smallest.

f* (unconstrained) g(x) = 0 ∇g ∇f not optimal ∇f has component along constraint ∇g ∇f = λ∇g f contour tangent to g=0 x* At the optimum: ∇f and ∇g are parallel. No feasible direction reduces f further.
Fig 1. Geometric interpretation. At non-optimal points, ∇f has a component along the constraint curve — we can move along g=0 to reduce f. At the optimum x*, the contour of f is tangent to g=0, meaning ∇f ∥ ∇g.

The Optimality Argument

On the constraint surface \(g(\mathbf{x}) = 0\), we can only move in directions tangent to this surface. If the projection of \(\nabla f\) onto the tangent plane is nonzero, we can move along the constraint and decrease \(f\) further — we're not at the optimum.

At the constrained minimum \(\mathbf{x}^*\): the projection of \(\nabla f(\mathbf{x}^*)\) onto the constraint tangent plane must be zero. This means \(\nabla f(\mathbf{x}^*)\) is normal to the constraint surface. But \(\nabla g(\mathbf{x}^*)\) is also normal to the constraint surface (the gradient of any function is perpendicular to its level sets). Therefore:

Optimality Condition \[ \nabla f(\mathbf{x}^*) = \lambda^* \nabla g(\mathbf{x}^*) \]

for some scalar λ* ∈ ℝ — the Lagrange multiplier

💡 The Scalar λ*

The multiplier \(\lambda^*\) is the proportionality constant between the two gradient vectors. It can be positive or negative — we don't constrain its sign for equality constraints. Its magnitude tells you how sensitive the optimal value \(f(\mathbf{x}^*)\) is to small changes in the constraint — the "shadow price" of the constraint. If \(\lambda^* = 0\), the constraint is inactive (the unconstrained min happens to be feasible).

§03

The Lagrangian — Formal Definition

The geometric condition \(\nabla f = \lambda \nabla g\) plus the constraint \(g(\mathbf{x}) = 0\) gives us \(n+1\) equations in \(n+1\) unknowns (\(\mathbf{x} \in \mathbb{R}^n\) and \(\lambda \in \mathbb{R}\)). The Lagrangian packages this system elegantly.

The Lagrangian Function \[ \mathcal{L}(\mathbf{x}, \lambda) = f(\mathbf{x}) - \lambda \cdot g(\mathbf{x}) \]

Some texts use + instead of −; our convention follows the standard ML literature where λ ≥ 0 for ≤ constraints.

Why This Works — The Stationarity Conditions

Taking the gradient of \(\mathcal{L}\) with respect to all variables and setting to zero:

Stationarity Conditions \[ \nabla_{\mathbf{x}} \mathcal{L} = \mathbf{0} \implies \nabla f(\mathbf{x}) - \lambda \nabla g(\mathbf{x}) = \mathbf{0} \implies \nabla f = \lambda \nabla g \] \[ \frac{\partial \mathcal{L}}{\partial \lambda} = 0 \implies -g(\mathbf{x}) = 0 \implies g(\mathbf{x}) = 0 \]

The gradient with respect to \(\mathbf{x}\) recovers the parallelism condition. The derivative with respect to \(\lambda\) recovers the constraint itself. Setting all partials of \(\mathcal{L}\) to zero is equivalent to the full optimality system — the Lagrangian encodes both the objective and the constraint in one function.

★ The Key Insight

Finding the constrained optimum of \(f\) subject to \(g=0\) is equivalent to finding the unconstrained stationary point of \(\mathcal{L}(\mathbf{x}, \lambda)\). We've traded a constrained optimization over \(n\) variables for an unconstrained one over \(n+1\) variables. The multiplier \(\lambda\) is a new variable that we solve for alongside \(\mathbf{x}\).

Sensitivity Interpretation of λ

Consider parameterizing the constraint as \(g(\mathbf{x}) = c\) (a family of constraints). The optimal value \(f^*(c)\) changes as \(c\) changes. The Lagrange multiplier gives the rate of change:

Shadow Price / Sensitivity \[ \lambda^* = \frac{df^*(c)}{dc}\bigg|_{c=0} \]

If the constraint is \(g(\mathbf{x}) = \|\boldsymbol{\theta}\|^2 - t = 0\) (Ridge), then \(\lambda^*\) tells you how much the optimal loss changes per unit increase in the budget \(t\). In economics this is called the shadow price — the value of relaxing the constraint by one unit.

§04

Solving Equality Constraints — Full Methodology

The 4-Step Method

// Lagrange Multiplier Method — Equality Constraints Step 1: Write the Lagrangian
    L(x, λ) = f(x) − λ · g(x)

Step 2: Set all partial derivatives to zero
    ∂L/∂xᵢ = 0  for i=1,...,n   # n equations
    ∂L/∂λ = 0                  # 1 equation (recovers g(x)=0)

Step 3: Solve the resulting system of n+1 equations
    # Find all (x*, λ*) pairs that satisfy the system

Step 4: Evaluate f at candidates, pick the minimum
    # The Lagrangian conditions are necessary, not always sufficient
    # For convex f and affine g → conditions are sufficient

Worked Example — Minimum Distance on an Ellipse

Find the point on the ellipse \(\frac{x^2}{4} + y^2 = 1\) closest to the point \((3, 0)\).

Objective: minimize \(f(x,y) = (x-3)^2 + y^2\) (squared distance). Constraint: \(g(x,y) = \frac{x^2}{4} + y^2 - 1 = 0\).

// Full solution — closest point on ellipse to (3,0)
1
Lagrangian: \(\mathcal{L} = (x-3)^2 + y^2 - \lambda\left(\frac{x^2}{4} + y^2 - 1\right)\)
2
Stationarity conditions: \[\frac{\partial \mathcal{L}}{\partial x} = 2(x-3) - \frac{\lambda x}{2} = 0 \quad \Rightarrow \quad 2(x-3) = \frac{\lambda x}{2} \tag{i}\] \[\frac{\partial \mathcal{L}}{\partial y} = 2y - 2\lambda y = 0 \quad \Rightarrow \quad 2y(1-\lambda) = 0 \tag{ii}\] \[\frac{\partial \mathcal{L}}{\partial \lambda} = -\left(\frac{x^2}{4}+y^2-1\right) = 0 \tag{iii}\]
3
From (ii): either \(y = 0\) or \(\lambda = 1\).
Case A: y = 0. From (iii): x² /4 = 1 → x = ±2. Points: (2,0) and (−2,0). Distances: 1 and 25.
Case B: λ = 1. From (i): 2(x−3) = x/2 → 4x − 12 = x → x = 4. But (iii): 16/4 + y² = 1 → y² = −3 < 0. No real solution.
4
The constrained minimum is at \(\mathbf{x}^* = (2, 0)\) with distance \(\sqrt{1} = 1\). \(\lambda^* = \frac{2(2-3)}{2/2} = -2\).
Note: (−2, 0) is the constrained maximum (farthest point on the ellipse from (3,0)).

Common Pitfalls

  • Lagrangian conditions are necessary, not sufficient. Every constrained min satisfies them, but not every solution is a min. Always compare function values at all critical points.
  • The Lagrangian has saddle points. \(\mathcal{L}\) is minimized over \(\mathbf{x}\) and maximized over \(\lambda\) at the solution — not a minimum in all variables simultaneously.
  • Degenerate constraints (LICQ failure). If \(\nabla g(\mathbf{x}^*) = \mathbf{0}\), the method may fail — the multiplier may not exist. Check regularity conditions.
  • Multiplying through without checking signs. For equality constraints \(\lambda\) is unconstrained. For inequality constraints, sign matters — covered in §07.
§05

Multiple Equality Constraints

With \(m\) equality constraints \(g_1(\mathbf{x}) = 0, \ldots, g_m(\mathbf{x}) = 0\), introduce one multiplier per constraint:

Multi-Constraint Lagrangian \[ \mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}) = f(\mathbf{x}) - \sum_{i=1}^{m} \lambda_i g_i(\mathbf{x}) \] \[ \nabla_{\mathbf{x}} \mathcal{L} = \nabla f(\mathbf{x}) - \sum_{i=1}^{m}\lambda_i \nabla g_i(\mathbf{x}) = \mathbf{0} \]

The optimality condition says: at the constrained minimum, \(\nabla f\) must lie in the span of the constraint gradients \(\{\nabla g_1, \ldots, \nabla g_m\}\). Each multiplier \(\lambda_i\) is the coefficient for the \(i\)-th constraint gradient.

Geometric Picture

Each constraint \(g_i(\mathbf{x}) = 0\) defines a surface. The intersection of \(m\) surfaces is a submanifold of dimension \(n - m\) (assuming the constraints are independent). We minimize \(f\) on this lower-dimensional object.

The tangent space of the feasible submanifold at \(\mathbf{x}^*\) is orthogonal to all \(\nabla g_i\). Optimality requires \(\nabla f\) to have no component in this tangent space — i.e., \(\nabla f\) is in the span of \(\{\nabla g_i\}\).

Linear Independence Constraint Qualification (LICQ)

The Lagrange multiplier method requires that the active constraint gradients are linearly independent at \(\mathbf{x}^*\):

LICQ \[ \{\nabla g_1(\mathbf{x}^*), \ldots, \nabla g_m(\mathbf{x}^*)\} \text{ are linearly independent} \]

If LICQ fails (e.g., two constraints become tangent to each other at \(\mathbf{x}^*\)), the multipliers \(\boldsymbol{\lambda}\) may not exist or may not be unique. LICQ is the standard constraint qualification (CQ) that ensures the method works.

§06

Inequality Constraints — Extending the Framework

Equality constraints force \(\mathbf{x}\) to lie exactly on a surface. Inequality constraints \(h(\mathbf{x}) \leq 0\) allow \(\mathbf{x}\) to lie anywhere in a half-space — in the interior or on the boundary.

Two Cases at the Optimum

For a single inequality \(h(\mathbf{x}) \leq 0\):

CASE 1 — INACTIVE h(x) ≤ 0 x* (unconstrained min) λ* = 0, h(x*) < 0 constraint not binding CASE 2 — ACTIVE h(x) = 0 unconstrained min x* (on boundary) λ* > 0, h(x*) = 0 constraint is binding
Fig 2. Two cases for inequality constraints. Left: unconstrained min is feasible — constraint inactive, λ*=0. Right: unconstrained min violates constraint — optimal is pushed to the boundary, λ*>0.

The Sign of λ for Inequality Constraints

For equality constraints, \(\lambda\) can be any real number — \(\nabla f = \lambda\nabla g\) and \(\lambda\) could be positive or negative depending on whether the constraint pushes toward or away from the unconstrained min.

For inequality constraints \(h(\mathbf{x}) \leq 0\), when the constraint is active (\(h(\mathbf{x}^*) = 0\)), the constraint is pushing the optimum into the feasible region. The gradient \(\nabla f\) must point out of the feasible region (otherwise we could decrease \(f\) by going deeper in), while \(\nabla h\) points out of the feasible region by definition. So \(\nabla f = \mu \nabla h\) with \(\mu \geq 0\).

Sign Constraint for Inequality Multipliers \[ h(\mathbf{x}) \leq 0 \quad \Rightarrow \quad \mu \geq 0 \quad \text{(non-negative multiplier)} \]

This is the crucial distinction between equality and inequality multipliers. Violating this sign condition is a common error.

§07

KKT Conditions — Complete Treatment

The Karush-Kuhn-Tucker (KKT) conditions generalize Lagrange multipliers to problems with both equality and inequality constraints. They are the fundamental first-order necessary conditions for constrained optimization.

General Problem \[ \min_{\mathbf{x}} f(\mathbf{x}) \quad \text{s.t.} \quad g_i(\mathbf{x}) = 0 \; (i=1,\ldots,m), \quad h_j(\mathbf{x}) \leq 0 \; (j=1,\ldots,p) \]
KKT Lagrangian \[ \mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\mu}) = f(\mathbf{x}) - \sum_{i=1}^{m}\lambda_i g_i(\mathbf{x}) - \sum_{j=1}^{p}\mu_j h_j(\mathbf{x}) \]

The Five KKT Conditions

Condition 1 — Stationarity
\[\nabla_{\mathbf{x}} \mathcal{L} = \nabla f - \sum_i\lambda_i\nabla g_i - \sum_j\mu_j\nabla h_j = \mathbf{0}\]

At the optimum, the gradient of f is in the span of constraint gradients. The "force" from the objective is balanced by constraint "reaction forces."

Condition 2 — Primal Feasibility
\[g_i(\mathbf{x}^*) = 0 \quad \forall i\] \[h_j(\mathbf{x}^*) \leq 0 \quad \forall j\]

The solution must satisfy all original constraints — both equality and inequality. Non-negotiable.

Condition 3 — Dual Feasibility
\[\mu_j \geq 0 \quad \forall j\]

Inequality multipliers must be non-negative. Equality multipliers λᵢ are unconstrained in sign. This encodes the directionality of inequality constraints.

Condition 4 — Complementary Slackness
\[\mu_j \cdot h_j(\mathbf{x}^*) = 0 \quad \forall j\]

For each inequality: either the constraint is active (h=0, μ can be anything ≥ 0) OR the multiplier is zero (μ=0, constraint is slack). Never both active and non-zero simultaneously in a specific sense. This is the heart of KKT.

★ KKT as Necessary Conditions

Any local minimum of \(f\) subject to the constraints must satisfy the KKT conditions (under a constraint qualification like LICQ). The KKT conditions are necessary. For convex \(f\), convex \(h_j\), and affine \(g_i\), they are also sufficient — every KKT point is the global minimum. This sufficiency for convex problems is what makes KKT so powerful in ML.

Summary Table

KKT Condition Equations Count Encodes
Stationarity \(\nabla f = \sum\lambda_i\nabla g_i + \sum\mu_j\nabla h_j\) \(n\) Objective-constraint gradient balance
Primal feasibility \(g_i = 0\), \(h_j \leq 0\) \(m + p\) All constraints satisfied
Dual feasibility \(\mu_j \geq 0\) \(p\) Inequality multiplier signs
Complementary slackness \(\mu_j h_j = 0\) \(p\) Active vs inactive inequality constraints

Total: \(n + m + 2p\) conditions for \(n + m + p\) unknowns (\(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\mu}\)). The system is generically well-determined.

§08

Complementary Slackness — Deep Dive

Complementary slackness is the most subtle and powerful of the KKT conditions. The product \(\mu_j h_j(\mathbf{x}^*) = 0\) seems simple — but its implications are far-reaching.

What It Really Means

The condition \(\mu_j \cdot h_j(\mathbf{x}^*) = 0\) says exactly one of two things must be true for each inequality constraint:

CASE A: INACTIVE CONSTRAINT h_j(x*) < 0 → μ_j = 0 Constraint is slack — it doesn't influence the optimal solution. CASE B: ACTIVE CONSTRAINT h_j(x*) = 0 → μ_j ≥ 0 (unconstrained) Constraint is binding — it actively shapes the optimal solution.
⚠ The Third Possibility is Impossible

What we can never have: \(\mu_j > 0\) AND \(h_j(\mathbf{x}^*) < 0\) simultaneously. If the constraint is not active, its multiplier must be zero — the constraint is not influencing the solution, so it exerts no "force." Having a nonzero multiplier on a slack constraint would contradict optimality.

Connection to Support Vectors

In SVM (§11), the inequality constraints are \(h_i(\mathbf{x}) = 1 - y_i(\mathbf{w}^T\mathbf{x}_i + b) \leq 0\). Complementary slackness says: \(\mu_i(1 - y_i(\mathbf{w}^T\mathbf{x}_i+b)) = 0\).

  • For most training points: \(y_i(\mathbf{w}^T\mathbf{x}_i+b) > 1\) (well inside the margin) → \(\mu_i = 0\). These points don't influence the solution at all.
  • For support vectors: \(y_i(\mathbf{w}^T\mathbf{x}_i+b) = 1\) (exactly on the margin) → \(\mu_i \geq 0\). These points are what define the solution.

This is why SVM is a sparse classifier: the optimal boundary is determined entirely by a small subset of training points — the support vectors. All other points are irrelevant. Complementary slackness makes this mathematically precise.

§09

Lagrangian Duality and the Dual Problem

Duality transforms the original (primal) optimization problem into a different (dual) problem that is always convex — regardless of the primal's convexity. The dual provides a lower bound on the primal, and for convex problems, the bounds are tight.

The Dual Function

Given the Lagrangian \(\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\mu})\), the dual function is formed by minimizing over \(\mathbf{x}\):

Dual Function \[ d(\boldsymbol{\lambda}, \boldsymbol{\mu}) = \inf_{\mathbf{x}} \mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\mu}) = \inf_{\mathbf{x}}\left[f(\mathbf{x}) - \sum_i\lambda_i g_i(\mathbf{x}) - \sum_j\mu_j h_j(\mathbf{x})\right] \]

The dual function is always concave in \((\boldsymbol{\lambda}, \boldsymbol{\mu})\) — regardless of whether \(f\), \(g_i\), or \(h_j\) are convex. This is because the pointwise infimum of affine functions (which is what \(d\) is, as a function of the multipliers) is always concave.

Weak Duality — The Lower Bound Property

For any feasible \(\mathbf{x}\) and any dual-feasible \((\boldsymbol{\lambda}, \boldsymbol{\mu})\) (with \(\mu_j \geq 0\)):

Weak Duality \[ d(\boldsymbol{\lambda}, \boldsymbol{\mu}) \leq f(\mathbf{x}^*) \quad \text{(dual value ≤ primal optimum)} \]
// Proof of weak duality
1
For any feasible \(\mathbf{x}\): \(g_i(\mathbf{x}) = 0\) and \(h_j(\mathbf{x}) \leq 0\). With \(\mu_j \geq 0\): \[\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\mu}) = f(\mathbf{x}) - \underbrace{\sum_i\lambda_i g_i(\mathbf{x})}_{=\,0} - \underbrace{\sum_j\mu_j h_j(\mathbf{x})}_{\leq\,0} \leq f(\mathbf{x})\]
The equality constraint terms vanish; the inequality terms are subtracted and are ≤ 0, so subtracting them increases the value.
2
Therefore: \(d(\boldsymbol{\lambda}, \boldsymbol{\mu}) = \inf_{\mathbf{x}}\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\mu}) \leq \mathcal{L}(\mathbf{x}^*, \boldsymbol{\lambda}, \boldsymbol{\mu}) \leq f(\mathbf{x}^*) \qquad \blacksquare\)

The Dual Problem

To get the tightest lower bound, maximize \(d\) over dual-feasible multipliers:

The Dual Problem \[ \max_{\boldsymbol{\lambda}, \boldsymbol{\mu}} \; d(\boldsymbol{\lambda}, \boldsymbol{\mu}) \quad \text{subject to} \quad \mu_j \geq 0 \; \forall j \]

The dual problem is always convex (maximizing a concave function), even when the primal is non-convex. This is a remarkable property — it allows efficient solving of the dual even when the primal is hard.

Let \(p^*\) = primal optimum, \(d^*\) = dual optimum. Weak duality guarantees \(d^* \leq p^*\). The difference \(p^* - d^*\) is the duality gap.

§10

Strong Duality and Constraint Qualification

Strong Duality \[ d^* = p^* \quad \text{(zero duality gap)} \]

Strong duality does not always hold. It requires a constraint qualification — a regularity condition on the problem structure.

Slater's Condition (Most Common CQ)

For a convex problem (convex \(f\) and \(h_j\), affine \(g_i\)), strong duality holds if there exists a strictly feasible point — a point in the interior of the feasible set:

Slater's Condition \[ \exists\, \tilde{\mathbf{x}} \text{ such that } g_i(\tilde{\mathbf{x}}) = 0 \; \forall i \text{ and } h_j(\tilde{\mathbf{x}}) < 0 \; \forall j \]
★ Why Strong Duality Matters for ML

The SVM optimization problem is convex and satisfies Slater's condition (there always exists a margin-satisfying hyperplane if the data is separable). Therefore, strong duality holds — we can solve the dual SVM problem instead of the primal. The dual has a beautiful structure that enables the kernel trick, non-linear classification, and sparse solutions via complementary slackness.

Conditions for Strong Duality Summary

Problem Type Condition for Strong Duality Result
Linear programming (LP) Primal and dual both feasible \(d^* = p^*\) always
Convex QP (e.g., SVM) Slater's condition \(d^* = p^*\)
General convex Slater's or other CQ \(d^* = p^*\)
Non-convex (e.g., deep networks) Rarely satisfied \(d^* \leq p^*\) (gap possible)
§11

Application — Hard-Margin SVM Derivation

SVM is the cleanest and most beautiful application of Lagrange multipliers in ML. We derive the full primal and dual.

The Primal SVM Problem

Given linearly separable training data \(\{(\mathbf{x}_i, y_i)\}_{i=1}^{m}\) with \(y_i \in \{-1, +1\}\), find the hyperplane \(\mathbf{w}^T\mathbf{x} + b = 0\) with maximum margin.

The margin width is \(\frac{2}{\|\mathbf{w}\|}\). Maximizing it is equivalent to minimizing \(\|\mathbf{w}\|^2\). The constraints ensure all points are classified correctly with margin \(\geq 1\):

Primal SVM \[ \min_{\mathbf{w}, b} \; \frac{1}{2}\|\mathbf{w}\|^2 \quad \text{subject to} \quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1, \quad i = 1, \ldots, m \]

Rewrite constraints as \(h_i(\mathbf{w},b) = 1 - y_i(\mathbf{w}^T\mathbf{x}_i+b) \leq 0\).

The Lagrangian

SVM Lagrangian \[ \mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \frac{1}{2}\|\mathbf{w}\|^2 - \sum_{i=1}^{m}\alpha_i\left[y_i(\mathbf{w}^T\mathbf{x}_i+b) - 1\right], \quad \alpha_i \geq 0 \]

Deriving the Dual — Full Steps

// Eliminate w and b to get the dual
1
Stationarity w.r.t. \(\mathbf{w}\): \[\frac{\partial\mathcal{L}}{\partial\mathbf{w}} = \mathbf{w} - \sum_i\alpha_i y_i\mathbf{x}_i = \mathbf{0} \quad\Rightarrow\quad \boxed{\mathbf{w}^* = \sum_{i=1}^{m}\alpha_i y_i\mathbf{x}_i}\]
The optimal weight vector is a linear combination of training data — entirely determined by the αᵢ!
2
Stationarity w.r.t. \(b\): \[\frac{\partial\mathcal{L}}{\partial b} = -\sum_i\alpha_i y_i = 0 \quad\Rightarrow\quad \boxed{\sum_{i=1}^{m}\alpha_i y_i = 0}\]
3
Substitute \(\mathbf{w}^* = \sum_i\alpha_i y_i\mathbf{x}_i\) into \(\mathcal{L}\): \[d(\boldsymbol{\alpha}) = \inf_{\mathbf{w},b}\mathcal{L} = \frac{1}{2}\|\mathbf{w}^*\|^2 - \sum_i\alpha_i[y_i({\mathbf{w}^*}^T\mathbf{x}_i+b)-1]\]
4
Expand \(\|\mathbf{w}^*\|^2 = \sum_{i,j}\alpha_i\alpha_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j\) and use \(\sum_i\alpha_i y_i = 0\) to eliminate \(b\): \[d(\boldsymbol{\alpha}) = \sum_{i=1}^{m}\alpha_i - \frac{1}{2}\sum_{i=1}^{m}\sum_{j=1}^{m}\alpha_i\alpha_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j\]
This is the dual objective — notice it depends on data only through dot products xᵢᵀxⱼ!

By complementary slackness: \(\alpha_i(y_i(\mathbf{w}^T\mathbf{x}_i+b)-1) = 0\). Points with \(\alpha_i > 0\) are the support vectors — they lie exactly on the margin. All other points have \(\alpha_i = 0\) and don't contribute to \(\mathbf{w}^* = \sum\alpha_i y_i\mathbf{x}_i\).

SVM Dual Problem \[ \max_{\boldsymbol{\alpha}} \; \sum_{i=1}^{m}\alpha_i - \frac{1}{2}\sum_{i,j}\alpha_i\alpha_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j \quad \text{s.t.} \quad \alpha_i \geq 0, \quad \sum_i\alpha_i y_i = 0 \]
★ Why the Dual is Revolutionary

The dual depends on data only through inner products \(\mathbf{x}_i^T\mathbf{x}_j\). Replace this with a kernel \(K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^T\phi(\mathbf{x}_j)\) — and you implicitly operate in a feature space of arbitrarily high (even infinite) dimension without ever computing \(\phi(\mathbf{x})\). This is the kernel trick, only possible because of the dual formulation. Lagrange multipliers enable nonlinear SVMs.

By complementary slackness: \(\alpha_i(y_i(\mathbf{w}^T\mathbf{x}_i+b)-1) = 0\). Points with \(\alpha_i > 0\) are the support vectors — they lie exactly on the margin. All other points have \(\alpha_i = 0\) and don't contribute to \(\mathbf{w}^* = \sum\alpha_i y_i\mathbf{x}_i\).

§12

Application — Soft-Margin SVM and Dual

Hard-margin SVM requires linear separability. In practice, data has overlap and noise. Soft-margin SVM allows constraint violations via slack variables \(\xi_i \geq 0\).

Soft-Margin Primal \[ \min_{\mathbf{w}, b, \boldsymbol{\xi}} \; \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_{i=1}^{m}\xi_i \] \[ \text{s.t.} \quad y_i(\mathbf{w}^T\mathbf{x}_i+b) \geq 1-\xi_i, \quad \xi_i \geq 0, \quad \forall i \]

The parameter \(C > 0\) trades off margin width vs. constraint violations. Large \(C\) → small violations allowed (hard margin behavior). Small \(C\) → more violations tolerated (wider, more flexible margin).

Introducing multipliers \(\alpha_i \geq 0\) for the margin constraints and \(\beta_i \geq 0\) for \(\xi_i \geq 0\), the dual becomes:

Soft-Margin Dual \[ \max_{\boldsymbol{\alpha}} \; \sum_i\alpha_i - \frac{1}{2}\sum_{i,j}\alpha_i\alpha_j y_i y_j\mathbf{x}_i^T\mathbf{x}_j \quad \text{s.t.} \quad 0 \leq \alpha_i \leq C, \quad \sum_i\alpha_i y_i = 0 \]

The only change from the hard-margin dual: \(\alpha_i \in [0, C]\) instead of \(\alpha_i \geq 0\). The upper bound \(C\) comes from the slack variable stationarity condition \(\frac{\partial\mathcal{L}}{\partial\xi_i} = C - \alpha_i - \beta_i = 0\) with \(\beta_i \geq 0\), giving \(\alpha_i \leq C\).

💡 Three Types of Points

By complementary slackness: (1) \(\alpha_i = 0\) → point outside the margin, ignored. (2) \(0 < \alpha_i < C\) → point exactly on the margin, a support vector with \(\xi_i=0\). (3) \(\alpha_i=C\) → point inside or on the wrong side of the margin, with \(\xi_i> 0\). The structure of the solution is entirely determined by KKT and complementary slackness.

§13

Application — Ridge as Constrained Optimization

Ridge regression is usually written as a penalized problem: minimize \(\|\mathbf{X}\boldsymbol{\theta}-\mathbf{y}\|^2 + \lambda\|\boldsymbol{\theta}\|^2\). But it has an equivalent constrained formulation — and the Lagrangian connects them.

Primal Constrained Form

Ridge as Constrained Opt \[ \min_{\boldsymbol{\theta}} \; \|\mathbf{X}\boldsymbol{\theta}-\mathbf{y}\|^2 \quad \text{subject to} \quad \|\boldsymbol{\theta}\|^2 \leq t \]

The Lagrangian: \(\mathcal{L} = \|\mathbf{X}\boldsymbol{\theta}-\mathbf{y}\|^2 - \mu(\|\boldsymbol{\theta}\|^2 - t)\). Setting \(\frac{\partial\mathcal{L}}{\partial\boldsymbol{\theta}} = 0\):

// Ridge closed-form from KKT
1
\(\frac{\partial\mathcal{L}}{\partial\boldsymbol{\theta}} = 2\mathbf{X}^T(\mathbf{X}\boldsymbol{\theta}-\mathbf{y}) - 2\mu\boldsymbol{\theta} = \mathbf{0}\)
2
\(\mathbf{X}^T\mathbf{X}\boldsymbol{\theta} - \mathbf{X}^T\mathbf{y} = \mu\boldsymbol{\theta}\) \[(\mathbf{X}^T\mathbf{X} + \mu\mathbf{I})\boldsymbol{\theta} = \mathbf{X}^T\mathbf{y}\]
3
\[\hat{\boldsymbol{\theta}}_{\text{ridge}} = (\mathbf{X}^T\mathbf{X} + \mu\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}\]
The Lagrange multiplier μ plays exactly the role of the regularization parameter λ in the penalized form. They are equivalent by Lagrangian duality — every value of μ ≥ 0 corresponds to some budget t, and vice versa.
★ Equivalence of Penalized and Constrained Forms

The penalized Ridge problem (\(\min\|\mathbf{X}\boldsymbol{\theta}-\mathbf{y}\|^2 + \lambda\|\boldsymbol{\theta}\|^2\)) and the constrained Ridge problem (\(\min\|\mathbf{X}\boldsymbol{\theta}-\mathbf{y}\|^2\) s.t. \(\|\boldsymbol{\theta}\|^2 \leq t\)) give identical solutions. The Lagrange multiplier \(\mu^*\) from the constrained problem equals \(\lambda\) in the penalized problem. This equivalence holds in general: any Lagrange penalty \(\lambda g(\mathbf{x})\) corresponds to a budget constraint \(g(\mathbf{x}) \leq t(\lambda)\). This is why regularization and constrained optimization are two faces of the same coin.

Lasso — The L1 Ball Constraint

Lasso: minimize \(\|\mathbf{X}\boldsymbol{\theta}-\mathbf{y}\|^2\) subject to \(\|\boldsymbol{\theta}\|_1 \leq t\). The L1 ball has corners at coordinate axes — see the geometric figure in the linear regression masterclass. The KKT conditions now involve the subdifferential of \(\|\boldsymbol{\theta}\|_1\):

\[ \frac{\partial\mathcal{L}}{\partial\theta_j} = 2[\mathbf{X}^T(\mathbf{X}\boldsymbol{\theta}-\mathbf{y})]_j - \mu\,\partial|\theta_j| = 0 \] \[ \text{where } \partial|\theta_j| = \begin{cases} \{+1\} & \theta_j > 0 \\ [-1,+1] & \theta_j = 0 \\ \{-1\} & \theta_j < 0 \end{cases} \]

For the zero case: if \(|2[\mathbf{X}^T(\mathbf{X}\boldsymbol{\theta}-\mathbf{y})]_j| \leq \mu\), then \(\theta_j^* = 0\) is a solution — there exists a subgradient that makes the KKT condition hold. This is the formal mechanism by which Lasso produces exact zeros.

§14

Application — Maximum Entropy and Softmax

The softmax function — used in every multiclass classifier and the last layer of your chest X-ray ViT — can be derived from maximum entropy with Lagrange multipliers.

The Maximum Entropy Problem

Among all probability distributions over \(K\) classes, find the one that maximizes entropy subject to known feature expectations. The simplest version: maximize entropy subject only to the normalization constraint.

Max Entropy Problem \[ \max_{\mathbf{p}} \; H(\mathbf{p}) = -\sum_{k=1}^{K}p_k\log p_k \quad \text{subject to} \quad \sum_{k=1}^{K}p_k = 1, \quad p_k \geq 0 \]

Deriving Softmax via Lagrange Multipliers

Now add linear constraints on feature expectations: \(\sum_k p_k f_j(k) = \mu_j\) for \(j = 1, \ldots, n\). With \(f_j(k) = z_k\) (the logits), the Lagrangian is:

// Max entropy with constraints → softmax derivation
1
\(\mathcal{L} = -\sum_k p_k\log p_k - \lambda_0\left(\sum_k p_k - 1\right) - \sum_j\lambda_j\left(\sum_k p_k z_{jk} - \mu_j\right)\)
2
Stationarity w.r.t. \(p_k\): \[\frac{\partial\mathcal{L}}{\partial p_k} = -\log p_k - 1 - \lambda_0 - \sum_j\lambda_j z_{jk} = 0\]
3
\(\log p_k = -1 - \lambda_0 - \sum_j\lambda_j z_{jk}\) \[\Rightarrow p_k = e^{-1-\lambda_0} \cdot e^{-\sum_j\lambda_j z_{jk}} = C \cdot e^{w_k}\]
where w_k = −∑_j λ_j z_{jk} absorbs the feature-constraint multipliers into a score for class k.
4
Applying \(\sum_k p_k = 1\): \(C = \frac{1}{\sum_{k'}e^{w_{k'}}}\). Therefore: \[\boxed{p_k^* = \frac{e^{w_k}}{\sum_{k'}e^{w_{k'}}} = \text{softmax}(\mathbf{w})_k}\]
The softmax is not an arbitrary design choice — it is the maximum entropy distribution subject to linear constraints on the expected features. This is its deep probabilistic justification.
★ Softmax = Maximum Entropy Classifier

Every time you use a softmax output layer, you're implicitly assuming the model should be the maximum entropy distribution consistent with the learned linear constraints. The Lagrange multipliers are the softmax weights \(\mathbf{w}\). This gives softmax its information-theoretic justification — it makes the fewest additional assumptions beyond what the data specifies. Lagrange multipliers are hidden inside every neural network classifier you've ever trained.

§15

Numerical Methods — ALM and Penalty Methods

Analytical KKT solutions exist only for structured problems. For general constrained optimization in ML, we need numerical methods that can handle arbitrary constraints.

Penalty Method — Constraint as Soft Penalty

Approximate the constraint by adding a penalty term that blows up when the constraint is violated:

Penalty Method \[ \min_{\mathbf{x}} \; f(\mathbf{x}) + \frac{\rho}{2}\sum_i g_i(\mathbf{x})^2 + \frac{\rho}{2}\sum_j\max(0, h_j(\mathbf{x}))^2 \]

For large \(\rho\): constraint violations are heavily penalized → solution approaches the feasible region. But very large \(\rho\) makes the problem ill-conditioned and hard to optimize numerically.

Augmented Lagrangian Method (ALM / ADMM)

Combines the Lagrangian (via dual variables) with the quadratic penalty. The augmented Lagrangian:

Augmented Lagrangian \[ \mathcal{L}_{\rho}(\mathbf{x}, \boldsymbol{\lambda}) = f(\mathbf{x}) - \sum_i\lambda_i g_i(\mathbf{x}) + \frac{\rho}{2}\sum_i g_i(\mathbf{x})^2 \]

ALM iterates between minimizing \(\mathcal{L}_\rho\) over \(\mathbf{x}\) (primal update) and updating \(\lambda_i \leftarrow \lambda_i - \rho g_i(\mathbf{x})\) (dual ascent). The dual update drives the Lagrange multipliers toward their true values, while the penalty term maintains feasibility. ALM converges for fixed \(\rho\) — unlike the penalty method which requires \(\rho \to \infty\).

ADMM (Alternating Direction Method of Multipliers)

ADMM splits the variable \(\mathbf{x}\) into blocks and minimizes alternately — popular for distributed optimization of large-scale ML problems. The update rules:

// ADMM for min f(x) + g(z) s.t. Ax + Bz = c Initialize: x, z, λ (dual variable)

repeat until convergence:
  # x-update (minimize augmented Lagrangian over x)
  xargminx [f(x) + (ρ/2)‖Ax + Bz - c + λ/ρ‖²]

  # z-update (minimize over z)
  zargminz [g(z) + (ρ/2)‖Ax + Bz - c + λ/ρ‖²]

  # Dual variable update (gradient ascent on dual)
  λλ + ρ·(Ax + Bz - c)

ADMM is used to solve Lasso, SVM, and distributed deep learning training across multiple machines — each with a local copy of parameters, with the consensus constraint enforced via the dual variable \(\boldsymbol{\lambda}\).

§16

The Full Mental Model

LAGRANGE MULTIPLIERS — COMPLETE MAP CONSTRAINED PROBLEM min f(x) s.t. g=0, h≤0 GEOMETRY ∇f = λ∇g at optimum gradients are parallel LAGRANGIAN L = f − λg − μh μ ≥ 0 for inequalities SENSITIVITY λ* = df*/dc shadow price of constraint KKT CONDITIONS ① Stationarity: ∇f = Σλ∇g + Σμ∇h ② Primal feasibility: g=0, h≤0 ③ Dual feasibility: μ ≥ 0 ② Primal feasibility: g=0, h≤0 ④ Complementary slackness: μⱼhⱼ=0 DUALITY d(λ,μ) = inf_x L Weak: d* ≤ p* Strong: d* = p* (Slater) ML APPLICATIONS SVM: min ½‖w‖² s.t. margins → dual via kernel trick Ridge/Lasso: penalized ≡ constrained (λ = μ*) Softmax: max entropy + linear constraints → softmax UNIFIED VIEW Equality: ∇f = λ∇g, g=0 → λ unconstrained Inequality: ∇f = μ∇h, h≤0, μ≥0, μh=0 Dual: d(λ,μ)=inf_x L ≤ p* Strong duality (convex + Slater): d*=p* λ = shadow price of constraint Comp. slack: inactive constraints ignored Dual enables kernel trick in SVM
Fig 3. Complete map of Lagrange multipliers — from geometric insight through KKT conditions, duality theory, and ML applications.

The complete story in one coherent thread:

  1. Geometric origin: At the constrained optimum, \(\nabla f\) and \(\nabla g\) must be parallel — otherwise we could move along the constraint to decrease \(f\). The Lagrange multiplier \(\lambda\) is the proportionality constant: \(\nabla f = \lambda\nabla g\).
  2. The Lagrangian: \(\mathcal{L} = f - \sum\lambda_i g_i - \sum\mu_j h_j\) encodes all conditions. Setting \(\nabla\mathcal{L} = \mathbf{0}\) recovers both the parallelism condition (stationarity w.r.t. \(\mathbf{x}\)) and the constraints (stationarity w.r.t. multipliers).
  3. Inequality constraints: Multipliers \(\mu_j \geq 0\) (non-negative). The sign constraint encodes the direction in which the inequality pushes the solution.
  4. Complementary slackness: \(\mu_j h_j = 0\). Either the constraint is active (\(h_j = 0\)) or the multiplier is zero. Inactive constraints are invisible to the solution. This is why SVMs have sparse support vectors.
  5. KKT conditions: Stationarity + primal feasibility + dual feasibility + complementary slackness. Necessary for any local minimum under LICQ. Sufficient for convex problems.
  6. Duality: The dual function \(d(\boldsymbol{\lambda},\boldsymbol{\mu}) = \inf_{\mathbf{x}}\mathcal{L}\) is always concave. Weak duality: \(d^* \leq p^*\). Strong duality (\(d^* = p^*\)) holds under Slater's condition for convex problems — enabling us to solve the dual instead of the primal.
  7. SVM: Primal \(\min\frac{1}{2}\|\mathbf{w}\|^2\). Dual depends on data only through \(\mathbf{x}_i^T\mathbf{x}_j\) → kernel trick → nonlinear classification in infinite-dimensional spaces. Support vectors = points with \(\alpha_i > 0\) by complementary slackness.
  8. Regularization: Ridge/Lasso as penalized problems are equivalent to constrained problems via the Lagrangian. The penalty parameter \(\lambda\) is the Lagrange multiplier for the budget constraint.
  9. Softmax: The maximum entropy distribution subject to linear constraints is exactly softmax — Lagrange multipliers are the softmax weights. Every classification network output is a constrained entropy maximizer.
★ The One Unifying Idea

Lagrange multipliers are the language of constrained trade-offs. Every multiplier measures the cost of tightening a constraint — the shadow price. Every KKT condition is an equilibrium between the objective's "pull" and the constraint's "resistance." This language appears everywhere: the regularization parameter in Ridge is a Lagrange multiplier. The support vectors in SVM are defined by complementary slackness. The softmax weights are Lagrange multipliers for entropy maximization. Half of all optimization problems in ML are constrained problems in disguise — and Lagrange multipliers are how you solve them.