Constrained Optimization — The Setup
In unconstrained optimization, we minimize \(f(\mathbf{x})\) over all \(\mathbf{x} \in \mathbb{R}^n\). The solution is wherever \(\nabla f = \mathbf{0}\). Constrained optimization forces the solution to lie on a specific set — a surface, curve, or region — defined by constraints. This is far more realistic for real-world problems.
The set of all \(\mathbf{x}\) satisfying all constraints is the feasible set \(\mathcal{F}\). We want the minimum of \(f\) restricted to \(\mathcal{F}\).
Why Can't We Just Ignore Constraints?
The unconstrained minimum \(\mathbf{x}^* = \arg\min f\) might not be in \(\mathcal{F}\). Three scenarios:
- Unconstrained min is feasible: \(\mathbf{x}^* \in \mathcal{F}\) → constraints are inactive, same answer. Lagrange multipliers will give \(\lambda^* = 0\).
- Unconstrained min is infeasible: The optimal feasible point lies on the constraint boundary — we are "pushed" there by the constraint. This is the interesting case.
- No feasible point exists: Problem is infeasible → no solution.
Examples from Your Background
| Problem | Objective \(f\) | Constraint |
|---|---|---|
| SVM (hard margin) | Minimize \(\frac{1}{2}\|\mathbf{w}\|^2\) | \(y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1\) for all \(i\) |
| Ridge regression | Minimize \(\|\mathbf{X}\boldsymbol{\theta}-\mathbf{y}\|^2\) | \(\|\boldsymbol{\theta}\|^2 \leq t\) |
| Lasso regression | Minimize \(\|\mathbf{X}\boldsymbol{\theta}-\mathbf{y}\|^2\) | \(\|\boldsymbol{\theta}\|_1 \leq t\) |
| Maximum entropy | Maximize \(-\sum_k p_k \log p_k\) | \(\sum_k p_k = 1\), \(p_k \geq 0\) |
| Portfolio optimization | Minimize portfolio variance | Budget sums to 1, return \(\geq r_{\min}\) |
Lagrange multipliers convert a constrained optimization problem into an unconstrained one by absorbing the constraints into the objective via multiplier terms. The price: we now optimize over both the original variables and the multipliers. But the resulting system — the stationarity conditions — can be solved systematically.
Geometric Intuition — Why Gradients Must Be Parallel
Before any formulas: why does the method work? The entire theory rests on a single geometric insight.
The Setup
Consider minimizing \(f(\mathbf{x})\) subject to \(g(\mathbf{x}) = 0\). The constraint defines a level surface (a curve in 2D, a surface in 3D). We must move along this surface and find where \(f\) is smallest.
The Optimality Argument
On the constraint surface \(g(\mathbf{x}) = 0\), we can only move in directions tangent to this surface. If the projection of \(\nabla f\) onto the tangent plane is nonzero, we can move along the constraint and decrease \(f\) further — we're not at the optimum.
At the constrained minimum \(\mathbf{x}^*\): the projection of \(\nabla f(\mathbf{x}^*)\) onto the constraint tangent plane must be zero. This means \(\nabla f(\mathbf{x}^*)\) is normal to the constraint surface. But \(\nabla g(\mathbf{x}^*)\) is also normal to the constraint surface (the gradient of any function is perpendicular to its level sets). Therefore:
for some scalar λ* ∈ ℝ — the Lagrange multiplier
The multiplier \(\lambda^*\) is the proportionality constant between the two gradient vectors. It can be positive or negative — we don't constrain its sign for equality constraints. Its magnitude tells you how sensitive the optimal value \(f(\mathbf{x}^*)\) is to small changes in the constraint — the "shadow price" of the constraint. If \(\lambda^* = 0\), the constraint is inactive (the unconstrained min happens to be feasible).
The Lagrangian — Formal Definition
The geometric condition \(\nabla f = \lambda \nabla g\) plus the constraint \(g(\mathbf{x}) = 0\) gives us \(n+1\) equations in \(n+1\) unknowns (\(\mathbf{x} \in \mathbb{R}^n\) and \(\lambda \in \mathbb{R}\)). The Lagrangian packages this system elegantly.
Some texts use + instead of −; our convention follows the standard ML literature where λ ≥ 0 for ≤ constraints.
Why This Works — The Stationarity Conditions
Taking the gradient of \(\mathcal{L}\) with respect to all variables and setting to zero:
The gradient with respect to \(\mathbf{x}\) recovers the parallelism condition. The derivative with respect to \(\lambda\) recovers the constraint itself. Setting all partials of \(\mathcal{L}\) to zero is equivalent to the full optimality system — the Lagrangian encodes both the objective and the constraint in one function.
Finding the constrained optimum of \(f\) subject to \(g=0\) is equivalent to finding the unconstrained stationary point of \(\mathcal{L}(\mathbf{x}, \lambda)\). We've traded a constrained optimization over \(n\) variables for an unconstrained one over \(n+1\) variables. The multiplier \(\lambda\) is a new variable that we solve for alongside \(\mathbf{x}\).
Sensitivity Interpretation of λ
Consider parameterizing the constraint as \(g(\mathbf{x}) = c\) (a family of constraints). The optimal value \(f^*(c)\) changes as \(c\) changes. The Lagrange multiplier gives the rate of change:
If the constraint is \(g(\mathbf{x}) = \|\boldsymbol{\theta}\|^2 - t = 0\) (Ridge), then \(\lambda^*\) tells you how much the optimal loss changes per unit increase in the budget \(t\). In economics this is called the shadow price — the value of relaxing the constraint by one unit.
Solving Equality Constraints — Full Methodology
The 4-Step Method
L(x, λ) = f(x) − λ · g(x)
Step 2: Set all partial derivatives to zero
∂L/∂xᵢ = 0 for i=1,...,n # n equations
∂L/∂λ = 0 # 1 equation (recovers g(x)=0)
Step 3: Solve the resulting system of n+1 equations
# Find all (x*, λ*) pairs that satisfy the system
Step 4: Evaluate f at candidates, pick the minimum
# The Lagrangian conditions are necessary, not always sufficient
# For convex f and affine g → conditions are sufficient
Worked Example — Minimum Distance on an Ellipse
Find the point on the ellipse \(\frac{x^2}{4} + y^2 = 1\) closest to the point \((3, 0)\).
Objective: minimize \(f(x,y) = (x-3)^2 + y^2\) (squared distance). Constraint: \(g(x,y) = \frac{x^2}{4} + y^2 - 1 = 0\).
Common Pitfalls
- Lagrangian conditions are necessary, not sufficient. Every constrained min satisfies them, but not every solution is a min. Always compare function values at all critical points.
- The Lagrangian has saddle points. \(\mathcal{L}\) is minimized over \(\mathbf{x}\) and maximized over \(\lambda\) at the solution — not a minimum in all variables simultaneously.
- Degenerate constraints (LICQ failure). If \(\nabla g(\mathbf{x}^*) = \mathbf{0}\), the method may fail — the multiplier may not exist. Check regularity conditions.
- Multiplying through without checking signs. For equality constraints \(\lambda\) is unconstrained. For inequality constraints, sign matters — covered in §07.
Multiple Equality Constraints
With \(m\) equality constraints \(g_1(\mathbf{x}) = 0, \ldots, g_m(\mathbf{x}) = 0\), introduce one multiplier per constraint:
The optimality condition says: at the constrained minimum, \(\nabla f\) must lie in the span of the constraint gradients \(\{\nabla g_1, \ldots, \nabla g_m\}\). Each multiplier \(\lambda_i\) is the coefficient for the \(i\)-th constraint gradient.
Geometric Picture
Each constraint \(g_i(\mathbf{x}) = 0\) defines a surface. The intersection of \(m\) surfaces is a submanifold of dimension \(n - m\) (assuming the constraints are independent). We minimize \(f\) on this lower-dimensional object.
The tangent space of the feasible submanifold at \(\mathbf{x}^*\) is orthogonal to all \(\nabla g_i\). Optimality requires \(\nabla f\) to have no component in this tangent space — i.e., \(\nabla f\) is in the span of \(\{\nabla g_i\}\).
Linear Independence Constraint Qualification (LICQ)
The Lagrange multiplier method requires that the active constraint gradients are linearly independent at \(\mathbf{x}^*\):
If LICQ fails (e.g., two constraints become tangent to each other at \(\mathbf{x}^*\)), the multipliers \(\boldsymbol{\lambda}\) may not exist or may not be unique. LICQ is the standard constraint qualification (CQ) that ensures the method works.
Inequality Constraints — Extending the Framework
Equality constraints force \(\mathbf{x}\) to lie exactly on a surface. Inequality constraints \(h(\mathbf{x}) \leq 0\) allow \(\mathbf{x}\) to lie anywhere in a half-space — in the interior or on the boundary.
Two Cases at the Optimum
For a single inequality \(h(\mathbf{x}) \leq 0\):
The Sign of λ for Inequality Constraints
For equality constraints, \(\lambda\) can be any real number — \(\nabla f = \lambda\nabla g\) and \(\lambda\) could be positive or negative depending on whether the constraint pushes toward or away from the unconstrained min.
For inequality constraints \(h(\mathbf{x}) \leq 0\), when the constraint is active (\(h(\mathbf{x}^*) = 0\)), the constraint is pushing the optimum into the feasible region. The gradient \(\nabla f\) must point out of the feasible region (otherwise we could decrease \(f\) by going deeper in), while \(\nabla h\) points out of the feasible region by definition. So \(\nabla f = \mu \nabla h\) with \(\mu \geq 0\).
This is the crucial distinction between equality and inequality multipliers. Violating this sign condition is a common error.
KKT Conditions — Complete Treatment
The Karush-Kuhn-Tucker (KKT) conditions generalize Lagrange multipliers to problems with both equality and inequality constraints. They are the fundamental first-order necessary conditions for constrained optimization.
The Five KKT Conditions
At the optimum, the gradient of f is in the span of constraint gradients. The "force" from the objective is balanced by constraint "reaction forces."
The solution must satisfy all original constraints — both equality and inequality. Non-negotiable.
Inequality multipliers must be non-negative. Equality multipliers λᵢ are unconstrained in sign. This encodes the directionality of inequality constraints.
For each inequality: either the constraint is active (h=0, μ can be anything ≥ 0) OR the multiplier is zero (μ=0, constraint is slack). Never both active and non-zero simultaneously in a specific sense. This is the heart of KKT.
Any local minimum of \(f\) subject to the constraints must satisfy the KKT conditions (under a constraint qualification like LICQ). The KKT conditions are necessary. For convex \(f\), convex \(h_j\), and affine \(g_i\), they are also sufficient — every KKT point is the global minimum. This sufficiency for convex problems is what makes KKT so powerful in ML.
Summary Table
| KKT Condition | Equations | Count | Encodes |
|---|---|---|---|
| Stationarity | \(\nabla f = \sum\lambda_i\nabla g_i + \sum\mu_j\nabla h_j\) | \(n\) | Objective-constraint gradient balance |
| Primal feasibility | \(g_i = 0\), \(h_j \leq 0\) | \(m + p\) | All constraints satisfied |
| Dual feasibility | \(\mu_j \geq 0\) | \(p\) | Inequality multiplier signs |
| Complementary slackness | \(\mu_j h_j = 0\) | \(p\) | Active vs inactive inequality constraints |
Total: \(n + m + 2p\) conditions for \(n + m + p\) unknowns (\(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\mu}\)). The system is generically well-determined.
Complementary Slackness — Deep Dive
Complementary slackness is the most subtle and powerful of the KKT conditions. The product \(\mu_j h_j(\mathbf{x}^*) = 0\) seems simple — but its implications are far-reaching.
What It Really Means
The condition \(\mu_j \cdot h_j(\mathbf{x}^*) = 0\) says exactly one of two things must be true for each inequality constraint:
What we can never have: \(\mu_j > 0\) AND \(h_j(\mathbf{x}^*) < 0\) simultaneously. If the constraint is not active, its multiplier must be zero — the constraint is not influencing the solution, so it exerts no "force." Having a nonzero multiplier on a slack constraint would contradict optimality.
Connection to Support Vectors
In SVM (§11), the inequality constraints are \(h_i(\mathbf{x}) = 1 - y_i(\mathbf{w}^T\mathbf{x}_i + b) \leq 0\). Complementary slackness says: \(\mu_i(1 - y_i(\mathbf{w}^T\mathbf{x}_i+b)) = 0\).
- For most training points: \(y_i(\mathbf{w}^T\mathbf{x}_i+b) > 1\) (well inside the margin) → \(\mu_i = 0\). These points don't influence the solution at all.
- For support vectors: \(y_i(\mathbf{w}^T\mathbf{x}_i+b) = 1\) (exactly on the margin) → \(\mu_i \geq 0\). These points are what define the solution.
This is why SVM is a sparse classifier: the optimal boundary is determined entirely by a small subset of training points — the support vectors. All other points are irrelevant. Complementary slackness makes this mathematically precise.
Lagrangian Duality and the Dual Problem
Duality transforms the original (primal) optimization problem into a different (dual) problem that is always convex — regardless of the primal's convexity. The dual provides a lower bound on the primal, and for convex problems, the bounds are tight.
The Dual Function
Given the Lagrangian \(\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\mu})\), the dual function is formed by minimizing over \(\mathbf{x}\):
The dual function is always concave in \((\boldsymbol{\lambda}, \boldsymbol{\mu})\) — regardless of whether \(f\), \(g_i\), or \(h_j\) are convex. This is because the pointwise infimum of affine functions (which is what \(d\) is, as a function of the multipliers) is always concave.
Weak Duality — The Lower Bound Property
For any feasible \(\mathbf{x}\) and any dual-feasible \((\boldsymbol{\lambda}, \boldsymbol{\mu})\) (with \(\mu_j \geq 0\)):
The Dual Problem
To get the tightest lower bound, maximize \(d\) over dual-feasible multipliers:
The dual problem is always convex (maximizing a concave function), even when the primal is non-convex. This is a remarkable property — it allows efficient solving of the dual even when the primal is hard.
Let \(p^*\) = primal optimum, \(d^*\) = dual optimum. Weak duality guarantees \(d^* \leq p^*\). The difference \(p^* - d^*\) is the duality gap.
Strong Duality and Constraint Qualification
Strong duality does not always hold. It requires a constraint qualification — a regularity condition on the problem structure.
Slater's Condition (Most Common CQ)
For a convex problem (convex \(f\) and \(h_j\), affine \(g_i\)), strong duality holds if there exists a strictly feasible point — a point in the interior of the feasible set:
The SVM optimization problem is convex and satisfies Slater's condition (there always exists a margin-satisfying hyperplane if the data is separable). Therefore, strong duality holds — we can solve the dual SVM problem instead of the primal. The dual has a beautiful structure that enables the kernel trick, non-linear classification, and sparse solutions via complementary slackness.
Conditions for Strong Duality Summary
| Problem Type | Condition for Strong Duality | Result |
|---|---|---|
| Linear programming (LP) | Primal and dual both feasible | \(d^* = p^*\) always |
| Convex QP (e.g., SVM) | Slater's condition | \(d^* = p^*\) |
| General convex | Slater's or other CQ | \(d^* = p^*\) |
| Non-convex (e.g., deep networks) | Rarely satisfied | \(d^* \leq p^*\) (gap possible) |
Application — Hard-Margin SVM Derivation
SVM is the cleanest and most beautiful application of Lagrange multipliers in ML. We derive the full primal and dual.
The Primal SVM Problem
Given linearly separable training data \(\{(\mathbf{x}_i, y_i)\}_{i=1}^{m}\) with \(y_i \in \{-1, +1\}\), find the hyperplane \(\mathbf{w}^T\mathbf{x} + b = 0\) with maximum margin.
The margin width is \(\frac{2}{\|\mathbf{w}\|}\). Maximizing it is equivalent to minimizing \(\|\mathbf{w}\|^2\). The constraints ensure all points are classified correctly with margin \(\geq 1\):
Rewrite constraints as \(h_i(\mathbf{w},b) = 1 - y_i(\mathbf{w}^T\mathbf{x}_i+b) \leq 0\).
The Lagrangian
Deriving the Dual — Full Steps
By complementary slackness: \(\alpha_i(y_i(\mathbf{w}^T\mathbf{x}_i+b)-1) = 0\). Points with \(\alpha_i > 0\) are the support vectors — they lie exactly on the margin. All other points have \(\alpha_i = 0\) and don't contribute to \(\mathbf{w}^* = \sum\alpha_i y_i\mathbf{x}_i\).
The dual depends on data only through inner products \(\mathbf{x}_i^T\mathbf{x}_j\). Replace this with a kernel \(K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^T\phi(\mathbf{x}_j)\) — and you implicitly operate in a feature space of arbitrarily high (even infinite) dimension without ever computing \(\phi(\mathbf{x})\). This is the kernel trick, only possible because of the dual formulation. Lagrange multipliers enable nonlinear SVMs.
By complementary slackness: \(\alpha_i(y_i(\mathbf{w}^T\mathbf{x}_i+b)-1) = 0\). Points with \(\alpha_i > 0\) are the support vectors — they lie exactly on the margin. All other points have \(\alpha_i = 0\) and don't contribute to \(\mathbf{w}^* = \sum\alpha_i y_i\mathbf{x}_i\).
Application — Soft-Margin SVM and Dual
Hard-margin SVM requires linear separability. In practice, data has overlap and noise. Soft-margin SVM allows constraint violations via slack variables \(\xi_i \geq 0\).
The parameter \(C > 0\) trades off margin width vs. constraint violations. Large \(C\) → small violations allowed (hard margin behavior). Small \(C\) → more violations tolerated (wider, more flexible margin).
Introducing multipliers \(\alpha_i \geq 0\) for the margin constraints and \(\beta_i \geq 0\) for \(\xi_i \geq 0\), the dual becomes:
The only change from the hard-margin dual: \(\alpha_i \in [0, C]\) instead of \(\alpha_i \geq 0\). The upper bound \(C\) comes from the slack variable stationarity condition \(\frac{\partial\mathcal{L}}{\partial\xi_i} = C - \alpha_i - \beta_i = 0\) with \(\beta_i \geq 0\), giving \(\alpha_i \leq C\).
By complementary slackness: (1) \(\alpha_i = 0\) → point outside the margin, ignored. (2) \(0 < \alpha_i < C\) → point exactly on the margin, a support vector with \(\xi_i=0\). (3) \(\alpha_i=C\) → point inside or on the wrong side of the margin, with \(\xi_i> 0\). The structure of the solution is entirely determined by KKT and complementary slackness.
Application — Ridge as Constrained Optimization
Ridge regression is usually written as a penalized problem: minimize \(\|\mathbf{X}\boldsymbol{\theta}-\mathbf{y}\|^2 + \lambda\|\boldsymbol{\theta}\|^2\). But it has an equivalent constrained formulation — and the Lagrangian connects them.
Primal Constrained Form
The Lagrangian: \(\mathcal{L} = \|\mathbf{X}\boldsymbol{\theta}-\mathbf{y}\|^2 - \mu(\|\boldsymbol{\theta}\|^2 - t)\). Setting \(\frac{\partial\mathcal{L}}{\partial\boldsymbol{\theta}} = 0\):
The penalized Ridge problem (\(\min\|\mathbf{X}\boldsymbol{\theta}-\mathbf{y}\|^2 + \lambda\|\boldsymbol{\theta}\|^2\)) and the constrained Ridge problem (\(\min\|\mathbf{X}\boldsymbol{\theta}-\mathbf{y}\|^2\) s.t. \(\|\boldsymbol{\theta}\|^2 \leq t\)) give identical solutions. The Lagrange multiplier \(\mu^*\) from the constrained problem equals \(\lambda\) in the penalized problem. This equivalence holds in general: any Lagrange penalty \(\lambda g(\mathbf{x})\) corresponds to a budget constraint \(g(\mathbf{x}) \leq t(\lambda)\). This is why regularization and constrained optimization are two faces of the same coin.
Lasso — The L1 Ball Constraint
Lasso: minimize \(\|\mathbf{X}\boldsymbol{\theta}-\mathbf{y}\|^2\) subject to \(\|\boldsymbol{\theta}\|_1 \leq t\). The L1 ball has corners at coordinate axes — see the geometric figure in the linear regression masterclass. The KKT conditions now involve the subdifferential of \(\|\boldsymbol{\theta}\|_1\):
For the zero case: if \(|2[\mathbf{X}^T(\mathbf{X}\boldsymbol{\theta}-\mathbf{y})]_j| \leq \mu\), then \(\theta_j^* = 0\) is a solution — there exists a subgradient that makes the KKT condition hold. This is the formal mechanism by which Lasso produces exact zeros.
Application — Maximum Entropy and Softmax
The softmax function — used in every multiclass classifier and the last layer of your chest X-ray ViT — can be derived from maximum entropy with Lagrange multipliers.
The Maximum Entropy Problem
Among all probability distributions over \(K\) classes, find the one that maximizes entropy subject to known feature expectations. The simplest version: maximize entropy subject only to the normalization constraint.
Deriving Softmax via Lagrange Multipliers
Now add linear constraints on feature expectations: \(\sum_k p_k f_j(k) = \mu_j\) for \(j = 1, \ldots, n\). With \(f_j(k) = z_k\) (the logits), the Lagrangian is:
Every time you use a softmax output layer, you're implicitly assuming the model should be the maximum entropy distribution consistent with the learned linear constraints. The Lagrange multipliers are the softmax weights \(\mathbf{w}\). This gives softmax its information-theoretic justification — it makes the fewest additional assumptions beyond what the data specifies. Lagrange multipliers are hidden inside every neural network classifier you've ever trained.
Numerical Methods — ALM and Penalty Methods
Analytical KKT solutions exist only for structured problems. For general constrained optimization in ML, we need numerical methods that can handle arbitrary constraints.
Penalty Method — Constraint as Soft Penalty
Approximate the constraint by adding a penalty term that blows up when the constraint is violated:
For large \(\rho\): constraint violations are heavily penalized → solution approaches the feasible region. But very large \(\rho\) makes the problem ill-conditioned and hard to optimize numerically.
Augmented Lagrangian Method (ALM / ADMM)
Combines the Lagrangian (via dual variables) with the quadratic penalty. The augmented Lagrangian:
ALM iterates between minimizing \(\mathcal{L}_\rho\) over \(\mathbf{x}\) (primal update) and updating \(\lambda_i \leftarrow \lambda_i - \rho g_i(\mathbf{x})\) (dual ascent). The dual update drives the Lagrange multipliers toward their true values, while the penalty term maintains feasibility. ALM converges for fixed \(\rho\) — unlike the penalty method which requires \(\rho \to \infty\).
ADMM (Alternating Direction Method of Multipliers)
ADMM splits the variable \(\mathbf{x}\) into blocks and minimizes alternately — popular for distributed optimization of large-scale ML problems. The update rules:
repeat until convergence:
# x-update (minimize augmented Lagrangian over x)
x ← argminx [f(x) + (ρ/2)‖Ax + Bz - c + λ/ρ‖²]
# z-update (minimize over z)
z ← argminz [g(z) + (ρ/2)‖Ax + Bz - c + λ/ρ‖²]
# Dual variable update (gradient ascent on dual)
λ ← λ + ρ·(Ax + Bz - c)
ADMM is used to solve Lasso, SVM, and distributed deep learning training across multiple machines — each with a local copy of parameters, with the consensus constraint enforced via the dual variable \(\boldsymbol{\lambda}\).
The Full Mental Model
The complete story in one coherent thread:
- Geometric origin: At the constrained optimum, \(\nabla f\) and \(\nabla g\) must be parallel — otherwise we could move along the constraint to decrease \(f\). The Lagrange multiplier \(\lambda\) is the proportionality constant: \(\nabla f = \lambda\nabla g\).
- The Lagrangian: \(\mathcal{L} = f - \sum\lambda_i g_i - \sum\mu_j h_j\) encodes all conditions. Setting \(\nabla\mathcal{L} = \mathbf{0}\) recovers both the parallelism condition (stationarity w.r.t. \(\mathbf{x}\)) and the constraints (stationarity w.r.t. multipliers).
- Inequality constraints: Multipliers \(\mu_j \geq 0\) (non-negative). The sign constraint encodes the direction in which the inequality pushes the solution.
- Complementary slackness: \(\mu_j h_j = 0\). Either the constraint is active (\(h_j = 0\)) or the multiplier is zero. Inactive constraints are invisible to the solution. This is why SVMs have sparse support vectors.
- KKT conditions: Stationarity + primal feasibility + dual feasibility + complementary slackness. Necessary for any local minimum under LICQ. Sufficient for convex problems.
- Duality: The dual function \(d(\boldsymbol{\lambda},\boldsymbol{\mu}) = \inf_{\mathbf{x}}\mathcal{L}\) is always concave. Weak duality: \(d^* \leq p^*\). Strong duality (\(d^* = p^*\)) holds under Slater's condition for convex problems — enabling us to solve the dual instead of the primal.
- SVM: Primal \(\min\frac{1}{2}\|\mathbf{w}\|^2\). Dual depends on data only through \(\mathbf{x}_i^T\mathbf{x}_j\) → kernel trick → nonlinear classification in infinite-dimensional spaces. Support vectors = points with \(\alpha_i > 0\) by complementary slackness.
- Regularization: Ridge/Lasso as penalized problems are equivalent to constrained problems via the Lagrangian. The penalty parameter \(\lambda\) is the Lagrange multiplier for the budget constraint.
- Softmax: The maximum entropy distribution subject to linear constraints is exactly softmax — Lagrange multipliers are the softmax weights. Every classification network output is a constrained entropy maximizer.
Lagrange multipliers are the language of constrained trade-offs. Every multiplier measures the cost of tightening a constraint — the shadow price. Every KKT condition is an equilibrium between the objective's "pull" and the constraint's "resistance." This language appears everywhere: the regularization parameter in Ridge is a Lagrange multiplier. The support vectors in SVM are defined by complementary slackness. The softmax weights are Lagrange multipliers for entropy maximization. Half of all optimization problems in ML are constrained problems in disguise — and Lagrange multipliers are how you solve them.