Review:
- Line search
- Equality and inequality constraints
- KKT conditions
Lecture 5 Optimization Pt.3
Overview
- Algorithms for constrained minimization
- Augumented lagrangian method
- Quadratic programming
- More on regularization and line search
1. Inequality-constrained minimization
min x f ( x ) s.t. c ( x ) ≤ 0 \begin{aligned} &\min_{\mathbf{x}} \quad f(\mathbf{x}) \\ &\text{s.t.} \quad \mathbf{c}(\mathbf{x}) \leq \mathbf{0} \end{aligned} xminf(x)s.t.c(x)≤0
KKT conditions:
∇ f ( x ) + ( ∂ c ∂ x ) ⊤ λ = 0 ( s t a t i o n a r i t y ) c ( x ) ≤ 0 ( p r i m a l f e a s i b i l i t y ) λ ≥ 0 ( d u a l f e a s i b i l i t y ) λ ⊙ c ( x ) = λ ⊤ c ( x ) = 0 ( c o m p l e m e n t a r y s l a c k n e s s ) \begin{align*} \nabla f\left(\mathbf{x}\right) + \left(\frac{\partial \mathbf{c}}{\partial \mathbf{x}}\right)^\top \lambda = \mathbf{0}& \quad (\mathrm{stationarity}) \\ \mathbf{c}\left(\mathbf{x}\right) \leq \mathbf{0}& \quad (\mathrm{primal\; feasibility}) \\ \lambda \geq \mathbf{0}& \quad (\mathrm{dual\; feasibility}) \\ \lambda\odot \mathbf{c}\left(\mathbf{x}\right) = \lambda^\top \mathbf{c}\left(\mathbf{x}\right) = \mathbf{0}& \quad (\mathrm{complementary\; slackness}) \end{align*} ∇f(x)+(∂x∂c)⊤λ=0c(x)≤0λ≥0λ⊙c(x)=λ⊤c(x)=0(stationarity)(primalfeasibility)(dualfeasibility)(complementaryslackness)
- ( ∂ c ∂ x ) ⊤ λ \left(\frac{\partial \mathbf{c}}{\partial \mathbf{x}}\right)^\top \lambda (∂x∂c)⊤λ acts like a penalty term. Make sure its sign is consistent (makes cost worse for c ( x ) \mathbf{c}\left(\mathbf{x}\right) c(x) infeasible)
- The notation ⊙ \odot ⊙ is the Hadamard product (element-wise product)
2. Optimization algorithms
(1) Active-set methods
- Guess active/inactive constraints
- Solve equality constrained problem
- Used when you have a good heuristic for the active set
(2) Barrier/Interior-point methods
Replace inequalities with barrier function in objective
min
x
f
(
x
)
s.t.
c
(
x
)
≤
0
\begin{aligned} &\min_{\mathbf{x}} \quad f(\mathbf{x}) \\ &\text{s.t.} \quad \mathbf{c}(\mathbf{x}) \leq \mathbf{0} \end{aligned}
xminf(x)s.t.c(x)≤0
⇒ min x f ( x ) − ∑ i = 1 m 1 ρ log ( − c i ( x ) ) \Rightarrow \min_{\mathbf{x}} \quad f(\mathbf{x}) - \sum_{i=1}^m \frac{1}{\rho}\log(-c_i(\mathbf{x})) ⇒xminf(x)−i=1∑mρ1log(−ci(x))
- Gold standard for convex problems. (e.g. MPC)
Example:
min
x
f
(
x
)
=
x
1
2
+
0.3
x
2
2
s.t.
c
1
(
x
)
=
x
1
2
+
x
2
2
−
2
≤
0
c
2
(
x
)
=
x
1
+
x
2
−
1
≤
0
\begin{aligned} &\min_{\mathbf{x}} \quad f(\mathbf{x}) = x_1^2 + 0.3x_2^2 \\ &\text{s.t.} \quad c_1(\mathbf{x}) = x_1^2 + x_2^2 - 2 \leq 0 \\ &\quad \quad \quad c_2(\mathbf{x}) = x_1 + x_2 - 1 \leq 0 \end{aligned}
xminf(x)=x12+0.3x22s.t.c1(x)=x12+x22−2≤0c2(x)=x1+x2−1≤0
Plot the objective and constraints on 2D plane:
Plot the objective and constraints in 3D:
(3) Penalty
Replace constraints with penalty term that penalizes violations
min
x
f
(
x
)
s.t.
c
(
x
)
≤
0
\begin{aligned} &\min_{\mathbf{x}} \quad f(\mathbf{x}) \\ &\text{s.t.} \quad \mathbf{c}(\mathbf{x}) \leq \mathbf{0} \end{aligned}
xminf(x)s.t.c(x)≤0
⇒
min
x
f
(
x
)
+
ρ
2
[
max
(
0
,
c
(
x
)
)
]
2
\Rightarrow \min_{\mathbf{x}} \quad f(\mathbf{x}) + \frac{\rho}{2} \left[\max(0, c(\mathbf{x}))\right]^2
⇒xminf(x)+2ρ[max(0,c(x))]2
- easy to implement
- has issues with numerical ill-conditioning
- cannot achieve high accuracy
- rarely used in practice
(4) Augmented Lagrangian
Add Lagrange multiplier estimate to penalty method:
min
x
L
ρ
(
x
,
λ
~
)
=
f
(
x
)
+
λ
~
⊤
c
(
x
)
+
ρ
2
[
max
(
0
,
c
(
x
)
)
]
2
\min_{\mathbf{x}} \quad \mathcal{L}_\rho(\mathbf{x}, \tilde{\lambda}) = f(\mathbf{x}) + \tilde{\lambda}^\top \mathbf{c}(\mathbf{x}) + \frac{\rho}{2} \left[\max(0, c(\mathbf{x}))\right]^2 \\
xminLρ(x,λ~)=f(x)+λ~⊤c(x)+2ρ[max(0,c(x))]2
L ρ ( x , λ ~ ) \mathcal{L}_\rho(\mathbf{x}, \tilde{\lambda}) Lρ(x,λ~) is called the augmented Lagrangian. λ ~ \tilde{\lambda} λ~ is the estimate of the Lagrange multiplier.
i. Update
λ
~
\tilde{\lambda}
λ~ by offloading the penalty term into
λ
~
\tilde{\lambda}
λ~ at each iteration:
λ
~
←
λ
~
+
ρ
c
(
x
)
\tilde{\lambda}\leftarrow \tilde{\lambda} + \rho \mathbf{c}(\mathbf{x})
λ~←λ~+ρc(x)(for active constraints)
Insight:
∂ f ∂ x + λ ~ ⊤ ∂ c ∂ x + + ρ c ⊤ ( x ) ∂ c ∂ x = ∂ f ∂ x + [ λ ~ + ρ c ( x ) ] ⊤ ∂ c ∂ x = 0 \begin{aligned} &\frac{\partial f}{\partial \mathbf{x}} + \tilde{\lambda}^\top \frac{\partial \mathbf{c}}{\partial \mathbf{x}} + + \rho \mathbf{c}^\top(\mathbf{x}) \frac{\partial \mathbf{c}}{\partial \mathbf{x}} \\ = &\frac{\partial f}{\partial \mathbf{x}} + \left[\tilde{\lambda} + \rho \mathbf{c}(\mathbf{x})\right]^\top \frac{\partial \mathbf{c}}{\partial \mathbf{x}} \\ = & \mathbf{0} \end{aligned} ==∂x∂f+λ~⊤∂x∂c++ρc⊤(x)∂x∂c∂x∂f+[λ~+ρc(x)]⊤∂x∂c0 The term λ ~ + ρ c ( x ) \tilde{\lambda} + \rho c(\mathbf{x}) λ~+ρc(x) looks like a Lagrange multiplier. It pushes at the same direction as the Lagrange multiplier, and pushes up against the constraint.
ii. Repeat until convergence:
(1)
min
x
L
ρ
(
x
,
λ
~
)
\min_{\mathbf{x}} \quad \mathcal{L}_\rho(\mathbf{x}, \tilde{\lambda})
minxLρ(x,λ~)
(2)
λ
~
←
max
(
0
,
λ
~
+
ρ
c
(
x
)
)
\tilde{\lambda}\leftarrow \max(0, \tilde{\lambda} + \rho \mathbf{c}(\mathbf{x}))
λ~←max(0,λ~+ρc(x)) (clamping to guarantee
λ
~
≥
0
\tilde{\lambda} \geq 0
λ~≥0)
(3)
ρ
←
α
ρ
\rho \leftarrow \alpha \rho
ρ←αρ (
α
\alpha
α typically 10)
- Fixes ill-conditioning of penalty method
- Converges with finite ρ \rho ρ
- Works well on non-convex problems
Example: Quadratic program
min x 1 2 x ⊤ Q x + q ⊤ x ( Q ≻ 0 ) s.t. A x ≤ b , C x = d \begin{aligned} &\min_{\mathbf{x}} \quad \frac{1}{2} \mathbf{x}^\top \mathbf{Q} \mathbf{x} + \mathbf{q}^\top \mathbf{x} \quad (\mathbf{Q} \succ 0) \\ &\text{s.t.} \quad \mathbf{A}\mathbf{x} \leq \mathbf{b}, \mathbf{C}\mathbf{x} = \mathbf{d} \end{aligned} xmin21x⊤Qx+q⊤x(Q≻0)s.t.Ax≤b,Cx=d
- Super useful in control
- Can be solved very fast (~kHz)
Parameters: Q = [ 0.5 0 0 1 ] \mathbf{Q} = \begin{bmatrix} 0.5 & 0 \\ 0 & 1 \end{bmatrix} Q=[0.5001], q = 0 \mathbf{q} = \mathbf{0} q=0, A = [ 1 1 ] \mathbf{A} = \begin{bmatrix} 1 & 1 \end{bmatrix} A=[11], b = − 1 \mathbf{b} = -1 b=−1, C = 0 \mathbf{C} = \mathbf{0} C=0, d = 0 \mathbf{d} = \mathbf{0} d=0.
Plot the objective and constraints on 2D plane:
After 3 iterations:
Without
ρ
\rho
ρ update (slow convergence):
Example:
- Try with penalty method, full Augmented Lagrangian, and just λ \lambda λ update
3. Regularization & Duality
(1) Equality constraints
Given
min
x
f
(
x
)
s.t.
c
(
x
)
=
0
\begin{aligned} &\min_{\mathbf{x}} \quad f(\mathbf{x}) \\ &\text{s.t.} \quad \mathbf{c}(\mathbf{x}) = \mathbf{0} \end{aligned}
xminf(x)s.t.c(x)=0
We might like to turn this into:
min
x
f
(
x
)
+
p
∞
(
c
(
x
)
)
,
p
∞
(
x
)
=
{
0
x
=
0
+
∞
otherwise
\min_{\mathbf{x}} \quad f(\mathbf{x}) + p_\infty(\mathbf{c}(\mathbf{x})),\quad p_\infty(\mathbf{x}) = \left\{ \begin{array}{ll} 0 & \mathbf{x} = \mathbf{0} \\ +\infty & \text{otherwise} \end{array} \right.
xminf(x)+p∞(c(x)),p∞(x)={0+∞x=0otherwise
Practically it’s terrible, but we can get the same effect solving:
min
x
max
λ
f
(
x
)
+
λ
⊤
c
(
x
)
\min_{\mathbf{x}} \max_{\lambda} \quad f(\mathbf{x}) + \lambda^\top \mathbf{c}(\mathbf{x})
xminλmaxf(x)+λ⊤c(x)
If the constraints cannot be satisfied ( c ( x ) ≠ 0 \mathbf{c}(\mathbf{x}) \neq \mathbf{0} c(x)=0), then the cost is infinite.
The max \max max acts like an infinite penalty.
(2) Inequality constraints
The above trick is also similar for inequality constraints:
min
x
f
(
x
)
s.t.
c
(
x
)
≤
0
\begin{aligned} &\min_{\mathbf{x}} \quad f(\mathbf{x}) \\ &\text{s.t.} \quad \mathbf{c}(\mathbf{x}) \leq \mathbf{0} \end{aligned}
xminf(x)s.t.c(x)≤0
⇒ min x f ( x ) + p ∞ ( c ( x ) ) , p ∞ ( x ) = { 0 x ≤ 0 + ∞ otherwise \Rightarrow \min_{\mathbf{x}} \quad f(\mathbf{x}) + p_\infty(\mathbf{c}(\mathbf{x})),\quad p_\infty(\mathbf{x}) = \left\{ \begin{array}{ll} 0 & \mathbf{x} \leq \mathbf{0} \\ +\infty & \text{otherwise} \end{array} \right. ⇒xminf(x)+p∞(c(x)),p∞(x)={0+∞x≤0otherwise
Apply the “min-max” trick:
⇒
min
x
max
λ
≥
0
f
(
x
)
+
λ
⊤
c
(
x
)
\Rightarrow \min_{\mathbf{x}} \max_{\lambda \geq \mathbf{0}} \quad f(\mathbf{x}) + \lambda^\top \mathbf{c}(\mathbf{x})
⇒xminλ≥0maxf(x)+λ⊤c(x)
We denote
L
(
x
,
λ
)
=
f
(
x
)
+
λ
⊤
c
(
x
)
\mathcal{L}(\mathbf{x}, \lambda) = f(\mathbf{x}) + \lambda^\top \mathbf{c}(\mathbf{x})
L(x,λ)=f(x)+λ⊤c(x)
The solution is a saddle point in ( x , λ ) \left(\mathbf{x}, \lambda\right) (x,λ) space.
Consider the eigenvalues of the KKT matrix: The eigenvalues related to x \mathbf{x} x are all positive, and the eigenvalues related to λ \lambda λ should be all negative.
(3) Interpretation
- KKT conditions define a saddle point in ( x , λ ) \left(\mathbf{x}, \lambda\right) (x,λ) space.
- KKT system should have dim ( x ) \dim(\mathbf{x}) dim(x) positive eigenvalues and dim ( λ ) \dim(\lambda) dim(λ) negative eigenvalues at an optimum. It’s called a quasi-definite system.
- Regularize the KKT matrix as follows:
[ H + β I c ⊤ c − β I ] [ Δ x Δ λ ] = [ − ∇ x L − c ( x ) ] , β > 0 \begin{bmatrix} H+\beta I & c^\top \\ c & -\beta I \end{bmatrix} \begin{bmatrix} \Delta x \\ \Delta \lambda \end{bmatrix} = \begin{bmatrix} -\nabla_\mathbf{x} \mathcal{L} \\ -\mathbf{c}\left(\mathbf{x}\right) \end{bmatrix}, \quad \beta > 0 [H+βIcc⊤−βI][ΔxΔλ]=[−∇xL−c(x)],β>0
(4) Duality
Question: Can we swap the
min
\min
min and
max
\max
max?
Answer: If the problem is convex, yes! (Strong duality)
- For convex problems, the optimal value of the primal problem is equal to the optimal value of the dual problem.
- For non-convex problems, the optimal value of the primal problem is less than or equal to the optimal value of the dual problem. The difference is called the duality gap.
(5) Example
Without regularization, the solution will not converge to the optimal solution:
Check the eigenvalues of the KKT matrix:
H = ∇2f(xguess[:,end]) + ForwardDiff.jacobian(xn -> ∂c(xn)'*λguess[end], xguess[:,end])
C = ∂c(xguess[:,end])
K = [H C'; C 0]
eigvals(K)
We get:
3-element Vector{Float64}:
-2.7504046059655027
-0.5914642540616377
1.6230587398907093
where two eigenvalues are negative and one is positive.
To fix this, we add regularization to the KKT matrix:
function regularized_newton_step(x,λ)
β = 1.0
H = ∇2f(x) + ForwardDiff.jacobian(xn -> ∂c(xn)'*λ, x)
C = ∂c(x)
K = [H C'; C 0]
e = eigvals(K)
while !(sum(e .> 0) == length(x) && sum(e .< 0) == length(λ)) # check if the KKT matrix is quasi-definite
K = K + Diagonal([β*ones(length(x)); -β*ones(length(λ))]) # add regularization
e = eigvals(K)
end
Δz = K$$-∇f(x)-C'*λ; -c(x)]
Δx = Δz[1:2]
Δλ = Δz[3]
return Δx, Δλ
end
The solution converges to the optimal solution in 6 iterations:
- still overshoot (the 4th iteration) ⇒ \Rightarrow ⇒ use line search
4. Merit function
(1) Question: How do we do a line search on a root-finding problem?
⇒ \Rightarrow ⇒ We find x ∗ \mathbf{x}^* x∗ such that c ( x ∗ ) = 0 \mathbf{c}(\mathbf{x}^*) = \mathbf{0} c(x∗)=0.
We define a scalar merit function p ( x ) p\left(\mathbf{x}\right) p(x) that measures the distance to solution.
Here are some standard choices:
- L 2 L_2 L2 norm: p ( x ) = 1 2 c ⊤ ( x ) c ( x ) = 1 2 ∥ c ( x ) ∥ 2 2 p\left(\mathbf{x}\right) = \frac{1}{2}\mathbf{c}^\top(\mathbf{x})\mathbf{c}(\mathbf{x}) = \frac{1}{2}\|\mathbf{c}(\mathbf{x})\|_2^2 p(x)=21c⊤(x)c(x)=21∥c(x)∥22
- L 1 L_1 L1 norm: p ( x ) = ∥ c ( x ) ∥ 1 p\left(\mathbf{x}\right) = \|\mathbf{c}(\mathbf{x})\|_1 p(x)=∥c(x)∥1 (any norm works)
Now just do Armijo line search on
p
(
x
)
p\left(\mathbf{x}\right)
p(x):
α
=
1
\alpha = 1
α=1 (step length)
while
p
(
x
+
α
Δ
x
)
>
p
(
x
)
+
b
α
∇
p
(
x
)
⊤
Δ
x
p\left(\mathbf{x} + \alpha \Delta \mathbf{x}\right) > p\left(\mathbf{x}\right) + b \alpha \nabla p\left(\mathbf{x}\right)^\top \Delta \mathbf{x}
p(x+αΔx)>p(x)+bα∇p(x)⊤Δx
\quad
α
←
θ
α
\alpha \leftarrow \theta \alpha
α←θα (
θ
∈
(
0
,
1
)
\theta \in (0,1)
θ∈(0,1))
end
x
←
x
+
α
Δ
x
\mathbf{x} \leftarrow \mathbf{x} + \alpha \Delta \mathbf{x}
x←x+αΔx
(2) Question: How about constrained minimization?
min x f ( x ) s.t. c ( x ) ≤ 0 , d ( x ) = 0 \begin{aligned} &\min_{\mathbf{x}} \quad f(\mathbf{x}) \\ &\text{s.t.} \quad \mathbf{c}(\mathbf{x}) \leq \mathbf{0}, \mathbf{d}(\mathbf{x}) = \mathbf{0} \end{aligned} xminf(x)s.t.c(x)≤0,d(x)=0
⇒ min x L ( x , λ , μ ) = f ( x ) + λ ⊤ c ( x ) + μ ⊤ d ( x ) \Rightarrow \min_{\mathbf{x}} \quad \mathcal{L}(\mathbf{x}, \lambda, \mu) = f(\mathbf{x}) + \lambda^\top \mathbf{c}(\mathbf{x}) + \mu^\top \mathbf{d}(\mathbf{x}) ⇒xminL(x,λ,μ)=f(x)+λ⊤c(x)+μ⊤d(x)
Pick a merit function:
p
(
x
,
λ
,
μ
)
=
1
2
∥
r
(
x
,
λ
,
μ
)
∥
2
2
,
r
(
x
,
λ
,
μ
)
=
[
∇
x
L
(
x
,
λ
,
μ
)
min
(
0
,
c
(
x
)
)
d
(
x
)
]
p(\mathbf{x}, \lambda, \mu) = \frac{1}{2}\|\mathbf{r}(\mathbf{x}, \lambda, \mu)\|_2^2, \quad \mathbf{r}(\mathbf{x}, \lambda, \mu) = \begin{bmatrix} \nabla_{\mathbf{x}} \mathcal{L}(\mathbf{x}, \lambda, \mu) \\ \min(0, \mathbf{c}(\mathbf{x})) \\ \mathbf{d}(\mathbf{x}) \end{bmatrix}
p(x,λ,μ)=21∥r(x,λ,μ)∥22,r(x,λ,μ)=
∇xL(x,λ,μ)min(0,c(x))d(x)
(
r
(
x
,
λ
,
μ
)
\mathbf{r}(\mathbf{x}, \lambda, \mu)
r(x,λ,μ) is called the KKT residual)
or
p
(
x
,
λ
,
μ
)
=
f
(
x
)
+
ρ
∣
∣
[
min
(
0
,
c
(
x
)
)
d
(
x
)
]
∣
∣
1
p(\mathbf{x}, \lambda, \mu) = f(\mathbf{x}) + \rho \left|\left| \begin{bmatrix} \min(0, \mathbf{c}(\mathbf{x})) \\ \mathbf{d}(\mathbf{x}) \end{bmatrix} \right|\right|_1
p(x,λ,μ)=f(x)+ρ
[min(0,c(x))d(x)]
1
or the Augmented Lagrangian:
p
(
x
,
λ
,
μ
)
=
L
ρ
(
x
,
λ
,
μ
)
p(\mathbf{x}, \lambda, \mu) = \mathcal{L}_{\rho}(\mathbf{x}, \lambda, \mu)
p(x,λ,μ)=Lρ(x,λ,μ)
(3) Example
See the jupyter notebook.