Review:
- Root finding
- Newton’s method
- Minimization
- Regularization/Damped Newton’s method
Lecture 4 Optimization Pt.2
Overview
- Line search (which can solve the “over-shoot” problem) (trust region method can also solve this problem)
- Constrained minimization
1. Line Search
Motivation:
- Δ x \Delta \mathbf{x} Δx step from Newton’s method may overshoot the minimum.
- To fix this, check f ( x + Δ x ) f\left(\mathbf{x}+ \Delta \mathbf{x}\right) f(x+Δx) and “backtrack” until we get a “good” reduction in f f f.
(1) Armijo Rule
There are many strategies for this, but we will focus on the Armijo rule which is simple and effective.
α
=
1
\alpha = 1
α=1 (step length)
while
f
(
x
+
α
Δ
x
)
>
f
(
x
)
+
b
α
∇
f
(
x
)
T
Δ
x
f\left(\mathbf{x}+\alpha \Delta \mathbf{x}\right) > f\left(\mathbf{x}\right) + b \alpha \nabla f\left(\mathbf{x}\right)^T \Delta \mathbf{x}
f(x+αΔx)>f(x)+bα∇f(x)TΔx
\quad
α
←
c
α
\alpha \leftarrow c \alpha
α←cα (
c
∈
(
0
,
1
)
c \in \left(0,1\right)
c∈(0,1))
end
b b b is tolerance, b ∈ ( 0 , 1 ) b \in \left(0,1\right) b∈(0,1)
b α ∇ f ( x ) T Δ x b \alpha \nabla f\left(\mathbf{x}\right)^T \Delta \mathbf{x} bα∇f(x)TΔx is the expected reduction from gradient
(2) Intuition
- Make sure step agrees with linearization within some tolerance b b b.
- Typical values: b = 1 0 − 4 − 1 0 − 1 b = 10^{-4}- 10^{-1} b=10−4−10−1, c = 1 / 2 c = 1/2 c=1/2.
(3) Example
function backtracking_regularized_newton_step(x0)
b = 0.1
c = 0.5
β = 1.0
H = ∇2f(x0)
while !isposdef(H)
H = H + β*I
end
Δx = -H\∇f(x0)
α = 1.0
while f(x0 + α*Δx) > f(x0) + b*α*∇f(x0)'*Δx
α = c*α
end
xn = x0 + α*Δx
end
(4) Takeaway message
- Newton with simple and cheap modifications (globalization startegy) is extrmely effective at finding local minima.
2. Equality Constraints
f
(
x
)
:
R
n
→
R
f\left(\mathbf{x}\right): \mathbb{R}^n \rightarrow \mathbb{R}
f(x):Rn→R,
c
(
x
)
:
R
n
→
R
m
\mathbf{c}\left(\mathbf{x}\right): \mathbb{R}^n \rightarrow \mathbb{R}^m
c(x):Rn→Rm.
min
x
f
(
x
)
s.t.
c
(
x
)
=
0
\min_{\mathbf{x}} f\left(\mathbf{x}\right) \quad \text{s.t.} \quad \mathbf{c}\left(\mathbf{x}\right) = \mathbf{0}
xminf(x)s.t.c(x)=0
(1) First-order necessary conditions
i. Need
∇
f
(
x
)
=
0
\nabla f\left(\mathbf{x}\right) = \mathbf{0}
∇f(x)=0 in free directions.
ii. Need
c
(
x
)
=
0
\mathbf{c}\left(\mathbf{x}\right) = \mathbf{0}
c(x)=0.
Another statement: any non-zero component of ∇ f ( x ) \nabla f\left(\mathbf{x}\right) ∇f(x) must be normal to the constraint surface/manifold.
Explanation:
if the component of ∇ f ( x ) \nabla f\left(\mathbf{x}\right) ∇f(x) is not normal to the constraint surface, then we can move along the constraint surface to reduce the value of f ( x ) f\left(\mathbf{x}\right) f(x).
(i) Lagrange multiplier
⇒
∇
f
(
x
)
+
λ
∇
c
(
x
)
=
0
,
\Rightarrow \nabla f\left(\mathbf{x}\right) + \lambda \nabla \mathbf{c}\left(\mathbf{x}\right) = \mathbf{0},
⇒∇f(x)+λ∇c(x)=0, for some
λ
∈
R
\lambda \in \mathbb{R}
λ∈R. (Lagrange multiplier/dual variable)
In other words,
∇
f
(
x
)
\nabla f\left(\mathbf{x}\right)
∇f(x) and
∇
c
(
x
)
\nabla \mathbf{c}\left(\mathbf{x}\right)
∇c(x) are parallel.
(ii) More general case
In general (in vector form):
∂
f
∂
x
+
λ
⊤
∂
c
∂
x
=
0
,
\frac{\partial f}{\partial \mathbf{x}} + \lambda^\top \frac{\partial \mathbf{c}}{\partial \mathbf{x}} = \mathbf{0},
∂x∂f+λ⊤∂x∂c=0, where
λ
∈
R
m
\lambda \in \mathbb{R}^m
λ∈Rm.
(iii) Lagrangian
Based on this gradient condition, we define the Lagrangian:
L
(
x
,
λ
)
=
f
(
x
)
+
λ
⊤
c
(
x
)
.
\mathcal{L}\left(\mathbf{x}, \lambda\right) = f\left(\mathbf{x}\right) + \lambda^\top \mathbf{c}\left(\mathbf{x}\right).
L(x,λ)=f(x)+λ⊤c(x).
It turns the constrained minimization problem into an unconstrained one:
min
x
L
(
x
,
λ
)
.
\min_{\mathbf{x}} \mathcal{L}\left(\mathbf{x}, \lambda\right).
xminL(x,λ).
Its gradients are (also called KKT conditions):
∇
x
L
(
x
,
λ
)
=
∇
f
(
x
)
+
(
∂
c
∂
x
)
⊤
λ
=
0
∇
λ
L
(
x
,
λ
)
=
c
(
x
)
=
0.
\begin{align*} \nabla_\mathbf{x} \mathcal{L}\left(\mathbf{x}, \lambda\right) &= \nabla f\left(\mathbf{x}\right) + \left(\frac{\partial \mathbf{c}}{\partial \mathbf{x}}\right)^\top \lambda = \mathbf{0} \\ \nabla_\lambda \mathcal{L}\left(\mathbf{x}, \lambda\right) &= \mathbf{c}\left(\mathbf{x}\right) = \mathbf{0}. \end{align*}
∇xL(x,λ)∇λL(x,λ)=∇f(x)+(∂x∂c)⊤λ=0=c(x)=0.
So, we can solve this with Newton’s method (root finding problem).
∇
x
L
(
x
+
Δ
x
,
λ
+
Δ
λ
)
≈
∇
x
L
(
x
,
λ
)
+
∂
2
L
∂
x
2
Δ
x
+
∂
2
L
∂
x
∂
λ
Δ
λ
=
0
=
∇
x
L
(
x
,
λ
)
+
∂
2
L
∂
x
2
Δ
x
+
(
∂
c
∂
x
)
⊤
Δ
λ
=
0
\begin{align*} \nabla_\mathbf{x} \mathcal{L}\left(\mathbf{x}+\Delta \mathbf{x}, \lambda + \Delta \lambda\right) &\approx \nabla_\mathbf{x} \mathcal{L}\left(\mathbf{x}, \lambda\right) + \frac{\partial^2 \mathcal{L}}{\partial \mathbf{x}^2} \Delta \mathbf{x} + \frac{\partial^2 \mathcal{L}}{\partial \mathbf{x} \partial \lambda} \Delta \lambda = \mathbf{0} \\ &= \nabla_\mathbf{x} \mathcal{L}\left(\mathbf{x}, \lambda\right) + \frac{\partial^2 \mathcal{L}}{\partial \mathbf{x}^2} \Delta \mathbf{x} + \left(\frac{\partial \mathbf{c}}{\partial \mathbf{x}}\right)^\top \Delta \lambda = \mathbf{0} \end{align*}
∇xL(x+Δx,λ+Δλ)≈∇xL(x,λ)+∂x2∂2LΔx+∂x∂λ∂2LΔλ=0=∇xL(x,λ)+∂x2∂2LΔx+(∂x∂c)⊤Δλ=0
∇ λ L ( x + Δ x , λ + Δ λ ) ≈ c ( x ) + ∂ c ∂ x Δ x = 0 \nabla_\lambda \mathcal{L}\left(\mathbf{x}+\Delta \mathbf{x}, \lambda + \Delta \lambda\right) \approx \mathbf{c}\left(\mathbf{x}\right) + \frac{\partial \mathbf{c}}{\partial \mathbf{x}} \Delta \mathbf{x} = \mathbf{0} ∇λL(x+Δx,λ+Δλ)≈c(x)+∂x∂cΔx=0
The Newton step is:
[
∂
2
L
∂
x
2
(
∂
c
∂
x
)
⊤
∂
c
∂
x
0
]
[
Δ
x
Δ
λ
]
=
[
−
∇
x
L
(
x
,
λ
)
−
c
(
x
)
]
\begin{bmatrix} \frac{\partial^2 \mathcal{L}}{\partial \mathbf{x}^2} & \left(\frac{\partial \mathbf{c}}{\partial \mathbf{x}}\right)^\top \\ \frac{\partial \mathbf{c}}{\partial \mathbf{x}} & \mathbf{0} \end{bmatrix} \begin{bmatrix} \Delta \mathbf{x} \\ \Delta \lambda \end{bmatrix} = \begin{bmatrix} -\nabla_\mathbf{x} \mathcal{L}\left(\mathbf{x}, \lambda\right) \\ -\mathbf{c}\left(\mathbf{x}\right) \end{bmatrix}
[∂x2∂2L∂x∂c(∂x∂c)⊤0][ΔxΔλ]=[−∇xL(x,λ)−c(x)]
The equation is called KKT system, and the first matrix is symmetric. (this is a symmetric indefinite system)
3. Gauss-Newton Method
(1) Basic idea
∂ 2 L ∂ x 2 = ∇ 2 f ( x ) + ∂ ∂ x [ ( ∂ c ∂ x ) ⊤ λ ] ≈ ∇ 2 f ( x ) \begin{align*} \frac{\partial^2 \mathcal{L}}{\partial \mathbf{x}^2} &= \nabla^2 f\left(\mathbf{x}\right) + \frac{\partial}{\partial \mathbf{x}}\left[\left(\frac{\partial \mathbf{c}}{\partial \mathbf{x}}\right)^\top \lambda\right]\\ &\approx \nabla^2 f\left(\mathbf{x}\right) \end{align*} ∂x2∂2L=∇2f(x)+∂x∂[(∂x∂c)⊤λ]≈∇2f(x)
- The second term is a tensor and is expensive to compute. So, we often drop the second constraint curvature term. This is called Gauss-Newton method.
- This method has slightly slower convergence than Newton’s method (more iterations), but each iteration is much cheaper (often wins in wall-clock time).
(2) Example: comparison of Newton and Gauss-Newton
(i) Newton’s method
function newton_step(x0,λ0)
H = ∇2f(x0) + ForwardDiff.jacobian(x -> ∂c(x)'*λ0, x0)
C = ∂c(x0)
Δz = [H C'; C 0]$$-∇f(x0)-C'*λ0; -c(x0)]
Δx = Δz[1:2]
Δλ = Δz[3]
return x0+Δx, λ0+Δλ
end
start from
(
−
1
,
−
1
)
\left(-1,-1\right)
(−1,−1)
start from
(
−
3
,
2
)
\left(-3,2\right)
(−3,2)
it does not converge to the minimum
check the Hessians
H = ∇2f(xguess[:,end]) + ForwardDiff.jacobian(x -> ∂c(x)'*λguess[end], xguess[:,end])
result:
2×2 Matrix{Float64}:
-1.75818 0.0
0.0 1.0
It has a negative eigenvalue, which results in a wrong descent direction. This is brought by the second term (constraint curvature term), so we must do some regularization during the Newton’s method. However, the Gauss-Newton method does not have this problem.
(ii) Gauss-Newton method
function gauss_newton_step(x0,λ0)
H = ∇2f(x0) # drop the 2nd term (tensor)
C = ∂c(x0)
Δz = [H C'; C 0]$$-∇f(x0)-C'*λ0; -c(x0)]
Δx = Δz[1:2]
Δλ = Δz[3]
return x0+Δx, λ0+Δλ
end
start from
(
−
1
,
−
1
)
\left(-1,-1\right)
(−1,−1)
start from
(
−
3
,
2
)
\left(-3,2\right)
(−3,2)
It converges to the minimum in 8 iterations.
(3) Takeaway message
- May still need to regularize the Hessian ∂ 2 L ∂ x 2 \frac{\partial^2 \mathcal{L}}{\partial \mathbf{x}^2} ∂x2∂2L even if ∇ 2 f ( x ) ≻ 0 \nabla^2 f\left(\mathbf{x}\right)\succ 0 ∇2f(x)≻0.
- Gauss-Newton method is often used in practice.
4. Inequality Constraints
min x f ( x ) s.t. c ( x ) ≤ 0 \min_{\mathbf{x}} f\left(\mathbf{x}\right) \quad \text{s.t.} \quad \mathbf{c}\left(\mathbf{x}\right) \leq \mathbf{0} xminf(x)s.t.c(x)≤0
- We’ll look at just inequality constraints for now.
- Just combine with previous methods to handle both kinds of constraints.
(1) First-order necessary conditions
i. Need
∇
f
(
x
)
=
0
\nabla f\left(\mathbf{x}\right) = \mathbf{0}
∇f(x)=0 in free directions.
ii. Need
c
(
x
)
≤
0
\mathbf{c}\left(\mathbf{x}\right) \leq \mathbf{0}
c(x)≤0.
(same as equality constraints)
(i) KKT conditions
∇ f ( x ) + ( ∂ c ∂ x ) ⊤ λ = 0 ( s t a t i o n a r i t y ) c ( x ) ≤ 0 ( p r i m a l f e a s i b i l i t y ) λ ≥ 0 ( d u a l f e a s i b i l i t y ) λ ⊙ c ( x ) = λ ⊤ c ( x ) = 0 ( c o m p l e m e n t a r y s l a c k n e s s ) \begin{align*} \nabla f\left(\mathbf{x}\right) + \left(\frac{\partial \mathbf{c}}{\partial \mathbf{x}}\right)^\top \lambda = \mathbf{0}& \quad (\mathrm{stationarity}) \\ \mathbf{c}\left(\mathbf{x}\right) \leq \mathbf{0}& \quad (\mathrm{primal\; feasibility}) \\ \lambda \geq \mathbf{0}& \quad (\mathrm{dual\; feasibility}) \\ \lambda\odot \mathbf{c}\left(\mathbf{x}\right) = \lambda^\top \mathbf{c}\left(\mathbf{x}\right) = \mathbf{0}& \quad (\mathrm{complementary\; slackness}) \end{align*} ∇f(x)+(∂x∂c)⊤λ=0c(x)≤0λ≥0λ⊙c(x)=λ⊤c(x)=0(stationarity)(primalfeasibility)(dualfeasibility)(complementaryslackness) for some λ ∈ R m \lambda \in \mathbb{R}^m λ∈Rm.
(ii) Intuition of KKT conditions
- If the constraint is active ( c ( x ) = 0 \mathbf{c}\left(\mathbf{x}\right) = \mathbf{0} c(x)=0) ⇒ \Rightarrow ⇒ λ > 0 \lambda > 0 λ>0.
- If the constraint is inactive ( c ( x ) < 0 \mathbf{c}\left(\mathbf{x}\right) < \mathbf{0} c(x)<0) ⇒ \Rightarrow ⇒ λ = 0 \lambda = 0 λ=0.
(iii) Takeaway message
- The complementary slackness condition is just like a switch. If the constraint is active, then the switch is on ( λ > 0 \lambda > 0 λ>0), and the constraint is inactive, then the switch is off ( λ = 0 \lambda = 0 λ=0).
- There is also an edge when the minimum of the objective happens to be on the constraint manifold when c ( x ) = λ = 0 \mathbf{c}\left(\mathbf{x}\right) = \lambda = 0 c(x)=λ=0.