【Optimal Control (CMU 16-745)】Lecture 4 Optimization Part 2

Review:

  • Root finding
  • Newton’s method
  • Minimization
  • Regularization/Damped Newton’s method

Lecture 4 Optimization Pt.2

Overview

  • Line search (which can solve the “over-shoot” problem) (trust region method can also solve this problem)
  • Constrained minimization

1. Line Search

Motivation:

  • Δ x \Delta \mathbf{x} Δx step from Newton’s method may overshoot the minimum.
  • To fix this, check f ( x + Δ x ) f\left(\mathbf{x}+ \Delta \mathbf{x}\right) f(x+Δx) and “backtrack” until we get a “good” reduction in f f f.
(1) Armijo Rule

There are many strategies for this, but we will focus on the Armijo rule which is simple and effective.

α = 1 \alpha = 1 α=1 (step length)
while f ( x + α Δ x ) > f ( x ) + b α ∇ f ( x ) T Δ x f\left(\mathbf{x}+\alpha \Delta \mathbf{x}\right) > f\left(\mathbf{x}\right) + b \alpha \nabla f\left(\mathbf{x}\right)^T \Delta \mathbf{x} f(x+αΔx)>f(x)+bαf(x)TΔx
\quad α ← c α \alpha \leftarrow c \alpha αcα ( c ∈ ( 0 , 1 ) c \in \left(0,1\right) c(0,1))
end

b b b is tolerance, b ∈ ( 0 , 1 ) b \in \left(0,1\right) b(0,1)
b α ∇ f ( x ) T Δ x b \alpha \nabla f\left(\mathbf{x}\right)^T \Delta \mathbf{x} bαf(x)TΔx is the expected reduction from gradient
在这里插入图片描述

(2) Intuition
  • Make sure step agrees with linearization within some tolerance b b b.
  • Typical values: b = 1 0 − 4 − 1 0 − 1 b = 10^{-4}- 10^{-1} b=104101, c = 1 / 2 c = 1/2 c=1/2.
(3) Example
function backtracking_regularized_newton_step(x0)
    b = 0.1
    c = 0.5
    β = 1.0
    H = ∇2f(x0)
    while !isposdef(H)
        H = H + β*I 
    end
    Δx = -H\∇f(x0)
    
    α = 1.0
    while f(x0 + α*Δx) > f(x0) + b*α*∇f(x0)'*Δx
        α = c*α
    end
    
    xn = x0 + α*Δx
end

在这里插入图片描述

(4) Takeaway message
  • Newton with simple and cheap modifications (globalization startegy) is extrmely effective at finding local minima.

2. Equality Constraints

f ( x ) : R n → R f\left(\mathbf{x}\right): \mathbb{R}^n \rightarrow \mathbb{R} f(x):RnR, c ( x ) : R n → R m \mathbf{c}\left(\mathbf{x}\right): \mathbb{R}^n \rightarrow \mathbb{R}^m c(x):RnRm.
min ⁡ x f ( x ) s.t. c ( x ) = 0 \min_{\mathbf{x}} f\left(\mathbf{x}\right) \quad \text{s.t.} \quad \mathbf{c}\left(\mathbf{x}\right) = \mathbf{0} xminf(x)s.t.c(x)=0

(1) First-order necessary conditions

i. Need ∇ f ( x ) = 0 \nabla f\left(\mathbf{x}\right) = \mathbf{0} f(x)=0 in free directions.
ii. Need c ( x ) = 0 \mathbf{c}\left(\mathbf{x}\right) = \mathbf{0} c(x)=0.
在这里插入图片描述

Another statement: any non-zero component of ∇ f ( x ) \nabla f\left(\mathbf{x}\right) f(x) must be normal to the constraint surface/manifold.

Explanation:
if the component of ∇ f ( x ) \nabla f\left(\mathbf{x}\right) f(x) is not normal to the constraint surface, then we can move along the constraint surface to reduce the value of f ( x ) f\left(\mathbf{x}\right) f(x).

(i) Lagrange multiplier

⇒ ∇ f ( x ) + λ ∇ c ( x ) = 0 , \Rightarrow \nabla f\left(\mathbf{x}\right) + \lambda \nabla \mathbf{c}\left(\mathbf{x}\right) = \mathbf{0}, f(x)+λc(x)=0, for some λ ∈ R \lambda \in \mathbb{R} λR. (Lagrange multiplier/dual variable)
In other words, ∇ f ( x ) \nabla f\left(\mathbf{x}\right) f(x) and ∇ c ( x ) \nabla \mathbf{c}\left(\mathbf{x}\right) c(x) are parallel.

(ii) More general case

In general (in vector form):
∂ f ∂ x + λ ⊤ ∂ c ∂ x = 0 , \frac{\partial f}{\partial \mathbf{x}} + \lambda^\top \frac{\partial \mathbf{c}}{\partial \mathbf{x}} = \mathbf{0}, xf+λxc=0, where λ ∈ R m \lambda \in \mathbb{R}^m λRm.

(iii) Lagrangian

Based on this gradient condition, we define the Lagrangian:
L ( x , λ ) = f ( x ) + λ ⊤ c ( x ) . \mathcal{L}\left(\mathbf{x}, \lambda\right) = f\left(\mathbf{x}\right) + \lambda^\top \mathbf{c}\left(\mathbf{x}\right). L(x,λ)=f(x)+λc(x).

It turns the constrained minimization problem into an unconstrained one:
min ⁡ x L ( x , λ ) . \min_{\mathbf{x}} \mathcal{L}\left(\mathbf{x}, \lambda\right). xminL(x,λ).

Its gradients are (also called KKT conditions):
∇ x L ( x , λ ) = ∇ f ( x ) + ( ∂ c ∂ x ) ⊤ λ = 0 ∇ λ L ( x , λ ) = c ( x ) = 0. \begin{align*} \nabla_\mathbf{x} \mathcal{L}\left(\mathbf{x}, \lambda\right) &= \nabla f\left(\mathbf{x}\right) + \left(\frac{\partial \mathbf{c}}{\partial \mathbf{x}}\right)^\top \lambda = \mathbf{0} \\ \nabla_\lambda \mathcal{L}\left(\mathbf{x}, \lambda\right) &= \mathbf{c}\left(\mathbf{x}\right) = \mathbf{0}. \end{align*} xL(x,λ)λL(x,λ)=f(x)+(xc)λ=0=c(x)=0.

So, we can solve this with Newton’s method (root finding problem).
∇ x L ( x + Δ x , λ + Δ λ ) ≈ ∇ x L ( x , λ ) + ∂ 2 L ∂ x 2 Δ x + ∂ 2 L ∂ x ∂ λ Δ λ = 0 = ∇ x L ( x , λ ) + ∂ 2 L ∂ x 2 Δ x + ( ∂ c ∂ x ) ⊤ Δ λ = 0 \begin{align*} \nabla_\mathbf{x} \mathcal{L}\left(\mathbf{x}+\Delta \mathbf{x}, \lambda + \Delta \lambda\right) &\approx \nabla_\mathbf{x} \mathcal{L}\left(\mathbf{x}, \lambda\right) + \frac{\partial^2 \mathcal{L}}{\partial \mathbf{x}^2} \Delta \mathbf{x} + \frac{\partial^2 \mathcal{L}}{\partial \mathbf{x} \partial \lambda} \Delta \lambda = \mathbf{0} \\ &= \nabla_\mathbf{x} \mathcal{L}\left(\mathbf{x}, \lambda\right) + \frac{\partial^2 \mathcal{L}}{\partial \mathbf{x}^2} \Delta \mathbf{x} + \left(\frac{\partial \mathbf{c}}{\partial \mathbf{x}}\right)^\top \Delta \lambda = \mathbf{0} \end{align*} xL(x+Δx,λ+Δλ)xL(x,λ)+x22LΔx+xλ2LΔλ=0=xL(x,λ)+x22LΔx+(xc)Δλ=0

∇ λ L ( x + Δ x , λ + Δ λ ) ≈ c ( x ) + ∂ c ∂ x Δ x = 0 \nabla_\lambda \mathcal{L}\left(\mathbf{x}+\Delta \mathbf{x}, \lambda + \Delta \lambda\right) \approx \mathbf{c}\left(\mathbf{x}\right) + \frac{\partial \mathbf{c}}{\partial \mathbf{x}} \Delta \mathbf{x} = \mathbf{0} λL(x+Δx,λ+Δλ)c(x)+xcΔx=0

The Newton step is:
[ ∂ 2 L ∂ x 2 ( ∂ c ∂ x ) ⊤ ∂ c ∂ x 0 ] [ Δ x Δ λ ] = [ − ∇ x L ( x , λ ) − c ( x ) ] \begin{bmatrix} \frac{\partial^2 \mathcal{L}}{\partial \mathbf{x}^2} & \left(\frac{\partial \mathbf{c}}{\partial \mathbf{x}}\right)^\top \\ \frac{\partial \mathbf{c}}{\partial \mathbf{x}} & \mathbf{0} \end{bmatrix} \begin{bmatrix} \Delta \mathbf{x} \\ \Delta \lambda \end{bmatrix} = \begin{bmatrix} -\nabla_\mathbf{x} \mathcal{L}\left(\mathbf{x}, \lambda\right) \\ -\mathbf{c}\left(\mathbf{x}\right) \end{bmatrix} [x22Lxc(xc)0][ΔxΔλ]=[xL(x,λ)c(x)]

The equation is called KKT system, and the first matrix is symmetric. (this is a symmetric indefinite system)

3. Gauss-Newton Method

(1) Basic idea

∂ 2 L ∂ x 2 = ∇ 2 f ( x ) + ∂ ∂ x [ ( ∂ c ∂ x ) ⊤ λ ] ≈ ∇ 2 f ( x ) \begin{align*} \frac{\partial^2 \mathcal{L}}{\partial \mathbf{x}^2} &= \nabla^2 f\left(\mathbf{x}\right) + \frac{\partial}{\partial \mathbf{x}}\left[\left(\frac{\partial \mathbf{c}}{\partial \mathbf{x}}\right)^\top \lambda\right]\\ &\approx \nabla^2 f\left(\mathbf{x}\right) \end{align*} x22L=2f(x)+x[(xc)λ]2f(x)

  • The second term is a tensor and is expensive to compute. So, we often drop the second constraint curvature term. This is called Gauss-Newton method.
  • This method has slightly slower convergence than Newton’s method (more iterations), but each iteration is much cheaper (often wins in wall-clock time).
(2) Example: comparison of Newton and Gauss-Newton
(i) Newton’s method
function newton_step(x0,λ0)
    H = ∇2f(x0) + ForwardDiff.jacobian(x -> ∂c(x)'*λ0, x0)
    C = ∂c(x0)
    Δz = [H C'; C 0]$$-∇f(x0)-C'*λ0; -c(x0)]
    Δx = Δz[1:2]
    Δλ = Δz[3]
    return x0+Δx, λ0+Δλ
end

start from ( − 1 , − 1 ) \left(-1,-1\right) (1,1)
在这里插入图片描述

start from ( − 3 , 2 ) \left(-3,2\right) (3,2)
在这里插入图片描述

it does not converge to the minimum
check the Hessians

H = ∇2f(xguess[:,end]) + ForwardDiff.jacobian(x -> ∂c(x)'*λguess[end], xguess[:,end])

result:

2×2 Matrix{Float64}:
 -1.75818  0.0
  0.0      1.0

It has a negative eigenvalue, which results in a wrong descent direction. This is brought by the second term (constraint curvature term), so we must do some regularization during the Newton’s method. However, the Gauss-Newton method does not have this problem.

(ii) Gauss-Newton method
function gauss_newton_step(x0,λ0)
    H = ∇2f(x0) # drop the 2nd term (tensor)
    C = ∂c(x0)
    Δz = [H C'; C 0]$$-∇f(x0)-C'*λ0; -c(x0)]
    Δx = Δz[1:2]
    Δλ = Δz[3]
    return x0+Δx, λ0+Δλ
end

start from ( − 1 , − 1 ) \left(-1,-1\right) (1,1)
在这里插入图片描述

start from ( − 3 , 2 ) \left(-3,2\right) (3,2)
在这里插入图片描述

It converges to the minimum in 8 iterations.

(3) Takeaway message
  • May still need to regularize the Hessian ∂ 2 L ∂ x 2 \frac{\partial^2 \mathcal{L}}{\partial \mathbf{x}^2} x22L even if ∇ 2 f ( x ) ≻ 0 \nabla^2 f\left(\mathbf{x}\right)\succ 0 2f(x)0.
  • Gauss-Newton method is often used in practice.

4. Inequality Constraints

min ⁡ x f ( x ) s.t. c ( x ) ≤ 0 \min_{\mathbf{x}} f\left(\mathbf{x}\right) \quad \text{s.t.} \quad \mathbf{c}\left(\mathbf{x}\right) \leq \mathbf{0} xminf(x)s.t.c(x)0

  • We’ll look at just inequality constraints for now.
  • Just combine with previous methods to handle both kinds of constraints.
(1) First-order necessary conditions

i. Need ∇ f ( x ) = 0 \nabla f\left(\mathbf{x}\right) = \mathbf{0} f(x)=0 in free directions.
ii. Need c ( x ) ≤ 0 \mathbf{c}\left(\mathbf{x}\right) \leq \mathbf{0} c(x)0.
(same as equality constraints)

(i) KKT conditions

∇ f ( x ) + ( ∂ c ∂ x ) ⊤ λ = 0 ( s t a t i o n a r i t y ) c ( x ) ≤ 0 ( p r i m a l    f e a s i b i l i t y ) λ ≥ 0 ( d u a l    f e a s i b i l i t y ) λ ⊙ c ( x ) = λ ⊤ c ( x ) = 0 ( c o m p l e m e n t a r y    s l a c k n e s s ) \begin{align*} \nabla f\left(\mathbf{x}\right) + \left(\frac{\partial \mathbf{c}}{\partial \mathbf{x}}\right)^\top \lambda = \mathbf{0}& \quad (\mathrm{stationarity}) \\ \mathbf{c}\left(\mathbf{x}\right) \leq \mathbf{0}& \quad (\mathrm{primal\; feasibility}) \\ \lambda \geq \mathbf{0}& \quad (\mathrm{dual\; feasibility}) \\ \lambda\odot \mathbf{c}\left(\mathbf{x}\right) = \lambda^\top \mathbf{c}\left(\mathbf{x}\right) = \mathbf{0}& \quad (\mathrm{complementary\; slackness}) \end{align*} f(x)+(xc)λ=0c(x)0λ0λc(x)=λc(x)=0(stationarity)(primalfeasibility)(dualfeasibility)(complementaryslackness) for some λ ∈ R m \lambda \in \mathbb{R}^m λRm.

(ii) Intuition of KKT conditions
  • If the constraint is active ( c ( x ) = 0 \mathbf{c}\left(\mathbf{x}\right) = \mathbf{0} c(x)=0) ⇒ \Rightarrow λ > 0 \lambda > 0 λ>0.
  • If the constraint is inactive ( c ( x ) < 0 \mathbf{c}\left(\mathbf{x}\right) < \mathbf{0} c(x)<0) ⇒ \Rightarrow λ = 0 \lambda = 0 λ=0.
(iii) Takeaway message
  • The complementary slackness condition is just like a switch. If the constraint is active, then the switch is on ( λ > 0 \lambda > 0 λ>0), and the constraint is inactive, then the switch is off ( λ = 0 \lambda = 0 λ=0).
  • There is also an edge when the minimum of the objective happens to be on the constraint manifold when c ( x ) = λ = 0 \mathbf{c}\left(\mathbf{x}\right) = \lambda = 0 c(x)=λ=0.
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值