【Optimal Control (CMU 16-745)】Lecture 3 Optimization Part 1

Review:

  • Discrete-time dynamics/simulations
  • Stability of discrete-time systems
  • Forward/backward Euler methods
  • RK4 method
  • Zero/first order hold on controls

Lecture 3 Optimization Pt.1

Overview

  • Notation
  • Root finding
  • Minimization

1. Notation

(1) Scalar function f ( x ) : R n → R f(x): \mathbb{R}^n \rightarrow \mathbb{R} f(x):RnR

∂ f ∂ x = [ ∂ f ∂ x 1 ⋯ ∂ f ∂ x n ] ∈ R n × 1 ∈ R 1 × n \frac{\partial f}{\partial x} = \left[\frac{\partial f}{\partial x_1} \cdots \frac{\partial f}{\partial x_n}\right] \in \mathbb{R}^{n\times 1} \in \mathbb{R}^{1\times n} xf=[x1fxnf]Rn×1R1×n is a row vector.

This is because ∂ f ∂ x \frac{\partial f}{\partial x} xf is the linear operator mapping Δ x \Delta x Δx into Δ f \Delta f Δf:
f ( x + Δ x ) ≈ f ( x ) + ∂ f ∂ x Δ x f\left(x + \Delta x\right) \approx f(x) + \frac{\partial f}{\partial x} \Delta x f(x+Δx)f(x)+xfΔx

(2) Vector function g ( y ) : R m → R n g(y): \mathbb{R}^m \rightarrow \mathbb{R}^n g(y):RmRn

Similarly, we have
∂ g ∂ y ∈ R n × m , \frac{\partial g}{\partial y}\in \mathbb{R}^{n\times m}, ygRn×m, because
g ( y + Δ y ) ≈ g ( y ) + ∂ g ∂ y Δ y . g\left(y + \Delta y\right) \approx g(y) + \frac{\partial g}{\partial y} \Delta y. g(y+Δy)g(y)+ygΔy.

(3) Chain rule

The above conventions make the chain rule work:
f ( g ( y + Δ y ) ) ≈ f ( g ( y ) ) + ∂ f ∂ x ∣ g ( y ) ⋅ ∂ g ∂ y ∣ y ⋅ Δ y f\left(g(y+\Delta y)\right) \approx f\left(g(y)\right) + \left.\frac{\partial f}{\partial x} \right|_{g(y)} \cdot \left.\frac{\partial g}{\partial y} \right|_{y} \cdot \Delta y f(g(y+Δy))f(g(y))+xf g(y)yg yΔy

For convenience, we will define the Jacobian matrix as
∇ f ( x ) = ( ∂ f ∂ x ) T ∈ R n × 1 \nabla f \left(x\right) = \left(\frac{\partial f}{\partial x}\right)^T \in \mathbb{R}^{n\times 1} f(x)=(xf)TRn×1 (column vector) and Hessian matrix as
∇ 2 f ( x ) = ∂ ∂ x ( ∇ f ( x ) ) = ∂ 2 f ∂ x 2 ∈ R n × n \nabla ^2 f \left(x\right) = \frac{\partial}{\partial x} \left(\nabla f \left(x\right)\right) = \frac{\partial^2 f}{\partial x^2} \in \mathbb{R}^{n\times n} 2f(x)=x(f(x))=x22fRn×n (symmetric matrix).

2. Root finding

(1)

Given f ( x ) f\left(\mathrm{x}\right) f(x), find x ∗ \mathrm{x}^* x such that f ( x ∗ ) = 0 f\left(\mathrm{x}^*\right) = 0 f(x)=0.

Example:

equilibrium point of a continuous-time dynamics

(2)

Given f ( x ) f\left(\mathrm{x}\right) f(x), find x ∗ \mathrm{x}^* x such that f ( x ∗ ) = x ∗ f\left(\mathrm{x}^*\right) = \mathrm{x}^* f(x)=x.

Example:

equilibrium point of a discrete-time dynamics

(3) Method 1: Fixed-point iteration
  • Simplist solution method
  • If fixed point is stable, just iterate the dynamics until convergence
  • Only works for systems with single equilibrium
  • Only works for stable fixed points and has slow convergence
(4) Method 2: Newton’s method

I. Fit a linear approximation to f ( x ) f\left(\mathrm{x}\right) f(x)
f ( x + Δ x ) ≈ f ( x ) + ∂ f ∂ x ∣ x Δ x f\left(\mathrm{x}+\Delta \mathrm{x}\right) \approx f\left(\mathrm{x}\right) + \left.\frac{\partial f}{\partial x}\right|_{\mathrm{x}} \Delta \mathrm{x} f(x+Δx)f(x)+xf xΔx

II. Set approximation to zero and solve for Δ x \Delta \mathrm{x} Δx
f ( x ) + ∂ f ∂ x ∣ x Δ x = 0 ⇒ Δ x = − ∂ f ∂ x ∣ x − 1 f ( x ) f\left(\mathrm{x}\right) + \left.\frac{\partial f}{\partial x}\right|_{\mathrm{x}} \Delta \mathrm{x} = 0 \Rightarrow \Delta \mathrm{x} = - \left.\frac{\partial f}{\partial x}\right|_{\mathrm{x}}^{-1} f\left(\mathrm{x}\right) f(x)+xf xΔx=0Δx=xf x1f(x)

III. Apply correction:
x ← x + Δ x \mathrm{x} \leftarrow \mathrm{x} + \Delta \mathrm{x} xx+Δx

IV. Repeat until convergence

Example: Backward Euler

在这里插入图片描述
Both two methods can get a damping motion, but Newton’s method is much faster.
Plot the error during the iteration:
在这里插入图片描述
Very fast convergence with Newton’s method.
The convergence rate of fixed-point iteration is linear, while Newton’s method is quadratic.

3-element Vector{Float64}:
0.09793658173053843
3.7830087232931797e-6
5.2874553670659e-15

(5) Takeaway message
  • Quadaratic convergence (Newton’s method)
  • Can get machine precision
  • Most expensive part: solving linear system (caculation of Jacobian is not the most expensive part, but the factorization and inversion of Jacobian is. O ( n 3 ) \mathcal{O}\left(n^3\right) O(n3) for n × n n\times n n×n matrix)
  • Can improve complexity by taking advantage of problem structure (e.g. sparse Jacobian in many cases) (more later)

3. Minimization

min ⁡ x f ( x ) , f ( x ) : R n → R \min_{\mathrm{x}} f\left(\mathrm{x}\right), f\left(\mathrm{x}\right): \mathbb{R}^n \rightarrow \mathbb{R} xminf(x),f(x):RnR

If f f f is smooth, then ∂ f ∂ x ∣ x ∗ = 0 \left.\frac{\partial f}{\partial x}\right|_{\mathrm{x}^*} = 0 xf x=0 at a local minimum.
Thus, we can transform the minimization problem into a root-finding problem: ∇ f ( x ) = 0 \nabla f\left(\mathrm{x}\right) = 0 f(x)=0.

⇒ \Rightarrow Apply Newton’s method:
∇ f ( x + Δ x ) ≈ ∇ f ( x ) + ∂ ∂ x ( ∇ f ( x ) ) Δ x = ∇ f ( x ) + ∇ 2 f ( x ) Δ x = 0 \begin{align*} \nabla f\left(\mathrm{x}+\Delta x\right) &\approx \nabla f\left(\mathrm{x}\right) + \frac{\partial}{\partial x} \left(\nabla f\left(\mathrm{x}\right)\right) \Delta x \\ & = \nabla f\left(\mathrm{x}\right) + \nabla ^2 f\left(\mathrm{x}\right) \Delta x \\ & = 0 \end{align*} f(x+Δx)f(x)+x(f(x))Δx=f(x)+2f(x)Δx=0

⇒ Δ x = − ( ∇ 2 f ( x ) ) − 1 ∇ f ( x ) \Rightarrow \Delta x = - \left(\nabla ^2 f\left(\mathrm{x}\right)\right)^{-1} \nabla f\left(\mathrm{x}\right) Δx=(2f(x))1f(x)

Update x ← x + Δ x \mathrm{x} \leftarrow \mathrm{x} + \Delta \mathrm{x} xx+Δx, then repeat until convergence.

(1) Intuition
  • Fit a quadratic approximation to f ( x ) f\left(\mathrm{x}\right) f(x)
  • Exactly minimize quadratic approximation
(2) Example:

min ⁡ x f ( x ) = x 4 + x 3 − x 2 − x \min_{x} f\left(x\right) = x^4 + x^3 - x^2 -x xminf(x)=x4+x3x2x

start at: x = 1.0 x = 1.0 x=1.0 ⇒ \Rightarrow converge to the global minimum
在这里插入图片描述
start at: x = − 1.5 x = -1.5 x=1.5 ⇒ \Rightarrow converge to a local minimum
在这里插入图片描述
start at: x = 0.0 x = 0.0 x=0.0 ⇒ \Rightarrow converge to a local maximum
![在这里插入图片描述](https://img-blog.csdnimg.cn/a03161e649ac4383ae90a07f12001444.png#pic_center

(3) Takeaway message
  • Newton’s method is a local root-finding method. It will converge to the closest fixed point to the initial guess (minimum, maximum, or saddle point).

4. Sufficient conditions

  • first-order necessary condition: ∇ f ( x ) = 0 \nabla f\left(\mathrm{x}\right) = 0 f(x)=0 (not sufficient for a minimum)

  • Scalar case:
    Δ x = − ( ∇ 2 f ( x ) ) − 1 ∇ f ( x ) \Delta x = - \left(\nabla ^2 f\left(\mathrm{x}\right)\right)^{-1} \nabla f\left(\mathrm{x}\right) Δx=(2f(x))1f(x). “-” means descent, ∇ f \nabla f f is the gradient.
    The Hessian ∇ 2 f ( x ) \nabla ^2 f\left(\mathrm{x}\right) 2f(x) can be interpreted as the learning rate or step size, so it should be positive definite.
    i. ∇ 2 f ( x ) > 0 \nabla ^2 f\left(\mathrm{x}\right) > 0 2f(x)>0 ⇒ \Rightarrow descent (minimization)
    ii. ∇ 2 f ( x ) < 0 \nabla ^2 f\left(\mathrm{x}\right) < 0 2f(x)<0 ⇒ \Rightarrow ascent (maximization)

  • Vector case ( R n \mathbb{R}^n Rn):
    ∇ 2 f ( x ) ≻ 0 \nabla ^2 f\left(\mathrm{x}\right) \succ 0 2f(x)0 (or ∇ 2 f ( x ) ∈ S + + n \nabla ^2 f\left(\mathrm{x}\right) \in \mathbb{S}^n_{++} 2f(x)S++n)
    ⇒ \Rightarrow descent (minimization)
    If ∇ 2 f ( x ) ≻ 0 \nabla ^2 f\left(\mathrm{x}\right) \succ 0 2f(x)0, ∀ x \forall \mathrm{x} x, then f ( x ) f\left(\mathrm{x}\right) f(x) is strongly convex, then Newton’s method will converge globally and give global minimum.
    But this always not true for hard/nonlinear problems.

5. Regularization

aims to ensure the update direction is always descent direction

It is a practical solution to make sure we are always minimizing: check whether the Hessain is positive definite.

H ← ∇ 2 f ( x ) H \leftarrow \nabla ^2 f\left(\mathrm{x}\right) H2f(x)
while H ⊁ 0 H \nsucc 0 H0:
\quad H ← H + β I ( β > 0 ) H \leftarrow H + \beta I (\beta > 0) HH+βI(β>0)
\quad ( β \beta β is a hyperparameter)
end
Δ x = − H − 1 ∇ f ( x ) \Delta x = - H^{-1} \nabla f\left(\mathrm{x}\right) Δx=H1f(x)
x ← x + Δ x \mathrm{x} \leftarrow \mathrm{x} + \Delta \mathrm{x} xx+Δx

# Initialization
H = calculate_hessian(x)
beta = hyperparameter

# Optimization Loop
while not is_positive_definite(H):
    H = H + beta * I  # I is the identity matrix, beta is a hyperparameter

# Update
dx = - H.inv() * grad_f(x)
x = x + dx

# Termination

It is also called damped Newton’s method. It guarantees that the update direction is always descent direction, and makes the step size smaller (adding β I \beta I βI is equivalent to adding a quadratic penalty term to Δ x \Delta x Δx).

start at: x = 0.0 x = 0.0 x=0.0
在这里插入图片描述

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值