Review:
- Discrete-time dynamics/simulations
- Stability of discrete-time systems
- Forward/backward Euler methods
- RK4 method
- Zero/first order hold on controls
Lecture 3 Optimization Pt.1
Overview
- Notation
- Root finding
- Minimization
1. Notation
(1) Scalar function f ( x ) : R n → R f(x): \mathbb{R}^n \rightarrow \mathbb{R} f(x):Rn→R
∂ f ∂ x = [ ∂ f ∂ x 1 ⋯ ∂ f ∂ x n ] ∈ R n × 1 ∈ R 1 × n \frac{\partial f}{\partial x} = \left[\frac{\partial f}{\partial x_1} \cdots \frac{\partial f}{\partial x_n}\right] \in \mathbb{R}^{n\times 1} \in \mathbb{R}^{1\times n} ∂x∂f=[∂x1∂f⋯∂xn∂f]∈Rn×1∈R1×n is a row vector.
This is because
∂
f
∂
x
\frac{\partial f}{\partial x}
∂x∂f is the linear operator mapping
Δ
x
\Delta x
Δx into
Δ
f
\Delta f
Δf:
f
(
x
+
Δ
x
)
≈
f
(
x
)
+
∂
f
∂
x
Δ
x
f\left(x + \Delta x\right) \approx f(x) + \frac{\partial f}{\partial x} \Delta x
f(x+Δx)≈f(x)+∂x∂fΔx
(2) Vector function g ( y ) : R m → R n g(y): \mathbb{R}^m \rightarrow \mathbb{R}^n g(y):Rm→Rn
Similarly, we have
∂
g
∂
y
∈
R
n
×
m
,
\frac{\partial g}{\partial y}\in \mathbb{R}^{n\times m},
∂y∂g∈Rn×m, because
g
(
y
+
Δ
y
)
≈
g
(
y
)
+
∂
g
∂
y
Δ
y
.
g\left(y + \Delta y\right) \approx g(y) + \frac{\partial g}{\partial y} \Delta y.
g(y+Δy)≈g(y)+∂y∂gΔy.
(3) Chain rule
The above conventions make the chain rule work:
f
(
g
(
y
+
Δ
y
)
)
≈
f
(
g
(
y
)
)
+
∂
f
∂
x
∣
g
(
y
)
⋅
∂
g
∂
y
∣
y
⋅
Δ
y
f\left(g(y+\Delta y)\right) \approx f\left(g(y)\right) + \left.\frac{\partial f}{\partial x} \right|_{g(y)} \cdot \left.\frac{\partial g}{\partial y} \right|_{y} \cdot \Delta y
f(g(y+Δy))≈f(g(y))+∂x∂f
g(y)⋅∂y∂g
y⋅Δy
For convenience, we will define the Jacobian matrix as
∇
f
(
x
)
=
(
∂
f
∂
x
)
T
∈
R
n
×
1
\nabla f \left(x\right) = \left(\frac{\partial f}{\partial x}\right)^T \in \mathbb{R}^{n\times 1}
∇f(x)=(∂x∂f)T∈Rn×1 (column vector) and Hessian matrix as
∇
2
f
(
x
)
=
∂
∂
x
(
∇
f
(
x
)
)
=
∂
2
f
∂
x
2
∈
R
n
×
n
\nabla ^2 f \left(x\right) = \frac{\partial}{\partial x} \left(\nabla f \left(x\right)\right) = \frac{\partial^2 f}{\partial x^2} \in \mathbb{R}^{n\times n}
∇2f(x)=∂x∂(∇f(x))=∂x2∂2f∈Rn×n (symmetric matrix).
2. Root finding
(1)
Given f ( x ) f\left(\mathrm{x}\right) f(x), find x ∗ \mathrm{x}^* x∗ such that f ( x ∗ ) = 0 f\left(\mathrm{x}^*\right) = 0 f(x∗)=0.
Example:
equilibrium point of a continuous-time dynamics
(2)
Given f ( x ) f\left(\mathrm{x}\right) f(x), find x ∗ \mathrm{x}^* x∗ such that f ( x ∗ ) = x ∗ f\left(\mathrm{x}^*\right) = \mathrm{x}^* f(x∗)=x∗.
Example:
equilibrium point of a discrete-time dynamics
(3) Method 1: Fixed-point iteration
- Simplist solution method
- If fixed point is stable, just iterate the dynamics until convergence
- Only works for systems with single equilibrium
- Only works for stable fixed points and has slow convergence
(4) Method 2: Newton’s method
I. Fit a linear approximation to
f
(
x
)
f\left(\mathrm{x}\right)
f(x)
f
(
x
+
Δ
x
)
≈
f
(
x
)
+
∂
f
∂
x
∣
x
Δ
x
f\left(\mathrm{x}+\Delta \mathrm{x}\right) \approx f\left(\mathrm{x}\right) + \left.\frac{\partial f}{\partial x}\right|_{\mathrm{x}} \Delta \mathrm{x}
f(x+Δx)≈f(x)+∂x∂f
xΔx
II. Set approximation to zero and solve for
Δ
x
\Delta \mathrm{x}
Δx
f
(
x
)
+
∂
f
∂
x
∣
x
Δ
x
=
0
⇒
Δ
x
=
−
∂
f
∂
x
∣
x
−
1
f
(
x
)
f\left(\mathrm{x}\right) + \left.\frac{\partial f}{\partial x}\right|_{\mathrm{x}} \Delta \mathrm{x} = 0 \Rightarrow \Delta \mathrm{x} = - \left.\frac{\partial f}{\partial x}\right|_{\mathrm{x}}^{-1} f\left(\mathrm{x}\right)
f(x)+∂x∂f
xΔx=0⇒Δx=−∂x∂f
x−1f(x)
III. Apply correction:
x
←
x
+
Δ
x
\mathrm{x} \leftarrow \mathrm{x} + \Delta \mathrm{x}
x←x+Δx
IV. Repeat until convergence
Example: Backward Euler
Both two methods can get a damping motion, but Newton’s method is much faster.
Plot the error during the iteration:
Very fast convergence with Newton’s method.
The convergence rate of fixed-point iteration is linear, while Newton’s method is quadratic.
3-element Vector{Float64}:
0.09793658173053843
3.7830087232931797e-6
5.2874553670659e-15
(5) Takeaway message
- Quadaratic convergence (Newton’s method)
- Can get machine precision
- Most expensive part: solving linear system (caculation of Jacobian is not the most expensive part, but the factorization and inversion of Jacobian is. O ( n 3 ) \mathcal{O}\left(n^3\right) O(n3) for n × n n\times n n×n matrix)
- Can improve complexity by taking advantage of problem structure (e.g. sparse Jacobian in many cases) (more later)
3. Minimization
min x f ( x ) , f ( x ) : R n → R \min_{\mathrm{x}} f\left(\mathrm{x}\right), f\left(\mathrm{x}\right): \mathbb{R}^n \rightarrow \mathbb{R} xminf(x),f(x):Rn→R
If
f
f
f is smooth, then
∂
f
∂
x
∣
x
∗
=
0
\left.\frac{\partial f}{\partial x}\right|_{\mathrm{x}^*} = 0
∂x∂f
x∗=0 at a local minimum.
Thus, we can transform the minimization problem into a root-finding problem:
∇
f
(
x
)
=
0
\nabla f\left(\mathrm{x}\right) = 0
∇f(x)=0.
⇒
\Rightarrow
⇒Apply Newton’s method:
∇
f
(
x
+
Δ
x
)
≈
∇
f
(
x
)
+
∂
∂
x
(
∇
f
(
x
)
)
Δ
x
=
∇
f
(
x
)
+
∇
2
f
(
x
)
Δ
x
=
0
\begin{align*} \nabla f\left(\mathrm{x}+\Delta x\right) &\approx \nabla f\left(\mathrm{x}\right) + \frac{\partial}{\partial x} \left(\nabla f\left(\mathrm{x}\right)\right) \Delta x \\ & = \nabla f\left(\mathrm{x}\right) + \nabla ^2 f\left(\mathrm{x}\right) \Delta x \\ & = 0 \end{align*}
∇f(x+Δx)≈∇f(x)+∂x∂(∇f(x))Δx=∇f(x)+∇2f(x)Δx=0
⇒ Δ x = − ( ∇ 2 f ( x ) ) − 1 ∇ f ( x ) \Rightarrow \Delta x = - \left(\nabla ^2 f\left(\mathrm{x}\right)\right)^{-1} \nabla f\left(\mathrm{x}\right) ⇒Δx=−(∇2f(x))−1∇f(x)
Update x ← x + Δ x \mathrm{x} \leftarrow \mathrm{x} + \Delta \mathrm{x} x←x+Δx, then repeat until convergence.
(1) Intuition
- Fit a quadratic approximation to f ( x ) f\left(\mathrm{x}\right) f(x)
- Exactly minimize quadratic approximation
(2) Example:
min x f ( x ) = x 4 + x 3 − x 2 − x \min_{x} f\left(x\right) = x^4 + x^3 - x^2 -x xminf(x)=x4+x3−x2−x
start at:
x
=
1.0
x = 1.0
x=1.0
⇒
\Rightarrow
⇒ converge to the global minimum
start at:
x
=
−
1.5
x = -1.5
x=−1.5
⇒
\Rightarrow
⇒ converge to a local minimum
start at:
x
=
0.0
x = 0.0
x=0.0
⇒
\Rightarrow
⇒ converge to a local maximum
![在这里插入图片描述](https://img-blog.csdnimg.cn/a03161e649ac4383ae90a07f12001444.png#pic_center
(3) Takeaway message
- Newton’s method is a local root-finding method. It will converge to the closest fixed point to the initial guess (minimum, maximum, or saddle point).
4. Sufficient conditions
-
first-order necessary condition: ∇ f ( x ) = 0 \nabla f\left(\mathrm{x}\right) = 0 ∇f(x)=0 (not sufficient for a minimum)
-
Scalar case:
Δ x = − ( ∇ 2 f ( x ) ) − 1 ∇ f ( x ) \Delta x = - \left(\nabla ^2 f\left(\mathrm{x}\right)\right)^{-1} \nabla f\left(\mathrm{x}\right) Δx=−(∇2f(x))−1∇f(x). “-” means descent, ∇ f \nabla f ∇f is the gradient.
The Hessian ∇ 2 f ( x ) \nabla ^2 f\left(\mathrm{x}\right) ∇2f(x) can be interpreted as the learning rate or step size, so it should be positive definite.
i. ∇ 2 f ( x ) > 0 \nabla ^2 f\left(\mathrm{x}\right) > 0 ∇2f(x)>0 ⇒ \Rightarrow ⇒ descent (minimization)
ii. ∇ 2 f ( x ) < 0 \nabla ^2 f\left(\mathrm{x}\right) < 0 ∇2f(x)<0 ⇒ \Rightarrow ⇒ ascent (maximization) -
Vector case ( R n \mathbb{R}^n Rn):
∇ 2 f ( x ) ≻ 0 \nabla ^2 f\left(\mathrm{x}\right) \succ 0 ∇2f(x)≻0 (or ∇ 2 f ( x ) ∈ S + + n \nabla ^2 f\left(\mathrm{x}\right) \in \mathbb{S}^n_{++} ∇2f(x)∈S++n)
⇒ \Rightarrow ⇒ descent (minimization)
If ∇ 2 f ( x ) ≻ 0 \nabla ^2 f\left(\mathrm{x}\right) \succ 0 ∇2f(x)≻0, ∀ x \forall \mathrm{x} ∀x, then f ( x ) f\left(\mathrm{x}\right) f(x) is strongly convex, then Newton’s method will converge globally and give global minimum.
But this always not true for hard/nonlinear problems.
5. Regularization
aims to ensure the update direction is always descent direction
It is a practical solution to make sure we are always minimizing: check whether the Hessain is positive definite.
H
←
∇
2
f
(
x
)
H \leftarrow \nabla ^2 f\left(\mathrm{x}\right)
H←∇2f(x)
while
H
⊁
0
H \nsucc 0
H⊁0:
\quad
H
←
H
+
β
I
(
β
>
0
)
H \leftarrow H + \beta I (\beta > 0)
H←H+βI(β>0)
\quad
(
β
\beta
β is a hyperparameter)
end
Δ
x
=
−
H
−
1
∇
f
(
x
)
\Delta x = - H^{-1} \nabla f\left(\mathrm{x}\right)
Δx=−H−1∇f(x)
x
←
x
+
Δ
x
\mathrm{x} \leftarrow \mathrm{x} + \Delta \mathrm{x}
x←x+Δx
# Initialization
H = calculate_hessian(x)
beta = hyperparameter
# Optimization Loop
while not is_positive_definite(H):
H = H + beta * I # I is the identity matrix, beta is a hyperparameter
# Update
dx = - H.inv() * grad_f(x)
x = x + dx
# Termination
It is also called damped Newton’s method. It guarantees that the update direction is always descent direction, and makes the step size smaller (adding β I \beta I βI is equivalent to adding a quadratic penalty term to Δ x \Delta x Δx).
start at:
x
=
0.0
x = 0.0
x=0.0