高级优化理论与方法(六)
Quasi-Newton Methods
Rank-One Algorithm
核心公式
d
k
=
−
H
k
g
k
d^k=-H^kg^k
dk=−Hkgk
α
k
=
a
r
g
m
i
n
f
(
x
k
+
α
d
k
)
\alpha^k=argmin f(x^k+\alpha d^k)
αk=argminf(xk+αdk)
x
k
+
1
=
x
k
+
α
k
d
k
x^{k+1}=x^k+\alpha^kd^k
xk+1=xk+αkdk
H
k
+
1
=
H
k
+
(
Δ
x
k
−
H
k
Δ
g
k
)
(
Δ
x
k
−
H
k
Δ
g
k
)
T
Δ
g
k
T
(
Δ
x
k
−
H
k
Δ
g
k
)
H^{k+1}=H^k+\frac{(\Delta x^k-H^k\Delta g^k)(\Delta x^k-H^k\Delta g^k)^T}{\Delta{g^k}^T(\Delta x^k-H^k\Delta g^k)}
Hk+1=Hk+ΔgkT(Δxk−HkΔgk)(Δxk−HkΔgk)(Δxk−HkΔgk)T
Theorem
Quadratic: H k + 1 Δ g i = Δ x i , ∀ i ≤ k H^{k+1}\Delta g^i=\Delta x^i, \forall i\leq k Hk+1Δgi=Δxi,∀i≤k
Example
f ( x ) = x 1 2 + 1 2 x 2 2 + 3 = 1 2 x T [ 2 0 0 1 ] x + 3 f(x)=x_1^2+\frac{1}{2}x_2^2+3=\frac{1}{2}x^T\begin{bmatrix} 2&0 \\ 0&1 \end{bmatrix}x+3 f(x)=x12+21x22+3=21xT[2001]x+3
g k = [ 2 0 0 1 ] x k g^k=\begin{bmatrix} 2&0 \\ 0&1 \end{bmatrix}x^k gk=[2001]xk
x 0 = [ 1 2 ] x^0=\begin{bmatrix} 1 \\ 2 \end{bmatrix} x0=[12]
H 0 = [ 1 0 0 1 ] H^0=\begin{bmatrix} 1&0 \\ 0&1 \end{bmatrix} H0=[1001]
d 0 = − H 0 g 0 = [ − 2 − 2 ] d^0=-H^0g^0=\begin{bmatrix} -2 \\ -2 \end{bmatrix} d0=−H0g0=[−2−2]
α 0 = g 0 T d 0 d 0 T Q d 0 = 2 3 \alpha^0=\frac{{g^0}^Td^0}{{d^0}^TQd^0}=\frac{2}{3} α0=d0TQd0g0Td0=32
x 1 = x 0 + α 0 d 0 = [ − 1 3 2 3 ] x^1=x^0+\alpha^0d^0=\begin{bmatrix} -\frac{1}{3} \\ \frac{2}{3} \end{bmatrix} x1=x0+α0d0=[−3132]
H 1 = [ 1 2 0 0 1 ] H^1=\begin{bmatrix} \frac{1}{2}&0 \\ 0&1 \end{bmatrix} H1=[21001]
Problems
H k H^k Hk not positive definite ⇒ \Rightarrow ⇒ not descent.
Δ g k T ( Δ x k − H k Δ g k ) \Delta{g^k}^T(\Delta x^k-H^k\Delta g^k) ΔgkT(Δxk−HkΔgk) too small, close to zero.
DFP-Algorithm
算法步骤
IN: x 0 , H 0 x^0, H^0 x0,H0(positive definite, symmetric)
- k : = 0 k:=0 k:=0
- If g k = 0 g^k=0 gk=0, then stop; else d k = − H k g k d^k=-H^kg^k dk=−Hkgk
- compute α k = a r g m i n f ( x k + α d k ) , x k + 1 = x k + α k d k \alpha^k=argmin f(x^k+\alpha d^k), x^{k+1}=x^k+\alpha^kd^k αk=argminf(xk+αdk),xk+1=xk+αkdk
- compute Δ x k = α k d k , Δ g k = g k + 1 − g k , H k + 1 = H k + Δ x k Δ x k T Δ x k T Δ g k − ( H k Δ g k ) ( H k Δ g k ) T Δ g k T H k Δ g k \Delta x^k=\alpha^kd^k, \Delta g^k=g^{k+1}-g^k,H^{k+1}=H^k+\frac{\Delta x^k{\Delta x^k}^T}{{\Delta x^k}^T\Delta g^k}-\frac{(H^k\Delta g^k)(H^k\Delta g^k)^T}{{\Delta g^k}^TH^k\Delta g^k} Δxk=αkdk,Δgk=gk+1−gk,Hk+1=Hk+ΔxkTΔgkΔxkΔxkT−ΔgkTHkΔgk(HkΔgk)(HkΔgk)T
- k k k++,goto 2
Theorem
Applying DFP to quadratic functions: H k + 1 Δ g i = Δ x i , ∀ i ≤ k H^{k+1}\Delta g^i=\Delta x^i, \forall i\leq k Hk+1Δgi=Δxi,∀i≤k
Example
f ( x ) = 1 2 x T [ 4 2 2 2 ] x − x T [ − 1 1 ] f(x)=\frac{1}{2}x^T\begin{bmatrix} 4&2 \\ 2&2 \end{bmatrix}x-x^T\begin{bmatrix} -1 \\ 1 \end{bmatrix} f(x)=21xT[4222]x−xT[−11]
x 0 = [ 0 0 ] x^0=\begin{bmatrix} 0 \\ 0 \end{bmatrix} x0=[00]
H 0 = [ 1 0 0 1 ] H^0=\begin{bmatrix} 1&0 \\ 0&1 \end{bmatrix} H0=[1001]
g k = [ 4 2 2 2 ] x − [ − 1 1 ] g^k=\begin{bmatrix} 4&2 \\ 2&2 \end{bmatrix}x-\begin{bmatrix} -1 \\ 1 \end{bmatrix} gk=[4222]x−[−11]
g 0 = [ 1 − 1 ] g^0=\begin{bmatrix} 1 \\ -1 \end{bmatrix} g0=[1−1]
d 0 = − H 0 g 0 = [ − 1 1 ] d^0=-H^0g^0=\begin{bmatrix} -1 \\ 1 \end{bmatrix} d0=−H0g0=[−11]
α 0 = 1 \alpha^0=1 α0=1
x 1 = x 0 + α 0 d 0 = [ − 1 1 ] x^1=x^0+\alpha^0d^0=\begin{bmatrix} -1 \\ 1 \end{bmatrix} x1=x0+α0d0=[−11]
g 1 = [ − 1 1 ] g^1=\begin{bmatrix} -1 \\ 1 \end{bmatrix} g1=[−11]
Δ x 0 = [ − 1 1 ] \Delta x^0=\begin{bmatrix} -1\\ 1 \end{bmatrix} Δx0=[−11]
Δ g 0 = [ − 2 0 ] \Delta g^0=\begin{bmatrix} -2\\ 0 \end{bmatrix} Δg0=[−20]
Δ x 0 Δ x 0 T = [ − 1 1 ] [ − 1 , 1 ] = [ 1 − 1 − 1 1 ] \Delta x^0{\Delta x^0}^T=\begin{bmatrix} -1\\ 1 \end{bmatrix}[-1,1]=\begin{bmatrix} 1&-1 \\ -1&1 \end{bmatrix} Δx0Δx0T=[−11][−1,1]=[1−1−11]
Δ x 0 T Δ g 0 = 2 {\Delta x^0}^T\Delta g^0=2 Δx0TΔg0=2
H 0 Δ g 0 = [ − 2 0 ] H^0\Delta g^0=\begin{bmatrix} -2\\ 0 \end{bmatrix} H0Δg0=[−20]
( H 0 Δ g 0 ) ( H 0 Δ g 0 ) T = [ 4 0 0 0 ] (H^0\Delta g^0)(H^0\Delta g^0)^T=\begin{bmatrix} 4&0 \\ 0&0 \end{bmatrix} (H0Δg0)(H0Δg0)T=[4000]
Δ g 0 T H 0 Δ g 0 = 4 {\Delta g^0}^T H^0 \Delta g^0=4 Δg0TH0Δg0=4
H 1 = H 0 + Δ x 0 Δ x 0 T Δ x 0 T Δ g 0 − ( H 0 Δ g 0 ) ( H 0 Δ g 0 ) T Δ g 0 T H 0 Δ g 0 = [ 1 0 0 1 ] + [ 1 − 1 − 1 1 ] 2 − [ 4 0 0 0 ] 4 = [ 1 2 − 1 2 − 1 2 3 2 ] H^1=H^0+\frac{\Delta x^0{\Delta x^0}^T}{{\Delta x^0}^T\Delta g^0}-\frac{(H^0\Delta g^0)(H^0\Delta g^0)^T}{{\Delta g^0}^TH^0\Delta g^0}=\begin{bmatrix} 1&0 \\ 0&1 \end{bmatrix}+\frac{\begin{bmatrix} 1&-1 \\ -1&1 \end{bmatrix}}{2}-\frac{\begin{bmatrix} 4&0 \\ 0&0 \end{bmatrix}}{4}=\begin{bmatrix} \frac{1}{2}&-\frac{1}{2} \\ -\frac{1}{2}&\frac{3}{2} \end{bmatrix} H1=H0+Δx0TΔg0Δx0Δx0T−Δg0TH0Δg0(H0Δg0)(H0Δg0)T=[1001]+2[1−1−11]−4[4000]=[21−21−2123]
d 1 = − H 1 g 1 = [ 0 1 ] d^1=-H^1g^1=\begin{bmatrix} 0 \\ 1 \end{bmatrix} d1=−H1g1=[01]
α 1 = 1 2 \alpha^1=\frac{1}{2} α1=21
x 2 = x 1 + α 1 d 1 = [ − 1 3 2 ] = x ∗ x^2=x^1+\alpha^1 d^1=\begin{bmatrix} -1 \\ \frac{3}{2} \end{bmatrix}=x^* x2=x1+α1d1=[−123]=x∗
Theorem
If g k ≠ 0 , H k g^k\neq 0, H^k gk=0,Hk being positive definite implies H k + 1 H^{k+1} Hk+1 being positive definite.
BFGS
前两个算法的核心是用
H
k
H^k
Hk来表示
F
(
x
k
)
−
1
F(x^k)^{-1}
F(xk)−1,即二阶导的逆。两种算法的不同之处仅仅在于用不同的公式来计算
H
k
H^k
Hk。BFGS是用
G
k
+
1
G^{k+1}
Gk+1来表示
F
(
x
k
)
F(x^k)
F(xk),从而推导出
H
k
H^k
Hk。BFGS的收敛速度在某些情况下较其它两种方法快,但是在非二次函数的情形下难以证明。这里只给出
H
k
H^k
Hk的公式,不做过多展开。
H
k
+
1
=
H
k
+
(
1
+
Δ
g
k
T
H
k
Δ
g
k
Δ
g
k
T
Δ
x
k
)
Δ
x
k
Δ
x
k
T
Δ
x
k
Δ
g
k
−
H
k
Δ
g
k
Δ
x
k
T
+
(
H
k
Δ
g
k
Δ
x
k
T
)
T
Δ
g
k
T
Δ
x
k
H^{k+1}=H^k+(1+\frac{{\Delta g^k}^TH^k\Delta g^k}{{\Delta g^k}^T\Delta x^k})\frac{\Delta x^k{\Delta x^k}^T}{\Delta x^k\Delta g^k}-\frac{H^k\Delta g^k{\Delta x^k}^T+(H^k\Delta g^k{\Delta x^k}^T)^T}{{\Delta g^k}^T\Delta x^k}
Hk+1=Hk+(1+ΔgkTΔxkΔgkTHkΔgk)ΔxkΔgkΔxkΔxkT−ΔgkTΔxkHkΔgkΔxkT+(HkΔgkΔxkT)T
Solving Linear Equations
A x = b , A ∈ R m × n , x ∈ R n , b ∈ R m Ax=b, A\in \mathbb{R}^{m\times n}, x\in \mathbb{R}^n, b\in \mathbb{R}^m Ax=b,A∈Rm×n,x∈Rn,b∈Rm
{ a 11 x 1 + a 12 x 2 + ⋯ + a 1 n x n = b 1 ⋯ a m 1 x 1 + a m 2 x 2 + ⋯ + a m n x n = b n \begin{cases} a_{11}x_1+a_{12}x_2+\cdots+a_{1n}x_n=b_1\\ \cdots\\ a_{m1}x_1+a_{m2}x_2+\cdots+a_{mn}x_n=b_n \end{cases} ⎩ ⎨ ⎧a11x1+a12x2+⋯+a1nxn=b1⋯am1x1+am2x2+⋯+amnxn=bn
n n n代表变量个数, m m m代表方程个数。默认这 m m m个方程线性无关。
Case 1
m ≥ n , r a n k A = n m\geq n, rankA=n m≥n,rankA=n
可以把问题看成用 m m m个数据点来拟合有 n n n个变量的函数,此时可能不存在严格的解。于是我们不要求严格地解这个方程,而是把解方程问题转化为数据拟合问题(Data fitting)。
在数据拟合问题中,我们一般采取的是Least Square fitting,即求 m i n ∣ ∣ A x − b ∣ ∣ 2 min ||Ax-b||^2 min∣∣Ax−b∣∣2
Lemma
r a n k A = n ⇔ r a n k A T A = n rankA=n\Leftrightarrow rank A^TA=n rankA=n⇔rankATA=n
Theorem
The unique vector x ∗ x^* x∗ minimizing ∣ ∣ A x − b ∣ ∣ 2 ||Ax-b||^2 ∣∣Ax−b∣∣2 is given by x ∗ = ( A T A ) − 1 A T b x^*=(A^TA)^{-1}A^Tb x∗=(ATA)−1ATb
Proposition
A = [ a 1 , a 2 , ⋯ , a n ] A=[a_1,a_2,\cdots,a_n] A=[a1,a2,⋯,an]
s p a n [ a 1 , a 2 , ⋯ , a n ] span[a_1,a_2,\cdots,a_n] span[a1,a2,⋯,an]表示由这些向量张成的空间。
设 b b b在该空间上的投影为 v v v,则 v = A x ∗ = A ( A T A ) − 1 A T b v=Ax^*=A(A^TA)^{-1}A^Tb v=Ax∗=A(ATA)−1ATb
注:该命题是通过几何方法给出了 x ∗ = ( A T A ) − 1 A T b x^*=(A^TA)^{-1}A^Tb x∗=(ATA)−1ATb的一个证明。
Example
i i i | 1 | 2 | 3 |
---|---|---|---|
t i t_i ti | 2 | 3 | 4 |
y i y_i yi | 3 | 4 | 15 |
y = m t + c y=mt+c y=mt+c
A = [ 2 1 3 1 4 1 ] A=\begin{bmatrix} 2&1 \\ 3&1\\ 4&1 \end{bmatrix} A= 234111
b = [ 3 4 15 ] b=\begin{bmatrix} 3 \\ 4\\ 15 \end{bmatrix} b= 3415
x ∗ = [ 6 − 32 3 ] x^*=\begin{bmatrix} 6 \\ -\frac{32}{3} \end{bmatrix} x∗=[6−332]
注:此时, x = [ m c ] x=\begin{bmatrix} m \\ c \end{bmatrix} x=[mc]
Recursive Least-Square Algorithm
根据原始数据,已经计算出了
x
∗
x^*
x∗的值,那么此时如果给出一些新的数据,如何在借助已知值的基础上快速算出新的
x
∗
x^*
x∗,就是这个算法要解决的问题。
m
i
n
∣
∣
A
0
x
−
b
0
∣
∣
2
⇐
x
0
=
G
0
−
1
A
0
T
b
0
min||A_0x-b_0||^2\Leftarrow x_0=G_0^{-1}A_0^Tb_0
min∣∣A0x−b0∣∣2⇐x0=G0−1A0Tb0
G
0
=
A
0
−
1
A
0
G_0=A_0^{-1}A_0
G0=A0−1A0
m i n ∣ ∣ [ A 0 A 1 ] x − [ b 0 b 1 ] ∣ ∣ 2 min\left|\left|\begin{bmatrix} A_0 \\ A_1 \end{bmatrix}x-\begin{bmatrix} b_0 \\ b_1 \end{bmatrix}\right|\right| ^2 min [A0A1]x−[b0b1] 2
x 1 = G 1 − 1 [ A 0 A 1 ] T [ b 0 b 1 ] x_1=G_1^{-1}\begin{bmatrix} A_0 \\ A_1 \end{bmatrix}^T\begin{bmatrix} b_0 \\ b_1 \end{bmatrix} x1=G1−1[A0A1]T[b0b1]
G 1 = [ A 0 A 1 ] T [ A 0 A 1 ] = A 0 T A 0 + A 1 T A 1 G_1=\begin{bmatrix} A_0 \\ A_1 \end{bmatrix}^T\begin{bmatrix} A_0 \\ A_1 \end{bmatrix}=A_0^TA_0+A_1^TA_1 G1=[A0A1]T[A0A1]=A0TA0+A1TA1
Goal: express x 1 x_1 x1 as a function of x 0 , A 1 , b 1 , G 0 x_0,A_1,b_1,G_0 x0,A1,b1,G0
x 1 = x 0 + G 1 − 1 A 1 T ( b 1 − A 1 x 0 ) x_1=x_0+G_1^{-1}A_1^T(b_1-A_1x_0) x1=x0+G1−1A1T(b1−A1x0)
k → k + 1 : { G k + 1 = G k + A k + 1 T A k + 1 x k + 1 = x k + G k + 1 − 1 A k + 1 T ( b k + 1 − A k + 1 x k ) k\to k+1: \begin{cases} G_{k+1}=G_k+A_{k+1}^TA_{k+1}\\ x_{k+1}=x_k+G_{k+1}^{-1}A_{k+1}^T(b_{k+1}-A_{k+1}x_k) \end{cases} k→k+1:{Gk+1=Gk+Ak+1TAk+1xk+1=xk+Gk+1−1Ak+1T(bk+1−Ak+1xk)
Let G k + 1 − 1 = P k + 1 , G k − 1 = P k G_{k+1}^{-1}=P_{k+1}, G_k^{-1}=P_k Gk+1−1=Pk+1,Gk−1=Pk
P k + 1 = P k − P k A k + 1 T ( I + A k + 1 P k A k + 1 T ) − 1 A k + 1 P k P_{k+1}=P_k-P_kA_{k+1}^T(I+A_{k+1}P_kA_{k+1}^T)^{-1}A_{k+1}P_k Pk+1=Pk−PkAk+1T(I+Ak+1PkAk+1T)−1Ak+1Pk
x k + 1 = x k + P k A k + 1 T ( b k + 1 − A k + 1 x k ) x_{k+1}=x_k+P_kA_{k+1}^T(b_{k+1}-A_{k+1}x_k) xk+1=xk+PkAk+1T(bk+1−Ak+1xk)
Lemma
X X X invertible matrix, U U U and V V V satisfy I + V X − 1 U I+VX^{-1}U I+VX−1U being invertible, then ( X + U V ) − 1 = X − 1 − ( X − 1 U ) ( I + V X − 1 U ) − 1 ( V X − 1 ) (X+UV)^{-1}=X^{-1}-(X^{-1}U)(I+VX^{-1}U)^{-1}(VX^{-1}) (X+UV)−1=X−1−(X−1U)(I+VX−1U)−1(VX−1)
总结
本节课先就上节课没讲完的拟牛顿法接着讲,回顾了秩为1的修正方法。由于该方法存在一些问题,为了解决这些问题,又介绍了DFP算法。接着又简要介绍了收敛速度较快的BFGS算法。接着开始了一个新的主题——解线性方程组。由于问题的复杂性,我们讲问题分成两种情况来分类讨论,即 m ≥ n m\geq n m≥n和 m < n m<n m<n两种情况。目前只讲了第一种情况。第一种情况本质上是一个数据拟合问题,根据数据,求得 m i n ∣ ∣ A x − b ∣ ∣ 2 min ||Ax-b||^2 min∣∣Ax−b∣∣2。然后,介绍了直接的求解公式。最后,对于如何在充分利用已知数据的基础上,对新数据进行拟合,又提出了递归最小二乘法。