牛顿法
首先介绍一下牛顿法。给定一个优化问题
m
i
n
x
f
(
x
)
min_x f(x)
minxf(x)
其中
f
(
x
)
f(x)
f(x)具有连续的二阶偏导。
由Taylor展开,当
x
→
x
k
x\rightarrow x_k
x→xk,
f
(
x
∗
)
=
f
(
x
k
)
+
∇
f
(
x
k
)
(
x
∗
−
x
k
)
+
1
2
(
x
∗
−
x
k
)
T
∇
2
f
(
x
k
)
(
x
∗
−
x
k
)
+
O
(
x
3
)
f(x^*) = f(x_k)+\nabla f(x_k)(x^*-x_k)+\frac{1}{2}(x^*-x_k)^T\nabla^2f(x_k)(x^*-x_k)+ O(x^3)
f(x∗)=f(xk)+∇f(xk)(x∗−xk)+21(x∗−xk)T∇2f(xk)(x∗−xk)+O(x3)
在极值点有
∇
f
(
x
∗
)
=
0
\nabla f(x^*)=0
∇f(x∗)=0,因此
∇
f
(
x
∗
)
=
f
(
x
∗
)
−
f
(
x
k
)
x
∗
−
x
k
=
∇
f
(
x
k
)
+
∇
2
f
(
x
k
)
(
x
∗
−
x
k
)
=
0
\nabla f(x^*)=\frac{f(x^*)-f(x_k)}{x^*-x_k} = \nabla f(x_k) + \nabla^2f(x_k)(x^*-x_k)=0
∇f(x∗)=x∗−xkf(x∗)−f(xk)=∇f(xk)+∇2f(xk)(x∗−xk)=0
⟹
x
∗
=
x
k
−
H
k
−
1
g
k
\Longrightarrow \quad x^*=x_k-H_k^{-1}g_k
⟹x∗=xk−Hk−1gk
其中
H
k
=
∇
2
f
(
x
k
)
,
g
k
=
∇
f
(
x
k
)
H_k=\nabla^2f(x_k), \quad g_k=\nabla f(x_k)
Hk=∇2f(xk),gk=∇f(xk)(Note:因为此处是求极小值,而当
H
k
H_k
Hk正定时,
f
(
x
)
f(x)
f(x)的极值为最小值,所以此处不妨假设
H
k
H_k
Hk是正定的,即 我们假设
f
(
x
)
f(x)
f(x) 是凸函数)
由此我们得到了牛顿法的下降方向 − H k − 1 g k -H_k^{-1}g_k −Hk−1gk,下面证明总 ∃ λ > 0 \exists \lambda>0 ∃λ>0,使得 p k = − λ H k − 1 g k p_k=-\lambda H_k^{-1}g_k pk=−λHk−1gk满足: f ( x k + p k ) < f ( x k ) f(x_k+p_k)<f(x_k) f(xk+pk)<f(xk)
由Taylor展开, f ( x k + 1 ) = f ( x k ) − λ g k H k − 1 g k + 1 2 λ 2 ( H k − 1 g k ) T H k ( H k − 1 g k ) f(x_{k+1})=f(x_k)-\lambda g_k H_k^{-1}g_k+ \frac{1}{2}\lambda^2(H_k^{-1}g_k)^T H_k (H_k^{-1}g_k) f(xk+1)=f(xk)−λgkHk−1gk+21λ2(Hk−1gk)THk(Hk−1gk),因为 H k H_k Hk正定,所以 ∀ a ∈ R n × 1 , a T H a > 0 \forall a\in \mathbb{R}^{n\times 1}, a^THa>0 ∀a∈Rn×1,aTHa>0,因此总存在一个充分小的 λ k \lambda_k λk,使得 f ( x k + 1 ) < f ( x k ) f(x_{k+1})<f(x_k) f(xk+1)<f(xk)
Note: 牛顿法由于使用了二阶信息,可以证明它是具有二阶收敛性的,因此它的收敛速度会比作为一阶算法的梯度下降法快很多,但缺点是 H k − 1 H_k^{-1} Hk−1的计算量很大,而且由于存在求逆的步骤,如果 H k H_k Hk的条件数很大的话,容易在数值计算中出现NAN
拟牛顿法
针对上述问题,实际应用中,我们通常会找一个正定矩阵 G k G_k Gk来拟合 H k − 1 H_k^{-1} Hk−1,或者 B k B_k Bk来拟合 H k H_k Hk,下面我们来看牛顿法中的 H k H_k Hk应该满足什么条件。
首先根据上述牛顿法,我们可以得到
g
k
=
g
k
+
1
+
H
k
+
1
(
x
k
−
x
k
+
1
)
g_{k}=g_{k+1}+H_{k+1}(x_k-x_{k+1})
gk=gk+1+Hk+1(xk−xk+1),记
y
k
=
g
k
+
1
−
g
k
,
s
k
=
x
k
+
1
−
x
k
⟹
H
k
+
1
s
k
=
y
k
y_k=g_{k+1}-g_k ,s_k=x_{k+1}-x_k \Longrightarrow H_{k+1} s_k=y_k
yk=gk+1−gk,sk=xk+1−xk⟹Hk+1sk=yk
由此我们得到了拟牛顿条件:
H
k
+
1
s
k
=
y
k
H_{k+1} s_k=y_k
Hk+1sk=yk。并且要求
H
k
+
1
H_{k+1}
Hk+1是一个对称正定矩阵。
BFGS,LBFGS
BFGS
BFGS的思路是构造
B
k
B_k
Bk来拟合
H
k
H_k
Hk,由于
H
k
H_k
Hk是正定矩阵,因此在线性空间中
H
k
H_k
Hk是对称矩阵,所以我们首先将
B
k
+
1
B_{k+1}
Bk+1构造为对称阵:
B
k
+
1
=
B
k
+
α
u
u
T
+
β
v
v
T
B_{k+1}=B_k+\alpha u u^T+\beta v v^T
Bk+1=Bk+αuuT+βvvT,那么根据拟牛顿条件,有:
B
k
+
1
s
k
=
B
k
s
k
+
α
u
u
T
s
k
+
β
v
v
T
s
k
=
B
k
s
k
+
α
u
T
s
k
u
+
β
v
T
s
k
v
\begin{aligned} B_{k+1}s_k &= B_ks_k+\alpha u u^Ts_k+\beta v v^Ts_k\\ &= B_ks_k+\alpha u^Ts_k u+\beta v^Ts_k v \end{aligned}
Bk+1sk=Bksk+αuuTsk+βvvTsk=Bksk+αuTsku+βvTskv
不妨令:
α
u
T
s
k
=
1
,
β
v
T
s
k
=
−
1
\alpha u^Ts_k=1,\beta v^Ts_k=-1
αuTsk=1,βvTsk=−1,则
α
=
1
u
T
s
k
,
β
=
−
1
v
T
s
k
\alpha=\frac{1}{u^Ts_k}, \beta= -\frac{1}{v^Ts_k}
α=uTsk1,β=−vTsk1。此时:
B
k
s
k
+
u
−
v
=
y
k
B_ks_k+u-v=y_k
Bksk+u−v=yk
再令
u
=
y
k
,
v
=
B
k
s
k
u=y_k, v=B_k s_k
u=yk,v=Bksk,
⟹
α
=
1
y
k
T
s
k
,
β
=
−
1
s
k
T
B
k
s
k
\Longrightarrow \alpha=\frac{1}{y_k^T s_k}, \beta= -\frac{1}{s_k^TB_k s_k}
⟹α=ykTsk1,β=−skTBksk1
⟹
B
k
+
1
=
B
k
+
y
k
y
k
T
y
k
T
s
k
−
B
k
s
k
s
k
T
B
k
s
k
T
B
k
s
k
\Longrightarrow\quad B_{k+1} = B_k + \frac{y_k y_k^T}{y_k^T s_k}-\frac{B_k s_k s_k^TB_k}{s_k^T B_k s_k}
⟹Bk+1=Bk+ykTskykykT−skTBkskBkskskTBk
但注意到这个式子里依然需要求
B
k
−
1
B_k^{-1}
Bk−1,因此还需要利用Sherman-Morrison公式:
(
A
+
u
v
T
)
−
1
=
A
−
1
−
A
−
1
u
v
T
A
−
1
1
+
v
T
A
−
1
u
(A+uv^T)^{-1} = A^{-1}-\frac{A^{-1}uv^TA^{-1}}{1+v^TA^{-1}u}
(A+uvT)−1=A−1−1+vTA−1uA−1uvTA−1
其中
A
A
A是n阶可逆矩阵,
A
+
u
v
T
A+uv^T
A+uvT也是可逆矩阵。
引用两次Sherman-Morrison公式可得:(具体推导参考BFGS公式推导)
G
k
+
1
=
(
1
−
s
k
y
k
T
s
k
T
y
k
)
G
k
(
1
−
s
k
y
k
T
s
k
T
y
k
)
T
+
s
k
s
k
T
s
k
T
y
k
G_{k+1} = (1-\frac{s_k y_k^T}{s_k^T y_k})G_k(1-\frac{s_k y_k^T}{s_k^T y_k})^T+\frac{s_k s_k^T}{s_k^T y_k}
Gk+1=(1−skTykskykT)Gk(1−skTykskykT)T+skTykskskT
其中
G
k
=
1
/
B
k
G_k = 1/B_k
Gk=1/Bk。
LBFGS
下面用 H k H_k Hk来代替上式的 G k G_k Gk。当维度比较大的时候,矩阵 H H H的储存开销很大,因此实际应用中,通常使用的是LBFGS(Limited-Memory BFGS):其基本思路是,计算过程中不储存 H k H_k Hk,而是储存向量序列 { y k } \{y_k\} {yk}和 { s k } \{s_k\} {sk}。并且向量序列 { y k } , { s k } \{y_k\},\{s_k\} {yk},{sk}也不是全部储存,而是固定存最新的 m m m个,需要 H k H_k Hk时,再用最新的m个 y k y_k yk和 s k s_k sk计算。这样一来,空间复杂度从 O ( N 2 ) O(N^2) O(N2)降到了 O ( m N ) O(mN) O(mN)。
记
ρ
k
=
1
y
k
T
s
k
,
V
k
=
I
−
ρ
k
y
k
s
k
T
\rho_k = \frac{1}{y_k^T s_k},V_k=I-\rho_k y_k s_k^T
ρk=ykTsk1,Vk=I−ρkykskT,那么上式可以写成:
H
k
+
1
=
V
k
T
H
k
V
k
+
ρ
k
s
k
s
k
T
H_{k+1} = V_k^TH_kV_k + \rho_k s_k s_k^T
Hk+1=VkTHkVk+ρkskskT
H
0
H_0
H0初始化为
I
I
I,那么:
H
1
=
V
0
T
H
0
V
0
+
ρ
0
s
0
s
0
T
H
2
=
V
1
T
H
1
V
1
+
ρ
1
s
1
s
1
T
=
V
1
T
(
V
0
T
H
0
V
0
+
ρ
0
s
0
s
0
T
)
V
1
+
ρ
1
s
1
s
1
T
=
V
1
T
V
0
T
H
0
V
0
V
1
+
V
1
T
ρ
0
s
0
s
0
T
V
1
+
ρ
1
s
1
s
1
T
⋯
H
k
+
1
=
(
V
k
T
V
k
−
1
T
⋯
V
1
T
V
0
T
)
H
0
(
V
0
V
1
⋯
V
k
−
1
V
k
)
+
(
V
k
T
V
k
−
1
T
⋯
V
1
T
)
ρ
0
s
0
s
0
T
(
V
1
⋯
V
k
−
1
V
k
)
⋯
+
V
k
T
V
k
−
1
T
ρ
k
−
2
s
k
−
2
s
k
−
2
T
V
k
−
1
V
k
+
V
k
T
ρ
k
−
1
s
k
−
1
s
k
−
1
T
V
k
+
ρ
k
s
k
s
k
T
(LBFGS[1])
\begin{aligned} H_1 = & V_0^TH_0V_0 + \rho_0 s_0 s_0^T\\ \quad\\ H_2 = & V_1^TH_1V_1 + \rho_1 s_1 s_1^T\\ = & V_1^T(V_0^TH_0V_0 + \rho_0 s_0 s_0^T)V_1 + \rho_1 s_1 s_1^T\\ = & V_1^TV_0^TH_0V_0V_1 + V_1^T\rho_0 s_0 s_0^TV_1 + \rho_1 s_1 s_1^T\\ \cdots\\ H_{k+1} = &(V_k^TV_{k-1}^T\cdots V_1^T V_0^T)H_0(V_0V_1\cdots V_{k-1}V_k) \\ & + (V_k^TV_{k-1}^T\cdots V_1^T)\rho_0 s_0 s_0^T(V_1\cdots V_{k-1}V_k) \\ & \cdots \\ & + V_k^TV_{k-1}^T\rho_{k-2} s_{k-2} s_{k-2}^T V_{k-1}V_k \\ & + V_k^T \rho_{k-1} s_{k-1} s_{k-1}^TV_k \\ & + \rho_k s_k s_k^T \tag{LBFGS[1]} \end{aligned}
H1=H2===⋯Hk+1=V0TH0V0+ρ0s0s0TV1TH1V1+ρ1s1s1TV1T(V0TH0V0+ρ0s0s0T)V1+ρ1s1s1TV1TV0TH0V0V1+V1Tρ0s0s0TV1+ρ1s1s1T(VkTVk−1T⋯V1TV0T)H0(V0V1⋯Vk−1Vk)+(VkTVk−1T⋯V1T)ρ0s0s0T(V1⋯Vk−1Vk)⋯+VkTVk−1Tρk−2sk−2sk−2TVk−1Vk+VkTρk−1sk−1sk−1TVk+ρkskskT(LBFGS[1])
只保留最近的m步:
H
k
+
1
=
(
V
k
T
V
k
−
1
T
⋯
V
k
−
m
+
1
T
)
H
0
(
V
k
−
m
+
1
⋯
V
k
−
1
V
k
)
+
(
V
k
T
V
k
−
1
T
⋯
V
k
−
m
+
2
T
)
ρ
k
−
m
+
1
s
k
−
m
+
1
s
k
−
m
+
1
T
(
V
k
−
m
+
2
⋯
V
k
−
1
V
k
)
⋯
+
V
k
T
ρ
k
−
1
s
k
−
1
s
k
−
1
T
V
k
+
ρ
k
s
k
s
k
T
(LBFGS[2])
\begin{aligned} H_{k+1} = &(V_k^TV_{k-1}^T\cdots V_{k-m+1}^T)H_0(V_{k-m+1}\cdots V_{k-1}V_k) \\ & + (V_k^TV_{k-1}^T\cdots V_{k-m+2}^T)\rho_{k-m+1} s_{k-m+1} s_{k-m+1}^T(V_{k-m+2}\cdots V_{k-1}V_k) \\ & \cdots \\ & + V_k^T \rho_{k-1} s_{k-1} s_{k-1}^TV_k \\ & + \rho_k s_k s_k^T \tag{LBFGS[2]} \end{aligned}
Hk+1=(VkTVk−1T⋯Vk−m+1T)H0(Vk−m+1⋯Vk−1Vk)+(VkTVk−1T⋯Vk−m+2T)ρk−m+1sk−m+1sk−m+1T(Vk−m+2⋯Vk−1Vk)⋯+VkTρk−1sk−1sk−1TVk+ρkskskT(LBFGS[2])
(LBFGS 双圈循环算法)
if k ≤ m k \leq m k≤m
i n c r = 0 , b o u n d = k \quad incr = 0, bound = k incr=0,bound=k
else
i n c r = k − m , b o u n d = m \quad incr = k-m, bound = m incr=k−m,bound=m
\quad
q b o u n d = ∇ f k q_{bound} = \nabla f_k qbound=∇fk
for i = b o u n d − 1 , b o u n d − 2 , ⋯ , 0 i = bound-1, bound-2, \cdots, 0 i=bound−1,bound−2,⋯,0
j = i + i n c r \quad j = i+incr j=i+incr
α i = ρ j s j T q i + 1 \quad \alpha_i = \rho_j s_j^Tq_{i+1} αi=ρjsjTqi+1
q i = q i + 1 − α i y j \quad q_i = q_{i+1} - \alpha_i y_j qi=qi+1−αiyj
\quad
r 0 = H 0 q 0 r_0 = H_0 q_0 r0=H0q0
for i = 0 , 1 , ⋯ , b o u n d − 1 i = 0,1, \cdots, bound-1 i=0,1,⋯,bound−1
j = i + i n c r \quad j = i + incr j=i+incr
β i = ρ j y j T r i \quad \beta_i = \rho_j y_j^T r_i βi=ρjyjTri
r i + 1 = r i + s j ( α i − β i ) \quad r_{i+1} = r_i + s_j(\alpha_i - \beta_i) ri+1=ri+sj(αi−βi)
\quad
H k ∇ f k = r H_k \nabla f_k =r Hk∇fk=r是要求的下降方向
下面证明(LBFGS 双圈循环算法)可以正确推导出上面的
H
k
+
1
H_{k+1}
Hk+1表达式:
k
≤
m
:
L
o
o
p
1
:
q
k
=
∇
f
k
q
k
−
i
=
q
k
−
i
+
1
−
α
k
−
i
y
k
−
i
=
q
k
−
i
+
1
−
p
k
−
i
s
k
−
i
T
q
k
−
i
+
1
y
k
−
i
=
(
I
−
p
k
−
i
y
k
−
i
s
k
−
i
T
)
q
k
−
i
+
1
=
V
k
−
i
q
k
−
i
+
1
=
V
k
−
i
V
k
−
i
+
1
⋯
V
k
−
1
q
k
α
k
−
i
=
ρ
k
−
i
s
k
−
i
T
q
k
−
i
+
1
=
ρ
k
−
i
s
k
−
i
T
V
k
−
i
+
1
⋯
V
k
−
1
q
k
T
e
r
m
i
n
a
t
i
o
n
:
q
0
=
V
0
V
1
⋯
V
k
−
1
∇
f
k
α
0
=
ρ
0
s
0
T
V
1
⋯
V
k
−
1
∇
f
k
L
o
o
p
2
:
r
0
=
H
0
q
0
=
H
0
V
0
V
1
⋯
V
k
−
1
∇
f
k
r
i
+
1
=
r
i
+
s
i
(
α
i
−
β
i
)
=
r
i
+
s
i
α
i
−
ρ
i
s
i
y
i
T
r
i
=
(
I
−
ρ
i
s
i
y
i
T
)
r
i
+
s
i
α
i
=
V
i
T
r
i
+
s
i
α
i
r
k
=
V
k
−
1
T
r
k
−
1
+
s
k
−
1
α
k
−
1
=
V
k
−
1
T
(
V
k
−
2
T
r
k
−
2
+
s
k
−
1
α
k
−
1
)
+
ρ
k
−
1
s
k
−
1
s
k
−
1
T
∇
f
k
=
V
k
−
1
T
V
k
−
2
T
r
k
−
2
+
V
k
−
1
T
ρ
k
−
2
s
k
−
2
s
k
−
2
T
V
k
−
1
∇
f
k
+
ρ
k
−
1
s
k
−
1
s
k
−
1
T
∇
f
k
⋯
=
V
k
−
1
T
V
k
−
2
T
⋯
V
1
T
r
0
+
V
k
−
1
T
V
k
−
2
T
⋯
V
1
T
ρ
0
s
0
s
0
T
V
1
⋯
V
k
−
1
∇
f
k
+
⋯
+
V
k
−
1
T
ρ
k
−
2
s
k
−
2
s
k
−
2
T
V
k
−
1
∇
f
k
+
ρ
k
−
1
s
k
−
1
s
k
−
1
T
∇
f
k
=
V
k
−
1
T
V
k
−
2
T
⋯
V
1
T
H
0
V
0
V
1
⋯
V
k
−
1
∇
f
k
+
V
k
−
1
T
V
k
−
2
T
⋯
V
1
T
ρ
0
s
0
s
0
T
V
1
⋯
V
k
−
1
∇
f
k
+
⋯
+
V
k
−
1
T
ρ
k
−
2
s
k
−
2
s
k
−
2
T
V
k
−
1
∇
f
k
+
ρ
k
−
1
s
k
−
1
s
k
−
1
T
∇
f
k
\begin{aligned} k \leq m: \qquad \qquad \qquad \quad&\\ Loop 1: \quad q_k = & \nabla f_k\\ q_{k-i} = & q_{k-i+1} - \alpha_{k-i}y_{k-i}\\ = & q_{k-i+1} - p_{k-i}s^T_{k-i}q_{k-i+1}y_{k-i}\\ = & (I-p_{k-i}y_{k-i}s^T_{k-i})q_{k-i+1}\\ = & V_{k-i}q_{k-i+1}\\ = & V_{k-i}V_{k-i+1}\cdots V_{k-1}q_k\\ \alpha_{k-i} = & \rho_{k-i}s_{k-i}^Tq_{k-i+1} \\ = & \rho_{k-i}s_{k-i}^TV_{k-i+1}\cdots V_{k-1}q_k\\ \quad\\ Termination:\quad q_0 = & V_0V_1\cdots V_{k-1}\nabla f_k\\ \alpha_0 = & \rho_0s_0^TV_1\cdots V_{k-1}\nabla f_k\\ \quad\\ Loop 2: \quad r_0 = & H_0q_0 = H_0V_0V_1\cdots V_{k-1}\nabla f_k\\ r_{i+1} = & r_i + s_i(\alpha_i - \beta_i)& \\ = & r_i + s_i\alpha_i - \rho_is_iy_i^Tr_i\\ = & (I-\rho_is_iy_i^T) r_i + s_i\alpha_i\\ = & V_i^Tr_i + s_i\alpha_i\\ r_k = & V_{k-1}^Tr_{k-1}+ s_{k-1}\alpha_{k-1}\\ = &V_{k-1}^T(V_{k-2}^Tr_{k-2} + s_{k-1}\alpha_{k-1}) + \rho_{k-1}s_{k-1}s_{k-1}^T\nabla f_k\\ = & V_{k-1}^TV_{k-2}^Tr_{k-2} + V_{k-1}^T\rho_{k-2}s_{k-2}s_{k-2}^TV_{k-1}\nabla f_k+ \rho_{k-1}s_{k-1}s_{k-1}^T\nabla f_k\\ \cdots\\ = & V_{k-1}^TV_{k-2}^T\cdots V_1^Tr_0 + V_{k-1}^TV_{k-2}^T\cdots V_1^T\rho_0s_0s_0^TV_1\cdots V_{k-1}\nabla f_k\\ & + \cdots + V_{k-1}^T\rho_{k-2}s_{k-2}s_{k-2}^TV_{k-1}\nabla f_k + \rho_{k-1}s_{k-1}s_{k-1}^T\nabla f_k\\ = & V_{k-1}^TV_{k-2}^T\cdots V_1^TH_0V_0V_1\cdots V_{k-1}\nabla f_k\\ & + V_{k-1}^TV_{k-2}^T\cdots V_1^T\rho_0s_0s_0^TV_1\cdots V_{k-1}\nabla f_k\\ & + \cdots \\ & + V_{k-1}^T\rho_{k-2}s_{k-2}s_{k-2}^TV_{k-1}\nabla f_k\\ & + \rho_{k-1}s_{k-1}s_{k-1}^T\nabla f_k \end{aligned}
k≤m:Loop1:qk=qk−i=====αk−i==Termination:q0=α0=Loop2:r0=ri+1====rk===⋯==∇fkqk−i+1−αk−iyk−iqk−i+1−pk−isk−iTqk−i+1yk−i(I−pk−iyk−isk−iT)qk−i+1Vk−iqk−i+1Vk−iVk−i+1⋯Vk−1qkρk−isk−iTqk−i+1ρk−isk−iTVk−i+1⋯Vk−1qkV0V1⋯Vk−1∇fkρ0s0TV1⋯Vk−1∇fkH0q0=H0V0V1⋯Vk−1∇fkri+si(αi−βi)ri+siαi−ρisiyiTri(I−ρisiyiT)ri+siαiViTri+siαiVk−1Trk−1+sk−1αk−1Vk−1T(Vk−2Trk−2+sk−1αk−1)+ρk−1sk−1sk−1T∇fkVk−1TVk−2Trk−2+Vk−1Tρk−2sk−2sk−2TVk−1∇fk+ρk−1sk−1sk−1T∇fkVk−1TVk−2T⋯V1Tr0+Vk−1TVk−2T⋯V1Tρ0s0s0TV1⋯Vk−1∇fk+⋯+Vk−1Tρk−2sk−2sk−2TVk−1∇fk+ρk−1sk−1sk−1T∇fkVk−1TVk−2T⋯V1TH0V0V1⋯Vk−1∇fk+Vk−1TVk−2T⋯V1Tρ0s0s0TV1⋯Vk−1∇fk+⋯+Vk−1Tρk−2sk−2sk−2TVk−1∇fk+ρk−1sk−1sk−1T∇fk
刚好等于式(LBFGS[1])计算出的
H
k
⋅
∇
f
k
H_k \cdot \nabla f_k
Hk⋅∇fk,也就是LBFGS的下降方向
k
>
m
:
L
o
o
p
1
:
q
m
=
∇
f
k
q
m
−
i
=
q
m
−
i
+
1
−
α
m
−
i
y
k
−
i
=
q
m
−
i
+
1
−
p
k
−
i
s
k
−
i
T
q
m
−
i
+
1
y
k
−
i
=
(
I
−
p
k
−
i
y
k
−
i
s
k
−
i
T
)
q
m
−
i
+
1
=
V
k
−
i
q
m
−
i
+
1
=
V
k
−
i
V
k
−
i
+
1
⋯
V
k
−
1
q
m
α
m
−
i
=
ρ
k
−
i
s
k
−
i
T
q
m
−
i
+
1
=
ρ
k
−
i
s
k
−
i
T
V
k
−
i
+
1
⋯
V
k
−
1
q
m
T
e
r
m
i
n
a
t
i
o
n
:
q
0
=
V
k
−
m
V
k
−
m
+
1
⋯
V
k
−
1
∇
f
k
α
0
=
ρ
k
−
m
s
k
−
m
T
V
k
−
m
+
1
⋯
V
k
−
1
∇
f
k
L
o
o
p
2
:
r
0
=
H
0
q
0
=
H
0
V
k
−
m
V
k
−
m
+
1
⋯
V
k
−
1
∇
f
k
r
i
+
1
=
r
i
+
s
k
−
m
+
i
(
α
i
−
β
i
)
=
r
i
+
s
k
−
m
+
i
α
i
−
ρ
k
−
m
+
i
s
k
−
m
+
i
y
k
−
m
+
i
T
r
i
=
(
I
−
ρ
k
−
m
+
i
s
k
−
m
+
i
y
k
−
m
+
i
T
)
r
i
+
s
k
−
m
+
i
α
i
=
V
k
−
m
+
i
T
r
i
+
s
k
−
m
+
i
α
i
r
m
=
V
k
−
1
T
r
m
−
1
+
s
k
−
1
α
m
−
1
=
V
k
−
1
T
(
V
k
−
2
T
r
m
−
2
+
s
k
−
1
α
m
−
1
)
+
ρ
k
−
1
s
k
−
1
s
k
−
1
T
∇
f
k
=
V
k
−
1
T
V
k
−
2
T
r
m
−
2
+
V
k
−
1
T
ρ
k
−
2
s
k
−
2
s
k
−
2
T
V
k
−
1
∇
f
k
+
ρ
k
−
1
s
k
−
1
s
k
−
1
T
∇
f
k
⋯
=
V
k
−
1
T
V
k
−
2
T
⋯
V
k
−
m
T
r
0
+
V
k
−
1
T
V
k
−
2
T
⋯
V
k
−
m
+
1
T
ρ
k
−
m
s
k
−
m
s
k
−
m
T
V
k
−
m
+
1
⋯
V
k
−
1
∇
f
k
+
⋯
+
V
k
−
1
T
ρ
k
−
2
s
k
−
2
s
k
−
2
T
V
k
−
1
∇
f
k
+
ρ
k
−
1
s
k
−
1
s
k
−
1
T
∇
f
k
=
V
k
−
1
T
V
k
−
2
T
⋯
V
k
−
m
T
H
0
V
k
−
m
V
k
−
m
+
1
⋯
V
k
−
1
∇
f
k
+
V
k
−
1
T
V
k
−
2
T
⋯
V
k
−
m
+
1
T
ρ
k
−
m
s
k
−
m
s
k
−
m
T
V
k
−
m
+
1
⋯
V
k
−
1
∇
f
k
+
⋯
+
V
k
−
1
T
ρ
k
−
2
s
k
−
2
s
k
−
2
T
V
k
−
1
∇
f
k
+
ρ
k
−
1
s
k
−
1
s
k
−
1
T
∇
f
k
\begin{aligned} k > m: \qquad \qquad \qquad \quad&\\ Loop 1: \quad q_m = & \nabla f_k\\ q_{m-i} = & q_{m-i+1} - \alpha_{m-i}y_{k-i}\\ = & q_{m-i+1} - p_{k-i}s^T_{k-i}q_{m-i+1}y_{k-i}\\ = & (I-p_{k-i}y_{k-i}s^T_{k-i})q_{m-i+1}\\ = & V_{k-i}q_{m-i+1}\\ = & V_{k-i}V_{k-i+1}\cdots V_{k-1}q_m\\ \alpha_{m-i} = & \rho_{k-i}s_{k-i}^Tq_{m-i+1} \\ = & \rho_{k-i}s_{k-i}^TV_{k-i+1}\cdots V_{k-1}q_m\\ \quad\\ Termination:\quad q_0 = & V_{k-m}V_{k-m+1}\cdots V_{k-1}\nabla f_k\\ \alpha_0 = & \rho_{k-m}s_{k-m}^TV_{k-m+1}\cdots V_{k-1}\nabla f_k\\ \quad\\ Loop 2: \quad r_0 = & H_0q_0 = H_0V_{k-m}V_{k-m+1}\cdots V_{k-1}\nabla f_k\\ r_{i+1} = & r_i + s_{k-m+i}(\alpha_i - \beta_i)& \\ = & r_i + s_{k-m+i}\alpha_i - \rho_{k-m+i}s_{k-m+i}y_{k-m+i}^Tr_i\\ = & (I-\rho_{k-m+i}s_{k-m+i}y_{k-m+i}^T) r_i + s_{k-m+i}\alpha_i\\ = & V_{k-m+i}^Tr_i + s_{k-m+i}\alpha_i\\ r_m = & V_{k-1}^Tr_{m-1}+ s_{k-1}\alpha_{m-1}\\ = &V_{k-1}^T(V_{k-2}^Tr_{m-2} + s_{k-1}\alpha_{m-1}) + \rho_{k-1}s_{k-1}s_{k-1}^T\nabla f_k\\ = & V_{k-1}^TV_{k-2}^Tr_{m-2} + V_{k-1}^T\rho_{k-2}s_{k-2}s_{k-2}^TV_{k-1}\nabla f_k+ \rho_{k-1}s_{k-1}s_{k-1}^T\nabla f_k\\ \cdots\\ = & V_{k-1}^TV_{k-2}^T\cdots V_{k-m}^Tr_0 + V_{k-1}^TV_{k-2}^T\cdots V_{k-m+1}^T\rho_{k-m}s_{k-m}s_{k-m}^TV_{k-m+1}\cdots V_{k-1}\nabla f_k\\ & + \cdots + V_{k-1}^T\rho_{k-2}s_{k-2}s_{k-2}^TV_{k-1}\nabla f_k + \rho_{k-1}s_{k-1}s_{k-1}^T\nabla f_k\\ = & V_{k-1}^TV_{k-2}^T\cdots V_{k-m}^TH_0V_{k-m}V_{k-m+1}\cdots V_{k-1}\nabla f_k\\ & + V_{k-1}^TV_{k-2}^T\cdots V_{k-m+1}^T\rho_{k-m}s_{k-m}s_{k-m}^TV_{k-m+1}\cdots V_{k-1}\nabla f_k\\ & + \cdots \\ & + V_{k-1}^T\rho_{k-2}s_{k-2}s_{k-2}^TV_{k-1}\nabla f_k\\ & + \rho_{k-1}s_{k-1}s_{k-1}^T\nabla f_k \end{aligned}
k>m:Loop1:qm=qm−i=====αm−i==Termination:q0=α0=Loop2:r0=ri+1====rm===⋯==∇fkqm−i+1−αm−iyk−iqm−i+1−pk−isk−iTqm−i+1yk−i(I−pk−iyk−isk−iT)qm−i+1Vk−iqm−i+1Vk−iVk−i+1⋯Vk−1qmρk−isk−iTqm−i+1ρk−isk−iTVk−i+1⋯Vk−1qmVk−mVk−m+1⋯Vk−1∇fkρk−msk−mTVk−m+1⋯Vk−1∇fkH0q0=H0Vk−mVk−m+1⋯Vk−1∇fkri+sk−m+i(αi−βi)ri+sk−m+iαi−ρk−m+isk−m+iyk−m+iTri(I−ρk−m+isk−m+iyk−m+iT)ri+sk−m+iαiVk−m+iTri+sk−m+iαiVk−1Trm−1+sk−1αm−1Vk−1T(Vk−2Trm−2+sk−1αm−1)+ρk−1sk−1sk−1T∇fkVk−1TVk−2Trm−2+Vk−1Tρk−2sk−2sk−2TVk−1∇fk+ρk−1sk−1sk−1T∇fkVk−1TVk−2T⋯Vk−mTr0+Vk−1TVk−2T⋯Vk−m+1Tρk−msk−msk−mTVk−m+1⋯Vk−1∇fk+⋯+Vk−1Tρk−2sk−2sk−2TVk−1∇fk+ρk−1sk−1sk−1T∇fkVk−1TVk−2T⋯Vk−mTH0Vk−mVk−m+1⋯Vk−1∇fk+Vk−1TVk−2T⋯Vk−m+1Tρk−msk−msk−mTVk−m+1⋯Vk−1∇fk+⋯+Vk−1Tρk−2sk−2sk−2TVk−1∇fk+ρk−1sk−1sk−1T∇fk
刚好等于式(LBFGS[2])计算出的
H
k
⋅
∇
f
k
H_k \cdot \nabla f_k
Hk⋅∇fk,也就只保存m个
s
i
,
y
i
s_i,y_i
si,yi时计算的LBFGS下降方向,得证。
注:每次进入Loop2都需要一个 H 0 H_0 H0,实践当中被证明比较好的初始化方法为: H 0 = γ k I , γ k = s k − 1 T y k − 1 y k − 1 T y k − 1 H_0 = \gamma_k I,\gamma_k = \frac{s_{k-1}^Ty_{k-1}}{y_{k-1}^Ty_{k-1}} H0=γkI,γk=yk−1Tyk−1sk−1Tyk−1
Reference: