1、牛顿法
牛顿法也是一种迭代的求解方法,相比于梯度下降法,牛顿法在搜索方向上不仅考虑一阶梯度方向,同时考虑二阶梯度,因此用牛顿法求解收敛更快。
假设有函数
f
(
x
)
,
x
∈
R
f(x), x\in\R
f(x),x∈R,
x
k
x_k
xk为当前的极小值估计,在
x
k
x_k
xk处做二阶泰勒展开:
f
(
x
)
=
f
(
x
k
)
+
f
′
(
x
k
)
(
x
−
x
k
)
+
1
2
f
′
′
(
x
k
)
(
x
−
x
k
)
2
f(x)=f(x_k)+f^\prime(x_k)(x-x_k)+\frac{1}{2}f^{\prime\prime}(x_k)(x-x_k)^2
f(x)=f(xk)+f′(xk)(x−xk)+21f′′(xk)(x−xk)2
两边同时对
x
x
x求导:
f
′
(
x
)
=
f
′
(
x
k
)
+
f
′
′
(
x
k
)
(
x
−
x
k
)
f^\prime(x)=f^\prime(x_k)+f^{\prime\prime}(x_k)(x-x_k)
f′(x)=f′(xk)+f′′(xk)(x−xk)
x
x
x取极值点的时候,
f
′
(
x
)
=
0
f^\prime(x)=0
f′(x)=0, 因此:
f
′
(
x
k
)
+
f
′
′
(
x
k
)
(
x
−
x
k
)
=
0
f^\prime(x_k)+f^{\prime\prime}(x_k)(x-x_k)=0
f′(xk)+f′′(xk)(x−xk)=0可以推导得到:
x
=
x
k
−
f
′
(
x
k
)
f
′
′
(
x
k
)
x=x_k-\frac{f^\prime(x_k)}{f^{\prime\prime}(x_k)}\qquad\qquad
x=xk−f′′(xk)f′(xk)
因此给定初始点
x
0
x_0
x0, 可以构造如下迭代式:
x
k
+
1
=
x
k
−
f
′
(
x
k
)
f
′
′
(
x
k
)
,
k
=
0
,
1
,
2
,
.
.
.
x_{k+1}=x_k-\frac{f^\prime(x_k)}{f^{\prime\prime}(x_k)}, k=0,1,2,...
xk+1=xk−f′′(xk)f′(xk),k=0,1,2,...
如果推广到多维,假设
x
∈
R
N
\bf x\in\R^N
x∈RN, 则在
x
k
\bf x_k
xk处的二阶泰勒展开式:
f
(
x
)
=
f
(
x
k
)
+
∇
f
(
x
k
)
(
x
−
x
k
)
+
1
2
(
x
−
x
k
)
T
∇
2
f
(
x
k
)
(
x
−
x
k
)
f({\bf x})=f({\bf x_k})+\nabla f({\bf x_k})({\bf x-x_k})+\frac{1}{2}({\bf x-x_k})^{\bf T}\nabla^2 f({\bf x_k})({\bf x-x_k})
f(x)=f(xk)+∇f(xk)(x−xk)+21(x−xk)T∇2f(xk)(x−xk)
其中
∇
f
\nabla f
∇f为
f
(
x
)
f({\bf x})
f(x)的梯度向量,
∇
2
f
\nabla^2 f
∇2f为
f
(
x
)
f({\bf x})
f(x)的海森矩阵(Hessian Matrix),
∇
f
=
[
∂
f
∂
x
1
∂
f
∂
x
2
⋮
∂
f
∂
x
N
]
,
∇
2
f
=
[
∂
2
f
∂
x
1
2
∂
2
f
∂
x
1
∂
x
2
⋯
∂
2
f
∂
x
1
∂
x
N
∂
f
∂
x
2
∂
x
1
∂
2
f
∂
x
2
2
⋯
∂
2
f
∂
x
2
∂
x
N
⋮
⋮
⋱
⋮
∂
2
f
∂
x
N
∂
x
1
∂
2
f
∂
x
N
∂
x
2
⋯
∂
2
f
∂
x
N
2
]
\nabla f= \left[ \begin{matrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_N} \end{matrix} \right], \qquad \nabla^2 f= \left[ \begin{matrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1\partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1\partial x_N}\\ \frac{\partial f}{\partial x_2\partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2\partial x_N}\\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_N\partial x_1} & \frac{\partial^2 f}{\partial x_N\partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_N^2} \end{matrix} \right]
∇f=⎣⎢⎢⎢⎢⎡∂x1∂f∂x2∂f⋮∂xN∂f⎦⎥⎥⎥⎥⎤,∇2f=⎣⎢⎢⎢⎢⎢⎡∂x12∂2f∂x2∂x1∂f⋮∂xN∂x1∂2f∂x1∂x2∂2f∂x22∂2f⋮∂xN∂x2∂2f⋯⋯⋱⋯∂x1∂xN∂2f∂x2∂xN∂2f⋮∂xN2∂2f⎦⎥⎥⎥⎥⎥⎤
同样的,我们两边对
x
\bf x
x求导,并令
f
′
(
x
)
=
0
f^\prime(\bf x)=0
f′(x)=0,得到:
∇
f
(
x
k
)
+
∇
2
f
(
x
k
)
(
x
−
x
k
)
=
0
\nabla f({\bf x_k})+\nabla^2 f({\bf x_k})({\bf x-x_k})=0
∇f(xk)+∇2f(xk)(x−xk)=0得到:
x
=
x
k
−
(
∇
2
f
(
x
k
)
)
−
1
∇
f
(
x
k
)
\bf x=\bf x_k- (\nabla^2 f({\bf x_k}))^{-1}\nabla f({\bf x_k})
x=xk−(∇2f(xk))−1∇f(xk)
一般的我们令
∇
f
(
x
k
)
=
g
k
\nabla f({\bf x_k})=\bf{g_k}
∇f(xk)=gk,
∇
2
f
(
x
k
)
=
H
k
\nabla^2 f({\bf x_k})=\bf H_k
∇2f(xk)=Hk,可得到迭代式:
x
k
+
1
=
x
k
−
H
k
−
1
⋅
g
k
\bf x_{k+1}=x_k-H_k^{-1}\cdot g_k
xk+1=xk−Hk−1⋅gk
这就是最原始的牛顿法,其中
d
k
=
−
H
k
−
1
⋅
g
k
\bf d_k=-H_k^{-1}\cdot g_k
dk=−Hk−1⋅gk称为牛顿方向,但是最原始的牛顿法没有步长因子,所以无法保证稳定的下降,因此有了阻尼牛顿法,阻尼牛顿法的梯度方向仍然是
d
k
\bf d_k
dk,但是在每一轮迭代时在梯度方向上搜索最优的步长因子,即在迭代计算
x
k
+
1
\bf x_{k+1}
xk+1之前,先得到步长
λ
k
\lambda_{k}
λk:
λ
k
=
a
r
g
 
min
λ
(
x
k
+
λ
d
k
)
,
λ
∈
R
\lambda_k=arg\,\min_{\lambda}(\bf {x_k} + \lambda d_k), \lambda \in R
λk=argλmin(xk+λdk),λ∈R
然后才迭代得到:
x
k
+
1
=
x
k
+
λ
k
d
k
\bf x_{k+1}=x_k+\lambda_kd_k
xk+1=xk+λkdk, 即:
x
k
+
1
=
x
k
−
λ
k
H
k
−
1
⋅
g
k
\bf x_{k+1}=x_k-\lambda_kH_k^{-1}\cdot g_k
xk+1=xk−λkHk−1⋅gk
2、拟牛顿法
牛顿法收敛速度快,但是有两个缺点:
1、需要计算目标函数的二阶偏导数,计算复杂度高
2、海森矩阵无法一定保持正定,从而无法求得逆矩阵,导致牛顿法失效
因此我们有了拟牛顿法(或者准牛顿法),不用去求二阶偏导数,而是构造出近似海森矩阵的正定的对阵矩阵,然后来优化目标函数。不同的构造方法自然就产生了不同的拟牛顿法。
假设在迭代到
x
k
+
1
\bf x_{k+1}
xk+1时,我们将目标函数在
x
k
+
1
\bf x_{k+1}
xk+1处做二阶泰勒展开:
f
(
x
)
≈
x
f
(
x
k
+
1
)
+
∇
f
(
x
k
+
1
)
(
x
−
x
k
+
1
)
+
1
2
(
x
−
x
k
+
1
)
T
∇
2
f
(
x
k
+
1
)
(
x
−
x
k
+
1
)
f({\bf x}) \approx xf({\bf x_{k+1}})+\nabla f({\bf x_{k+1}})({\bf x-x_{k+1}})+\frac{1}{2}({\bf x-x_{k+1}})^{\bf T}\nabla^2 f({\bf x_{k+1}})({\bf x-x_{k+1}})
f(x)≈xf(xk+1)+∇f(xk+1)(x−xk+1)+21(x−xk+1)T∇2f(xk+1)(x−xk+1)
两边同时求梯度:
∇
f
(
x
)
≈
∇
f
(
x
k
+
1
)
+
∇
2
f
(
x
k
+
1
)
(
x
−
x
k
+
1
)
\nabla f({\bf x}) \approx \nabla f({\bf x_{k+1}})+\nabla^2 f({\bf x_{k+1}})({\bf x-x_{k+1}})
∇f(x)≈∇f(xk+1)+∇2f(xk+1)(x−xk+1)
令
x
=
x
k
\bf x=x_k
x=xk, 得到:
∇
f
(
x
k
)
≈
∇
f
(
x
k
+
1
)
+
∇
2
f
(
x
k
+
1
)
(
x
k
−
x
k
+
1
)
\nabla f({\bf x_k}) \approx \nabla f({\bf x_{k+1}})+\nabla^2 f({\bf x_{k+1}})({\bf x_k-x_{k+1}})
∇f(xk)≈∇f(xk+1)+∇2f(xk+1)(xk−xk+1),
即:
g
k
+
1
−
g
k
≈
H
k
+
1
(
x
k
+
1
−
x
k
)
\bf g_{k+1}-g_k \approx H_{k+1}(x_{k+1}-x_{k})
gk+1−gk≈Hk+1(xk+1−xk)
我们引入记号:
y
k
=
g
k
+
1
−
g
k
\bf y_k=g_{k+1}-g_k
yk=gk+1−gk,
s
k
=
x
k
+
1
−
x
k
\bf s_k=x_{k+1}-x_k
sk=xk+1−xk, 可得到:
y
k
≈
H
k
+
1
s
k
或
者
s
k
≈
H
k
+
1
−
1
y
k
{\bf y_k \approx H_{k+1}s_k} 或者 {\bf s_k \approx H_{k+1}^{-1}y_k}
yk≈Hk+1sk或者sk≈Hk+1−1yk
我们用
B
k
+
1
\bf B_{k+1}
Bk+1对海森矩阵
H
k
+
1
\bf H_{k+1}
Hk+1做近似, 用
D
k
+
1
\bf D_{k+1}
Dk+1对海森矩阵的逆矩阵
H
k
+
1
−
1
\bf H_{k+1}^{-1}
Hk+1−1做近似, 则有:
y
k
=
B
k
+
1
s
k
或
者
s
k
=
D
k
+
1
y
k
{\bf y_k = B_{k+1}s_k} 或者 {\bf s_k = D_{k+1}y_k}
yk=Bk+1sk或者sk=Dk+1yk
DFP算法
DFP算法的核心思想是通过迭代的方法对
D
k
+
1
D_{k+1}
Dk+1(即
H
k
+
1
−
1
H_{k+1}^{-1}
Hk+1−1)做近似,如下:
D
k
+
1
=
D
k
+
Δ
D
k
D_{k+1}=D_k+\Delta D_k
Dk+1=Dk+ΔDk
D
0
D_0
D0一般取单位矩阵,因此关键是
Δ
D
k
\Delta D_k
ΔDk如何构造, 这里我们将
Δ
D
k
\Delta D_k
ΔDk构造如下:
Δ
D
k
=
α
u
u
T
+
β
v
v
T
\Delta D_k=\alpha {\bf uu^T}+\beta {\bf vv^T}
ΔDk=αuuT+βvvT
这样保证了
Δ
D
k
\Delta D_k
ΔDk的对称性,下面我们开始从
s
k
=
D
k
+
1
y
k
{\bf s_k = D_{k+1}y_k}
sk=Dk+1yk推导,可以得到:
s
k
=
(
D
k
+
Δ
D
k
)
y
k
=
D
k
y
k
+
α
u
u
T
y
k
+
β
v
v
T
y
k
{\bf s_k}=(D_k+\Delta D_k){\bf y_k}=D_k{\bf y_k}+\alpha {\bf uu^Ty_k}+\beta {\bf vv^Ty_k}
sk=(Dk+ΔDk)yk=Dkyk+αuuTyk+βvvTyk
s
k
=
D
k
y
k
+
u
(
α
u
T
y
k
)
+
v
(
β
v
T
y
k
)
{\bf s_k}=D_k{\bf y_k}+{\bf u}(\alpha {\bf u^Ty_k})+{\bf v}(\beta {\bf v^Ty_k})
sk=Dkyk+u(αuTyk)+v(βvTyk)
其中
α
u
T
y
k
\alpha {\bf u^Ty_k}
αuTyk和
β
v
T
y
k
\beta {\bf v^Ty_k}
βvTyk为两个数,我们可以做如下赋值:
α
u
T
y
k
=
1
,
β
v
T
y
k
=
−
1
\alpha {\bf u^Ty_k}=1,\beta {\bf v^Ty_k}=-1
αuTyk=1,βvTyk=−1
这样可以得到:
α
=
1
u
T
y
k
\alpha =\frac{1}{\bf u^Ty_k}
α=uTyk1,
β
=
−
1
v
T
y
k
\beta=-\frac{1}{\bf v^Ty_k}
β=−vTyk1
接下来我们将要确定如何构造
u
\bf u
u和
v
\bf v
v, 我们再回到上面
s
k
\bf s_k
sk的表达式,将
α
u
T
y
k
=
1
,
β
v
T
y
k
=
−
1
\alpha {\bf u^Ty_k}=1,\beta {\bf v^Ty_k}=-1
αuTyk=1,βvTyk=−1代入,得到:
s
k
=
D
k
y
k
+
u
−
v
{\bf s_k}=D_k{\bf y_k}+\bf u-v
sk=Dkyk+u−v
u
−
v
=
s
k
−
D
k
y
k
{\bf u-v=s_k}-D_k\bf y_k
u−v=sk−Dkyk
要确保上式成立,我们可以直接令
u
=
s
k
\bf u=s_k
u=sk,
v
=
D
k
y
k
{\bf v}=D_k{\bf y_k}
v=Dkyk,这样我们可以得到:
α
=
1
s
k
T
y
k
,
β
=
−
1
y
k
T
D
k
T
y
k
\alpha =\frac{1}{\bf s_k^Ty_k}, \beta=-\frac{1}{{\bf y_k^T}D_k^T{\bf y_k}}
α=skTyk1,β=−ykTDkTyk1
最后可以得到
Δ
D
k
\Delta D_k
ΔDk的迭代式:
Δ
D
k
=
s
k
s
k
T
s
k
T
y
k
−
D
k
y
k
y
k
T
D
k
T
y
k
T
D
k
T
y
k
\Delta D_k=\frac{\bf s_ks_k^T}{\bf s_k^Ty_k}-\frac{D_k{\bf y_ky_k^T}D_k^T}{{\bf y_k^T}D_k^T{\bf y_k}}
ΔDk=skTykskskT−ykTDkTykDkykykTDkT
这样我们可以初始
D
0
=
I
D_0=I
D0=I, 每轮迭代确定搜索方向为
−
D
k
g
k
-D_k\bf g_k
−Dkgk,利用阻尼牛顿法确定最优的步长因子,然后迭代得到
x
k
+
1
\bf x_{k+1}
xk+1, 在下一轮迭代前,同样迭代求得
D
k
+
1
D_{k+1}
Dk+1
BFGS算法
与DFP算法相比,BFGS算法性能更好,是目前求解无约束非线性优化问题最常用的方法之一。
与DFP算法不同的是,BFGS算法是通过构造
B
k
+
1
B_{k+1}
Bk+1来直接逼近海森矩阵
H
k
+
1
H_{k+1}
Hk+1, 同样有:
B
k
+
1
=
B
k
+
Δ
B
k
B_{k+1}=B_k+\Delta B_k
Bk+1=Bk+ΔBk
采用与DFP算法同样的推导过程,这里我们略过,最终得到矫正矩阵
Δ
B
k
\Delta B_k
ΔBk的迭代式:
Δ
B
k
=
y
k
y
k
T
y
k
T
y
k
−
B
k
s
k
s
k
T
B
k
T
s
k
T
B
k
T
s
k
\Delta B_k=\frac{\bf y_ky_k^T}{\bf y_k^Ty_k}-\frac{B_k{\bf s_ks_k^T}B_k^T}{{\bf s_k^T}B_k^T{\bf s_k}}
ΔBk=ykTykykykT−skTBkTskBkskskTBkT
可以看到与DFP的
Δ
D
k
\Delta D_k
ΔDk的迭代式只是
s
k
\bf s_k
sk与
y
k
\bf y_k
yk换了个位置。
BFGS算法在每轮迭代确定搜索方为:
d
k
=
−
B
k
−
1
g
k
{\bf d_k}=-B_k^{-1}\bf g_k
dk=−Bk−1gk, 为了避免矩阵求逆,更一般的做法是通过Sherman-Morrison公式在每轮迭代中直接给出
B
k
+
1
−
1
B_{k+1}^{-1}
Bk+1−1与
B
k
−
1
B_{k}^{-1}
Bk−1的关系式:
B
k
+
1
−
1
=
(
I
−
s
k
y
k
T
y
k
T
s
k
)
B
k
−
1
(
I
−
y
k
s
k
T
y
k
T
s
k
)
+
s
k
s
k
T
y
k
T
s
k
B_{k+1}^{-1}=(I-\frac{\bf s_ky_k^T}{\bf y_k^Ts_k})B_{k}^{-1}(I-\frac{\bf y_ks_k^T}{\bf y_k^Ts_k})+\frac{\bf s_ks_k^T}{\bf y_k^Ts_k}
Bk+1−1=(I−ykTskskykT)Bk−1(I−ykTskykskT)+ykTskskskT
关于如何通过Sherman-Morrison公式推导得出上式,可以参考这篇文章: https://blog.csdn.net/langb2014/article/details/48915425, 附录中有对上式的详细推导。
L-BFGS
BFGS算法需要存储
N
×
N
N\times N
N×N的矩阵
D
K
D_K
DK, 当N很大时,需要很大的内存,因此有了L-BFGS(Limited-memory BFGS), 其基本思想是不存储
D
K
D_K
DK, 而是存储计算过程用到的一系列向量{
s
i
\bf s_i
si}和{
y
i
\bf y_i
yi}, 需要
D
K
D_K
DK时,通过{
s
i
\bf s_i
si}和{
y
i
\bf y_i
yi}计算得到。当然{
s
i
\bf s_i
si}和{
y
i
\bf y_i
yi}只是存固定的
m
m
m个,这样内存的开销将由
O
(
N
2
)
O(N^2)
O(N2)降到
O
(
m
N
)
O(mN)
O(mN)。下面我们来进行推导。
再来看BFGS的迭代式:
B
k
+
1
−
1
=
(
I
−
s
k
y
k
T
y
k
T
s
k
)
B
k
−
1
(
I
−
y
k
s
k
T
y
k
T
s
k
)
+
s
k
s
k
T
y
k
T
s
k
B_{k+1}^{-1}=(I-\frac{\bf s_ky_k^T}{\bf y_k^Ts_k})B_{k}^{-1}(I-\frac{\bf y_ks_k^T}{\bf y_k^Ts_k})+\frac{\bf s_ks_k^T}{\bf y_k^Ts_k}
Bk+1−1=(I−ykTskskykT)Bk−1(I−ykTskykskT)+ykTskskskT
因为
D
k
=
B
k
−
1
D_k=B_k^{-1}
Dk=Bk−1, 所以:
D
k
+
1
=
(
I
−
s
k
y
k
T
y
k
T
s
k
)
D
k
(
I
−
y
k
s
k
T
y
k
T
s
k
)
+
s
k
s
k
T
y
k
T
s
k
D_{k+1}=(I-\frac{\bf s_ky_k^T}{\bf y_k^Ts_k})D_{k}(I-\frac{\bf y_ks_k^T}{\bf y_k^Ts_k})+\frac{\bf s_ks_k^T}{\bf y_k^Ts_k}
Dk+1=(I−ykTskskykT)Dk(I−ykTskykskT)+ykTskskskT
令
ρ
k
=
1
y
k
T
s
k
\rho _k=\frac{1}{\bf y_k^Ts_k}
ρk=ykTsk1,
V
k
=
I
−
ρ
k
y
k
s
k
T
V_k=I-\rho _k{\bf y_ks_k^T}
Vk=I−ρkykskT, 则有:
D
k
+
1
=
V
k
T
D
k
V
k
+
ρ
k
s
k
s
k
T
D_{k+1}=V_k^TD_{k}V_k+\rho _k{\bf s_ks_k^T}
Dk+1=VkTDkVk+ρkskskT
如果给定初始
D
0
D_0
D0, 我们有:
D
1
=
V
0
T
D
0
V
0
+
ρ
0
s
0
s
0
T
D_{1}=V_0^TD_{0}V_0+\rho _0{\bf s_0s_0^T}
D1=V0TD0V0+ρ0s0s0T
D
2
=
V
1
T
D
1
V
1
+
ρ
1
s
1
s
1
T
=
V
1
T
V
0
T
D
0
V
0
V
1
+
V
1
T
(
ρ
0
s
0
s
0
T
)
V
1
+
ρ
1
s
1
s
1
T
D
3
=
V
2
T
D
2
V
2
+
ρ
2
s
2
s
2
T
=
V
2
T
(
V
1
T
V
0
T
D
0
V
0
V
1
+
V
1
T
(
ρ
0
s
0
s
0
T
)
V
1
+
ρ
1
s
1
s
1
T
)
V
2
+
ρ
2
s
2
s
2
T
=
V
2
T
V
1
T
V
0
T
D
0
V
0
V
1
V
2
+
V
2
T
V
1
T
(
ρ
0
s
0
s
0
T
)
V
1
V
2
+
V
2
T
(
ρ
1
s
1
s
1
T
)
V
2
+
ρ
2
s
2
s
2
T
⋯
D
k
+
1
=
(
V
k
T
V
k
−
1
T
⋯
V
1
T
V
0
T
)
D
0
(
V
0
V
1
⋯
V
k
−
1
V
k
)
+
(
V
k
T
V
k
−
1
T
⋯
V
1
T
)
(
ρ
0
s
0
s
0
T
)
(
V
1
⋯
V
k
−
1
V
k
)
+
(
V
k
T
V
k
−
1
T
⋯
V
2
T
)
(
ρ
1
s
1
s
1
T
)
(
V
2
⋯
V
k
−
1
V
k
)
+
⋯
+
V
k
T
(
ρ
k
−
1
s
k
−
1
s
k
−
1
T
)
V
k
+
ρ
k
s
k
s
k
T
\begin{aligned} D_{2} &=V_1^TD_{1}V_1+\rho _1{\bf s_1s_1^T}\\ &=V_1^TV_0^TD_{0}V_0V_1+V_1^T(\rho _0{\bf s_0s_0^T})V_1+\rho _1{\bf s_1s_1^T}\\ D3&=V_2^TD_{2}V_2+\rho _2{\bf s_2s_2^T}\\ &=V_2^T(V_1^TV_0^TD_{0}V_0V_1+V_1^T(\rho _0{\bf s_0s_0^T})V_1+\rho _1{\bf s_1s_1^T})V_2+\rho _2{\bf s_2s_2^T}\\ &=V_2^TV_1^TV_0^TD_{0}V_0V_1V_2+V_2^TV_1^T(\rho _0{\bf s_0s_0^T})V_1V_2+V_2^T(\rho _1{\bf s_1s_1^T})V_2+\rho _2{\bf s_2s_2^T}\\ &\cdots \\ D_{k+1}&=(V_k^TV_{k-1}^T\cdots V_1^TV_0^T)D_0(V_0V_{1}\cdots V_{k-1}V_k)\\ &+(V_k^TV_{k-1}^T\cdots V_1^T)(\rho _0{\bf s_0s_0^T})(V_1\cdots V_{k-1 }V_k)\\ &+(V_k^TV_{k-1}^T\cdots V_2^T)(\rho _1{\bf s_1s_1^T})(V_2\cdots V_{k-1 }V_k)\\ &+\cdots \\ &+V_k^T(\rho _{k-1}{\bf s_{k-1}s_{k-1}^T})V_k\\ &+ \rho _k{\bf s_ks_k^T} \end {aligned}
D2D3Dk+1=V1TD1V1+ρ1s1s1T=V1TV0TD0V0V1+V1T(ρ0s0s0T)V1+ρ1s1s1T=V2TD2V2+ρ2s2s2T=V2T(V1TV0TD0V0V1+V1T(ρ0s0s0T)V1+ρ1s1s1T)V2+ρ2s2s2T=V2TV1TV0TD0V0V1V2+V2TV1T(ρ0s0s0T)V1V2+V2T(ρ1s1s1T)V2+ρ2s2s2T⋯=(VkTVk−1T⋯V1TV0T)D0(V0V1⋯Vk−1Vk)+(VkTVk−1T⋯V1T)(ρ0s0s0T)(V1⋯Vk−1Vk)+(VkTVk−1T⋯V2T)(ρ1s1s1T)(V2⋯Vk−1Vk)+⋯+VkT(ρk−1sk−1sk−1T)Vk+ρkskskT
因此如果计算
D
k
+
1
D_{k+1}
Dk+1, 我们只需存储k个向量
s
i
\bf{s_i}
si和
y
i
\bf{y_i}
yi即可,但是我们事先并不知道收敛时需要的迭代次数,如果迭代次数非常大,那也会存储很大数量的向量。如果要降低内存消耗,则必须舍弃一些向量,当然我们应该首先舍弃最早生成的向量,从
s
0
\bf{s_0}
s0和
y
0
\bf{y_0}
y0开始,依次向后。因此,如果我们设定只能存储
m
m
m个向量, 那么
D
k
+
1
D_{k+1}
Dk+1可以用下式近似计算:
D
k
+
1
=
(
V
k
T
V
k
−
1
T
⋯
V
k
−
m
+
2
T
V
k
−
m
+
1
T
)
D
0
(
V
k
−
m
+
1
V
k
−
m
+
2
⋯
V
k
−
1
V
k
)
+
(
V
k
T
V
k
−
1
T
⋯
V
k
−
m
+
2
T
)
(
ρ
0
s
0
s
0
T
)
(
V
k
−
m
+
2
⋯
V
k
−
1
V
k
)
+
(
V
k
T
V
k
−
1
T
⋯
V
k
−
m
+
3
T
)
(
ρ
1
s
1
s
1
T
)
(
V
k
−
m
+
3
⋯
V
k
−
1
V
k
)
+
⋯
+
(
V
k
T
V
k
−
1
T
)
(
ρ
k
−
2
s
k
−
2
s
k
−
2
T
)
V
k
−
1
V
k
+
V
k
T
(
ρ
k
−
1
s
k
−
1
s
k
−
1
T
)
V
k
+
ρ
k
s
k
s
k
T
\begin{aligned} D_{k+1}&=(V_k^TV_{k-1}^T\cdots V_{k-m+2}^TV_{k-m+1}^T)D_0(V_{k-m+1}V_{k-m+2}\cdots V_{k-1}V_k)\\ &+(V_k^TV_{k-1}^T\cdots V_{k-m+2}^T)(\rho _0{\bf s_0s_0^T})(V_{k-m+2}\cdots V_{k-1 }V_k)\\ &+(V_k^TV_{k-1}^T\cdots V_{k-m+3}^T)(\rho _1{\bf s_1s_1^T})(V_{k-m+3}\cdots V_{k-1 }V_k)\\ &+\cdots \\ &+(V_k^TV_{k-1}^T)(\rho _{k-2}{\bf s_{k-2}s_{k-2}^T})V_{k-1}V_k\\ &+V_k^T(\rho _{k-1}{\bf s_{k-1}s_{k-1}^T})V_k\\ &+ \rho _k{\bf s_ks_k^T} \end {aligned}
Dk+1=(VkTVk−1T⋯Vk−m+2TVk−m+1T)D0(Vk−m+1Vk−m+2⋯Vk−1Vk)+(VkTVk−1T⋯Vk−m+2T)(ρ0s0s0T)(Vk−m+2⋯Vk−1Vk)+(VkTVk−1T⋯Vk−m+3T)(ρ1s1s1T)(Vk−m+3⋯Vk−1Vk)+⋯+(VkTVk−1T)(ρk−2sk−2sk−2T)Vk−1Vk+VkT(ρk−1sk−1sk−1T)Vk+ρkskskT