牛顿法,拟牛顿法,BFGS,LBFGS

本文深入探讨了牛顿法及其在高维优化中的局限性,进而引出了拟牛顿法的概念。详细介绍了BFGS和LBFGS两种经典的拟牛顿法优化算法,阐述了它们的基本思想、计算流程及优势,并通过数学推导验证了算法的正确性。BFGS利用近似Hessian矩阵加速优化,而LBFGS则通过有限存储方式降低了内存需求,适合大规模问题求解。
摘要由CSDN通过智能技术生成

牛顿法

首先介绍一下牛顿法。给定一个优化问题
m i n x f ( x ) min_x f(x) minxf(x)
其中 f ( x ) f(x) f(x)具有连续的二阶偏导。

由Taylor展开,当 x → x k x\rightarrow x_k xxk
f ( x ∗ ) = f ( x k ) + ∇ f ( x k ) ( x ∗ − x k ) + 1 2 ( x ∗ − x k ) T ∇ 2 f ( x k ) ( x ∗ − x k ) + O ( x 3 ) f(x^*) = f(x_k)+\nabla f(x_k)(x^*-x_k)+\frac{1}{2}(x^*-x_k)^T\nabla^2f(x_k)(x^*-x_k)+ O(x^3) f(x)=f(xk)+f(xk)(xxk)+21(xxk)T2f(xk)(xxk)+O(x3)
在极值点有 ∇ f ( x ∗ ) = 0 \nabla f(x^*)=0 f(x)=0,因此 ∇ f ( x ∗ ) = f ( x ∗ ) − f ( x k ) x ∗ − x k = ∇ f ( x k ) + ∇ 2 f ( x k ) ( x ∗ − x k ) = 0 \nabla f(x^*)=\frac{f(x^*)-f(x_k)}{x^*-x_k} = \nabla f(x_k) + \nabla^2f(x_k)(x^*-x_k)=0 f(x)=xxkf(x)f(xk)=f(xk)+2f(xk)(xxk)=0
⟹ x ∗ = x k − H k − 1 g k \Longrightarrow \quad x^*=x_k-H_k^{-1}g_k x=xkHk1gk
其中 H k = ∇ 2 f ( x k ) , g k = ∇ f ( x k ) H_k=\nabla^2f(x_k), \quad g_k=\nabla f(x_k) Hk=2f(xk),gk=f(xk)(Note:因为此处是求极小值,而当 H k H_k Hk正定时, f ( x ) f(x) f(x)的极值为最小值,所以此处不妨假设 H k H_k Hk是正定的,即 我们假设 f ( x ) f(x) f(x) 是凸函数

由此我们得到了牛顿法的下降方向 − H k − 1 g k -H_k^{-1}g_k Hk1gk,下面证明总 ∃ λ > 0 \exists \lambda>0 λ>0,使得 p k = − λ H k − 1 g k p_k=-\lambda H_k^{-1}g_k pk=λHk1gk满足: f ( x k + p k ) < f ( x k ) f(x_k+p_k)<f(x_k) f(xk+pk)<f(xk)

由Taylor展开, f ( x k + 1 ) = f ( x k ) − λ g k H k − 1 g k + 1 2 λ 2 ( H k − 1 g k ) T H k ( H k − 1 g k ) f(x_{k+1})=f(x_k)-\lambda g_k H_k^{-1}g_k+ \frac{1}{2}\lambda^2(H_k^{-1}g_k)^T H_k (H_k^{-1}g_k) f(xk+1)=f(xk)λgkHk1gk+21λ2(Hk1gk)THk(Hk1gk),因为 H k H_k Hk正定,所以 ∀ a ∈ R n × 1 , a T H a > 0 \forall a\in \mathbb{R}^{n\times 1}, a^THa>0 aRn×1,aTHa>0,因此总存在一个充分小的 λ k \lambda_k λk,使得 f ( x k + 1 ) < f ( x k ) f(x_{k+1})<f(x_k) f(xk+1)<f(xk)

Note: 牛顿法由于使用了二阶信息,可以证明它是具有二阶收敛性的,因此它的收敛速度会比作为一阶算法的梯度下降法快很多,但缺点是 H k − 1 H_k^{-1} Hk1的计算量很大,而且由于存在求逆的步骤,如果 H k H_k Hk的条件数很大的话,容易在数值计算中出现NAN


拟牛顿法

针对上述问题,实际应用中,我们通常会找一个正定矩阵 G k G_k Gk来拟合 H k − 1 H_k^{-1} Hk1,或者 B k B_k Bk来拟合 H k H_k Hk,下面我们来看牛顿法中的 H k H_k Hk应该满足什么条件。

首先根据上述牛顿法,我们可以得到 g k = g k + 1 + H k + 1 ( x k − x k + 1 ) g_{k}=g_{k+1}+H_{k+1}(x_k-x_{k+1}) gk=gk+1+Hk+1(xkxk+1),记 y k = g k + 1 − g k , s k = x k + 1 − x k ⟹ H k + 1 s k = y k y_k=g_{k+1}-g_k ,s_k=x_{k+1}-x_k \Longrightarrow H_{k+1} s_k=y_k yk=gk+1gksk=xk+1xkHk+1sk=yk
由此我们得到了拟牛顿条件 H k + 1 s k = y k H_{k+1} s_k=y_k Hk+1sk=yk。并且要求 H k + 1 H_{k+1} Hk+1是一个对称正定矩阵。

BFGS,LBFGS

BFGS

BFGS的思路是构造 B k B_k Bk来拟合 H k H_k Hk,由于 H k H_k Hk是正定矩阵,因此在线性空间中 H k H_k Hk是对称矩阵,所以我们首先将 B k + 1 B_{k+1} Bk+1构造为对称阵: B k + 1 = B k + α u u T + β v v T B_{k+1}=B_k+\alpha u u^T+\beta v v^T Bk+1=Bk+αuuT+βvvT,那么根据拟牛顿条件,有:
B k + 1 s k = B k s k + α u u T s k + β v v T s k = B k s k + α u T s k u + β v T s k v \begin{aligned} B_{k+1}s_k &= B_ks_k+\alpha u u^Ts_k+\beta v v^Ts_k\\ &= B_ks_k+\alpha u^Ts_k u+\beta v^Ts_k v \end{aligned} Bk+1sk=Bksk+αuuTsk+βvvTsk=Bksk+αuTsku+βvTskv
不妨令: α u T s k = 1 , β v T s k = − 1 \alpha u^Ts_k=1,\beta v^Ts_k=-1 αuTsk=1,βvTsk=1,则 α = 1 u T s k , β = − 1 v T s k \alpha=\frac{1}{u^Ts_k}, \beta= -\frac{1}{v^Ts_k} α=uTsk1,β=vTsk1。此时:
B k s k + u − v = y k B_ks_k+u-v=y_k Bksk+uv=yk
再令 u = y k , v = B k s k u=y_k, v=B_k s_k u=yk,v=Bksk ⟹ α = 1 y k T s k , β = − 1 s k T B k s k \Longrightarrow \alpha=\frac{1}{y_k^T s_k}, \beta= -\frac{1}{s_k^TB_k s_k} α=ykTsk1,β=skTBksk1
⟹ B k + 1 = B k + y k y k T y k T s k − B k s k s k T B k s k T B k s k \Longrightarrow\quad B_{k+1} = B_k + \frac{y_k y_k^T}{y_k^T s_k}-\frac{B_k s_k s_k^TB_k}{s_k^T B_k s_k} Bk+1=Bk+ykTskykykTskTBkskBkskskTBk

但注意到这个式子里依然需要求 B k − 1 B_k^{-1} Bk1,因此还需要利用Sherman-Morrison公式:
( A + u v T ) − 1 = A − 1 − A − 1 u v T A − 1 1 + v T A − 1 u (A+uv^T)^{-1} = A^{-1}-\frac{A^{-1}uv^TA^{-1}}{1+v^TA^{-1}u} (A+uvT)1=A11+vTA1uA1uvTA1
其中 A A A是n阶可逆矩阵, A + u v T A+uv^T A+uvT也是可逆矩阵。

引用两次Sherman-Morrison公式可得:(具体推导参考BFGS公式推导)
G k + 1 = ( 1 − s k y k T s k T y k ) G k ( 1 − s k y k T s k T y k ) T + s k s k T s k T y k G_{k+1} = (1-\frac{s_k y_k^T}{s_k^T y_k})G_k(1-\frac{s_k y_k^T}{s_k^T y_k})^T+\frac{s_k s_k^T}{s_k^T y_k} Gk+1=(1skTykskykT)Gk(1skTykskykT)T+skTykskskT
其中 G k = 1 / B k G_k = 1/B_k Gk=1/Bk

LBFGS

下面用 H k H_k Hk来代替上式的 G k G_k Gk。当维度比较大的时候,矩阵 H H H的储存开销很大,因此实际应用中,通常使用的是LBFGS(Limited-Memory BFGS):其基本思路是,计算过程中不储存 H k H_k Hk,而是储存向量序列 { y k } \{y_k\} {yk} { s k } \{s_k\} {sk}。并且向量序列 { y k } , { s k } \{y_k\},\{s_k\} {yk}{sk}也不是全部储存,而是固定存最新的 m m m个,需要 H k H_k Hk时,再用最新的m个 y k y_k yk s k s_k sk计算。这样一来,空间复杂度从 O ( N 2 ) O(N^2) O(N2)降到了 O ( m N ) O(mN) O(mN)

ρ k = 1 y k T s k , V k = I − ρ k y k s k T \rho_k = \frac{1}{y_k^T s_k},V_k=I-\rho_k y_k s_k^T ρk=ykTsk1Vk=IρkykskT,那么上式可以写成:
H k + 1 = V k T H k V k + ρ k s k s k T H_{k+1} = V_k^TH_kV_k + \rho_k s_k s_k^T Hk+1=VkTHkVk+ρkskskT

H 0 H_0 H0初始化为 I I I,那么:
H 1 = V 0 T H 0 V 0 + ρ 0 s 0 s 0 T H 2 = V 1 T H 1 V 1 + ρ 1 s 1 s 1 T = V 1 T ( V 0 T H 0 V 0 + ρ 0 s 0 s 0 T ) V 1 + ρ 1 s 1 s 1 T = V 1 T V 0 T H 0 V 0 V 1 + V 1 T ρ 0 s 0 s 0 T V 1 + ρ 1 s 1 s 1 T ⋯ H k + 1 = ( V k T V k − 1 T ⋯ V 1 T V 0 T ) H 0 ( V 0 V 1 ⋯ V k − 1 V k ) + ( V k T V k − 1 T ⋯ V 1 T ) ρ 0 s 0 s 0 T ( V 1 ⋯ V k − 1 V k ) ⋯ + V k T V k − 1 T ρ k − 2 s k − 2 s k − 2 T V k − 1 V k + V k T ρ k − 1 s k − 1 s k − 1 T V k + ρ k s k s k T (LBFGS[1]) \begin{aligned} H_1 = & V_0^TH_0V_0 + \rho_0 s_0 s_0^T\\ \quad\\ H_2 = & V_1^TH_1V_1 + \rho_1 s_1 s_1^T\\ = & V_1^T(V_0^TH_0V_0 + \rho_0 s_0 s_0^T)V_1 + \rho_1 s_1 s_1^T\\ = & V_1^TV_0^TH_0V_0V_1 + V_1^T\rho_0 s_0 s_0^TV_1 + \rho_1 s_1 s_1^T\\ \cdots\\ H_{k+1} = &(V_k^TV_{k-1}^T\cdots V_1^T V_0^T)H_0(V_0V_1\cdots V_{k-1}V_k) \\ & + (V_k^TV_{k-1}^T\cdots V_1^T)\rho_0 s_0 s_0^T(V_1\cdots V_{k-1}V_k) \\ & \cdots \\ & + V_k^TV_{k-1}^T\rho_{k-2} s_{k-2} s_{k-2}^T V_{k-1}V_k \\ & + V_k^T \rho_{k-1} s_{k-1} s_{k-1}^TV_k \\ & + \rho_k s_k s_k^T \tag{LBFGS[1]} \end{aligned} H1=H2===Hk+1=V0TH0V0+ρ0s0s0TV1TH1V1+ρ1s1s1TV1T(V0TH0V0+ρ0s0s0T)V1+ρ1s1s1TV1TV0TH0V0V1+V1Tρ0s0s0TV1+ρ1s1s1T(VkTVk1TV1TV0T)H0(V0V1Vk1Vk)+(VkTVk1TV1T)ρ0s0s0T(V1Vk1Vk)+VkTVk1Tρk2sk2sk2TVk1Vk+VkTρk1sk1sk1TVk+ρkskskT(LBFGS[1])
只保留最近的m步:
H k + 1 = ( V k T V k − 1 T ⋯ V k − m + 1 T ) H 0 ( V k − m + 1 ⋯ V k − 1 V k ) + ( V k T V k − 1 T ⋯ V k − m + 2 T ) ρ k − m + 1 s k − m + 1 s k − m + 1 T ( V k − m + 2 ⋯ V k − 1 V k ) ⋯ + V k T ρ k − 1 s k − 1 s k − 1 T V k + ρ k s k s k T (LBFGS[2]) \begin{aligned} H_{k+1} = &(V_k^TV_{k-1}^T\cdots V_{k-m+1}^T)H_0(V_{k-m+1}\cdots V_{k-1}V_k) \\ & + (V_k^TV_{k-1}^T\cdots V_{k-m+2}^T)\rho_{k-m+1} s_{k-m+1} s_{k-m+1}^T(V_{k-m+2}\cdots V_{k-1}V_k) \\ & \cdots \\ & + V_k^T \rho_{k-1} s_{k-1} s_{k-1}^TV_k \\ & + \rho_k s_k s_k^T \tag{LBFGS[2]} \end{aligned} Hk+1=(VkTVk1TVkm+1T)H0(Vkm+1Vk1Vk)+(VkTVk1TVkm+2T)ρkm+1skm+1skm+1T(Vkm+2Vk1Vk)+VkTρk1sk1sk1TVk+ρkskskT(LBFGS[2])

(LBFGS 双圈循环算法)
if k ≤ m k \leq m km
i n c r = 0 , b o u n d = k \quad incr = 0, bound = k incr=0,bound=k
else
i n c r = k − m , b o u n d = m \quad incr = k-m, bound = m incr=km,bound=m
\quad
q b o u n d = ∇ f k q_{bound} = \nabla f_k qbound=fk
for i = b o u n d − 1 , b o u n d − 2 , ⋯   , 0 i = bound-1, bound-2, \cdots, 0 i=bound1,bound2,,0
j = i + i n c r \quad j = i+incr j=i+incr
α i = ρ j s j T q i + 1 \quad \alpha_i = \rho_j s_j^Tq_{i+1} αi=ρjsjTqi+1
q i = q i + 1 − α i y j \quad q_i = q_{i+1} - \alpha_i y_j qi=qi+1αiyj
\quad
r 0 = H 0 q 0 r_0 = H_0 q_0 r0=H0q0
for i = 0 , 1 , ⋯   , b o u n d − 1 i = 0,1, \cdots, bound-1 i=0,1,,bound1
j = i + i n c r \quad j = i + incr j=i+incr
β i = ρ j y j T r i \quad \beta_i = \rho_j y_j^T r_i βi=ρjyjTri
r i + 1 = r i + s j ( α i − β i ) \quad r_{i+1} = r_i + s_j(\alpha_i - \beta_i) ri+1=ri+sj(αiβi)
\quad
H k ∇ f k = r H_k \nabla f_k =r Hkfk=r是要求的下降方向

下面证明(LBFGS 双圈循环算法)可以正确推导出上面的 H k + 1 H_{k+1} Hk+1表达式:
k ≤ m : L o o p 1 : q k = ∇ f k q k − i = q k − i + 1 − α k − i y k − i = q k − i + 1 − p k − i s k − i T q k − i + 1 y k − i = ( I − p k − i y k − i s k − i T ) q k − i + 1 = V k − i q k − i + 1 = V k − i V k − i + 1 ⋯ V k − 1 q k α k − i = ρ k − i s k − i T q k − i + 1 = ρ k − i s k − i T V k − i + 1 ⋯ V k − 1 q k T e r m i n a t i o n : q 0 = V 0 V 1 ⋯ V k − 1 ∇ f k α 0 = ρ 0 s 0 T V 1 ⋯ V k − 1 ∇ f k L o o p 2 : r 0 = H 0 q 0 = H 0 V 0 V 1 ⋯ V k − 1 ∇ f k r i + 1 = r i + s i ( α i − β i ) = r i + s i α i − ρ i s i y i T r i = ( I − ρ i s i y i T ) r i + s i α i = V i T r i + s i α i r k = V k − 1 T r k − 1 + s k − 1 α k − 1 = V k − 1 T ( V k − 2 T r k − 2 + s k − 1 α k − 1 ) + ρ k − 1 s k − 1 s k − 1 T ∇ f k = V k − 1 T V k − 2 T r k − 2 + V k − 1 T ρ k − 2 s k − 2 s k − 2 T V k − 1 ∇ f k + ρ k − 1 s k − 1 s k − 1 T ∇ f k ⋯ = V k − 1 T V k − 2 T ⋯ V 1 T r 0 + V k − 1 T V k − 2 T ⋯ V 1 T ρ 0 s 0 s 0 T V 1 ⋯ V k − 1 ∇ f k + ⋯ + V k − 1 T ρ k − 2 s k − 2 s k − 2 T V k − 1 ∇ f k + ρ k − 1 s k − 1 s k − 1 T ∇ f k = V k − 1 T V k − 2 T ⋯ V 1 T H 0 V 0 V 1 ⋯ V k − 1 ∇ f k + V k − 1 T V k − 2 T ⋯ V 1 T ρ 0 s 0 s 0 T V 1 ⋯ V k − 1 ∇ f k + ⋯ + V k − 1 T ρ k − 2 s k − 2 s k − 2 T V k − 1 ∇ f k + ρ k − 1 s k − 1 s k − 1 T ∇ f k \begin{aligned} k \leq m: \qquad \qquad \qquad \quad&\\ Loop 1: \quad q_k = & \nabla f_k\\ q_{k-i} = & q_{k-i+1} - \alpha_{k-i}y_{k-i}\\ = & q_{k-i+1} - p_{k-i}s^T_{k-i}q_{k-i+1}y_{k-i}\\ = & (I-p_{k-i}y_{k-i}s^T_{k-i})q_{k-i+1}\\ = & V_{k-i}q_{k-i+1}\\ = & V_{k-i}V_{k-i+1}\cdots V_{k-1}q_k\\ \alpha_{k-i} = & \rho_{k-i}s_{k-i}^Tq_{k-i+1} \\ = & \rho_{k-i}s_{k-i}^TV_{k-i+1}\cdots V_{k-1}q_k\\ \quad\\ Termination:\quad q_0 = & V_0V_1\cdots V_{k-1}\nabla f_k\\ \alpha_0 = & \rho_0s_0^TV_1\cdots V_{k-1}\nabla f_k\\ \quad\\ Loop 2: \quad r_0 = & H_0q_0 = H_0V_0V_1\cdots V_{k-1}\nabla f_k\\ r_{i+1} = & r_i + s_i(\alpha_i - \beta_i)& \\ = & r_i + s_i\alpha_i - \rho_is_iy_i^Tr_i\\ = & (I-\rho_is_iy_i^T) r_i + s_i\alpha_i\\ = & V_i^Tr_i + s_i\alpha_i\\ r_k = & V_{k-1}^Tr_{k-1}+ s_{k-1}\alpha_{k-1}\\ = &V_{k-1}^T(V_{k-2}^Tr_{k-2} + s_{k-1}\alpha_{k-1}) + \rho_{k-1}s_{k-1}s_{k-1}^T\nabla f_k\\ = & V_{k-1}^TV_{k-2}^Tr_{k-2} + V_{k-1}^T\rho_{k-2}s_{k-2}s_{k-2}^TV_{k-1}\nabla f_k+ \rho_{k-1}s_{k-1}s_{k-1}^T\nabla f_k\\ \cdots\\ = & V_{k-1}^TV_{k-2}^T\cdots V_1^Tr_0 + V_{k-1}^TV_{k-2}^T\cdots V_1^T\rho_0s_0s_0^TV_1\cdots V_{k-1}\nabla f_k\\ & + \cdots + V_{k-1}^T\rho_{k-2}s_{k-2}s_{k-2}^TV_{k-1}\nabla f_k + \rho_{k-1}s_{k-1}s_{k-1}^T\nabla f_k\\ = & V_{k-1}^TV_{k-2}^T\cdots V_1^TH_0V_0V_1\cdots V_{k-1}\nabla f_k\\ & + V_{k-1}^TV_{k-2}^T\cdots V_1^T\rho_0s_0s_0^TV_1\cdots V_{k-1}\nabla f_k\\ & + \cdots \\ & + V_{k-1}^T\rho_{k-2}s_{k-2}s_{k-2}^TV_{k-1}\nabla f_k\\ & + \rho_{k-1}s_{k-1}s_{k-1}^T\nabla f_k \end{aligned} km:Loop1:qk=qki=====αki==Termination:q0=α0=Loop2:r0=ri+1====rk=====fkqki+1αkiykiqki+1pkiskiTqki+1yki(IpkiykiskiT)qki+1Vkiqki+1VkiVki+1Vk1qkρkiskiTqki+1ρkiskiTVki+1Vk1qkV0V1Vk1fkρ0s0TV1Vk1fkH0q0=H0V0V1Vk1fkri+si(αiβi)ri+siαiρisiyiTri(IρisiyiT)ri+siαiViTri+siαiVk1Trk1+sk1αk1Vk1T(Vk2Trk2+sk1αk1)+ρk1sk1sk1TfkVk1TVk2Trk2+Vk1Tρk2sk2sk2TVk1fk+ρk1sk1sk1TfkVk1TVk2TV1Tr0+Vk1TVk2TV1Tρ0s0s0TV1Vk1fk++Vk1Tρk2sk2sk2TVk1fk+ρk1sk1sk1TfkVk1TVk2TV1TH0V0V1Vk1fk+Vk1TVk2TV1Tρ0s0s0TV1Vk1fk++Vk1Tρk2sk2sk2TVk1fk+ρk1sk1sk1Tfk
刚好等于式(LBFGS[1])计算出的 H k ⋅ ∇ f k H_k \cdot \nabla f_k Hkfk,也就是LBFGS的下降方向

k > m : L o o p 1 : q m = ∇ f k q m − i = q m − i + 1 − α m − i y k − i = q m − i + 1 − p k − i s k − i T q m − i + 1 y k − i = ( I − p k − i y k − i s k − i T ) q m − i + 1 = V k − i q m − i + 1 = V k − i V k − i + 1 ⋯ V k − 1 q m α m − i = ρ k − i s k − i T q m − i + 1 = ρ k − i s k − i T V k − i + 1 ⋯ V k − 1 q m T e r m i n a t i o n : q 0 = V k − m V k − m + 1 ⋯ V k − 1 ∇ f k α 0 = ρ k − m s k − m T V k − m + 1 ⋯ V k − 1 ∇ f k L o o p 2 : r 0 = H 0 q 0 = H 0 V k − m V k − m + 1 ⋯ V k − 1 ∇ f k r i + 1 = r i + s k − m + i ( α i − β i ) = r i + s k − m + i α i − ρ k − m + i s k − m + i y k − m + i T r i = ( I − ρ k − m + i s k − m + i y k − m + i T ) r i + s k − m + i α i = V k − m + i T r i + s k − m + i α i r m = V k − 1 T r m − 1 + s k − 1 α m − 1 = V k − 1 T ( V k − 2 T r m − 2 + s k − 1 α m − 1 ) + ρ k − 1 s k − 1 s k − 1 T ∇ f k = V k − 1 T V k − 2 T r m − 2 + V k − 1 T ρ k − 2 s k − 2 s k − 2 T V k − 1 ∇ f k + ρ k − 1 s k − 1 s k − 1 T ∇ f k ⋯ = V k − 1 T V k − 2 T ⋯ V k − m T r 0 + V k − 1 T V k − 2 T ⋯ V k − m + 1 T ρ k − m s k − m s k − m T V k − m + 1 ⋯ V k − 1 ∇ f k + ⋯ + V k − 1 T ρ k − 2 s k − 2 s k − 2 T V k − 1 ∇ f k + ρ k − 1 s k − 1 s k − 1 T ∇ f k = V k − 1 T V k − 2 T ⋯ V k − m T H 0 V k − m V k − m + 1 ⋯ V k − 1 ∇ f k + V k − 1 T V k − 2 T ⋯ V k − m + 1 T ρ k − m s k − m s k − m T V k − m + 1 ⋯ V k − 1 ∇ f k + ⋯ + V k − 1 T ρ k − 2 s k − 2 s k − 2 T V k − 1 ∇ f k + ρ k − 1 s k − 1 s k − 1 T ∇ f k \begin{aligned} k > m: \qquad \qquad \qquad \quad&\\ Loop 1: \quad q_m = & \nabla f_k\\ q_{m-i} = & q_{m-i+1} - \alpha_{m-i}y_{k-i}\\ = & q_{m-i+1} - p_{k-i}s^T_{k-i}q_{m-i+1}y_{k-i}\\ = & (I-p_{k-i}y_{k-i}s^T_{k-i})q_{m-i+1}\\ = & V_{k-i}q_{m-i+1}\\ = & V_{k-i}V_{k-i+1}\cdots V_{k-1}q_m\\ \alpha_{m-i} = & \rho_{k-i}s_{k-i}^Tq_{m-i+1} \\ = & \rho_{k-i}s_{k-i}^TV_{k-i+1}\cdots V_{k-1}q_m\\ \quad\\ Termination:\quad q_0 = & V_{k-m}V_{k-m+1}\cdots V_{k-1}\nabla f_k\\ \alpha_0 = & \rho_{k-m}s_{k-m}^TV_{k-m+1}\cdots V_{k-1}\nabla f_k\\ \quad\\ Loop 2: \quad r_0 = & H_0q_0 = H_0V_{k-m}V_{k-m+1}\cdots V_{k-1}\nabla f_k\\ r_{i+1} = & r_i + s_{k-m+i}(\alpha_i - \beta_i)& \\ = & r_i + s_{k-m+i}\alpha_i - \rho_{k-m+i}s_{k-m+i}y_{k-m+i}^Tr_i\\ = & (I-\rho_{k-m+i}s_{k-m+i}y_{k-m+i}^T) r_i + s_{k-m+i}\alpha_i\\ = & V_{k-m+i}^Tr_i + s_{k-m+i}\alpha_i\\ r_m = & V_{k-1}^Tr_{m-1}+ s_{k-1}\alpha_{m-1}\\ = &V_{k-1}^T(V_{k-2}^Tr_{m-2} + s_{k-1}\alpha_{m-1}) + \rho_{k-1}s_{k-1}s_{k-1}^T\nabla f_k\\ = & V_{k-1}^TV_{k-2}^Tr_{m-2} + V_{k-1}^T\rho_{k-2}s_{k-2}s_{k-2}^TV_{k-1}\nabla f_k+ \rho_{k-1}s_{k-1}s_{k-1}^T\nabla f_k\\ \cdots\\ = & V_{k-1}^TV_{k-2}^T\cdots V_{k-m}^Tr_0 + V_{k-1}^TV_{k-2}^T\cdots V_{k-m+1}^T\rho_{k-m}s_{k-m}s_{k-m}^TV_{k-m+1}\cdots V_{k-1}\nabla f_k\\ & + \cdots + V_{k-1}^T\rho_{k-2}s_{k-2}s_{k-2}^TV_{k-1}\nabla f_k + \rho_{k-1}s_{k-1}s_{k-1}^T\nabla f_k\\ = & V_{k-1}^TV_{k-2}^T\cdots V_{k-m}^TH_0V_{k-m}V_{k-m+1}\cdots V_{k-1}\nabla f_k\\ & + V_{k-1}^TV_{k-2}^T\cdots V_{k-m+1}^T\rho_{k-m}s_{k-m}s_{k-m}^TV_{k-m+1}\cdots V_{k-1}\nabla f_k\\ & + \cdots \\ & + V_{k-1}^T\rho_{k-2}s_{k-2}s_{k-2}^TV_{k-1}\nabla f_k\\ & + \rho_{k-1}s_{k-1}s_{k-1}^T\nabla f_k \end{aligned} k>m:Loop1:qm=qmi=====αmi==Termination:q0=α0=Loop2:r0=ri+1====rm=====fkqmi+1αmiykiqmi+1pkiskiTqmi+1yki(IpkiykiskiT)qmi+1Vkiqmi+1VkiVki+1Vk1qmρkiskiTqmi+1ρkiskiTVki+1Vk1qmVkmVkm+1Vk1fkρkmskmTVkm+1Vk1fkH0q0=H0VkmVkm+1Vk1fkri+skm+i(αiβi)ri+skm+iαiρkm+iskm+iykm+iTri(Iρkm+iskm+iykm+iT)ri+skm+iαiVkm+iTri+skm+iαiVk1Trm1+sk1αm1Vk1T(Vk2Trm2+sk1αm1)+ρk1sk1sk1TfkVk1TVk2Trm2+Vk1Tρk2sk2sk2TVk1fk+ρk1sk1sk1TfkVk1TVk2TVkmTr0+Vk1TVk2TVkm+1TρkmskmskmTVkm+1Vk1fk++Vk1Tρk2sk2sk2TVk1fk+ρk1sk1sk1TfkVk1TVk2TVkmTH0VkmVkm+1Vk1fk+Vk1TVk2TVkm+1TρkmskmskmTVkm+1Vk1fk++Vk1Tρk2sk2sk2TVk1fk+ρk1sk1sk1Tfk
刚好等于式(LBFGS[2])计算出的 H k ⋅ ∇ f k H_k \cdot \nabla f_k Hkfk,也就只保存m个 s i , y i s_i,y_i si,yi时计算的LBFGS下降方向,得证。

注:每次进入Loop2都需要一个 H 0 H_0 H0,实践当中被证明比较好的初始化方法为: H 0 = γ k I , γ k = s k − 1 T y k − 1 y k − 1 T y k − 1 H_0 = \gamma_k I,\gamma_k = \frac{s_{k-1}^Ty_{k-1}}{y_{k-1}^Ty_{k-1}} H0=γkIγk=yk1Tyk1sk1Tyk1


Reference:

  1. BFGS推导
  2. LBFGS算法推导
  3. Numerical Optimization, Jorge Nocedal, Stephen J. Wright, 2006
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值