L-BFGS
BFGS
设 x = ( x 1 , x 2 , ⋯ , x n ) \mathbf{x} = ( x_1, x_2, \cdots, x_n ) x=(x1,x2,⋯,xn), f ( x ) = f ( x 1 , x 2 , ⋯ , x n ) f(\mathbf{x}) = f(x_1, x_2, \cdots, x_n) f(x)=f(x1,x2,⋯,xn) 是 R n → R \mathbb{R}^n \to \mathbb{R} Rn→R 的标量函数
构造目标函数在迭代点 x k x_k xk 处的二次模型:
m k ( p ) = f k + ∇ f k T p + 1 2 p T B k p m_k(p) = f_k + \nabla f_k^{\mathsf{T}} p + \frac{1}{2} p^{\mathsf{T}} B_k p mk(p)=fk+∇fkTp+21pTBkp
其中, f k f_k fk 是该函数在 k k k 处的值
B k B_k Bk 是在每次迭代过程中被更新的 n × n n \times n n×n 的对称正定阵
则该模型的梯度:
∇ m k ( p ) = ∂ f k ∂ p + ∂ ( ∇ f k T p ) ∂ p + 1 2 ∂ ( p T B k p ) ∂ p = [ ∂ ( ∇ f k T p ) ∂ p 1 , ⋯ , ∂ ( ∇ f k T p ) ∂ p n ] T + 1 2 [ ∂ ∂ p 1 ∑ i , j = 1 n b i j p i p j , ⋯ , ∂ ∂ p n ∑ i , j = 1 n b i j p i p j ] T = [ ∂ f k ∂ x 1 , ⋯ , ∂ f k ∂ x n ] T + 1 2 [ b 11 p 1 + b 12 p 2 + ⋯ + b 1 n p n ⋮ b n 1 p 1 + b n 2 p 2 + ⋯ + b n n p n ] + 1 2 [ b 11 p 1 + b 21 p 2 + ⋯ + b n 1 p n ⋮ b 1 n p 1 + b 2 n p 2 + ⋯ + b n n p n ] = ∇ f k + 1 2 B k p + 1 2 B k T p = ∇ f k + B k p \nabla m_k(p) = \dfrac{\partial f_k}{\partial p} + \dfrac{\partial (\nabla f_k^{\mathsf{T}} p)}{\partial p} + \dfrac{1}{2} \dfrac{\partial (p^{\mathsf{T}} B_k p)}{\partial p} \\[6pt] = \left[ \dfrac{\partial (\nabla f_k^{\mathsf{T}} p)}{\partial p_1}, \cdots, \dfrac{\partial (\nabla f_k^{\mathsf{T}} p)}{\partial p_n} \right]^{\mathsf{T}} + \dfrac{1}{2} \left[ \dfrac{\partial }{\partial p_1}\sum_{i,j=1}^n b_{ij} p_i p_j , \cdots, \dfrac{\partial }{\partial p_n}\sum_{i,j=1}^n b_{ij} p_i p_j \right]^{\mathsf{T}} \\[6pt] = \left[ \dfrac{\partial f_k}{\partial x_1}, \cdots, \dfrac{\partial f_k}{\partial x_n} \right]^{\mathsf{T}} + \dfrac{1}{2} \begin{bmatrix} b_{11} p_1 + b_{12} p_2 + \cdots + b_{1n} p_n \\ \vdots \\ b_{n1} p_1 + b_{n2} p_2 + \cdots + b_{nn} p_n \end{bmatrix} \\ + \dfrac{1}{2} \begin{bmatrix} b_{11} p_1 + b_{21} p_2 + \cdots + b_{n1} p_n \\ \vdots \\ b_{1n} p_1 + b_{2n} p_2 + \cdots + b_{nn} p_n \end{bmatrix} \\[6pt] = \nabla f_k + \dfrac{1}{2} B_k p + \dfrac{1}{2} B_k^{\mathsf{T}} p \\[6pt] = \nabla f_k + B_k p ∇mk(p)=∂p∂fk+∂p∂(∇fkTp)+21∂p∂(pTBkp)=[∂p1∂(∇fkTp),⋯,∂pn∂(∇fkTp)]T+21[∂p1∂i,j=1∑nbijpipj,⋯,∂pn∂i,j=1∑nbijpipj]T=[∂x1∂fk,⋯,∂xn∂fk]T+21⎣⎢⎡b11p1+b12p2+⋯+b1npn⋮bn1p1+bn2p2+⋯+bnnpn⎦⎥⎤+21⎣⎢⎡b11p1+b21p2+⋯+bn1pn⋮b1np1+b2np2+⋯+bnnpn⎦⎥⎤=∇fk+21Bkp+21BkTp=∇fk+Bkp
所以极小值 p k = − B k − 1 ∇ f k p_k = - B_k^{-1} \nabla f_k pk=−Bk−1∇fk 也就是搜索方向,因此下一轮的迭代点 x k + 1 x_{k+1} xk+1 为:
x k + 1 = x k + α k p k x_{k+1} = x_k + \alpha_k p_k xk+1=xk+αkpk
其中,步长 α \alpha α 的选取应确保满足 W o l f e \mathtt{Wolfe} Wolfe 条件
而当我们得出新的迭代点 x k + 1 x_{k+1} xk+1 时,想要构造一个新的二次模型:
m k + 1 ( p ) = f k + 1 + ∇ f k + 1 T p + 1 2 p T B k + 1 p m_{k+1} (p) = f_{k+1} + \nabla f_{k+1}^{\mathsf{T}} p + \frac{1}{2} p^{\mathsf{T}} B_{k+1} p mk+1(p)=fk+1+∇fk+1Tp+21pTBk+1p
那么 m k + 1 m_{k+1} mk+1 的梯度应该和目标函数 f f f 最近两次的迭代 x k , x k + 1 x_k, x_{k+1} xk,xk+1 相等
就要满足:
∇ m k