牛顿法和拟牛顿法是求解无约束最优化的常用方法,有收敛速度快的优点. 牛顿法属于迭代算法,每一步需要求解目标函数的海赛矩阵的逆矩阵,计算复杂. 拟牛顿法通过正定矩阵近似海赛矩阵的逆矩阵,简化了这个过程.
牛顿法
对于无约束优化
min
x
∈
R
n
f
(
x
)
\min_{x\in R^n} f(x)
x∈Rnminf(x)
x
∗
x^*
x∗是目标的极小值点.
假设
f
(
x
)
f(x)
f(x)有二阶连续偏导数,第k次迭代值为
x
(
k
)
x^(k)
x(k),将
f
(
x
)
f(x)
f(x)在
x
(
k
)
x^(k)
x(k)附近进行二阶展开:
f
(
x
)
=
f
(
x
(
k
)
)
+
f
′
(
x
(
k
)
)
Δ
x
+
1
2
f
′
′
(
x
(
k
)
Δ
x
2
=
f
(
x
(
k
)
)
+
f
′
(
x
(
k
)
)
(
x
−
x
(
k
)
)
+
1
2
f
′
′
(
x
(
k
)
)
(
x
−
x
(
k
)
)
2
\begin{aligned} f(x) &= f(x^{(k)}) + f'(x^{(k)})\Delta x + \frac{1}{2} f''(x^{(k)}\Delta x^2\\ & = f(x^{(k)}) + f'(x^{(k)})(x - x^{(k)}) + \frac{1}{2} f''(x^{(k)})(x - x^{(k)})^2 \end{aligned}
f(x)=f(x(k))+f′(x(k))Δx+21f′′(x(k)Δx2=f(x(k))+f′(x(k))(x−x(k))+21f′′(x(k))(x−x(k))2
当这里的
x
x
x是高维时
f
(
x
)
=
f
(
x
(
k
)
)
+
g
k
T
(
x
−
x
(
k
)
)
+
1
2
(
x
−
x
(
k
)
)
T
H
(
x
(
k
)
)
(
x
−
x
(
k
)
)
f(x) = f(x^{(k)}) +g_k^T(x-x^{(k)}) + \frac{1}{2}(x-x^{(k)})^TH(x^{(k)})(x-x^{(k)})
f(x)=f(x(k))+gkT(x−x(k))+21(x−x(k))TH(x(k))(x−x(k))
其中,
g
k
=
g
(
x
(
k
)
)
=
▽
f
(
x
(
k
)
)
g_k = g(x^{(k)}) = \bigtriangledown f(x^{(k)})
gk=g(x(k))=▽f(x(k))是
f
(
x
)
f(x)
f(x)的梯度向量在点
x
(
k
)
x^{(k)}
x(k)的值,
H
(
x
(
k
)
)
H(x^{(k)})
H(x(k))是
f
(
x
)
f(x)
f(x)的海赛矩阵在点
x
(
k
)
x^{(k)}
x(k)的值.
海赛矩阵式二阶导数矩阵
H
(
x
)
=
[
∂
2
f
∂
x
i
∂
x
j
]
n
×
n
H(x) = \Big[\frac{\partial^2f}{\partial x_i\partial x_j}\Big]_{n\times n}
H(x)=[∂xi∂xj∂2f]n×n
f ( x ) f(x) f(x)有极值的必要条件是在极值点处一阶导数为0,即梯度向量为0. 当 H ( x ( k ) ) H(x^{(k)}) H(x(k))为正定矩阵时, f ( x ) f(x) f(x)的极值为极小值.
假设
▽
f
(
x
(
k
+
1
)
)
=
0
\bigtriangledown f(x^{(k+1)}) = 0
▽f(x(k+1))=0,我们有
▽
f
(
x
)
=
f
′
(
x
)
+
f
′
′
(
x
)
(
x
−
x
(
k
)
)
\bigtriangledown f(x) = f'(x) + f''(x)(x - x^{(k)})
▽f(x)=f′(x)+f′′(x)(x−x(k))
(1.1)
▽
f
(
x
)
=
g
k
+
H
k
(
x
−
x
(
k
)
)
\bigtriangledown f(x) = g_k + H_k(x-x^{(k)})\tag{1.1}
▽f(x)=gk+Hk(x−x(k))(1.1)
所以
g
k
+
H
k
(
x
−
x
(
k
)
)
=
0
x
(
k
+
1
)
=
x
(
k
)
−
H
k
−
1
g
k
g_k + H_k(x-x^{(k)}) = 0\\ x^{(k+1)} = x^{(k)} - H_k^{-1}g_k\\
gk+Hk(x−x(k))=0x(k+1)=x(k)−Hk−1gk
我们假设
H
k
p
k
=
−
g
k
H_kp_k = -g_k
Hkpk=−gk
则有
x
(
k
+
1
)
=
x
(
k
)
+
p
k
x^{(k+1)} = x^{(k)} + p_k
x(k+1)=x(k)+pk
算法过程
输入:目标函数 f ( x ) f(x) f(x),梯度 g ( x ) = ▽ f ( x ) g(x) = \bigtriangledown f(x) g(x)=▽f(x),海赛矩阵 H ( x ) H(x) H(x),精度要求 ϵ \epsilon ϵ.
输出: f ( x ) f(x) f(x)的极小值点 x ∗ x^* x∗.
- 取初始点 x ( 0 ) x^{(0)} x(0),令 k = 0 k=0 k=0.
- 计算 g k g_k gk.
- 若 ∣ ∣ g k ∣ ∣ < ϵ ||g_k||<\epsilon ∣∣gk∣∣<ϵ,则停止计算,得近似解 x ∗ = x ( k ) x^* = x^{(k)} x∗=x(k).
- 计算 H k H_k Hk,并求 p k p_k pk
- x ( k + 1 ) = x ( k ) + p k x^{(k+1)} = x^{(k)} + p_k x(k+1)=x(k)+pk
- k = k + 1 k=k+1 k=k+1,转2
拟牛顿法
基本思想:
海赛矩阵的逆矩阵计算困难,考虑用一个n阶矩阵 G k = G ( x ( k ) ) G_k = G(x^{(k)}) Gk=G(x(k))来近似替代 H k − 1 H_k^{-1} Hk−1.
在
(
1.1
)
(1.1)
(1.1)中代入
x
=
x
(
k
+
1
)
x = x^{(k+1)}
x=x(k+1),即有如下:
▽
f
(
x
(
k
+
1
)
)
=
g
k
+
H
k
(
x
(
k
+
1
)
−
x
(
k
)
)
g
k
+
1
−
g
k
=
H
k
(
x
(
k
+
1
)
−
x
(
k
)
)
\bigtriangledown f(x^{(k+1)}) = g_k + H_k(x^{(k+1)}-x^{(k)})\\ g_{k+1} - g_k = H_k(x^{(k+1)} - x^{(k)})
▽f(x(k+1))=gk+Hk(x(k+1)−x(k))gk+1−gk=Hk(x(k+1)−x(k))
记
y
k
=
g
(
k
+
1
)
−
g
k
,
δ
k
=
x
(
x
+
1
)
−
x
(
k
)
y_k = g_{(k+1)} - g_k, \delta_k = x^{(x+1)} - x^{(k)}
yk=g(k+1)−gk,δk=x(x+1)−x(k),则有
(2.1)
y
k
=
H
)
k
δ
k
H
k
−
1
y
k
=
δ
k
y_k = H)k\delta_k\tag{2.1}\\ H_k^{-1}y_k = \delta_k
yk=H)kδkHk−1yk=δk(2.1)
我们称
(
2.1
)
(2.1)
(2.1)为拟牛顿条件.
如果 H k H_k Hk是正定的,那么可以保证牛顿法搜索方向 p k p_k pk是下降方向:
因为搜索方向是
p
k
=
−
λ
g
k
p_k = -\lambda g_k
pk=−λgk
x
=
x
(
k
)
+
λ
p
k
=
x
(
k
)
−
λ
H
k
(
−
1
)
g
k
x = x^{(k)} + \lambda p_k = x^{(k)} - \lambda H_k^{(-1)}g_k
x=x(k)+λpk=x(k)−λHk(−1)gk
所以
f
(
x
)
f(x)
f(x)在
x
(
k
)
x^{(k)}
x(k)的泰勒展开式可以近似写成:
f
(
x
)
=
f
(
x
(
k
)
)
−
λ
g
k
T
H
k
−
1
g
k
f(x) = f(x^{(k)}) - \lambda g_k^TH_k^{-1}g_k
f(x)=f(x(k))−λgkTHk−1gk
由于
H
k
−
1
H_k^{-1}
Hk−1正定,所以有
g
k
T
H
k
−
1
g
k
>
0
g_k^TH_k^{-1}g_k>0
gkTHk−1gk>0. 当
λ
\lambda
λ为一个充分小的正数时,总有
f
(
x
)
<
f
(
x
(
k
)
)
f(x)<f(x^{(k)})
f(x)<f(x(k)),也就是说
p
k
p_k
pk是下降方向.
拟牛顿法将
G
k
G_k
Gk作为
H
k
−
1
H_k^{-1}
Hk−1的近似,要求矩阵
G
k
G_k
Gk满足同样的条件,每次迭代矩阵
G
k
G_k
Gk都是正定的,且
G
k
G_k
Gk要满足拟牛顿条件:
G
k
1
y
k
=
δ
k
G_{k_1}y_k = \delta_k
Gk1yk=δk
按照拟牛顿条件选择 G k G_k Gk作为 H k − 1 H_k^{-1} Hk−1的近似或选择 B k B_k Bk作为 H k H_k Hk的近似的算法称为拟牛顿法.
按照拟牛顿条件,每次迭代都可以更新矩阵:
G
k
+
1
=
G
k
+
Δ
G
k
G_{k+1} = G_k +\Delta G_k
Gk+1=Gk+ΔGk
有多种具体实现方法
DFP(Davidon-Fletcher-Powell)算法
DFP选择
G
k
+
1
G_{k+1}
Gk+1的方法是,假设每一次迭代
G
k
+
1
G_{k+1}
Gk+1都是由
G
k
G_k
Gk加上两个附加项构成,我们假设这两个附加项分别为
Q
k
、
P
k
Q_k、P_k
Qk、Pk,则有
G
k
+
1
=
G
k
+
Q
k
+
P
k
G
k
+
1
y
k
=
G
k
y
k
+
Q
k
y
k
+
P
k
y
k
G_{k+1} = G_k + Q_k + P_k\\ G_{k+1}y_k = G_ky_k + Q_ky_k + P_ky_k
Gk+1=Gk+Qk+PkGk+1yk=Gkyk+Qkyk+Pkyk
使
G
k
+
1
G_{k+1}
Gk+1满足拟牛顿条件,可以使
P
k
y
k
=
δ
k
Q
k
y
k
=
−
G
k
y
k
P_ky_k = \delta_k\\ Q_ky_k = - G_ky_k
Pkyk=δkQkyk=−Gkyk
取
P
k
=
δ
k
δ
k
T
δ
T
y
k
Q
=
−
G
k
y
k
y
k
T
G
k
y
k
T
G
k
y
k
P_k = \frac{\delta_k\delta_k^T}{\delta^Ty_k}\\ Q = - \frac{G_ky_ky_k^TG_k}{y_k^TG_ky_k}
Pk=δTykδkδkTQ=−ykTGkykGkykykTGk
所以可以得到
G
k
+
1
G_{k+1}
Gk+1的迭代公式
(2.2)
G
k
+
1
=
G
k
−
G
k
y
k
y
k
T
G
k
y
k
T
G
k
y
k
+
δ
k
δ
k
T
δ
T
y
k
G_{k+1} = G_k - \frac{G_ky_ky_k^TG_k}{y_k^TG_ky_k} + \frac{\delta_k\delta_k^T}{\delta^Ty_k}\tag{2.2}
Gk+1=Gk−ykTGkykGkykykTGk+δTykδkδkT(2.2)
算法过程:
输入: f ( x ) f(x) f(x),梯度 g ( x ) = ▽ f ( x ) g(x) = \bigtriangledown f(x) g(x)=▽f(x),精度 ϵ \epsilon ϵ.
输出: f ( x ) f(x) f(x)的极小点 x ∗ x^* x∗
-
选取初始点 x ( 0 ) x^{(0)} x(0),取 G 0 G_0 G0为正定对称矩阵,令 k = 0 k=0 k=0
-
计算 g k g_k gk,若 ∣ ∣ g k ∣ ∣ < ϵ ||g_k||<\epsilon ∣∣gk∣∣<ϵ,则停止计算,得到近似解 x ∗ = x ( k ) x^* = x^{(k)} x∗=x(k)
-
令 p k = − G k g k p_k = - G_kg_k pk=−Gkgk
-
一维搜索:
求 λ k \lambda_k λk使得 f ( x ( k ) + λ k p k ) = min λ ≥ 0 f ( x ( k ) + λ p k ) f(x^{(k)} + \lambda_kp_k) = \min_{\lambda\geq 0}f(x^{(k)}+\lambda p_k) f(x(k)+λkpk)=minλ≥0f(x(k)+λpk)
-
令 x ( k + 1 ) = x ( k ) + λ k p k x^{(k+1)} = x^{(k)} + \lambda_k p_k x(k+1)=x(k)+λkpk
-
计算 g k + 1 g_{k+1} gk+1,若 ∣ ∣ g k + 1 ∣ ∣ < ϵ ||g_{k+1}||<\epsilon ∣∣gk+1∣∣<ϵ,则停止计算,得近似解 x ∗ = x ( k ) x^* = x^{(k)} x∗=x(k),否则按式 ( 2.2 ) (2.2) (2.2)计算 G k + 1 G_{k+1} Gk+1.
-
k = k + 1 k=k+1 k=k+1
BFGS(Boyden-Fletcher-Goldfarb-Shanno)算法
BFGS是最流行得拟牛顿算法.
基本思想:
考虑用
B
k
B_k
Bk逼近海赛矩阵
H
H
H,则拟牛顿条件为:
B
k
+
1
δ
k
=
y
k
B_{k+1}\delta_k = y_k
Bk+1δk=yk
令
B
k
+
1
=
B
k
+
P
k
+
Q
k
B
k
+
1
δ
k
=
B
k
δ
k
+
P
k
δ
k
+
Q
k
δ
k
B_{k+1} = B_k + P_k + Q_k\\ B_{k+1}\delta_k = B_k\delta_k + P_k\delta_k + Q_k\delta_k
Bk+1=Bk+Pk+QkBk+1δk=Bkδk+Pkδk+Qkδk
使
P
k
P_k
Pk和
Q
k
Q_k
Qk满足:
P
k
δ
k
=
y
k
Q
k
δ
k
=
−
B
k
δ
k
P_k\delta_k = y_k\\ Q_k\delta_k = -B_k\delta_k
Pkδk=ykQkδk=−Bkδk
取
P
k
=
y
k
y
k
T
y
k
T
δ
k
Q
k
=
−
B
k
δ
k
δ
k
T
B
k
δ
k
T
B
k
δ
k
P_k = \frac{y_ky_k^T}{y_k^T\delta_k}\\ Q_k = - \frac{B_k\delta_k\delta_k^TB_k}{\delta_k^TB_k\delta_k}
Pk=ykTδkykykTQk=−δkTBkδkBkδkδkTBk
我们得到了
B
k
+
1
B_{k+1}
Bk+1的迭代公式:
(2.3)
B
k
+
1
=
B
k
−
B
k
δ
k
δ
k
T
B
k
δ
k
T
B
k
δ
k
+
y
k
y
k
T
y
k
T
δ
k
B_{k+1} = B_k - \frac{B_k\delta_k\delta_k^TB_k}{\delta_k^TB_k\delta_k} + \frac{y_ky_k^T}{y_k^T\delta_k}\tag{2.3}
Bk+1=Bk−δkTBkδkBkδkδkTBk+ykTδkykykT(2.3)
可以证明,如果初始矩阵
B
0
B_0
B0是正定的,那么迭代过程中的每个矩阵
B
k
B_k
Bk都是正定的.
算法过程:
输入: f ( x ) f(x) f(x),梯度 g ( x ) = ▽ f ( x ) g(x) = \bigtriangledown f(x) g(x)=▽f(x),精度 ϵ \epsilon ϵ.
输出: f ( x ) f(x) f(x)的极小点 x ∗ x^* x∗
-
选取初始点 x ( 0 ) x^{(0)} x(0),取 B 0 B_0 B0为正定对称矩阵,令 k = 0 k=0 k=0
-
计算 g k g_k gk,若 ∣ ∣ g k ∣ ∣ < ϵ ||g_k||<\epsilon ∣∣gk∣∣<ϵ,则停止计算,得到近似解 x ∗ = x ( k ) x^* = x^{(k)} x∗=x(k)
-
令 p k = − B k g k p_k = - B_kg_k pk=−Bkgk
-
一维搜索:
求 λ k \lambda_k λk使得 f ( x ( k ) + λ k p k ) = min λ ≥ 0 f ( x ( k ) + λ p k ) f(x^{(k)} + \lambda_kp_k) = \min_{\lambda\geq 0}f(x^{(k)}+\lambda p_k) f(x(k)+λkpk)=minλ≥0f(x(k)+λpk)
-
令 x ( k + 1 ) = x ( k ) + λ k p k x^{(k+1)} = x^{(k)} + \lambda_k p_k x(k+1)=x(k)+λkpk
-
计算 g k + 1 g_{k+1} gk+1,若 ∣ ∣ g k + 1 ∣ ∣ < ϵ ||g_{k+1}||<\epsilon ∣∣gk+1∣∣<ϵ,则停止计算,得近似解 x ∗ = x ( k ) x^* = x^{(k)} x∗=x(k),否则按式 ( 2.3 ) (2.3) (2.3)计算 B k + 1 B_{k+1} Bk+1.
-
k = k + 1 k=k+1 k=k+1
Broden类算法
可以从BFGS算法的 B k B_k Bk矩阵得到关于 G k G_k Gk的迭代公式. 若记 G k = B k − 1 、 G k + 1 = B k + 1 − 1 G_k = B_k^{-1}、G_{k+1} = B_{k+1}^{-1} Gk=Bk−1、Gk+1=Bk+1−1,对 ( 2.3 ) (2.3) (2.3)用两次Sherman-Morrison公式可得
G k + 1 = ( I − δ k y k T δ k T y k ) G k ( I − δ k y k T δ k T y k ) T + δ k y k T δ k T y k G_{k+1} = \Big(I - \frac{\delta_ky_k^T}{\delta^T_ky_k}\Big)G_k\Big(I - \frac{\delta_ky_k^T}{\delta^T_ky_k}\Big)^T + \frac{\delta_ky_k^T}{\delta^T_ky_k} Gk+1=(I−δkTykδkykT)Gk(I−δkTykδkykT)T+δkTykδkykT
参考文献
李航-统计学习方法