高级优化理论与方法(六)

Quasi-Newton Methods

Rank-One Algorithm

核心公式

d k = − H k g k d^k=-H^kg^k dk=Hkgk
α k = a r g m i n f ( x k + α d k ) \alpha^k=argmin f(x^k+\alpha d^k) αk=argminf(xk+αdk)
x k + 1 = x k + α k d k x^{k+1}=x^k+\alpha^kd^k xk+1=xk+αkdk
H k + 1 = H k + ( Δ x k − H k Δ g k ) ( Δ x k − H k Δ g k ) T Δ g k T ( Δ x k − H k Δ g k ) H^{k+1}=H^k+\frac{(\Delta x^k-H^k\Delta g^k)(\Delta x^k-H^k\Delta g^k)^T}{\Delta{g^k}^T(\Delta x^k-H^k\Delta g^k)} Hk+1=Hk+ΔgkT(ΔxkHkΔgk)(ΔxkHkΔgk)(ΔxkHkΔgk)T

Theorem

Quadratic: H k + 1 Δ g i = Δ x i , ∀ i ≤ k H^{k+1}\Delta g^i=\Delta x^i, \forall i\leq k Hk+1Δgi=Δxi,ik

Example

f ( x ) = x 1 2 + 1 2 x 2 2 + 3 = 1 2 x T [ 2 0 0 1 ] x + 3 f(x)=x_1^2+\frac{1}{2}x_2^2+3=\frac{1}{2}x^T\begin{bmatrix} 2&0 \\ 0&1 \end{bmatrix}x+3 f(x)=x12+21x22+3=21xT[2001]x+3

g k = [ 2 0 0 1 ] x k g^k=\begin{bmatrix} 2&0 \\ 0&1 \end{bmatrix}x^k gk=[2001]xk

x 0 = [ 1 2 ] x^0=\begin{bmatrix} 1 \\ 2 \end{bmatrix} x0=[12]

H 0 = [ 1 0 0 1 ] H^0=\begin{bmatrix} 1&0 \\ 0&1 \end{bmatrix} H0=[1001]

d 0 = − H 0 g 0 = [ − 2 − 2 ] d^0=-H^0g^0=\begin{bmatrix} -2 \\ -2 \end{bmatrix} d0=H0g0=[22]

α 0 = g 0 T d 0 d 0 T Q d 0 = 2 3 \alpha^0=\frac{{g^0}^Td^0}{{d^0}^TQd^0}=\frac{2}{3} α0=d0TQd0g0Td0=32

x 1 = x 0 + α 0 d 0 = [ − 1 3 2 3 ] x^1=x^0+\alpha^0d^0=\begin{bmatrix} -\frac{1}{3} \\ \frac{2}{3} \end{bmatrix} x1=x0+α0d0=[3132]

H 1 = [ 1 2 0 0 1 ] H^1=\begin{bmatrix} \frac{1}{2}&0 \\ 0&1 \end{bmatrix} H1=[21001]

Problems

H k H^k Hk not positive definite ⇒ \Rightarrow not descent.

Δ g k T ( Δ x k − H k Δ g k ) \Delta{g^k}^T(\Delta x^k-H^k\Delta g^k) ΔgkT(ΔxkHkΔgk) too small, close to zero.

DFP-Algorithm

算法步骤

IN: x 0 , H 0 x^0, H^0 x0,H0(positive definite, symmetric)

  1. k : = 0 k:=0 k:=0
  2. If g k = 0 g^k=0 gk=0, then stop; else d k = − H k g k d^k=-H^kg^k dk=Hkgk
  3. compute α k = a r g m i n f ( x k + α d k ) , x k + 1 = x k + α k d k \alpha^k=argmin f(x^k+\alpha d^k), x^{k+1}=x^k+\alpha^kd^k αk=argminf(xk+αdk),xk+1=xk+αkdk
  4. compute Δ x k = α k d k , Δ g k = g k + 1 − g k , H k + 1 = H k + Δ x k Δ x k T Δ x k T Δ g k − ( H k Δ g k ) ( H k Δ g k ) T Δ g k T H k Δ g k \Delta x^k=\alpha^kd^k, \Delta g^k=g^{k+1}-g^k,H^{k+1}=H^k+\frac{\Delta x^k{\Delta x^k}^T}{{\Delta x^k}^T\Delta g^k}-\frac{(H^k\Delta g^k)(H^k\Delta g^k)^T}{{\Delta g^k}^TH^k\Delta g^k} Δxk=αkdk,Δgk=gk+1gk,Hk+1=Hk+ΔxkTΔgkΔxkΔxkTΔgkTHkΔgk(HkΔgk)(HkΔgk)T
  5. k k k++,goto 2

Theorem

Applying DFP to quadratic functions: H k + 1 Δ g i = Δ x i , ∀ i ≤ k H^{k+1}\Delta g^i=\Delta x^i, \forall i\leq k Hk+1Δgi=Δxi,ik

Example

f ( x ) = 1 2 x T [ 4 2 2 2 ] x − x T [ − 1 1 ] f(x)=\frac{1}{2}x^T\begin{bmatrix} 4&2 \\ 2&2 \end{bmatrix}x-x^T\begin{bmatrix} -1 \\ 1 \end{bmatrix} f(x)=21xT[4222]xxT[11]

x 0 = [ 0 0 ] x^0=\begin{bmatrix} 0 \\ 0 \end{bmatrix} x0=[00]

H 0 = [ 1 0 0 1 ] H^0=\begin{bmatrix} 1&0 \\ 0&1 \end{bmatrix} H0=[1001]

g k = [ 4 2 2 2 ] x − [ − 1 1 ] g^k=\begin{bmatrix} 4&2 \\ 2&2 \end{bmatrix}x-\begin{bmatrix} -1 \\ 1 \end{bmatrix} gk=[4222]x[11]

g 0 = [ 1 − 1 ] g^0=\begin{bmatrix} 1 \\ -1 \end{bmatrix} g0=[11]

d 0 = − H 0 g 0 = [ − 1 1 ] d^0=-H^0g^0=\begin{bmatrix} -1 \\ 1 \end{bmatrix} d0=H0g0=[11]

α 0 = 1 \alpha^0=1 α0=1

x 1 = x 0 + α 0 d 0 = [ − 1 1 ] x^1=x^0+\alpha^0d^0=\begin{bmatrix} -1 \\ 1 \end{bmatrix} x1=x0+α0d0=[11]

g 1 = [ − 1 1 ] g^1=\begin{bmatrix} -1 \\ 1 \end{bmatrix} g1=[11]

Δ x 0 = [ − 1 1 ] \Delta x^0=\begin{bmatrix} -1\\ 1 \end{bmatrix} Δx0=[11]

Δ g 0 = [ − 2 0 ] \Delta g^0=\begin{bmatrix} -2\\ 0 \end{bmatrix} Δg0=[20]

Δ x 0 Δ x 0 T = [ − 1 1 ] [ − 1 , 1 ] = [ 1 − 1 − 1 1 ] \Delta x^0{\Delta x^0}^T=\begin{bmatrix} -1\\ 1 \end{bmatrix}[-1,1]=\begin{bmatrix} 1&-1 \\ -1&1 \end{bmatrix} Δx0Δx0T=[11][1,1]=[1111]

Δ x 0 T Δ g 0 = 2 {\Delta x^0}^T\Delta g^0=2 Δx0TΔg0=2

H 0 Δ g 0 = [ − 2 0 ] H^0\Delta g^0=\begin{bmatrix} -2\\ 0 \end{bmatrix} H0Δg0=[20]

( H 0 Δ g 0 ) ( H 0 Δ g 0 ) T = [ 4 0 0 0 ] (H^0\Delta g^0)(H^0\Delta g^0)^T=\begin{bmatrix} 4&0 \\ 0&0 \end{bmatrix} (H0Δg0)(H0Δg0)T=[4000]

Δ g 0 T H 0 Δ g 0 = 4 {\Delta g^0}^T H^0 \Delta g^0=4 Δg0TH0Δg0=4

H 1 = H 0 + Δ x 0 Δ x 0 T Δ x 0 T Δ g 0 − ( H 0 Δ g 0 ) ( H 0 Δ g 0 ) T Δ g 0 T H 0 Δ g 0 = [ 1 0 0 1 ] + [ 1 − 1 − 1 1 ] 2 − [ 4 0 0 0 ] 4 = [ 1 2 − 1 2 − 1 2 3 2 ] H^1=H^0+\frac{\Delta x^0{\Delta x^0}^T}{{\Delta x^0}^T\Delta g^0}-\frac{(H^0\Delta g^0)(H^0\Delta g^0)^T}{{\Delta g^0}^TH^0\Delta g^0}=\begin{bmatrix} 1&0 \\ 0&1 \end{bmatrix}+\frac{\begin{bmatrix} 1&-1 \\ -1&1 \end{bmatrix}}{2}-\frac{\begin{bmatrix} 4&0 \\ 0&0 \end{bmatrix}}{4}=\begin{bmatrix} \frac{1}{2}&-\frac{1}{2} \\ -\frac{1}{2}&\frac{3}{2} \end{bmatrix} H1=H0+Δx0TΔg0Δx0Δx0TΔg0TH0Δg0(H0Δg0)(H0Δg0)T=[1001]+2[1111]4[4000]=[21212123]

d 1 = − H 1 g 1 = [ 0 1 ] d^1=-H^1g^1=\begin{bmatrix} 0 \\ 1 \end{bmatrix} d1=H1g1=[01]

α 1 = 1 2 \alpha^1=\frac{1}{2} α1=21

x 2 = x 1 + α 1 d 1 = [ − 1 3 2 ] = x ∗ x^2=x^1+\alpha^1 d^1=\begin{bmatrix} -1 \\ \frac{3}{2} \end{bmatrix}=x^* x2=x1+α1d1=[123]=x

Theorem

If g k ≠ 0 , H k g^k\neq 0, H^k gk=0,Hk being positive definite implies H k + 1 H^{k+1} Hk+1 being positive definite.

BFGS

前两个算法的核心是用 H k H^k Hk来表示 F ( x k ) − 1 F(x^k)^{-1} F(xk)1,即二阶导的逆。两种算法的不同之处仅仅在于用不同的公式来计算 H k H^k Hk。BFGS是用 G k + 1 G^{k+1} Gk+1来表示 F ( x k ) F(x^k) F(xk),从而推导出 H k H^k Hk。BFGS的收敛速度在某些情况下较其它两种方法快,但是在非二次函数的情形下难以证明。这里只给出 H k H^k Hk的公式,不做过多展开。
H k + 1 = H k + ( 1 + Δ g k T H k Δ g k Δ g k T Δ x k ) Δ x k Δ x k T Δ x k Δ g k − H k Δ g k Δ x k T + ( H k Δ g k Δ x k T ) T Δ g k T Δ x k H^{k+1}=H^k+(1+\frac{{\Delta g^k}^TH^k\Delta g^k}{{\Delta g^k}^T\Delta x^k})\frac{\Delta x^k{\Delta x^k}^T}{\Delta x^k\Delta g^k}-\frac{H^k\Delta g^k{\Delta x^k}^T+(H^k\Delta g^k{\Delta x^k}^T)^T}{{\Delta g^k}^T\Delta x^k} Hk+1=Hk+(1+ΔgkTΔxkΔgkTHkΔgk)ΔxkΔgkΔxkΔxkTΔgkTΔxkHkΔgkΔxkT+(HkΔgkΔxkT)T

Solving Linear Equations

A x = b , A ∈ R m × n , x ∈ R n , b ∈ R m Ax=b, A\in \mathbb{R}^{m\times n}, x\in \mathbb{R}^n, b\in \mathbb{R}^m Ax=b,ARm×n,xRn,bRm

{ a 11 x 1 + a 12 x 2 + ⋯ + a 1 n x n = b 1 ⋯ a m 1 x 1 + a m 2 x 2 + ⋯ + a m n x n = b n \begin{cases} a_{11}x_1+a_{12}x_2+\cdots+a_{1n}x_n=b_1\\ \cdots\\ a_{m1}x_1+a_{m2}x_2+\cdots+a_{mn}x_n=b_n \end{cases} a11x1+a12x2++a1nxn=b1am1x1+am2x2++amnxn=bn

n n n代表变量个数, m m m代表方程个数。默认这 m m m个方程线性无关。

Case 1

m ≥ n , r a n k A = n m\geq n, rankA=n mn,rankA=n

可以把问题看成用 m m m个数据点来拟合有 n n n个变量的函数,此时可能不存在严格的解。于是我们不要求严格地解这个方程,而是把解方程问题转化为数据拟合问题(Data fitting)。

在数据拟合问题中,我们一般采取的是Least Square fitting,即求 m i n ∣ ∣ A x − b ∣ ∣ 2 min ||Ax-b||^2 min∣∣Axb2

Lemma

r a n k A = n ⇔ r a n k A T A = n rankA=n\Leftrightarrow rank A^TA=n rankA=nrankATA=n

Theorem

The unique vector x ∗ x^* x minimizing ∣ ∣ A x − b ∣ ∣ 2 ||Ax-b||^2 ∣∣Axb2 is given by x ∗ = ( A T A ) − 1 A T b x^*=(A^TA)^{-1}A^Tb x=(ATA)1ATb

Proposition

A = [ a 1 , a 2 , ⋯   , a n ] A=[a_1,a_2,\cdots,a_n] A=[a1,a2,,an]

s p a n [ a 1 , a 2 , ⋯   , a n ] span[a_1,a_2,\cdots,a_n] span[a1,a2,,an]表示由这些向量张成的空间。

b b b在该空间上的投影为 v v v,则 v = A x ∗ = A ( A T A ) − 1 A T b v=Ax^*=A(A^TA)^{-1}A^Tb v=Ax=A(ATA)1ATb

注:该命题是通过几何方法给出了 x ∗ = ( A T A ) − 1 A T b x^*=(A^TA)^{-1}A^Tb x=(ATA)1ATb的一个证明。

Example

i i i123
t i t_i ti234
y i y_i yi3415

y = m t + c y=mt+c y=mt+c

A = [ 2 1 3 1 4 1 ] A=\begin{bmatrix} 2&1 \\ 3&1\\ 4&1 \end{bmatrix} A= 234111

b = [ 3 4 15 ] b=\begin{bmatrix} 3 \\ 4\\ 15 \end{bmatrix} b= 3415

x ∗ = [ 6 − 32 3 ] x^*=\begin{bmatrix} 6 \\ -\frac{32}{3} \end{bmatrix} x=[6332]

注:此时, x = [ m c ] x=\begin{bmatrix} m \\ c \end{bmatrix} x=[mc]

Recursive Least-Square Algorithm

根据原始数据,已经计算出了 x ∗ x^* x的值,那么此时如果给出一些新的数据,如何在借助已知值的基础上快速算出新的 x ∗ x^* x,就是这个算法要解决的问题。
m i n ∣ ∣ A 0 x − b 0 ∣ ∣ 2 ⇐ x 0 = G 0 − 1 A 0 T b 0 min||A_0x-b_0||^2\Leftarrow x_0=G_0^{-1}A_0^Tb_0 min∣∣A0xb02x0=G01A0Tb0
G 0 = A 0 − 1 A 0 G_0=A_0^{-1}A_0 G0=A01A0

m i n ∣ ∣ [ A 0 A 1 ] x − [ b 0 b 1 ] ∣ ∣ 2 min\left|\left|\begin{bmatrix} A_0 \\ A_1 \end{bmatrix}x-\begin{bmatrix} b_0 \\ b_1 \end{bmatrix}\right|\right| ^2 min [A0A1]x[b0b1] 2

x 1 = G 1 − 1 [ A 0 A 1 ] T [ b 0 b 1 ] x_1=G_1^{-1}\begin{bmatrix} A_0 \\ A_1 \end{bmatrix}^T\begin{bmatrix} b_0 \\ b_1 \end{bmatrix} x1=G11[A0A1]T[b0b1]

G 1 = [ A 0 A 1 ] T [ A 0 A 1 ] = A 0 T A 0 + A 1 T A 1 G_1=\begin{bmatrix} A_0 \\ A_1 \end{bmatrix}^T\begin{bmatrix} A_0 \\ A_1 \end{bmatrix}=A_0^TA_0+A_1^TA_1 G1=[A0A1]T[A0A1]=A0TA0+A1TA1

Goal: express x 1 x_1 x1 as a function of x 0 , A 1 , b 1 , G 0 x_0,A_1,b_1,G_0 x0,A1,b1,G0

x 1 = x 0 + G 1 − 1 A 1 T ( b 1 − A 1 x 0 ) x_1=x_0+G_1^{-1}A_1^T(b_1-A_1x_0) x1=x0+G11A1T(b1A1x0)

k → k + 1 : { G k + 1 = G k + A k + 1 T A k + 1 x k + 1 = x k + G k + 1 − 1 A k + 1 T ( b k + 1 − A k + 1 x k ) k\to k+1: \begin{cases} G_{k+1}=G_k+A_{k+1}^TA_{k+1}\\ x_{k+1}=x_k+G_{k+1}^{-1}A_{k+1}^T(b_{k+1}-A_{k+1}x_k) \end{cases} kk+1:{Gk+1=Gk+Ak+1TAk+1xk+1=xk+Gk+11Ak+1T(bk+1Ak+1xk)

Let G k + 1 − 1 = P k + 1 , G k − 1 = P k G_{k+1}^{-1}=P_{k+1}, G_k^{-1}=P_k Gk+11=Pk+1,Gk1=Pk

P k + 1 = P k − P k A k + 1 T ( I + A k + 1 P k A k + 1 T ) − 1 A k + 1 P k P_{k+1}=P_k-P_kA_{k+1}^T(I+A_{k+1}P_kA_{k+1}^T)^{-1}A_{k+1}P_k Pk+1=PkPkAk+1T(I+Ak+1PkAk+1T)1Ak+1Pk

x k + 1 = x k + P k A k + 1 T ( b k + 1 − A k + 1 x k ) x_{k+1}=x_k+P_kA_{k+1}^T(b_{k+1}-A_{k+1}x_k) xk+1=xk+PkAk+1T(bk+1Ak+1xk)

Lemma

X X X invertible matrix, U U U and V V V satisfy I + V X − 1 U I+VX^{-1}U I+VX1U being invertible, then ( X + U V ) − 1 = X − 1 − ( X − 1 U ) ( I + V X − 1 U ) − 1 ( V X − 1 ) (X+UV)^{-1}=X^{-1}-(X^{-1}U)(I+VX^{-1}U)^{-1}(VX^{-1}) (X+UV)1=X1(X1U)(I+VX1U)1(VX1)

总结

本节课先就上节课没讲完的拟牛顿法接着讲,回顾了秩为1的修正方法。由于该方法存在一些问题,为了解决这些问题,又介绍了DFP算法。接着又简要介绍了收敛速度较快的BFGS算法。接着开始了一个新的主题——解线性方程组。由于问题的复杂性,我们讲问题分成两种情况来分类讨论,即 m ≥ n m\geq n mn m < n m<n m<n两种情况。目前只讲了第一种情况。第一种情况本质上是一个数据拟合问题,根据数据,求得 m i n ∣ ∣ A x − b ∣ ∣ 2 min ||Ax-b||^2 min∣∣Axb2。然后,介绍了直接的求解公式。最后,对于如何在充分利用已知数据的基础上,对新数据进行拟合,又提出了递归最小二乘法。

  • 30
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值