牛顿下降法--最优化方法

一、分类

根据步长step size t是否设置为1分为pure(t=1)和damped(t不一定等于1)两种。
本文还介绍inexact的方法。

1)最优化问题目标函数:

min ⁡ f ( x ) \min f(x) minf(x)

2) f ( x k + p k ) f(x_k+p_k) f(xk+pk)估计值(Taylor 公式,quadratic approximation):

f ( x k + p k ) ≈ f ( x k ) + ∇ f ( x k ) T p k + 1 2 p k T ∇ 2 f ( x k ) p k f(x_k+p_k) \approx f(x_k)+ \nabla f(x_k)^Tp_k+\frac{1}{2} p_k^T \nabla ^2f(x_k)p_k f(xk+pk)f(xk)+f(xk)Tpk+21pkT2f(xk)pk

3)对该Taylor估计值的 p k p_k pk p k p_k pk使得该Taylor估计值取最小)进行求导:

∇ f ( x ) + ∇ 2 f ( x k ) p k = 0 , ∇ 2 f ( x k ) ≻ 0 ⇒ p k = − ∇ 2 f ( x k ) − 1 ∇ f ( x k ) \nabla f(x) + \nabla ^2 f(x_k)p_k=0, \nabla ^2f(x_k) \succ0 \Rightarrow p_k=-\nabla ^2f(x_k)^{-1}\nabla f(x_k) f(x)+2f(xk)pk=0,2f(xk)0pk=2f(xk)1f(xk)

二、pure Newton method(纯牛顿算法)
1)设定 x k + 1 = x k + t k p k x_{k+1}=x_k+t_kp_k xk+1=xk+tkpk

其中 t k = 1 , p k = − ∇ 2 f ( x k ) − 1 ∇ f ( x k ) t_k=1,p_k=-\nabla ^2f(x_k)^{-1}\nabla f(x_k) tk=1,pk=2f(xk)1f(xk),所以 x k + 1 = x k − ∇ 2 f ( x k ) − 1 ∇ f ( x k ) x_{k+1}=x_k-\nabla ^2f(x_k)^{-1}\nabla f(x_k) xk+1=xk2f(xk)1f(xk)

2)算法:

setting initial value x 0 w h i l e ∣ f ( x k ) − f ( x k + 1 ) ∣ < e p s i l o n o r ∣ x k − x k + 1 ∣ < e p s i l o n o r i t e r a t i o n _ t i m e s > = m a x _ i t e r a t i o n _ t i m e s d o compute  ∇ 2 f ( x k ) compute  ∇ f ( x k ) x k + 1 = x k − ∇ 2 f ( x k ) − 1 ∇ f ( x k ) e n d w h i l e \begin{align*} &\text{setting initial value}x_0\\ &while\quad |f(x_{k})-f(x_{k+1})| < epsilon \\ &\quad \quad or \quad |x_{k}-x_{k+1}| < epsilon \\ &\quad \quad or \quad iteration\_times >= max\_iteration\_times \quad do \\ &\quad \quad \text{compute }\nabla ^2f(x_k)\\ &\quad \quad \text{compute } \nabla f(x_k)\\ &\quad \quad x_{k+1}=x_k-\nabla ^2f(x_k)^{-1}\nabla f(x_k) \\ & end\quad while\\ \end{align*} setting initial valuex0whilef(xk)f(xk+1)<epsilonorxkxk+1<epsilonoriteration_times>=max_iteration_timesdocompute 2f(xk)compute f(xk)xk+1=xk2f(xk)1f(xk)endwhile

3)优劣分析:

纯牛顿法的优点是收敛速度快,但缺点是对初始点的选择比较敏感,且要求函数 f(x) 在根附近可导且导数不为零。

三、damped Newton method(阻尼牛顿算法)
1)设定 x k + 1 = x k + t k p k x_{k+1}=x_k+t_kp_k xk+1=xk+tkpk

其中 t k ≠ 1 , p k = − ∇ 2 f ( x k ) − 1 ∇ f ( x k ) t_k\not =1,p_k=-\nabla ^2f(x_k)^{-1}\nabla f(x_k) tk=1,pk=2f(xk)1f(xk),所以 x k + 1 = x k − t k ∇ 2 f ( x k ) − 1 ∇ f ( x k ) x_{k+1}=x_k-t_k \nabla ^2f(x_k)^{-1}\nabla f(x_k) xk+1=xktk2f(xk)1f(xk)

2)Newton decrement(停止迭代):

Newton decrement是牛顿法中用于衡量当前迭代点与最优解之间差距的一个重要指标,常用于优化问题的收敛性分析和停止条件判断。
Newton decrement推倒:
quadratic approximation:  f ( x k + p k ) = f ( x k ) + ∇ f ( x ) T p k + 1 2 p k T ∇ 2 f ( x ) − 1 p k min ⁡ f ( x k + p k ) ⇒ p k = − ∇ 2 f ( x k ) − 1 ∇ f ( x k ) ⇒ min ⁡ f ( x k + p k ) = min ⁡ f ( x k ) + ∇ f ( x ) T p k + 1 2 p k T ∇ 2 f ( x ) p k = f ( x k ) − ∇ f ( x k ) T ∇ 2 f ( x k ) − 1 ∇ f ( x k ) + 1 2 ∇ f ( x k ) T ∇ 2 f ( x k ) − T ∇ 2 f ( x ) ∇ 2 f ( x k ) − 1 ∇ f ( x k ) = f ( x k ) − 1 2 ∇ f ( x k ) T ∇ 2 f ( x k ) − 1 ∇ f ( x k ) ⇒ min ⁡ f ( x k + p k ) − f ( x k ) = − 1 2 ∇ f ( x k ) T ∇ 2 f ( x k ) − 1 ∇ f ( x k ) = 1 2 λ ( x k ) 2 ⇒ λ ( x k ) = ( − ∇ f ( x k ) T ∇ 2 f ( x k ) − 1 ∇ f ( x k ) ) 1 / 2 = ( − d ( x k ) T ∇ 2 f ( x k ) d ( x k ) ) 1 / 2 = ( − ∇ f ( x k ) T d ( x k ) ) 1 / 2 d ( x k ) = ∇ 2 f ( x k ) − 1 ∇ f ( x k ) \begin{align*} \text{quadratic approximation: } f(x_{k}+p_k)&=f(x_k)+\nabla f(x)^Tp_k+\frac{1}{2} p_k^T\nabla ^2f(x)^{-1}p_k\\ \min f(x_{k}+p_k) \Rightarrow p_k&=- \nabla ^2f(x_k)^{-1} \nabla f(x_k)\\ \Rightarrow \min f(x_{k}+p_k)&= \min f(x_k)+\nabla f(x)^Tp_k+\frac{1}{2} p_k^T\nabla ^2f(x)p_k\\ &=f(x_k)- \nabla f(x_k)^T\nabla ^2f(x_k)^{-1} \nabla f(x_k)+\frac{1}{2} \nabla f(x_k)^T\nabla ^2f(x_k)^{-T}\nabla ^2f(x) \nabla ^2f(x_k)^{-1} \nabla f(x_k)\\ & = f(x_k)- \frac{1}{2} \nabla f(x_k)^T\nabla ^2f(x_k)^{-1} \nabla f(x_k)\\ \Rightarrow \min f(x_{k}+p_k) -f(x_k)&=- \frac{1}{2} \nabla f(x_k)^T\nabla ^2f(x_k)^{-1} \nabla f(x_k)=\frac{1}{2} \lambda(x_k)^2\\ \Rightarrow \lambda(x_k) &= (-\nabla f(x_k)^T\nabla ^2f(x_k)^{-1} \nabla f(x_k))^{1/2}=(-d(x_k)^T\nabla ^2f(x_k)d(x_k))^{1/2}=(-\nabla f(x_k)^Td(x_k))^{1/2}\\ d(x_k) &= \nabla ^2f(x_k)^{-1} \nabla f(x_k) \end{align*} quadratic approximation: f(xk+pk)minf(xk+pk)pkminf(xk+pk)minf(xk+pk)f(xk)λ(xk)d(xk)=f(xk)+f(x)Tpk+21pkT2f(x)1pk=2f(xk)1f(xk)=minf(xk)+f(x)Tpk+21pkT2f(x)pk=f(xk)f(xk)T2f(xk)1f(xk)+21f(xk)T2f(xk)T2f(x)2f(xk)1f(xk)=f(xk)21f(xk)T2f(xk)1f(xk)=21f(xk)T2f(xk)1f(xk)=21λ(xk)2=(f(xk)T2f(xk)1f(xk))1/2=(d(xk)T2f(xk)d(xk))1/2=(f(xk)Td(xk))1/2=2f(xk)1f(xk)
所以Newton decrement λ ( x k ) = ( − ∇ f ( x k ) T ∇ 2 f ( x k ) − 1 ∇ f ( x k ) ) 1 / 2 \lambda(x_k) =(-\nabla f(x_k)^T\nabla ^2f(x_k)^{-1} \nabla f(x_k))^{1/2} λ(xk)=(f(xk)T2f(xk)1f(xk))1/2可以作为停止搜寻的条件。

3)Armijio rule backing line search(寻找 t k t_k tk):

α ∈ ( 0 , 1 / 2 ) , f ( x k + t k p k ) ≤ f ( x k ) + α t k ∇ f ( x k ) T p k ⇒ f ( x k ) − f ( x k + t k p k ) ≥ − α t k ∇ f ( x k ) T p k \alpha \in (0,1/2), f(x_k+t_kp_k) \leq f(x_k)+\alpha t_k \nabla f(x_k)^Tp_k \Rightarrow f(x_k)-f(x_k+t_kp_k) \ge -\alpha t_k \nabla f(x_k)^Tp_k α(0,1/2),f(xk+tkpk)f(xk)+αtkf(xk)Tpkf(xk)f(xk+tkpk)αtkf(xk)Tpk

4)算法:

setting initial value  x 0 , α ∈ ( 0 , 1 / 2 ) , β ∈ ( 0 , 1 ) w h i l e 1 2 λ ( x k ) 2 = − 1 2 ∇ f ( x k ) T ∇ 2 f ( x k ) − 1 ∇ f ( x k ) ≤ ϵ d o p k = − f ( x k ) − 1 ∇ f ( x k ) t k = 1 w h i l e f ( x k ) − f ( x k + t k p k ) < − α t k ∇ f ( x k ) T p k d o t k = β t k e n d w h i l e x k + 1 = x k + t k p k e n d w h i l e \begin{align*} &\text{setting initial value } x_0, \alpha \in (0,1/2),\beta \in (0,1) \\ &while\quad \frac{1}{2}\lambda(x_k) ^2=-\frac{1}{2}\nabla f(x_k)^T\nabla ^2f(x_k)^{-1} \nabla f(x_k) \leq \epsilon \quad do \\ &\quad \quad p_k=-f(x_k)^{-1} \nabla f(x_k) \\ &\quad \quad t_k=1\\ &\quad \quad while \quad f(x_k)-f(x_k+t_kp_k) < -\alpha t_k \nabla f(x_k)^Tp_k \quad do\\ &\quad \quad \quad \quad t_k=\beta t_k\\ &\quad \quad end \quad while\\ &\quad \quad x_{k+1}=x_k+t_kp_k \\ &end \quad while\\ \end{align*} setting initial value x0,α(0,1/2),β(0,1)while21λ(xk)2=21f(xk)T2f(xk)1f(xk)ϵdopk=f(xk)1f(xk)tk=1whilef(xk)f(xk+tkpk)<αtkf(xk)Tpkdotk=βtkendwhilexk+1=xk+tkpkendwhile

5)使用分析:damped Newton method需要计算目标函数的梯度和海森矩阵,这在高维问题中可能计算成本较高。在适当的条件下,阻尼牛顿算法可以保证快速收敛。
6)收敛性分析:

(1)对 f ( x ) , ∇ f ( x ) , ∇ 2 f ( x ) , x ∈ R n f(x), \nabla f(x),\nabla ^2f(x),x \in \mathbb{R^n} f(x),f(x),2f(x),xRn的设定(假设、限制):
f ( x )  二次可微且连续  一次导Lipschitz连续性:  ∇ 2 f ( x ) ⪯ M I ⇔ ∥ ∇ f ( y ) − ∇ f ( x ) ∥ ≤ M ∥ y − x ∥ 二次导Lipschitz连续性:  ∥ ∇ 2 f ( y ) − ∇ 2 f ( x ) ∥ ≤ L ∥ y − x ∥ strongly convex:  ∇ 2 f ( x ) ≻ m I ⇔ ∥ ∇ 2 f ( x ) ∥ ≥ m \begin{align*} &f(x) \text{ 二次可微且连续 }\\ &\text{一次导Lipschitz连续性: }\nabla ^2f(x) \preceq MI \Leftrightarrow \|\nabla f(y)-\nabla f(x)\| \leq M\|y-x\|\\ &\text{二次导Lipschitz连续性: }\|\nabla ^2f(y)-\nabla ^2f(x)\| \leq L\|y-x\|\\ &\text{strongly convex: } \nabla ^2f(x) \succ mI \Leftrightarrow \|\nabla ^2f(x) \| \ge m\\ \end{align*} f(x) 二次可微且连续 一次导Lipschitz连续性2f(x)MI∥∇f(y)f(x)Myx二次导Lipschitz连续性2f(y)2f(x)Lyxstrongly convex: 2f(x)mI2f(x)m
(2)基于假设上的达到收敛条件时迭代次数的上限:
达到 f ( x k ) − f ( x ∗ ) ≤ ϵ f(x_k)-f(x^*) \leq \epsilon f(xk)f(x)ϵ停止迭代条件时的迭代次数上界为 M 2 L 2 / m 5 α β min ⁡ ( 1 , 9 ( 1 − 2 α ) 2 ) ( f ( x 0 ) − f ( x ∗ ) ) + l o g 2 l o g 2 2 m 3 / L 2 ϵ \frac{M^2L^2/m^5}{\alpha \beta \min (1,9(1-2\alpha)^2)}(f(x_0)-f(x^*))+log_2log_2\frac{2m^3/L^2}{\epsilon} αβmin(1,9(12α)2)M2L2/m5(f(x0)f(x))+log2log2ϵ2m3/L2

(3)基于假设上的 ∥ x k + 1 − x ∗ ∥ ≤ L 2 m ∥ x k − x ∗ ∥ 2 \|x_{k+1}-x^*\| \leq \frac{L}{2m}\|x_k-x^*\|^2 xk+1x2mLxkx2
证明:
− ∇ f ( x k ) = 0 − ∇ f ( x k ) = ∇ f ( x ∗ ) − ∇ f ( x k ) ∇ f ( x ∗ ) − ∇ f ( x k ) = ∫ 0 1 ∇ 2 f ( x k + t ( x ∗ − x k ) ) ( x ∗ − x k ) d t x k + 1 − x ∗ = x k − ∇ 2 f ( x k ) − 1 ∇ f ( x k ) − x ∗ = x k − x ∗ + ∇ 2 f ( x k ) − 1 ( ∇ f ( x ∗ ) − ∇ f ( x k ) ) = x k − x ∗ + ∇ 2 f ( x k ) − 1 ∫ 0 1 ∇ 2 f ( x k + t ( x ∗ − x k ) ) ( x ∗ − x k ) d t = ∇ 2 f ( x k ) − 1 ∫ 0 1 [ ∇ 2 f ( x k + t ( x ∗ − x k ) ) − ∇ 2 f ( x k ) ] ( x ∗ − x k ) d t ∥ x k + 1 − x ∗ ∥ = ∥ ∇ 2 f ( x k ) − 1 ∫ 0 1 [ ∇ 2 f ( x k + t ( x ∗ − x k ) ) − ∇ 2 f ( x k ) ] ( x ∗ − x k ) d t ∥ ≤ ∥ ∇ 2 f ( x k ) − 1 ∥ ∥ ∫ 0 1 [ ∇ 2 f ( x k + t ( x ∗ − x k ) ) − ∇ 2 f ( x k ) ] ( x ∗ − x k ) d t ∥ ≤ 1 m ∥ ∫ 0 1 [ ∇ 2 f ( x k + t ( x ∗ − x k ) ) − ∇ 2 f ( x k ) ] ( x ∗ − x k ) d t ∥ ≤ 1 m ∫ 0 1 ∥ ∇ 2 f ( x k + t ( x ∗ − x k ) ) − ∇ 2 f ( x k ) ∥ ∥ ( x ∗ − x k ) ∥ d t ≤ 1 m ∫ 0 1 L t ∥ x ∗ − x k ∥ 2 d t = L 2 m ∥ x k − x ∗ ∥ 2 \begin{align*} -\nabla f(x_k)&=0-\nabla f(x_k)=\nabla f(x^*)-\nabla f(x_k)\\ \nabla f(x^*)-\nabla f(x_k)&=\int ^1_0\nabla ^2f(x_k+t(x^*-x_k))(x^*-x_k)dt\\ x_{k+1}-x^*&=x_k-\nabla ^2f(x_k)^{-1}\nabla f(x_k)-x^*\\ &=x_k-x^*+\nabla ^2f(x_k)^{-1}(\nabla f(x^*)-\nabla f(x_k))\\ &=x_k-x^*+\nabla ^2f(x_k)^{-1}\int ^1_0\nabla ^2f(x_k+t(x^*-x_k))(x^*-x_k)dt\\ &=\nabla ^2f(x_k)^{-1}\int ^1_0[\nabla ^2f(x_k+t(x^*-x_k))-\nabla ^2f(x_k)](x^*-x_k)dt\\ \|x_{k+1}-x^*\| &=\|\nabla ^2f(x_k)^{-1}\int ^1_0[\nabla ^2f(x_k+t(x^*-x_k))-\nabla ^2f(x_k)](x^*-x_k)dt\|\\ & \leq \|\nabla ^2f(x_k)^{-1}\| \|\int ^1_0[\nabla ^2f(x_k+t(x^*-x_k))-\nabla ^2f(x_k)](x^*-x_k)dt\|\\ & \leq \frac{1}{m}\|\int ^1_0[\nabla ^2f(x_k+t(x^*-x_k))-\nabla ^2f(x_k)](x^*-x_k)dt\|\\ & \leq \frac{1}{m}\int ^1_0\|\nabla ^2f(x_k+t(x^*-x_k))-\nabla ^2f(x_k)\| \|(x^*-x_k)\|dt\\ & \leq \frac{1}{m}\int ^1_0 Lt \|x^*-x_k\|^2dt\\ &= \frac{L}{2m}\|x_k-x^*\|^2\\ \end{align*} f(xk)f(x)f(xk)xk+1xxk+1x=0f(xk)=f(x)f(xk)=012f(xk+t(xxk))(xxk)dt=xk2f(xk)1f(xk)x=xkx+2f(xk)1(f(x)f(xk))=xkx+2f(xk)1012f(xk+t(xxk))(xxk)dt=2f(xk)101[2f(xk+t(xxk))2f(xk)](xxk)dt=2f(xk)101[2f(xk+t(xxk))2f(xk)](xxk)dt2f(xk)1∥∥01[2f(xk+t(xxk))2f(xk)](xxk)dtm101[2f(xk+t(xxk))2f(xk)](xxk)dtm1012f(xk+t(xxk))2f(xk)∥∥(xxk)dtm101Ltxxk2dt=2mLxkx2
(4)基于假设上的式子 m 2 ∥ x − x ∗ ∥ 2 2 ≤ f ( x ) − f ( x ∗ ) ≤ 1 2 m ∥ ∇ f ( x ) ∥ 2 2 \frac{m}{2}\|x-x^*\|^2_2 \leq f(x)-f(x^*) \leq \frac{1}{2m}\|\nabla f(x)\|^2_2 2mxx22f(x)f(x)2m1∥∇f(x)22
证明:
左边不等式:已知 ∥ ∇ 2 f ( x ) ∥ 2 ⪯ m , f ( x ) = f ( x ∗ ) + ∇ f ( x ∗ ) ( x − x ∗ ) + ( x − x ∗ ) T ∇ 2 f ( x ∗ ) ( x − x ∗ ) / 2 ≥ f ( x ∗ ) + ∥ x ∗ − x ∥ 2 2 2 m 右边不等式:已知 [ m 2 ( x ∗ − x ) + 1 2 m ∇ f ( x ) ] 2 = m 2 ∥ x ∗ − x ∥ 2 2 + 1 2 m ∥ ∇ f ( x ) ∥ 2 2 + ∇ f ( x ) T ( x ∗ − x ) ≥ 0 f ( x ∗ ) = f ( x ) + ∇ f ( x ) ( x ∗ − x ) + ( x ∗ − x ) T ∇ 2 f ( x ) ( x ∗ − x ) / 2 ≥ f ( x ) + ∇ f ( x ) ( x ∗ − x ) + m 2 ∥ x ∗ − x ∥ 2 2 ≥ f ( x ) − 1 2 m ∥ ∇ f ( x ) ∥ 2 2 \begin{align*} \text{左边不等式:已知}& \|\nabla ^2f(x) \|_2\preceq m,\\ f(x) &=f(x^*)+\nabla f(x^*)(x-x^*)+(x-x^*)^T\nabla ^2f(x^*)(x-x^*)/2\\ &\ge f(x^*)+\frac{\|x^*-x\|_2^2}{2m}\\ \text{右边不等式:已知}&[\frac{\sqrt{m}}{\sqrt{2}}(x^*-x)+\frac{1}{\sqrt{2m}}\nabla f(x)]^2=\frac{m}{2}\|x^*-x\|_2^2+\frac{1}{2m}\|\nabla f(x)\|_2^2+\nabla f(x)^T(x^*-x) \ge 0\\ f(x^*) &=f(x)+\nabla f(x)(x^*-x)+(x^*-x)^T\nabla ^2f(x)(x^*-x)/2\\ & \ge f(x)+\nabla f(x)(x^*-x)+\frac{m}{2}\|x^*-x\|^2_2\\ &\ge f(x)-\frac{1}{2m}\|\nabla f(x)\|_2^2\\ \end{align*} 左边不等式:已知f(x)右边不等式:已知f(x)2f(x)2m,=f(x)+f(x)(xx)+(xx)T2f(x)(xx)/2f(x)+2mxx22[2 m (xx)+2m 1f(x)]2=2mxx22+2m1∥∇f(x)22+f(x)T(xx)0=f(x)+f(x)(xx)+(xx)T2f(x)(xx)/2f(x)+f(x)(xx)+2mxx22f(x)2m1∥∇f(x)22

四、Inexact Newton method(一个包含多种细分解法的方法)

前面的pure和damped是通过直接解 ∇ 2 f ( x k ) p k + ∇ f ( x k ) = 0 ⇒ p k = − ∇ 2 f ( x k ) − 1 ∇ f ( x k ) \nabla ^2f(x_k)p_k+\nabla f(x_k)=0\Rightarrow p_k=-\nabla ^2f(x_k)^{-1}\nabla f(x_k) 2f(xk)pk+f(xk)=0pk=2f(xk)1f(xk) 得到的,inexact是通过迭代方式去解 r k = ∇ 2 f ( x k ) p k + ∇ f ( x k ) = 0 r_k=\nabla ^2f(x_k)p_k+\nabla f(x_k)=0 rk=2f(xk)pk+f(xk)=0 的。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值