简单例子说明优化器

简单例子说明优化器

我们会用下面的例子来说明优化器:

数据集:
( x 1 , y 1 ) = ( 1 , 3 ) ( x 2 , y 2 ) = ( 2 , 7 ) ( x 3 , y 3 ) = ( 3 , 8 ) ( x 4 , y 4 ) = ( 4 , 10 ) ( x 5 , y 5 ) = ( 5 , 14 ) (x_1,y_1) = (1,3) \\ (x_2,y_2) = (2,7) \\ (x_3,y_3) = (3,8) \\ (x_4,y_4) = (4,10) \\ (x_5,y_5) = (5,14) (x1,y1)=(1,3)(x2,y2)=(2,7)(x3,y3)=(3,8)(x4,y4)=(4,10)(x5,y5)=(5,14)
损失函数:
L ( x , y ; w ) = 1 2 ( y − w x ) 2 L(x,y;w) = \frac12(y-wx)^2\\ L(x,y;w)=21(ywx)2
损失函数对变量 w w w的梯度:
∂ L ( x , y ; w ) ∂ w = − x ( y − w x ) \frac{\partial L(x,y;w)}{\partial w} = -x(y-wx) wL(x,y;w)=x(ywx)
学习率: η = 0.01 \eta=0.01 η=0.01

初始值: w 0 = 0.01 w_0=0.01 w0=0.01

简单优化器

GD(Gradient Descent)

使用全部数据计算梯度。

更新公式:
w t = w t − 1 − η G t − 1 = w t − 1 − η 1 M ∑ i = 1 M ∂ L ( x i , y i ; w t − 1 ) ∂ w w_{t} = w_{t-1} - \eta G_{t-1} = w_{t-1} - \eta \frac1M \sum_{i=1}^M \frac{\partial L(x_i,y_i;w_{t-1})}{\partial w} wt=wt1ηGt1=wt1ηM1i=1MwL(xi,yi;wt1)
算每个数据的梯度:
∂ L ( x 1 , y 1 ; w 0 ) ∂ w = − x 1 ( y 1 − w 0 x 1 ) = − 1 × ( 3 − 1 × 1 ) = − 2 ∂ L ( x 2 , y 2 ; w 0 ) ∂ w = − x 1 ( y 2 − w 0 x 2 ) = − 2 × ( 7 − 1 × 2 ) = − 10 ∂ L ( x 3 , y 3 ; w 0 ) ∂ w = − x 1 ( y 3 − w 0 x 3 ) = − 3 × ( 8 − 1 × 3 ) = − 15 ∂ L ( x 4 , y 4 ; w 0 ) ∂ w = − x 4 ( y 4 − w 0 x 4 ) = − 4 × ( 10 − 1 × 4 ) = − 24 ∂ L ( x 5 , y 5 ; w 0 ) ∂ w = − x 5 ( y 5 − w 0 x 5 ) = − 5 × ( 14 − 1 × 5 ) = − 45 \frac{\partial L(x_1,y_1;w_{0})}{\partial w} = -x_1(y_1-w_0 x_1) = -1 \times (3-1\times1) = -2 \\ \frac{\partial L(x_2,y_2;w_{0})}{\partial w} = -x_1(y_2-w_0 x_2) = -2 \times (7-1\times2) = -10 \\ \frac{\partial L(x_3,y_3;w_{0})}{\partial w} = -x_1(y_3-w_0 x_3) = -3 \times (8-1\times3) = -15 \\ \frac{\partial L(x_4,y_4;w_{0})}{\partial w} = -x_4(y_4-w_0 x_4) = -4 \times (10-1\times4) = -24 \\ \frac{\partial L(x_5,y_5;w_{0})}{\partial w} = -x_5(y_5-w_0 x_5) = -5 \times (14-1\times5) = -45 \\ wL(x1,y1;w0)=x1(y1w0x1)=1×(31×1)=2wL(x2,y2;w0)=x1(y2w0x2)=2×(71×2)=10wL(x3,y3;w0)=x1(y3w0x3)=3×(81×3)=15wL(x4,y4;w0)=x4(y4w0x4)=4×(101×4)=24wL(x5,y5;w0)=x5(y5w0x5)=5×(141×5)=45
计算平均梯度:
G 0 = 1 5 ∑ i = 1 5 ∂ L ( x i , y i ; w 0 ) ∂ w = − 2 + 10 + 15 + 24 + 45 5 = − 19.2 G_0 = \frac15 \sum_{i=1}^5 \frac{\partial L(x_i,y_i;w_{0})}{\partial w} = -\frac{2+10+15+24+45}{5} = -19.2 G0=51i=15wL(xi,yi;w0)=52+10+15+24+45=19.2
迭代:
w 1 = w 0 − η G 0 = 1 − 0.01 × ( − 19.2 ) = 1.192 w_{1} = w_{0} - \eta G_0 = 1 - 0.01 \times (-19.2) = 1.192 w1=w0ηG0=10.01×(19.2)=1.192

SGD(Stochastic Gradient Descent)

使用batch_size个数据计算梯度。

更新公式与Gradient Descent的一样。

设置batch_size=2,随机抽两条数据:
( x 2 , y 2 ) = ( 2 , 7 ) ( x 3 , y 3 ) = ( 3 , 8 ) (x_2,y_2) = (2,7) \\ (x_3,y_3) = (3,8) (x2,y2)=(2,7)(x3,y3)=(3,8)
计算梯度:
∂ L ( x 2 , y 2 ; w 0 ) ∂ w = − 10 ∂ L ( x 3 , y 3 ; w 0 ) ∂ w = − 15 \frac{\partial L(x_2,y_2;w_{0})}{\partial w} = -10 \\ \frac{\partial L(x_3,y_3;w_{0})}{\partial w} = -15 \\ wL(x2,y2;w0)=10wL(x3,y3;w0)=15
计算平均梯度:
G 0 = 1 2 [ ∂ L ( x 2 , y 2 ; w 0 ) ∂ w + ∂ L ( x 3 , y 3 ; w 0 ) ∂ w ] = − 10 + 15 2 = − 12.5 G_0 = \frac12 [\frac{\partial L(x_2,y_2;w_{0})}{\partial w} + \frac{\partial L(x_3,y_3;w_{0})}{\partial w}] = -\frac{10+15}{2} = -12.5 G0=21[wL(x2,y2;w0)+wL(x3,y3;w0)]=210+15=12.5
迭代:
w 1 = w 0 − η G 0 = 1 − 0.01 × ( − 12.5 ) = 1.125 w_{1} = w_{0} - \eta G_{0} = 1 - 0.01 \times (-12.5) = 1.125 w1=w0ηG0=10.01×(12.5)=1.125

优化器改进策略

优化器的改进思路有2条:

  • 改进梯度
  • 改进学习率

梯度改进策略

Momentum

增添之前更新的反向梯度。

更新公式:
m t = μ m t − 1 + η G t − 1 = μ m t − 1 − η 1 M ∑ i = 1 M ∂ L ( x i , y i ; w t − 1 ) ∂ w w t = w t − 1 − m t m_{t} = \mu m_{t-1} + \eta G_{t-1} = \mu m_{t-1} - \eta \frac1M \sum_{i=1}^M \frac{\partial L(x_i,y_i;w_{t-1})}{\partial w} \\ w_{t} = w_{t-1} - m_{t} mt=μmt1+ηGt1=μmt1ηM1i=1MwL(xi,yi;wt1)wt=wt1mt
平均梯度(讲Gradient Descent的时候已经算好):
G 0 = 1 5 ∑ i = 1 5 ∂ L ( x i , y i ; w 0 ) ∂ w = − 2 + 10 + 15 + 24 + 45 5 = − 19.2 G_0 = \frac15 \sum_{i=1}^5 \frac{\partial L(x_i,y_i;w_{0})}{\partial w} = -\frac{2+10+15+24+45}{5} = -19.2 G0=51i=15wL(xi,yi;w0)=52+10+15+24+45=19.2
初始值: m 0 = 0 m_0=0 m0=0 μ = 0.3 \mu=0.3 μ=0.3

第一次迭代:
m 1 = μ m 0 + η G 0 = 0.01 × ( − 19.2 ) = − 0.192 w 1 = w 0 − m 1 = 1 − ( − 0.192 ) = 1.192 m_{1} = \mu m_{0} + \eta G_0 = 0.01 \times (-19.2) = -0.192 \\ w_{1} = w_{0} - m_{1} = 1 - (-0.192) = 1.192 m1=μm0+ηG0=0.01×(19.2)=0.192w1=w0m1=1(0.192)=1.192
算每个数据的梯度:
∂ L ( x 1 , y 1 ; w 1 ) ∂ w = − x 1 ( y 1 − w 1 x 1 ) = − 1 × ( 3 − 1.192 × 1 ) = − 1.808 ∂ L ( x 2 , y 2 ; w 1 ) ∂ w = − x 1 ( y 2 − w 1 x 2 ) = − 2 × ( 7 − 1.192 × 2 ) = − 9.232 ∂ L ( x 3 , y 3 ; w 1 ) ∂ w = − x 1 ( y 3 − w 1 x 3 ) = − 3 × ( 8 − 1.192 × 3 ) = − 13.272 ∂ L ( x 4 , y 4 ; w 1 ) ∂ w = − x 4 ( y 4 − w 1 x 4 ) = − 4 × ( 10 − 1.192 × 4 ) = − 20.928 ∂ L ( x 5 , y 5 ; w 1 ) ∂ w = − x 5 ( y 5 − w 1 x 5 ) = − 5 × ( 14 − 1.192 × 5 ) = − 40.2 \frac{\partial L(x_1,y_1;w_{1})}{\partial w} = -x_1(y_1-w_1 x_1) = -1 \times (3-1.192\times1) = -1.808 \\ \frac{\partial L(x_2,y_2;w_{1})}{\partial w} = -x_1(y_2-w_1 x_2) = -2 \times (7-1.192\times2) = -9.232 \\ \frac{\partial L(x_3,y_3;w_{1})}{\partial w} = -x_1(y_3-w_1 x_3) = -3 \times (8-1.192\times3) = -13.272 \\ \frac{\partial L(x_4,y_4;w_{1})}{\partial w} = -x_4(y_4-w_1 x_4) = -4 \times (10-1.192\times4) = -20.928 \\ \frac{\partial L(x_5,y_5;w_{1})}{\partial w} = -x_5(y_5-w_1 x_5) = -5 \times (14-1.192\times5) = -40.2 \\ wL(x1,y1;w1)=x1(y1w1x1)=1×(31.192×1)=1.808wL(x2,y2;w1)=x1(y2w1x2)=2×(71.192×2)=9.232wL(x3,y3;w1)=x1(y3w1x3)=3×(81.192×3)=13.272wL(x4,y4;w1)=x4(y4w1x4)=4×(101.192×4)=20.928wL(x5,y5;w1)=x5(y5w1x5)=5×(141.192×5)=40.2
计算平均梯度:
G 1 = 1 5 ∑ i = 1 5 ∂ L ( x i , y i ; w 1 ) ∂ w = − 1.808 + 9.232 + 13.272 + 20.928 + 40.2 5 = − 17.088 G_1 = \frac15 \sum_{i=1}^5 \frac{\partial L(x_i,y_i;w_{1})}{\partial w} = -\frac{1.808+9.232+13.272+20.928+40.2}{5} = -17.088 G1=51i=15wL(xi,yi;w1)=51.808+9.232+13.272+20.928+40.2=17.088
第二次迭代:
m 2 = μ m 1 + η G 1 = 0.3 × ( − 1.192 ) + 0.01 × ( − 17.088 ) = − 0.52848 w 2 = w 1 − m 2 = 1.192 − ( − 0.52848 ) = 1.72048 m_{2} = \mu m_{1} + \eta G_1 = 0.3 \times (-1.192) + 0.01 \times (-17.088) = -0.52848 \\ w_{2} = w_{1} - m_{2} = 1.192 - (-0.52848) = 1.72048 m2=μm1+ηG1=0.3×(1.192)+0.01×(17.088)=0.52848w2=w1m2=1.192(0.52848)=1.72048

Nesterov Momentum

更新公式:
m t = μ m t − 1 + η G t − 1 = w t − 1 − η 1 M ∑ i = 1 M ∂ L ( x i , y i ; w t − 1 − μ m t − 1 ) ∂ w w t = w t − 1 − m t m_{t} = \mu m_{t-1} + \eta G_{t-1} = w_{t-1} - \eta \frac1M \sum_{i=1}^M \frac{\partial L(x_i,y_i;w_{t-1}-\mu m_{t-1})}{\partial w} \\ w_{t} = w_{t-1} - m_{t} mt=μmt1+ηGt1=wt1ηM1i=1MwL(xi,yi;wt1μmt1)wt=wt1mt
初始值: m 0 = 0 m_0=0 m0=0 μ = 0.3 \mu=0.3 μ=0.3

那么: w 0 − μ m 0 = 1 w_{0}-\mu m_{0}=1 w0μm0=1

算每个数据的梯度:
∂ L ( x 1 , y 1 ; w 0 − μ m 0 ) ∂ w = − x 1 [ y 1 − ( w 0 − μ m 0 ) x 1 ] = − 1 × ( 3 − 1 × 1 ) = − 2 ∂ L ( x 2 , y 2 ; w 0 − μ m 0 ) ∂ w = − x 1 [ y 2 − ( w 0 − μ m 0 ) x 2 ] = − 2 × ( 7 − 1 × 2 ) = − 10 ∂ L ( x 3 , y 3 ; w 0 − μ m 0 ) ∂ w = − x 1 [ y 3 − ( w 0 − μ m 0 ) x 3 ] = − 3 × ( 8 − 1 × 3 ) = − 15 ∂ L ( x 4 , y 4 ; w 0 − μ m 0 ) ∂ w = − x 4 [ y 4 − ( w 0 − μ m 0 ) x 4 ] = − 4 × ( 10 − 1 × 4 ) = − 24 ∂ L ( x 5 , y 5 ; w 0 − μ m 0 ) ∂ w = − x 5 [ y 5 − ( w 0 − μ m 0 ) x 5 ] = − 5 × ( 14 − 1 × 5 ) = − 45 \frac{\partial L(x_1,y_1;w_{0}-\mu m_{0})}{\partial w} = -x_1[y_1 - (w_{0}-\mu m_{0}) x_1] = -1 \times (3-1\times1) = -2 \\ \frac{\partial L(x_2,y_2;w_{0}-\mu m_{0})}{\partial w} = -x_1[y_2 - (w_{0}-\mu m_{0}) x_2] = -2 \times (7-1\times2) = -10 \\ \frac{\partial L(x_3,y_3;w_{0}-\mu m_{0})}{\partial w} = -x_1[y_3 - (w_{0}-\mu m_{0}) x_3] = -3 \times (8-1\times3) = -15 \\ \frac{\partial L(x_4,y_4;w_{0}-\mu m_{0})}{\partial w} = -x_4[y_4 - (w_{0}-\mu m_{0}) x_4] = -4 \times (10-1\times4) = -24 \\ \frac{\partial L(x_5,y_5;w_{0}-\mu m_{0})}{\partial w} = -x_5[y_5 - (w_{0}-\mu m_{0}) x_5] = -5 \times (14-1\times5) = -45 \\ wL(x1,y1;w0μm0)=x1[y1(w0μm0)x1]=1×(31×1)=2wL(x2,y2;w0μm0)=x1[y2(w0μm0)x2]=2×(71×2)=10wL(x3,y3;w0μm0)=x1[y3(w0μm0)x3]=3×(81×3)=15wL(x4,y4;w0μm0)=x4[y4(w0μm0)x4]=4×(101×4)=24wL(x5,y5;w0μm0)=x5[y5(w0μm0)x5]=5×(141×5)=45
平均梯度:
G 0 = 1 5 ∑ i = 1 5 ∂ L ( x i , y i ; w 0 − μ m 0 ) ∂ w = − 2 + 10 + 15 + 24 + 45 5 = − 19.2 G_0 = \frac15 \sum_{i=1}^5 \frac{\partial L(x_i,y_i;w_{0}-\mu m_{0})}{\partial w} = -\frac{2+10+15+24+45}{5} = -19.2 G0=51i=15wL(xi,yi;w0μm0)=52+10+15+24+45=19.2
第一次迭代:
m 1 = μ m 0 + η G 0 = 0.01 × ( − 19.2 ) = − 0.192 w 1 = w 0 − m 1 = 1 − ( − 0.192 ) = 1.192 m_{1} = \mu m_{0} + \eta G_0 = 0.01 \times (-19.2) = -0.192 \\w_{1} = w_{0} - m_{1} = 1 - (-0.192) = 1.192 m1=μm0+ηG0=0.01×(19.2)=0.192w1=w0m1=1(0.192)=1.192
那么: w 1 − μ m 1 = 1.192 − 0.3 × ( − 0.192 ) = 1.2496 w_{1}-\mu m_{1}=1.192-0.3\times(-0.192)=1.2496 w1μm1=1.1920.3×(0.192)=1.2496

算每个数据的梯度:
∂ L ( x 1 , y 1 ; w 1 − μ m 1 ) ∂ w = − x 1 [ y 1 − ( w 1 − μ m 1 ) x 1 ] = − 1 × ( 3 − 1.2496 × 1 ) = − 1.7504 ∂ L ( x 2 , y 2 ; w 1 − μ m 1 ) ∂ w = − x 1 [ y 2 − ( w 1 − μ m 1 ) x 2 ] = − 2 × ( 7 − 1.2496 × 2 ) = − 9.0016 ∂ L ( x 3 , y 3 ; w 1 − μ m 1 ) ∂ w = − x 1 [ y 3 − ( w 1 − μ m 1 ) x 3 ] = − 3 × ( 8 − 1.2496 × 3 ) = − 12.7536 ∂ L ( x 4 , y 4 ; w 1 − μ m 1 ) ∂ w = − x 4 [ y 4 − ( w 1 − μ m 1 ) x 4 ] = − 4 × ( 10 − 1.2496 × 4 ) = − 20.0064 ∂ L ( x 5 , y 5 ; w 1 − μ m 1 ) ∂ w = − x 5 [ y 5 − ( w 1 − μ m 1 ) x 5 ] = − 5 × ( 14 − 1.2496 × 5 ) = − 38.76 \begin{aligned} &\begin{aligned} \frac{\partial L(x_1,y_1;w_{1}-\mu m_{1})}{\partial w} &= -x_1[y_1 - (w_{1}-\mu m_{1}) x_1] \\ &= -1 \times (3 - 1.2496 \times 1) \\ &= -1.7504 \end{aligned} \\ &\begin{aligned} \frac{\partial L(x_2,y_2;w_{1}-\mu m_{1})}{\partial w} &= -x_1[y_2 - (w_{1}-\mu m_{1}) x_2] \\ &= -2 \times (7 - 1.2496 \times 2) \\ &= -9.0016 \end{aligned} \\ &\begin{aligned} \frac{\partial L(x_3,y_3;w_{1}-\mu m_{1})}{\partial w} &= -x_1[y_3 - (w_{1}-\mu m_{1}) x_3] \\ &= -3 \times (8 - 1.2496 \times 3) \\ &= -12.7536 \end{aligned} \\ &\begin{aligned} \frac{\partial L(x_4,y_4;w_{1}-\mu m_{1})}{\partial w} &= -x_4[y_4 - (w_{1}-\mu m_{1}) x_4] \\ &= -4 \times (10 - 1.2496 \times 4) \\ &= -20.0064 \end{aligned} \\ &\begin{aligned} \frac{\partial L(x_5,y_5;w_{1}-\mu m_{1})}{\partial w} &= -x_5[y_5 - (w_{1}-\mu m_{1}) x_5] \\ &= -5 \times (14 - 1.2496 \times 5) \\ &= -38.76 \end{aligned} \end{aligned} wL(x1,y1;w1μm1)=x1[y1(w1μm1)x1]=1×(31.2496×1)=1.7504wL(x2,y2;w1μm1)=x1[y2(w1μm1)x2]=2×(71.2496×2)=9.0016wL(x3,y3;w1μm1)=x1[y3(w1μm1)x3]=3×(81.2496×3)=12.7536wL(x4,y4;w1μm1)=x4[y4(w1μm1)x4]=4×(101.2496×4)=20.0064wL(x5,y5;w1μm1)=x5[y5(w1μm1)x5]=5×(141.2496×5)=38.76

计算平均梯度:

G 1 = 1 5 ∑ i = 1 5 ∂ L ( x i , y i ; w 1 − μ m 1 ) ∂ w = − 1.7504 + 9.0016 + 12.7536 + 20.0064 + 38.76 5 = − 16.4544 G_1 = \frac15 \sum_{i=1}^5 \frac{\partial L(x_i,y_i;w_{1}-\mu m_{1})}{\partial w} = -\frac{1.7504+9.0016+12.7536+20.0064+38.76}{5} = -16.4544 G1=51i=15wL(xi,yi;w1μm1)=51.7504+9.0016+12.7536+20.0064+38.76=16.4544

第二次迭代:

m 2 = μ m 1 + η G 1 = 0.3 × ( − 0.192 ) + 0.01 × ( − 16.4544 ) = − 0.222144 w 2 = w 1 − m 2 = 1.192 − ( − 0.222144 ) = 1.414144 m_{2} = \mu m_{1} + \eta G_1 = 0.3 \times (-0.192) + 0.01 \times (-16.4544) = -0.222144 \\w_{2} = w_{1} - m_{2} = 1.192 - (-0.222144) = 1.414144 m2=μm1+ηG1=0.3×(0.192)+0.01×(16.4544)=0.222144w2=w1m2=1.192(0.222144)=1.414144

学习率改进策略

AdaGrad

与Momentum不同,AdaGrad没有改变梯度,着眼于学习率的递减,递减系数是之前所有更新的梯度平⽅和的累加。

更新公式:
g t = g t − 1 + G t − 1 2 = g t − 1 + [ 1 M ∑ i = 1 M ∂ L ( x i , y i ; w t − 1 ) ∂ w ] 2 w t = w t − 1 − η g t + ε G t − 1 g_t = g_{t-1} + G_{t-1}^2 = g_{t-1} + [\frac1M \sum_{i=1}^M \frac{\partial L(x_i,y_i;w_{t-1})}{\partial w}]^2 \\ w_{t} = w_{t-1} - \frac{\eta}{\sqrt{g_t+\varepsilon}} G_{t-1} gt=gt1+Gt12=gt1+[M1i=1MwL(xi,yi;wt1)]2wt=wt1gt+ε ηGt1
ε \varepsilon ε是非常小的数(如 1 0 − 8 10^{-8} 108),它是为了防止除零。

初始值: g 0 = 0 g_0=0 g0=0 η = 0.1 \eta = 0.1 η=0.1

平均梯度(讲Gradient Descent的时候已经算好):
G 0 = 1 5 ∑ i = 1 5 ∂ L ( x i , y i ; w 0 ) ∂ w = − 2 + 10 + 15 + 24 + 45 5 = − 19.2 G_0 = \frac15 \sum_{i=1}^5 \frac{\partial L(x_i,y_i;w_{0})}{\partial w} = -\frac{2+10+15+24+45}{5} = -19.2 G0=51i=15wL(xi,yi;w0)=52+10+15+24+45=19.2
第一次迭代:
g 1 = g 0 + G 0 2 = 0 + ( − 19.2 ) 2 = 368.64 w 1 = w 0 − η g 1 + ε G 0 = 1 − 0.1 368.64 × ( − 19.2 ) = 1.1 g_1 = g_{0} + G_0^2 = 0 + (-19.2)^2 = 368.64 \\ w_{1} = w_{0} - \frac{\eta}{\sqrt{g_1+\varepsilon}} G_0 = 1 - \frac{0.1}{\sqrt{368.64}} \times (-19.2) = 1.1 g1=g0+G02=0+(19.2)2=368.64w1=w0g1+ε ηG0=1368.64 0.1×(19.2)=1.1
算每个数据的梯度:
∂ L ( x 1 , y 1 ; w 1 ) ∂ w = − x 1 ( y 1 − w 1 x 1 ) = − 1 × ( 3 − 1.1 × 1 ) = − 1.9 ∂ L ( x 2 , y 2 ; w 1 ) ∂ w = − x 1 ( y 2 − w 1 x 2 ) = − 2 × ( 7 − 1.1 × 2 ) = − 9.6 ∂ L ( x 3 , y 3 ; w 1 ) ∂ w = − x 1 ( y 3 − w 1 x 3 ) = − 3 × ( 8 − 1.1 × 3 ) = − 14.1 ∂ L ( x 4 , y 4 ; w 1 ) ∂ w = − x 4 ( y 4 − w 1 x 4 ) = − 4 × ( 10 − 1.1 × 4 ) = − 22.4 ∂ L ( x 5 , y 5 ; w 1 ) ∂ w = − x 5 ( y 5 − w 1 x 5 ) = − 5 × ( 14 − 1.1 × 5 ) = − 42.5 \frac{\partial L(x_1,y_1;w_{1})}{\partial w} = -x_1(y_1-w_1 x_1) = -1 \times (3-1.1\times1) = -1.9 \\ \frac{\partial L(x_2,y_2;w_{1})}{\partial w} = -x_1(y_2-w_1 x_2) = -2 \times (7-1.1\times2) = -9.6 \\ \frac{\partial L(x_3,y_3;w_{1})}{\partial w} = -x_1(y_3-w_1 x_3) = -3 \times (8-1.1\times3) = -14.1 \\ \frac{\partial L(x_4,y_4;w_{1})}{\partial w} = -x_4(y_4-w_1 x_4) = -4 \times (10-1.1\times4) = -22.4 \\ \frac{\partial L(x_5,y_5;w_{1})}{\partial w} = -x_5(y_5-w_1 x_5) = -5 \times (14-1.1\times5) = -42.5 \\ wL(x1,y1;w1)=x1(y1w1x1)=1×(31.1×1)=1.9wL(x2,y2;w1)=x1(y2w1x2)=2×(71.1×2)=9.6wL(x3,y3;w1)=x1(y3w1x3)=3×(81.1×3)=14.1wL(x4,y4;w1)=x4(y4w1x4)=4×(101.1×4)=22.4wL(x5,y5;w1)=x5(y5w1x5)=5×(141.1×5)=42.5
计算平均梯度:
G 1 = 1 5 ∑ i = 1 5 ∂ L ( x i , y i ; w 1 ) ∂ w = − 1.9 + 9.6 + 14.1 + 22.4 + 42.5 5 = − 18.1 G_1 = \frac15 \sum_{i=1}^5 \frac{\partial L(x_i,y_i;w_{1})}{\partial w} = -\frac{1.9+9.6+14.1+22.4+42.5}{5} = -18.1 G1=51i=15wL(xi,yi;w1)=51.9+9.6+14.1+22.4+42.5=18.1
第二次迭代:
g 2 = g 1 + G 1 2 = 368.64 + ( − 18.1 ) 2 = 696.25 w 2 = w 1 − η g 2 + ε G 1 = 1.1 − 0.1 696.25 × ( − 18.1 ) ≈ 1.17 g_2 = g_{1} + G_1^2 = 368.64 + (-18.1)^2 = 696.25 \\ w_{2} = w_{1} - \frac{\eta}{\sqrt{g_2+\varepsilon}} G_1 = 1.1 - \frac{0.1}{\sqrt{696.25}} \times (-18.1) \approx 1.17 g2=g1+G12=368.64+(18.1)2=696.25w2=w1g2+ε ηG1=1.1696.25 0.1×(18.1)1.17
AdaGrad的缺点:
衰减系数累积了所有更新步骤中的梯度,到训练后期衰减非常大。

RMSProp

为了克服AdaGrad的缺点,只考察最近几步中的梯度来决定衰减系数。

更新公式:

g t = d e c a y _ r a t e ⋅ g t − 1 + ( 1 − d e c a y _ r a t e ) ⋅ G t − 1 2 = d e c a y _ r a t e ⋅ g t − 1 + ( 1 − d e c a y _ r a t e ) ⋅ [ 1 M ∑ i = 1 M ∂ L ( x i , y i ; w t − 1 ) ∂ w ] 2 w t = w t − 1 − η g t + ε G t − 1 \begin{aligned} g_t &= decay\_rate \cdot g_{t-1} + (1 - decay\_rate) \cdot G_{t-1}^2 \\ &= decay\_rate \cdot g_{t-1} + (1 - decay\_rate) \cdot [\frac1M \sum_{i=1}^M \frac{\partial L(x_i,y_i;w_{t-1})}{\partial w}]^2 \end{aligned} \\ w_{t} = w_{t-1} - \frac{\eta}{\sqrt{g_t+\varepsilon}} G_{t-1} gt=decay_rategt1+(1decay_rate)Gt12=decay_rategt1+(1decay_rate)[M1i=1MwL(xi,yi;wt1)]2wt=wt1gt+ε ηGt1

为什么说RMSProp考察的是最近的几步梯度来决定衰减系数呢?

先来看下面公式:
g 1 = d e c a y _ r a t e ⋅ g 0 + ( 1 − d e c a y _ r a t e ) ⋅ G 0 2 g 2 = d e c a y _ r a t e ⋅ g 1 + ( 1 − d e c a y _ r a t e ) ⋅ G 1 2 = d e c a y _ r a t e 2 ⋅ g 0 + d e c a y _ r a t e ⋅ ( 1 − d e c a y _ r a t e ) ⋅ G 0 2 + ( 1 − d e c a y _ r a t e ) ⋅ G 1 2 g 3 = d e c a y _ r a t e ⋅ g 2 + ( 1 − d e c a y _ r a t e ) ⋅ G 2 2 = d e c a y _ r a t e 3 ⋅ g 0 + d e c a y _ r a t e 2 ⋅ ( 1 − d e c a y _ r a t e ) ⋅ G 0 2 + d e c a y _ r a t e ⋅ ( 1 − d e c a y _ r a t e ) ⋅ G 1 2 + ( 1 − d e c a y _ r a t e ) ⋅ G 2 2 ⋮ g t = d e c a y _ r a t e ⋅ g t − 1 + ( 1 − d e c a y _ r a t e ) ⋅ G t − 1 2 = d e c a y _ r a t e t ⋅ g 0 + d e c a y _ r a t e t − 1 ⋅ ( 1 − d e c a y _ r a t e ) ⋅ G 0 2 + d e c a y _ r a t e t − 2 ⋅ ( 1 − d e c a y _ r a t e ) ⋅ G 1 2 + ⋯ + d e c a y _ r a t e ⋅ ( 1 − d e c a y _ r a t e ) ⋅ G t − 2 2 + ( 1 − d e c a y _ r a t e ) ⋅ G t − 1 2 \begin{aligned} &g_1 = decay\_rate \cdot g_{0} + (1 - decay\_rate) \cdot G_0^2 \\ &\begin{aligned} g_2 =& decay\_rate \cdot g_{1} + (1 - decay\_rate) \cdot G_1^2 \\ =& decay\_rate^2 \cdot g_{0} + decay\_rate \cdot (1 - decay\_rate) \cdot G_0^2 \\ &+ (1 - decay\_rate) \cdot G_1^2 \\ \end{aligned} \\ &\begin{aligned} g_3 =& decay\_rate \cdot g_{2} + (1 - decay\_rate) \cdot G_2^2 \\ =& decay\_rate^3 \cdot g_{0} + decay\_rate^2 \cdot (1 - decay\_rate) \cdot G_0^2 \\ &+ decay\_rate \cdot (1 - decay\_rate) \cdot G_1^2 + (1 - decay\_rate) \cdot G_2^2 \end{aligned} \\ &\quad \quad \quad \quad \quad \quad \quad \quad \quad \vdots \\ &\begin{aligned} g_t =& decay\_rate \cdot g_{t-1} + (1 - decay\_rate) \cdot G_{t-1}^2 \\ =& decay\_rate^{t} \cdot g_{0} + decay\_rate^{t-1} \cdot (1 - decay\_rate) \cdot G_{0}^2 \\ &+ decay\_rate^{t-2} \cdot (1 - decay\_rate) \cdot G_{1}^2 + \cdots \\ &+ decay\_rate \cdot (1 - decay\_rate) \cdot G_{t-2}^2 + (1 - decay\_rate) \cdot G_{t-1}^2 \end{aligned} \end{aligned} g1=decay_rateg0+(1decay_rate)G02g2==decay_rateg1+(1decay_rate)G12decay_rate2g0+decay_rate(1decay_rate)G02+(1decay_rate)G12g3==decay_rateg2+(1decay_rate)G22decay_rate3g0+decay_rate2(1decay_rate)G02+decay_rate(1decay_rate)G12+(1decay_rate)G22gt==decay_rategt1+(1decay_rate)Gt12decay_ratetg0+decay_ratet1(1decay_rate)G02+decay_ratet2(1decay_rate)G12++decay_rate(1decay_rate)Gt22+(1decay_rate)Gt12
经过t步迭代后,越早计算的梯度平方( G 0 2 , G 1 2 G_{0}^2,G_{1}^2 G02,G12等),它们的系数就越小。

例:假定 d e c a y _ r a t e = 0.9 , t = 100 decay\_rate=0.9,t=100 decay_rate=0.9,t=100,那么 G 0 2 G_{0}^2 G02的系数
d e c a y _ r a t e t − 1 ⋅ ( 1 − d e c a y _ r a t e ) = 0. 9 99 × 0.1 ≈ 2.95 × 1 0 − 6 decay\_rate^{t-1} \cdot (1 - decay\_rate) = 0.9^{99} \times 0.1 \approx 2.95 \times 10^{-6} decay_ratet1(1decay_rate)=0.999×0.12.95×106
可以忽略不计。

初始值: g 0 = 0 g_0=0 g0=0 η = 0.1 \eta = 0.1 η=0.1 d e c a y _ r a t e = 0.9 decay\_rate = 0.9 decay_rate=0.9

平均梯度(讲Gradient Descent的时候已经算好):
G 0 = 1 5 ∑ i = 1 5 ∂ L ( x i , y i ; w 0 ) ∂ w = − 2 + 10 + 15 + 24 + 45 5 = − 19.2 G_0 = \frac15 \sum_{i=1}^5 \frac{\partial L(x_i,y_i;w_{0})}{\partial w} = -\frac{2+10+15+24+45}{5} = -19.2 G0=51i=15wL(xi,yi;w0)=52+10+15+24+45=19.2
第一次迭代:
g 1 = d e c a y _ r a t e ⋅ g 0 + ( 1 − d e c a y _ r a t e ) ⋅ G 0 2 = 0.9 × 0 + 0.1 × ( − 19.2 ) 2 = 36.864 w 1 = w 0 − η g 1 + ε G 0 = 1 − 0.1 36.864 × ( − 19.2 ) ≈ 1.3162 g_1 = decay\_rate \cdot g_{0} + (1 - decay\_rate) \cdot G_{0}^2 = 0.9 \times 0 + 0.1 \times (-19.2)^2 = 36.864 \\ w_{1} = w_{0} - \frac{\eta}{\sqrt{g_1+\varepsilon}} G_{0} = 1 - \frac{0.1}{\sqrt{36.864}} \times (-19.2) \approx 1.3162 g1=decay_rateg0+(1decay_rate)G02=0.9×0+0.1×(19.2)2=36.864w1=w0g1+ε ηG0=136.864 0.1×(19.2)1.3162
算每个数据的梯度:
∂ L ( x 1 , y 1 ; w 1 ) ∂ w = − x 1 ( y 1 − w 1 x 1 ) = − 1 × ( 3 − 1.3162 × 1 ) = − 1.6838 ∂ L ( x 2 , y 2 ; w 1 ) ∂ w = − x 1 ( y 2 − w 1 x 2 ) = − 2 × ( 7 − 1.3162 × 2 ) = − 8.7352 ∂ L ( x 3 , y 3 ; w 1 ) ∂ w = − x 1 ( y 3 − w 1 x 3 ) = − 3 × ( 8 − 1.3162 × 3 ) = − 12.1542 ∂ L ( x 4 , y 4 ; w 1 ) ∂ w = − x 4 ( y 4 − w 1 x 4 ) = − 4 × ( 10 − 1.3162 × 4 ) = − 18.9408 ∂ L ( x 5 , y 5 ; w 1 ) ∂ w = − x 5 ( y 5 − w 1 x 5 ) = − 5 × ( 14 − 1.3162 × 5 ) = − 37.095 \frac{\partial L(x_1,y_1;w_{1})}{\partial w} = -x_1(y_1-w_1 x_1) = -1 \times (3-1.3162\times1) = -1.6838 \\ \frac{\partial L(x_2,y_2;w_{1})}{\partial w} = -x_1(y_2-w_1 x_2) = -2 \times (7-1.3162\times2) = -8.7352 \\ \frac{\partial L(x_3,y_3;w_{1})}{\partial w} = -x_1(y_3-w_1 x_3) = -3 \times (8-1.3162\times3) = -12.1542 \\ \frac{\partial L(x_4,y_4;w_{1})}{\partial w} = -x_4(y_4-w_1 x_4) = -4 \times (10-1.3162\times4) = -18.9408 \\ \frac{\partial L(x_5,y_5;w_{1})}{\partial w} = -x_5(y_5-w_1 x_5) = -5 \times (14-1.3162\times5) = -37.095 \\ wL(x1,y1;w1)=x1(y1w1x1)=1×(31.3162×1)=1.6838wL(x2,y2;w1)=x1(y2w1x2)=2×(71.3162×2)=8.7352wL(x3,y3;w1)=x1(y3w1x3)=3×(81.3162×3)=12.1542wL(x4,y4;w1)=x4(y4w1x4)=4×(101.3162×4)=18.9408wL(x5,y5;w1)=x5(y5w1x5)=5×(141.3162×5)=37.095
计算平均梯度:
G 1 = 1 5 ∑ i = 1 5 ∂ L ( x i , y i ; w 1 ) ∂ w = − 1.6838 + 8.7352 + 12.1542 + 18.9408 + 37.095 5 = − 15.7218 G_1 = \frac15 \sum_{i=1}^5 \frac{\partial L(x_i,y_i;w_{1})}{\partial w} = -\frac{1.6838+8.7352+12.1542+18.9408+37.095}{5} = -15.7218 G1=51i=15wL(xi,yi;w1)=51.6838+8.7352+12.1542+18.9408+37.095=15.7218
第二次迭代:
g 2 = d e c a y _ r a t e ⋅ g 1 + ( 1 − d e c a y _ r a t e ) ⋅ G 1 2 = 0.9 × 36.864 + 0.1 × ( − 15.7218 ) 2 ≈ 57.8951 w 2 = w 1 − η g 2 + ε G 1 = 1.3162 − 0.1 57.8951 × ( − 18.1 ) ≈ 1.5541 g_2 = decay\_rate \cdot g_{1} + (1-decay\_rate) \cdot G_1^2 = 0.9 \times 36.864 + 0.1 \times (-15.7218)^2 \approx 57.8951 \\ w_{2} = w_{1} - \frac{\eta}{\sqrt{g_2+\varepsilon}} G_1 = 1.3162 - \frac{0.1}{\sqrt{57.8951}} \times (-18.1) \approx 1.5541 g2=decay_rateg1+(1decay_rate)G12=0.9×36.864+0.1×(15.7218)257.8951w2=w1g2+ε ηG1=1.316257.8951 0.1×(18.1)1.5541

集大成者

Adam

综合了Momentum的梯度更新策略和RMSProp的学习率衰减策略。

更新公式:
m t = β 1 m t − 1 + ( 1 − β 1 ) G t = β 1 m t − 1 + ( 1 − β 1 ) ( 1 M ∑ i = 1 M ∂ L ( x i , y i ; w t − 1 ) ∂ w ) g t = β 2 g t − 1 + ( 1 − β 2 ) G t 2 = β 2 g t − 1 + ( 1 − β 2 ) [ 1 M ∑ i = 1 M ∂ L ( x i , y i ; w t − 1 ) ∂ w ] 2 m ^ t = m t 1 − β 1 t , s ^ t = s t 1 − β 2 t w t = w t − 1 − η s ^ t + ε ⋅ m ^ t m_t = \beta_1 m_{t-1} + (1-\beta_1) G_{t} = \beta_1 m_{t-1} + (1-\beta_1) (\frac1M \sum_{i=1}^M \frac{\partial L(x_i,y_i;w_{t-1})}{\partial w}) \\ g_t = \beta_2 g_{t-1} + (1-\beta_2) G_{t}^2 = \beta_2 g_{t-1} + (1-\beta_2)[\frac1M \sum_{i=1}^M \frac{\partial L(x_i,y_i;w_{t-1})}{\partial w}]^2 \\ \hat{m}_t = \frac{m_t}{1-\beta_1^t}, \hat{s}_t = \frac{s_t}{1-\beta_2^t} \\ w_t = w_{t-1} - \frac{\eta}{\sqrt{\hat{s}_t + \varepsilon}} \cdot \hat{m}_t mt=β1mt1+(1β1)Gt=β1mt1+(1β1)(M1i=1MwL(xi,yi;wt1))gt=β2gt1+(1β2)Gt2=β2gt1+(1β2)[M1i=1MwL(xi,yi;wt1)]2m^t=1β1tmt,s^t=1β2tstwt=wt1s^t+ε ηm^t
β 1 , β 2 \beta_1,\beta_2 β1,β2通常分别取0.9和0.999。

m t , s t m_t,s_t mt,st之所以要除以 1 − β 1 t , 1 − β 2 t 1-\beta_1^t,1-\beta_2^t 1β1t,1β2t(t是次方)的原因是:

希望更新初期的梯度和学习率衰减的变化可以比较剧烈,这样有利于增大初期下降路径的随机性,从而可能发现之前的优化器不会找到的最佳路径。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值