简单例子说明优化器
我们会用下面的例子来说明优化器:
数据集:
(
x
1
,
y
1
)
=
(
1
,
3
)
(
x
2
,
y
2
)
=
(
2
,
7
)
(
x
3
,
y
3
)
=
(
3
,
8
)
(
x
4
,
y
4
)
=
(
4
,
10
)
(
x
5
,
y
5
)
=
(
5
,
14
)
(x_1,y_1) = (1,3) \\ (x_2,y_2) = (2,7) \\ (x_3,y_3) = (3,8) \\ (x_4,y_4) = (4,10) \\ (x_5,y_5) = (5,14)
(x1,y1)=(1,3)(x2,y2)=(2,7)(x3,y3)=(3,8)(x4,y4)=(4,10)(x5,y5)=(5,14)
损失函数:
L
(
x
,
y
;
w
)
=
1
2
(
y
−
w
x
)
2
L(x,y;w) = \frac12(y-wx)^2\\
L(x,y;w)=21(y−wx)2
损失函数对变量
w
w
w的梯度:
∂
L
(
x
,
y
;
w
)
∂
w
=
−
x
(
y
−
w
x
)
\frac{\partial L(x,y;w)}{\partial w} = -x(y-wx)
∂w∂L(x,y;w)=−x(y−wx)
学习率:
η
=
0.01
\eta=0.01
η=0.01
初始值: w 0 = 0.01 w_0=0.01 w0=0.01
简单优化器
GD(Gradient Descent)
使用全部数据计算梯度。
更新公式:
w
t
=
w
t
−
1
−
η
G
t
−
1
=
w
t
−
1
−
η
1
M
∑
i
=
1
M
∂
L
(
x
i
,
y
i
;
w
t
−
1
)
∂
w
w_{t} = w_{t-1} - \eta G_{t-1} = w_{t-1} - \eta \frac1M \sum_{i=1}^M \frac{\partial L(x_i,y_i;w_{t-1})}{\partial w}
wt=wt−1−ηGt−1=wt−1−ηM1i=1∑M∂w∂L(xi,yi;wt−1)
算每个数据的梯度:
∂
L
(
x
1
,
y
1
;
w
0
)
∂
w
=
−
x
1
(
y
1
−
w
0
x
1
)
=
−
1
×
(
3
−
1
×
1
)
=
−
2
∂
L
(
x
2
,
y
2
;
w
0
)
∂
w
=
−
x
1
(
y
2
−
w
0
x
2
)
=
−
2
×
(
7
−
1
×
2
)
=
−
10
∂
L
(
x
3
,
y
3
;
w
0
)
∂
w
=
−
x
1
(
y
3
−
w
0
x
3
)
=
−
3
×
(
8
−
1
×
3
)
=
−
15
∂
L
(
x
4
,
y
4
;
w
0
)
∂
w
=
−
x
4
(
y
4
−
w
0
x
4
)
=
−
4
×
(
10
−
1
×
4
)
=
−
24
∂
L
(
x
5
,
y
5
;
w
0
)
∂
w
=
−
x
5
(
y
5
−
w
0
x
5
)
=
−
5
×
(
14
−
1
×
5
)
=
−
45
\frac{\partial L(x_1,y_1;w_{0})}{\partial w} = -x_1(y_1-w_0 x_1) = -1 \times (3-1\times1) = -2 \\ \frac{\partial L(x_2,y_2;w_{0})}{\partial w} = -x_1(y_2-w_0 x_2) = -2 \times (7-1\times2) = -10 \\ \frac{\partial L(x_3,y_3;w_{0})}{\partial w} = -x_1(y_3-w_0 x_3) = -3 \times (8-1\times3) = -15 \\ \frac{\partial L(x_4,y_4;w_{0})}{\partial w} = -x_4(y_4-w_0 x_4) = -4 \times (10-1\times4) = -24 \\ \frac{\partial L(x_5,y_5;w_{0})}{\partial w} = -x_5(y_5-w_0 x_5) = -5 \times (14-1\times5) = -45 \\
∂w∂L(x1,y1;w0)=−x1(y1−w0x1)=−1×(3−1×1)=−2∂w∂L(x2,y2;w0)=−x1(y2−w0x2)=−2×(7−1×2)=−10∂w∂L(x3,y3;w0)=−x1(y3−w0x3)=−3×(8−1×3)=−15∂w∂L(x4,y4;w0)=−x4(y4−w0x4)=−4×(10−1×4)=−24∂w∂L(x5,y5;w0)=−x5(y5−w0x5)=−5×(14−1×5)=−45
计算平均梯度:
G
0
=
1
5
∑
i
=
1
5
∂
L
(
x
i
,
y
i
;
w
0
)
∂
w
=
−
2
+
10
+
15
+
24
+
45
5
=
−
19.2
G_0 = \frac15 \sum_{i=1}^5 \frac{\partial L(x_i,y_i;w_{0})}{\partial w} = -\frac{2+10+15+24+45}{5} = -19.2
G0=51i=1∑5∂w∂L(xi,yi;w0)=−52+10+15+24+45=−19.2
迭代:
w
1
=
w
0
−
η
G
0
=
1
−
0.01
×
(
−
19.2
)
=
1.192
w_{1} = w_{0} - \eta G_0 = 1 - 0.01 \times (-19.2) = 1.192
w1=w0−ηG0=1−0.01×(−19.2)=1.192
SGD(Stochastic Gradient Descent)
使用batch_size个数据计算梯度。
更新公式与Gradient Descent的一样。
设置batch_size=2,随机抽两条数据:
(
x
2
,
y
2
)
=
(
2
,
7
)
(
x
3
,
y
3
)
=
(
3
,
8
)
(x_2,y_2) = (2,7) \\ (x_3,y_3) = (3,8)
(x2,y2)=(2,7)(x3,y3)=(3,8)
计算梯度:
∂
L
(
x
2
,
y
2
;
w
0
)
∂
w
=
−
10
∂
L
(
x
3
,
y
3
;
w
0
)
∂
w
=
−
15
\frac{\partial L(x_2,y_2;w_{0})}{\partial w} = -10 \\ \frac{\partial L(x_3,y_3;w_{0})}{\partial w} = -15 \\
∂w∂L(x2,y2;w0)=−10∂w∂L(x3,y3;w0)=−15
计算平均梯度:
G
0
=
1
2
[
∂
L
(
x
2
,
y
2
;
w
0
)
∂
w
+
∂
L
(
x
3
,
y
3
;
w
0
)
∂
w
]
=
−
10
+
15
2
=
−
12.5
G_0 = \frac12 [\frac{\partial L(x_2,y_2;w_{0})}{\partial w} + \frac{\partial L(x_3,y_3;w_{0})}{\partial w}] = -\frac{10+15}{2} = -12.5
G0=21[∂w∂L(x2,y2;w0)+∂w∂L(x3,y3;w0)]=−210+15=−12.5
迭代:
w
1
=
w
0
−
η
G
0
=
1
−
0.01
×
(
−
12.5
)
=
1.125
w_{1} = w_{0} - \eta G_{0} = 1 - 0.01 \times (-12.5) = 1.125
w1=w0−ηG0=1−0.01×(−12.5)=1.125
优化器改进策略
优化器的改进思路有2条:
- 改进梯度
- 改进学习率
梯度改进策略
Momentum
增添之前更新的反向梯度。
更新公式:
m
t
=
μ
m
t
−
1
+
η
G
t
−
1
=
μ
m
t
−
1
−
η
1
M
∑
i
=
1
M
∂
L
(
x
i
,
y
i
;
w
t
−
1
)
∂
w
w
t
=
w
t
−
1
−
m
t
m_{t} = \mu m_{t-1} + \eta G_{t-1} = \mu m_{t-1} - \eta \frac1M \sum_{i=1}^M \frac{\partial L(x_i,y_i;w_{t-1})}{\partial w} \\ w_{t} = w_{t-1} - m_{t}
mt=μmt−1+ηGt−1=μmt−1−ηM1i=1∑M∂w∂L(xi,yi;wt−1)wt=wt−1−mt
平均梯度(讲Gradient Descent的时候已经算好):
G
0
=
1
5
∑
i
=
1
5
∂
L
(
x
i
,
y
i
;
w
0
)
∂
w
=
−
2
+
10
+
15
+
24
+
45
5
=
−
19.2
G_0 = \frac15 \sum_{i=1}^5 \frac{\partial L(x_i,y_i;w_{0})}{\partial w} = -\frac{2+10+15+24+45}{5} = -19.2
G0=51i=1∑5∂w∂L(xi,yi;w0)=−52+10+15+24+45=−19.2
初始值:
m
0
=
0
m_0=0
m0=0,
μ
=
0.3
\mu=0.3
μ=0.3
第一次迭代:
m
1
=
μ
m
0
+
η
G
0
=
0.01
×
(
−
19.2
)
=
−
0.192
w
1
=
w
0
−
m
1
=
1
−
(
−
0.192
)
=
1.192
m_{1} = \mu m_{0} + \eta G_0 = 0.01 \times (-19.2) = -0.192 \\ w_{1} = w_{0} - m_{1} = 1 - (-0.192) = 1.192
m1=μm0+ηG0=0.01×(−19.2)=−0.192w1=w0−m1=1−(−0.192)=1.192
算每个数据的梯度:
∂
L
(
x
1
,
y
1
;
w
1
)
∂
w
=
−
x
1
(
y
1
−
w
1
x
1
)
=
−
1
×
(
3
−
1.192
×
1
)
=
−
1.808
∂
L
(
x
2
,
y
2
;
w
1
)
∂
w
=
−
x
1
(
y
2
−
w
1
x
2
)
=
−
2
×
(
7
−
1.192
×
2
)
=
−
9.232
∂
L
(
x
3
,
y
3
;
w
1
)
∂
w
=
−
x
1
(
y
3
−
w
1
x
3
)
=
−
3
×
(
8
−
1.192
×
3
)
=
−
13.272
∂
L
(
x
4
,
y
4
;
w
1
)
∂
w
=
−
x
4
(
y
4
−
w
1
x
4
)
=
−
4
×
(
10
−
1.192
×
4
)
=
−
20.928
∂
L
(
x
5
,
y
5
;
w
1
)
∂
w
=
−
x
5
(
y
5
−
w
1
x
5
)
=
−
5
×
(
14
−
1.192
×
5
)
=
−
40.2
\frac{\partial L(x_1,y_1;w_{1})}{\partial w} = -x_1(y_1-w_1 x_1) = -1 \times (3-1.192\times1) = -1.808 \\ \frac{\partial L(x_2,y_2;w_{1})}{\partial w} = -x_1(y_2-w_1 x_2) = -2 \times (7-1.192\times2) = -9.232 \\ \frac{\partial L(x_3,y_3;w_{1})}{\partial w} = -x_1(y_3-w_1 x_3) = -3 \times (8-1.192\times3) = -13.272 \\ \frac{\partial L(x_4,y_4;w_{1})}{\partial w} = -x_4(y_4-w_1 x_4) = -4 \times (10-1.192\times4) = -20.928 \\ \frac{\partial L(x_5,y_5;w_{1})}{\partial w} = -x_5(y_5-w_1 x_5) = -5 \times (14-1.192\times5) = -40.2 \\
∂w∂L(x1,y1;w1)=−x1(y1−w1x1)=−1×(3−1.192×1)=−1.808∂w∂L(x2,y2;w1)=−x1(y2−w1x2)=−2×(7−1.192×2)=−9.232∂w∂L(x3,y3;w1)=−x1(y3−w1x3)=−3×(8−1.192×3)=−13.272∂w∂L(x4,y4;w1)=−x4(y4−w1x4)=−4×(10−1.192×4)=−20.928∂w∂L(x5,y5;w1)=−x5(y5−w1x5)=−5×(14−1.192×5)=−40.2
计算平均梯度:
G
1
=
1
5
∑
i
=
1
5
∂
L
(
x
i
,
y
i
;
w
1
)
∂
w
=
−
1.808
+
9.232
+
13.272
+
20.928
+
40.2
5
=
−
17.088
G_1 = \frac15 \sum_{i=1}^5 \frac{\partial L(x_i,y_i;w_{1})}{\partial w} = -\frac{1.808+9.232+13.272+20.928+40.2}{5} = -17.088
G1=51i=1∑5∂w∂L(xi,yi;w1)=−51.808+9.232+13.272+20.928+40.2=−17.088
第二次迭代:
m
2
=
μ
m
1
+
η
G
1
=
0.3
×
(
−
1.192
)
+
0.01
×
(
−
17.088
)
=
−
0.52848
w
2
=
w
1
−
m
2
=
1.192
−
(
−
0.52848
)
=
1.72048
m_{2} = \mu m_{1} + \eta G_1 = 0.3 \times (-1.192) + 0.01 \times (-17.088) = -0.52848 \\ w_{2} = w_{1} - m_{2} = 1.192 - (-0.52848) = 1.72048
m2=μm1+ηG1=0.3×(−1.192)+0.01×(−17.088)=−0.52848w2=w1−m2=1.192−(−0.52848)=1.72048
Nesterov Momentum
更新公式:
m
t
=
μ
m
t
−
1
+
η
G
t
−
1
=
w
t
−
1
−
η
1
M
∑
i
=
1
M
∂
L
(
x
i
,
y
i
;
w
t
−
1
−
μ
m
t
−
1
)
∂
w
w
t
=
w
t
−
1
−
m
t
m_{t} = \mu m_{t-1} + \eta G_{t-1} = w_{t-1} - \eta \frac1M \sum_{i=1}^M \frac{\partial L(x_i,y_i;w_{t-1}-\mu m_{t-1})}{\partial w} \\ w_{t} = w_{t-1} - m_{t}
mt=μmt−1+ηGt−1=wt−1−ηM1i=1∑M∂w∂L(xi,yi;wt−1−μmt−1)wt=wt−1−mt
初始值:
m
0
=
0
m_0=0
m0=0,
μ
=
0.3
\mu=0.3
μ=0.3
那么: w 0 − μ m 0 = 1 w_{0}-\mu m_{0}=1 w0−μm0=1
算每个数据的梯度:
∂
L
(
x
1
,
y
1
;
w
0
−
μ
m
0
)
∂
w
=
−
x
1
[
y
1
−
(
w
0
−
μ
m
0
)
x
1
]
=
−
1
×
(
3
−
1
×
1
)
=
−
2
∂
L
(
x
2
,
y
2
;
w
0
−
μ
m
0
)
∂
w
=
−
x
1
[
y
2
−
(
w
0
−
μ
m
0
)
x
2
]
=
−
2
×
(
7
−
1
×
2
)
=
−
10
∂
L
(
x
3
,
y
3
;
w
0
−
μ
m
0
)
∂
w
=
−
x
1
[
y
3
−
(
w
0
−
μ
m
0
)
x
3
]
=
−
3
×
(
8
−
1
×
3
)
=
−
15
∂
L
(
x
4
,
y
4
;
w
0
−
μ
m
0
)
∂
w
=
−
x
4
[
y
4
−
(
w
0
−
μ
m
0
)
x
4
]
=
−
4
×
(
10
−
1
×
4
)
=
−
24
∂
L
(
x
5
,
y
5
;
w
0
−
μ
m
0
)
∂
w
=
−
x
5
[
y
5
−
(
w
0
−
μ
m
0
)
x
5
]
=
−
5
×
(
14
−
1
×
5
)
=
−
45
\frac{\partial L(x_1,y_1;w_{0}-\mu m_{0})}{\partial w} = -x_1[y_1 - (w_{0}-\mu m_{0}) x_1] = -1 \times (3-1\times1) = -2 \\ \frac{\partial L(x_2,y_2;w_{0}-\mu m_{0})}{\partial w} = -x_1[y_2 - (w_{0}-\mu m_{0}) x_2] = -2 \times (7-1\times2) = -10 \\ \frac{\partial L(x_3,y_3;w_{0}-\mu m_{0})}{\partial w} = -x_1[y_3 - (w_{0}-\mu m_{0}) x_3] = -3 \times (8-1\times3) = -15 \\ \frac{\partial L(x_4,y_4;w_{0}-\mu m_{0})}{\partial w} = -x_4[y_4 - (w_{0}-\mu m_{0}) x_4] = -4 \times (10-1\times4) = -24 \\ \frac{\partial L(x_5,y_5;w_{0}-\mu m_{0})}{\partial w} = -x_5[y_5 - (w_{0}-\mu m_{0}) x_5] = -5 \times (14-1\times5) = -45 \\
∂w∂L(x1,y1;w0−μm0)=−x1[y1−(w0−μm0)x1]=−1×(3−1×1)=−2∂w∂L(x2,y2;w0−μm0)=−x1[y2−(w0−μm0)x2]=−2×(7−1×2)=−10∂w∂L(x3,y3;w0−μm0)=−x1[y3−(w0−μm0)x3]=−3×(8−1×3)=−15∂w∂L(x4,y4;w0−μm0)=−x4[y4−(w0−μm0)x4]=−4×(10−1×4)=−24∂w∂L(x5,y5;w0−μm0)=−x5[y5−(w0−μm0)x5]=−5×(14−1×5)=−45
平均梯度:
G
0
=
1
5
∑
i
=
1
5
∂
L
(
x
i
,
y
i
;
w
0
−
μ
m
0
)
∂
w
=
−
2
+
10
+
15
+
24
+
45
5
=
−
19.2
G_0 = \frac15 \sum_{i=1}^5 \frac{\partial L(x_i,y_i;w_{0}-\mu m_{0})}{\partial w} = -\frac{2+10+15+24+45}{5} = -19.2
G0=51i=1∑5∂w∂L(xi,yi;w0−μm0)=−52+10+15+24+45=−19.2
第一次迭代:
m
1
=
μ
m
0
+
η
G
0
=
0.01
×
(
−
19.2
)
=
−
0.192
w
1
=
w
0
−
m
1
=
1
−
(
−
0.192
)
=
1.192
m_{1} = \mu m_{0} + \eta G_0 = 0.01 \times (-19.2) = -0.192 \\w_{1} = w_{0} - m_{1} = 1 - (-0.192) = 1.192
m1=μm0+ηG0=0.01×(−19.2)=−0.192w1=w0−m1=1−(−0.192)=1.192
那么:
w
1
−
μ
m
1
=
1.192
−
0.3
×
(
−
0.192
)
=
1.2496
w_{1}-\mu m_{1}=1.192-0.3\times(-0.192)=1.2496
w1−μm1=1.192−0.3×(−0.192)=1.2496
算每个数据的梯度:
∂
L
(
x
1
,
y
1
;
w
1
−
μ
m
1
)
∂
w
=
−
x
1
[
y
1
−
(
w
1
−
μ
m
1
)
x
1
]
=
−
1
×
(
3
−
1.2496
×
1
)
=
−
1.7504
∂
L
(
x
2
,
y
2
;
w
1
−
μ
m
1
)
∂
w
=
−
x
1
[
y
2
−
(
w
1
−
μ
m
1
)
x
2
]
=
−
2
×
(
7
−
1.2496
×
2
)
=
−
9.0016
∂
L
(
x
3
,
y
3
;
w
1
−
μ
m
1
)
∂
w
=
−
x
1
[
y
3
−
(
w
1
−
μ
m
1
)
x
3
]
=
−
3
×
(
8
−
1.2496
×
3
)
=
−
12.7536
∂
L
(
x
4
,
y
4
;
w
1
−
μ
m
1
)
∂
w
=
−
x
4
[
y
4
−
(
w
1
−
μ
m
1
)
x
4
]
=
−
4
×
(
10
−
1.2496
×
4
)
=
−
20.0064
∂
L
(
x
5
,
y
5
;
w
1
−
μ
m
1
)
∂
w
=
−
x
5
[
y
5
−
(
w
1
−
μ
m
1
)
x
5
]
=
−
5
×
(
14
−
1.2496
×
5
)
=
−
38.76
\begin{aligned} &\begin{aligned} \frac{\partial L(x_1,y_1;w_{1}-\mu m_{1})}{\partial w} &= -x_1[y_1 - (w_{1}-\mu m_{1}) x_1] \\ &= -1 \times (3 - 1.2496 \times 1) \\ &= -1.7504 \end{aligned} \\ &\begin{aligned} \frac{\partial L(x_2,y_2;w_{1}-\mu m_{1})}{\partial w} &= -x_1[y_2 - (w_{1}-\mu m_{1}) x_2] \\ &= -2 \times (7 - 1.2496 \times 2) \\ &= -9.0016 \end{aligned} \\ &\begin{aligned} \frac{\partial L(x_3,y_3;w_{1}-\mu m_{1})}{\partial w} &= -x_1[y_3 - (w_{1}-\mu m_{1}) x_3] \\ &= -3 \times (8 - 1.2496 \times 3) \\ &= -12.7536 \end{aligned} \\ &\begin{aligned} \frac{\partial L(x_4,y_4;w_{1}-\mu m_{1})}{\partial w} &= -x_4[y_4 - (w_{1}-\mu m_{1}) x_4] \\ &= -4 \times (10 - 1.2496 \times 4) \\ &= -20.0064 \end{aligned} \\ &\begin{aligned} \frac{\partial L(x_5,y_5;w_{1}-\mu m_{1})}{\partial w} &= -x_5[y_5 - (w_{1}-\mu m_{1}) x_5] \\ &= -5 \times (14 - 1.2496 \times 5) \\ &= -38.76 \end{aligned} \end{aligned}
∂w∂L(x1,y1;w1−μm1)=−x1[y1−(w1−μm1)x1]=−1×(3−1.2496×1)=−1.7504∂w∂L(x2,y2;w1−μm1)=−x1[y2−(w1−μm1)x2]=−2×(7−1.2496×2)=−9.0016∂w∂L(x3,y3;w1−μm1)=−x1[y3−(w1−μm1)x3]=−3×(8−1.2496×3)=−12.7536∂w∂L(x4,y4;w1−μm1)=−x4[y4−(w1−μm1)x4]=−4×(10−1.2496×4)=−20.0064∂w∂L(x5,y5;w1−μm1)=−x5[y5−(w1−μm1)x5]=−5×(14−1.2496×5)=−38.76
计算平均梯度:
G 1 = 1 5 ∑ i = 1 5 ∂ L ( x i , y i ; w 1 − μ m 1 ) ∂ w = − 1.7504 + 9.0016 + 12.7536 + 20.0064 + 38.76 5 = − 16.4544 G_1 = \frac15 \sum_{i=1}^5 \frac{\partial L(x_i,y_i;w_{1}-\mu m_{1})}{\partial w} = -\frac{1.7504+9.0016+12.7536+20.0064+38.76}{5} = -16.4544 G1=51i=1∑5∂w∂L(xi,yi;w1−μm1)=−51.7504+9.0016+12.7536+20.0064+38.76=−16.4544
第二次迭代:
m 2 = μ m 1 + η G 1 = 0.3 × ( − 0.192 ) + 0.01 × ( − 16.4544 ) = − 0.222144 w 2 = w 1 − m 2 = 1.192 − ( − 0.222144 ) = 1.414144 m_{2} = \mu m_{1} + \eta G_1 = 0.3 \times (-0.192) + 0.01 \times (-16.4544) = -0.222144 \\w_{2} = w_{1} - m_{2} = 1.192 - (-0.222144) = 1.414144 m2=μm1+ηG1=0.3×(−0.192)+0.01×(−16.4544)=−0.222144w2=w1−m2=1.192−(−0.222144)=1.414144
学习率改进策略
AdaGrad
与Momentum不同,AdaGrad没有改变梯度,着眼于学习率的递减,递减系数是之前所有更新的梯度平⽅和的累加。
更新公式:
g
t
=
g
t
−
1
+
G
t
−
1
2
=
g
t
−
1
+
[
1
M
∑
i
=
1
M
∂
L
(
x
i
,
y
i
;
w
t
−
1
)
∂
w
]
2
w
t
=
w
t
−
1
−
η
g
t
+
ε
G
t
−
1
g_t = g_{t-1} + G_{t-1}^2 = g_{t-1} + [\frac1M \sum_{i=1}^M \frac{\partial L(x_i,y_i;w_{t-1})}{\partial w}]^2 \\ w_{t} = w_{t-1} - \frac{\eta}{\sqrt{g_t+\varepsilon}} G_{t-1}
gt=gt−1+Gt−12=gt−1+[M1i=1∑M∂w∂L(xi,yi;wt−1)]2wt=wt−1−gt+εηGt−1
ε
\varepsilon
ε是非常小的数(如
1
0
−
8
10^{-8}
10−8),它是为了防止除零。
初始值: g 0 = 0 g_0=0 g0=0, η = 0.1 \eta = 0.1 η=0.1
平均梯度(讲Gradient Descent的时候已经算好):
G
0
=
1
5
∑
i
=
1
5
∂
L
(
x
i
,
y
i
;
w
0
)
∂
w
=
−
2
+
10
+
15
+
24
+
45
5
=
−
19.2
G_0 = \frac15 \sum_{i=1}^5 \frac{\partial L(x_i,y_i;w_{0})}{\partial w} = -\frac{2+10+15+24+45}{5} = -19.2
G0=51i=1∑5∂w∂L(xi,yi;w0)=−52+10+15+24+45=−19.2
第一次迭代:
g
1
=
g
0
+
G
0
2
=
0
+
(
−
19.2
)
2
=
368.64
w
1
=
w
0
−
η
g
1
+
ε
G
0
=
1
−
0.1
368.64
×
(
−
19.2
)
=
1.1
g_1 = g_{0} + G_0^2 = 0 + (-19.2)^2 = 368.64 \\ w_{1} = w_{0} - \frac{\eta}{\sqrt{g_1+\varepsilon}} G_0 = 1 - \frac{0.1}{\sqrt{368.64}} \times (-19.2) = 1.1
g1=g0+G02=0+(−19.2)2=368.64w1=w0−g1+εηG0=1−368.640.1×(−19.2)=1.1
算每个数据的梯度:
∂
L
(
x
1
,
y
1
;
w
1
)
∂
w
=
−
x
1
(
y
1
−
w
1
x
1
)
=
−
1
×
(
3
−
1.1
×
1
)
=
−
1.9
∂
L
(
x
2
,
y
2
;
w
1
)
∂
w
=
−
x
1
(
y
2
−
w
1
x
2
)
=
−
2
×
(
7
−
1.1
×
2
)
=
−
9.6
∂
L
(
x
3
,
y
3
;
w
1
)
∂
w
=
−
x
1
(
y
3
−
w
1
x
3
)
=
−
3
×
(
8
−
1.1
×
3
)
=
−
14.1
∂
L
(
x
4
,
y
4
;
w
1
)
∂
w
=
−
x
4
(
y
4
−
w
1
x
4
)
=
−
4
×
(
10
−
1.1
×
4
)
=
−
22.4
∂
L
(
x
5
,
y
5
;
w
1
)
∂
w
=
−
x
5
(
y
5
−
w
1
x
5
)
=
−
5
×
(
14
−
1.1
×
5
)
=
−
42.5
\frac{\partial L(x_1,y_1;w_{1})}{\partial w} = -x_1(y_1-w_1 x_1) = -1 \times (3-1.1\times1) = -1.9 \\ \frac{\partial L(x_2,y_2;w_{1})}{\partial w} = -x_1(y_2-w_1 x_2) = -2 \times (7-1.1\times2) = -9.6 \\ \frac{\partial L(x_3,y_3;w_{1})}{\partial w} = -x_1(y_3-w_1 x_3) = -3 \times (8-1.1\times3) = -14.1 \\ \frac{\partial L(x_4,y_4;w_{1})}{\partial w} = -x_4(y_4-w_1 x_4) = -4 \times (10-1.1\times4) = -22.4 \\ \frac{\partial L(x_5,y_5;w_{1})}{\partial w} = -x_5(y_5-w_1 x_5) = -5 \times (14-1.1\times5) = -42.5 \\
∂w∂L(x1,y1;w1)=−x1(y1−w1x1)=−1×(3−1.1×1)=−1.9∂w∂L(x2,y2;w1)=−x1(y2−w1x2)=−2×(7−1.1×2)=−9.6∂w∂L(x3,y3;w1)=−x1(y3−w1x3)=−3×(8−1.1×3)=−14.1∂w∂L(x4,y4;w1)=−x4(y4−w1x4)=−4×(10−1.1×4)=−22.4∂w∂L(x5,y5;w1)=−x5(y5−w1x5)=−5×(14−1.1×5)=−42.5
计算平均梯度:
G
1
=
1
5
∑
i
=
1
5
∂
L
(
x
i
,
y
i
;
w
1
)
∂
w
=
−
1.9
+
9.6
+
14.1
+
22.4
+
42.5
5
=
−
18.1
G_1 = \frac15 \sum_{i=1}^5 \frac{\partial L(x_i,y_i;w_{1})}{\partial w} = -\frac{1.9+9.6+14.1+22.4+42.5}{5} = -18.1
G1=51i=1∑5∂w∂L(xi,yi;w1)=−51.9+9.6+14.1+22.4+42.5=−18.1
第二次迭代:
g
2
=
g
1
+
G
1
2
=
368.64
+
(
−
18.1
)
2
=
696.25
w
2
=
w
1
−
η
g
2
+
ε
G
1
=
1.1
−
0.1
696.25
×
(
−
18.1
)
≈
1.17
g_2 = g_{1} + G_1^2 = 368.64 + (-18.1)^2 = 696.25 \\ w_{2} = w_{1} - \frac{\eta}{\sqrt{g_2+\varepsilon}} G_1 = 1.1 - \frac{0.1}{\sqrt{696.25}} \times (-18.1) \approx 1.17
g2=g1+G12=368.64+(−18.1)2=696.25w2=w1−g2+εηG1=1.1−696.250.1×(−18.1)≈1.17
AdaGrad的缺点:
衰减系数累积了所有更新步骤中的梯度,到训练后期衰减非常大。
RMSProp
为了克服AdaGrad的缺点,只考察最近几步中的梯度来决定衰减系数。
更新公式:
g t = d e c a y _ r a t e ⋅ g t − 1 + ( 1 − d e c a y _ r a t e ) ⋅ G t − 1 2 = d e c a y _ r a t e ⋅ g t − 1 + ( 1 − d e c a y _ r a t e ) ⋅ [ 1 M ∑ i = 1 M ∂ L ( x i , y i ; w t − 1 ) ∂ w ] 2 w t = w t − 1 − η g t + ε G t − 1 \begin{aligned} g_t &= decay\_rate \cdot g_{t-1} + (1 - decay\_rate) \cdot G_{t-1}^2 \\ &= decay\_rate \cdot g_{t-1} + (1 - decay\_rate) \cdot [\frac1M \sum_{i=1}^M \frac{\partial L(x_i,y_i;w_{t-1})}{\partial w}]^2 \end{aligned} \\ w_{t} = w_{t-1} - \frac{\eta}{\sqrt{g_t+\varepsilon}} G_{t-1} gt=decay_rate⋅gt−1+(1−decay_rate)⋅Gt−12=decay_rate⋅gt−1+(1−decay_rate)⋅[M1i=1∑M∂w∂L(xi,yi;wt−1)]2wt=wt−1−gt+εηGt−1
为什么说RMSProp考察的是最近的几步梯度来决定衰减系数呢?
先来看下面公式:
g 1 = d e c a y _ r a t e ⋅ g 0 + ( 1 − d e c a y _ r a t e ) ⋅ G 0 2 g 2 = d e c a y _ r a t e ⋅ g 1 + ( 1 − d e c a y _ r a t e ) ⋅ G 1 2 = d e c a y _ r a t e 2 ⋅ g 0 + d e c a y _ r a t e ⋅ ( 1 − d e c a y _ r a t e ) ⋅ G 0 2 + ( 1 − d e c a y _ r a t e ) ⋅ G 1 2 g 3 = d e c a y _ r a t e ⋅ g 2 + ( 1 − d e c a y _ r a t e ) ⋅ G 2 2 = d e c a y _ r a t e 3 ⋅ g 0 + d e c a y _ r a t e 2 ⋅ ( 1 − d e c a y _ r a t e ) ⋅ G 0 2 + d e c a y _ r a t e ⋅ ( 1 − d e c a y _ r a t e ) ⋅ G 1 2 + ( 1 − d e c a y _ r a t e ) ⋅ G 2 2 ⋮ g t = d e c a y _ r a t e ⋅ g t − 1 + ( 1 − d e c a y _ r a t e ) ⋅ G t − 1 2 = d e c a y _ r a t e t ⋅ g 0 + d e c a y _ r a t e t − 1 ⋅ ( 1 − d e c a y _ r a t e ) ⋅ G 0 2 + d e c a y _ r a t e t − 2 ⋅ ( 1 − d e c a y _ r a t e ) ⋅ G 1 2 + ⋯ + d e c a y _ r a t e ⋅ ( 1 − d e c a y _ r a t e ) ⋅ G t − 2 2 + ( 1 − d e c a y _ r a t e ) ⋅ G t − 1 2 \begin{aligned} &g_1 = decay\_rate \cdot g_{0} + (1 - decay\_rate) \cdot G_0^2 \\ &\begin{aligned} g_2 =& decay\_rate \cdot g_{1} + (1 - decay\_rate) \cdot G_1^2 \\ =& decay\_rate^2 \cdot g_{0} + decay\_rate \cdot (1 - decay\_rate) \cdot G_0^2 \\ &+ (1 - decay\_rate) \cdot G_1^2 \\ \end{aligned} \\ &\begin{aligned} g_3 =& decay\_rate \cdot g_{2} + (1 - decay\_rate) \cdot G_2^2 \\ =& decay\_rate^3 \cdot g_{0} + decay\_rate^2 \cdot (1 - decay\_rate) \cdot G_0^2 \\ &+ decay\_rate \cdot (1 - decay\_rate) \cdot G_1^2 + (1 - decay\_rate) \cdot G_2^2 \end{aligned} \\ &\quad \quad \quad \quad \quad \quad \quad \quad \quad \vdots \\ &\begin{aligned} g_t =& decay\_rate \cdot g_{t-1} + (1 - decay\_rate) \cdot G_{t-1}^2 \\ =& decay\_rate^{t} \cdot g_{0} + decay\_rate^{t-1} \cdot (1 - decay\_rate) \cdot G_{0}^2 \\ &+ decay\_rate^{t-2} \cdot (1 - decay\_rate) \cdot G_{1}^2 + \cdots \\ &+ decay\_rate \cdot (1 - decay\_rate) \cdot G_{t-2}^2 + (1 - decay\_rate) \cdot G_{t-1}^2 \end{aligned} \end{aligned} g1=decay_rate⋅g0+(1−decay_rate)⋅G02g2==decay_rate⋅g1+(1−decay_rate)⋅G12decay_rate2⋅g0+decay_rate⋅(1−decay_rate)⋅G02+(1−decay_rate)⋅G12g3==decay_rate⋅g2+(1−decay_rate)⋅G22decay_rate3⋅g0+decay_rate2⋅(1−decay_rate)⋅G02+decay_rate⋅(1−decay_rate)⋅G12+(1−decay_rate)⋅G22⋮gt==decay_rate⋅gt−1+(1−decay_rate)⋅Gt−12decay_ratet⋅g0+decay_ratet−1⋅(1−decay_rate)⋅G02+decay_ratet−2⋅(1−decay_rate)⋅G12+⋯+decay_rate⋅(1−decay_rate)⋅Gt−22+(1−decay_rate)⋅Gt−12
经过t步迭代后,越早计算的梯度平方( G 0 2 , G 1 2 G_{0}^2,G_{1}^2 G02,G12等),它们的系数就越小。例:假定 d e c a y _ r a t e = 0.9 , t = 100 decay\_rate=0.9,t=100 decay_rate=0.9,t=100,那么 G 0 2 G_{0}^2 G02的系数
d e c a y _ r a t e t − 1 ⋅ ( 1 − d e c a y _ r a t e ) = 0. 9 99 × 0.1 ≈ 2.95 × 1 0 − 6 decay\_rate^{t-1} \cdot (1 - decay\_rate) = 0.9^{99} \times 0.1 \approx 2.95 \times 10^{-6} decay_ratet−1⋅(1−decay_rate)=0.999×0.1≈2.95×10−6
可以忽略不计。
初始值: g 0 = 0 g_0=0 g0=0, η = 0.1 \eta = 0.1 η=0.1, d e c a y _ r a t e = 0.9 decay\_rate = 0.9 decay_rate=0.9
平均梯度(讲Gradient Descent的时候已经算好):
G
0
=
1
5
∑
i
=
1
5
∂
L
(
x
i
,
y
i
;
w
0
)
∂
w
=
−
2
+
10
+
15
+
24
+
45
5
=
−
19.2
G_0 = \frac15 \sum_{i=1}^5 \frac{\partial L(x_i,y_i;w_{0})}{\partial w} = -\frac{2+10+15+24+45}{5} = -19.2
G0=51i=1∑5∂w∂L(xi,yi;w0)=−52+10+15+24+45=−19.2
第一次迭代:
g
1
=
d
e
c
a
y
_
r
a
t
e
⋅
g
0
+
(
1
−
d
e
c
a
y
_
r
a
t
e
)
⋅
G
0
2
=
0.9
×
0
+
0.1
×
(
−
19.2
)
2
=
36.864
w
1
=
w
0
−
η
g
1
+
ε
G
0
=
1
−
0.1
36.864
×
(
−
19.2
)
≈
1.3162
g_1 = decay\_rate \cdot g_{0} + (1 - decay\_rate) \cdot G_{0}^2 = 0.9 \times 0 + 0.1 \times (-19.2)^2 = 36.864 \\ w_{1} = w_{0} - \frac{\eta}{\sqrt{g_1+\varepsilon}} G_{0} = 1 - \frac{0.1}{\sqrt{36.864}} \times (-19.2) \approx 1.3162
g1=decay_rate⋅g0+(1−decay_rate)⋅G02=0.9×0+0.1×(−19.2)2=36.864w1=w0−g1+εηG0=1−36.8640.1×(−19.2)≈1.3162
算每个数据的梯度:
∂
L
(
x
1
,
y
1
;
w
1
)
∂
w
=
−
x
1
(
y
1
−
w
1
x
1
)
=
−
1
×
(
3
−
1.3162
×
1
)
=
−
1.6838
∂
L
(
x
2
,
y
2
;
w
1
)
∂
w
=
−
x
1
(
y
2
−
w
1
x
2
)
=
−
2
×
(
7
−
1.3162
×
2
)
=
−
8.7352
∂
L
(
x
3
,
y
3
;
w
1
)
∂
w
=
−
x
1
(
y
3
−
w
1
x
3
)
=
−
3
×
(
8
−
1.3162
×
3
)
=
−
12.1542
∂
L
(
x
4
,
y
4
;
w
1
)
∂
w
=
−
x
4
(
y
4
−
w
1
x
4
)
=
−
4
×
(
10
−
1.3162
×
4
)
=
−
18.9408
∂
L
(
x
5
,
y
5
;
w
1
)
∂
w
=
−
x
5
(
y
5
−
w
1
x
5
)
=
−
5
×
(
14
−
1.3162
×
5
)
=
−
37.095
\frac{\partial L(x_1,y_1;w_{1})}{\partial w} = -x_1(y_1-w_1 x_1) = -1 \times (3-1.3162\times1) = -1.6838 \\ \frac{\partial L(x_2,y_2;w_{1})}{\partial w} = -x_1(y_2-w_1 x_2) = -2 \times (7-1.3162\times2) = -8.7352 \\ \frac{\partial L(x_3,y_3;w_{1})}{\partial w} = -x_1(y_3-w_1 x_3) = -3 \times (8-1.3162\times3) = -12.1542 \\ \frac{\partial L(x_4,y_4;w_{1})}{\partial w} = -x_4(y_4-w_1 x_4) = -4 \times (10-1.3162\times4) = -18.9408 \\ \frac{\partial L(x_5,y_5;w_{1})}{\partial w} = -x_5(y_5-w_1 x_5) = -5 \times (14-1.3162\times5) = -37.095 \\
∂w∂L(x1,y1;w1)=−x1(y1−w1x1)=−1×(3−1.3162×1)=−1.6838∂w∂L(x2,y2;w1)=−x1(y2−w1x2)=−2×(7−1.3162×2)=−8.7352∂w∂L(x3,y3;w1)=−x1(y3−w1x3)=−3×(8−1.3162×3)=−12.1542∂w∂L(x4,y4;w1)=−x4(y4−w1x4)=−4×(10−1.3162×4)=−18.9408∂w∂L(x5,y5;w1)=−x5(y5−w1x5)=−5×(14−1.3162×5)=−37.095
计算平均梯度:
G
1
=
1
5
∑
i
=
1
5
∂
L
(
x
i
,
y
i
;
w
1
)
∂
w
=
−
1.6838
+
8.7352
+
12.1542
+
18.9408
+
37.095
5
=
−
15.7218
G_1 = \frac15 \sum_{i=1}^5 \frac{\partial L(x_i,y_i;w_{1})}{\partial w} = -\frac{1.6838+8.7352+12.1542+18.9408+37.095}{5} = -15.7218
G1=51i=1∑5∂w∂L(xi,yi;w1)=−51.6838+8.7352+12.1542+18.9408+37.095=−15.7218
第二次迭代:
g
2
=
d
e
c
a
y
_
r
a
t
e
⋅
g
1
+
(
1
−
d
e
c
a
y
_
r
a
t
e
)
⋅
G
1
2
=
0.9
×
36.864
+
0.1
×
(
−
15.7218
)
2
≈
57.8951
w
2
=
w
1
−
η
g
2
+
ε
G
1
=
1.3162
−
0.1
57.8951
×
(
−
18.1
)
≈
1.5541
g_2 = decay\_rate \cdot g_{1} + (1-decay\_rate) \cdot G_1^2 = 0.9 \times 36.864 + 0.1 \times (-15.7218)^2 \approx 57.8951 \\ w_{2} = w_{1} - \frac{\eta}{\sqrt{g_2+\varepsilon}} G_1 = 1.3162 - \frac{0.1}{\sqrt{57.8951}} \times (-18.1) \approx 1.5541
g2=decay_rate⋅g1+(1−decay_rate)⋅G12=0.9×36.864+0.1×(−15.7218)2≈57.8951w2=w1−g2+εηG1=1.3162−57.89510.1×(−18.1)≈1.5541
集大成者
Adam
综合了Momentum的梯度更新策略和RMSProp的学习率衰减策略。
更新公式:
m
t
=
β
1
m
t
−
1
+
(
1
−
β
1
)
G
t
=
β
1
m
t
−
1
+
(
1
−
β
1
)
(
1
M
∑
i
=
1
M
∂
L
(
x
i
,
y
i
;
w
t
−
1
)
∂
w
)
g
t
=
β
2
g
t
−
1
+
(
1
−
β
2
)
G
t
2
=
β
2
g
t
−
1
+
(
1
−
β
2
)
[
1
M
∑
i
=
1
M
∂
L
(
x
i
,
y
i
;
w
t
−
1
)
∂
w
]
2
m
^
t
=
m
t
1
−
β
1
t
,
s
^
t
=
s
t
1
−
β
2
t
w
t
=
w
t
−
1
−
η
s
^
t
+
ε
⋅
m
^
t
m_t = \beta_1 m_{t-1} + (1-\beta_1) G_{t} = \beta_1 m_{t-1} + (1-\beta_1) (\frac1M \sum_{i=1}^M \frac{\partial L(x_i,y_i;w_{t-1})}{\partial w}) \\ g_t = \beta_2 g_{t-1} + (1-\beta_2) G_{t}^2 = \beta_2 g_{t-1} + (1-\beta_2)[\frac1M \sum_{i=1}^M \frac{\partial L(x_i,y_i;w_{t-1})}{\partial w}]^2 \\ \hat{m}_t = \frac{m_t}{1-\beta_1^t}, \hat{s}_t = \frac{s_t}{1-\beta_2^t} \\ w_t = w_{t-1} - \frac{\eta}{\sqrt{\hat{s}_t + \varepsilon}} \cdot \hat{m}_t
mt=β1mt−1+(1−β1)Gt=β1mt−1+(1−β1)(M1i=1∑M∂w∂L(xi,yi;wt−1))gt=β2gt−1+(1−β2)Gt2=β2gt−1+(1−β2)[M1i=1∑M∂w∂L(xi,yi;wt−1)]2m^t=1−β1tmt,s^t=1−β2tstwt=wt−1−s^t+εη⋅m^t
β
1
,
β
2
\beta_1,\beta_2
β1,β2通常分别取0.9和0.999。
m t , s t m_t,s_t mt,st之所以要除以 1 − β 1 t , 1 − β 2 t 1-\beta_1^t,1-\beta_2^t 1−β1t,1−β2t(t是次方)的原因是:
希望更新初期的梯度和学习率衰减的变化可以比较剧烈,这样有利于增大初期下降路径的随机性,从而可能发现之前的优化器不会找到的最佳路径。