梯度下降法
随机梯度下降法
批梯度下降法
g
k
=
f
k
′
(
x
)
g_k = f_k'(x)
gk=fk′(x)
x
k
+
1
=
x
k
−
l
r
∗
g
k
x_{k+1} = x_k - lr * g_k
xk+1=xk−lr∗gk
小批梯度下降法
优化算法
动量
g
k
=
f
k
′
(
x
)
g_k = f_k'(x)
gk=fk′(x)
v
k
=
α
v
k
−
1
−
l
r
∗
g
k
v_k = \alpha v_{k-1} -lr*g_k
vk=αvk−1−lr∗gk
x
k
+
1
=
x
k
+
v
k
=
x
k
+
α
v
k
−
l
r
∗
g
k
−
1
x_{k+1} = x_k +v_k = x_k+\alpha v_k - lr*g_{k-1}
xk+1=xk+vk=xk+αvk−lr∗gk−1
Nesterow 动量
g
k
=
f
k
′
(
x
+
α
v
k
)
g_k = f_k'(x+\alpha v_k)
gk=fk′(x+αvk)
v
k
=
α
v
k
−
1
−
l
r
∗
g
k
v_k = \alpha v_{k-1} -lr*g_k
vk=αvk−1−lr∗gk
x
k
+
1
=
x
k
+
v
k
=
x
k
+
α
v
k
−
l
r
∗
g
k
−
1
x_{k+1} = x_k +v_k = x_k+\alpha v_k - lr*g_{k-1}
xk+1=xk+vk=xk+αvk−lr∗gk−1
AdaGrad
g
k
=
f
k
′
(
x
)
g_k = f_k'(x)
gk=fk′(x)
r
k
=
r
k
−
1
+
g
∗
g
r_k = r_{k-1}+g*g
rk=rk−1+g∗g
x
k
+
1
=
x
k
−
l
r
δ
+
r
k
∗
g
k
x_{k+1} = x_k - \frac{lr}{\delta+\sqrt r_k} * g_k
xk+1=xk−δ+rklr∗gk
RMSProp
g
k
=
f
k
′
(
x
)
g_k = f_k'(x)
gk=fk′(x)
r
k
=
ρ
r
k
−
1
+
(
1
−
ρ
)
g
∗
g
r_k = \rho r_{k-1}+(1-\rho)g*g
rk=ρrk−1+(1−ρ)g∗g
x
k
+
1
=
x
k
−
l
r
δ
+
r
k
∗
g
k
x_{k+1} = x_k - \frac{lr}{\delta+\sqrt r_k} * g_k
xk+1=xk−δ+rklr∗gk
Adam
g
k
=
f
k
′
(
x
)
g_k = f_k'(x)
gk=fk′(x)
s
k
=
ρ
1
s
k
−
1
+
(
1
−
ρ
1
)
g
s_k = \rho_1 s_{k-1}+(1-\rho_1)g
sk=ρ1sk−1+(1−ρ1)g
r
k
=
ρ
2
r
k
−
1
+
(
1
−
ρ
2
)
g
∗
g
r_k = \rho_2 r_{k-1}+(1-\rho_2)g*g
rk=ρ2rk−1+(1−ρ2)g∗g
s
^
=
s
1
−
ρ
1
k
\hat s = \frac {s}{1-\rho_1^k}
s^=1−ρ1ks
r
^
=
r
1
−
ρ
2
k
\hat r = \frac {r}{1-\rho_2^k}
r^=1−ρ2kr
x
k
+
1
=
x
k
−
l
r
s
^
k
δ
+
r
^
k
x_{k+1} = x_k - \frac{lr \hat s_k}{\delta+\sqrt {\hat r_k}}
xk+1=xk−δ+r^klrs^k