Adagrad :Root Mean Square
θ i 1 = θ i 0 − η σ i 0 g i 0 σ i 0 = ( g i 0 ) 2 = | g i 0 | \theta_i^1 =\theta_i^0-\tfrac{\eta}{\sigma_i^0}g_i^0 \\ \sigma_i^0 = \sqrt{(g_i^0)^2}=|g_i^0| θi1=θi0−σi0ηgi0σi0=(gi0)2=|gi0|
θ
i
2
=
θ
i
1
−
η
σ
i
1
g
i
1
σ
i
1
=
1
2
[
(
g
i
1
)
2
+
(
g
i
1
)
2
]
\theta_i^2 =\theta_i^1-\tfrac{\eta}{\sigma_i^1}g_i^1 \\ \sigma_i^1 = \sqrt{\tfrac{1}{2}[ (g_i^1)^2+(g_i^1)^2]}
θi2=θi1−σi1ηgi1σi1=21[(gi1)2+(gi1)2]
即
θ
i
t
+
1
=
θ
i
t
−
η
σ
i
t
g
i
t
σ
i
t
=
1
t
+
1
∑
i
=
0
t
(
g
i
t
)
2
\theta_i^{t+1} =\theta_i^t-\tfrac{\eta}{\sigma_i^t}g_i^t \\ \sigma_i^t = \sqrt{\tfrac{1}{t+1}\sum\limits_{i=0}^t{(g_i^t)^2}}
θit+1=θit−σitηgitσit=t+11i=0∑t(git)2
当梯度大的时候,分母变大,总体变小,反之总体变小。
RMSProp
θ i 1 = θ i 0 − η σ i 0 g i 0 σ i 0 = ( g i 0 ) 2 = | g i 0 | \begin{matrix} \theta_i^1 =\theta_i^0-\tfrac{\eta}{\sigma_i^0}g_i^0 \\ \\ \sigma_i^0 = \sqrt{(g_i^0)^2}=|g_i^0| \end{matrix} θi1=θi0−σi0ηgi0σi0=(gi0)2=|gi0|
θ i 2 = θ i 1 − η σ i 1 g i 1 σ i 1 = α ( σ i 0 ) 2 + ( 1 − α ) ( g i 1 ) 2 \begin{matrix} \theta_i^2 =\theta_i^1-\tfrac{\eta}{\sigma_i^1}g_i^1 \\ \\ \sigma_i^1 = \sqrt{\alpha(\sigma_i^0)^2+(1-\alpha)(g_i^1)^2} \end{matrix} θi2=θi1−σi1ηgi1σi1=α(σi0)2+(1−α)(gi1)2
θ
i
3
=
θ
i
2
−
η
σ
i
2
g
i
2
σ
i
2
=
α
(
σ
i
1
)
2
+
(
1
−
α
)
(
g
i
2
)
2
\begin{matrix} \theta_i^3 =\theta_i^2-\tfrac{\eta}{\sigma_i^2}g_i^2 \\ \\ \sigma_i^2 = \sqrt{\alpha(\sigma_i^1)^2+(1-\alpha)(g_i^2)^2} \end{matrix}
θi3=θi2−σi2ηgi2σi2=α(σi1)2+(1−α)(gi2)2
即
θ
i
t
+
1
=
θ
i
t
−
η
σ
i
t
g
i
t
σ
i
t
=
α
(
σ
i
t
−
1
)
2
+
(
1
−
α
)
(
g
i
t
)
2
\begin{matrix} \theta_i^{t+1} =\theta_i^t-\tfrac{\eta}{\sigma_i^t}g_i^t \\ \\ \sigma_i^t = \sqrt{\alpha(\sigma_i^{t-1})^2+(1-\alpha)(g_i^t)^2} \end{matrix}
θit+1=θit−σitηgitσit=α(σit−1)2+(1−α)(git)2
Adam