这一篇写得特别详细:
深度学习优化算法解析(Momentum, RMSProp, Adam)
Adam(Adaptive Moment Estimation)
初始化:
On iteration t:
compute
dW,db
d
W
,
d
b
using mini batch
When implementing Adam, what people usually do is just use the default value
. So,
β1
β
1
and
β2
β
2
as well as
ϵ
ϵ
. I don’t think anyone ever really tunes Epsilon. And then, try a range of values of Alpha to see what works best
.
So, where does the term ‘Adam’ come from?
Adam stands for Adaptive Moment Estimation
. So
β1
β
1
is computing the mean of the derivatives. This is called the first moment. And
β2
β
2
is used to compute exponentially weighted average of the squares and that’s called the second moment. So that gives rise to the name adaptive moment estimation.