An overview of gradient descent optimization algorithms
原文地址http://sebastianruder.com/optimizing-gradient-descent/
借鉴了很多人写的东西,例如博文:http://blog.csdn.net/luo123n/article/details/48239963
系统漂亮的总结一下。
@(Sample)
首次使用,马克飞象+印象笔记,感觉还不错。
Aim: Providing the reader with intuitions with regard to the behaviour of different algorithms for optimizing gradient descent that will help her put them to use.
- Hypothesis: h(θ)=∑nJ=0θjxj=θ0x0+θ1x2+…+θnxn=θTx
- Model’s parameters: θ:θ0,θ1,…θn
- Objectiv(Cost) function: J(θ)=12m∑mi=1(hθ(x(i))−y(i))2
- Learning rate: η
- Gradient of the objective function: ▽θjJ(θ)
Gradient descent variants
1.Batch gradient descent
Vanilla gradient descent computes the gradient of the cost function to the parameters for the
entire
training dataset.
- Batch gradient descent: θj:=θj−η▽θJ(θ)
如果训练集m很大,可能出现1)收敛过程非常缓慢 2)如果误差曲面上有多个局极小值,那么不能保证这个过程会找到全局最小值
2.Stochastic gradient descent(SGD)
Stochastic gradient descent (SGD) in contrast performs a parameter update for
each
raining example x(i) and label y(i) .
- Stochastic gradient descent: θj:=θj−η▽θJ(θ;x(i);y(i))
SGD伴随的一个问题是噪音较BGD要多,使得SGD并不是每次迭代都向着整体最优化方向。
3.Mini-batch gradient descent
Mini-batch gradient descent performs an update for every mini-batch of n training examples.
对于整个训练数据集有m个样本,我们每次更新都利用一个mini-batch 的数据,n个样本,而非整个训练集,那么整个训练可以分成m/n 个 mini-batch
- Mini-batch gradient descent: θj:=θj−η▽θJ(θ;x(i:i+n);y(i:i+n))
Gradient descent optimization algorithms
1.Momentum
Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations as can be seen in Figure 2b. It does this by adding a fraction γ of the update vector of the past
time
step to the current update vector.
Momentum term: γ≈0.9
vt=γvt−1+η▽θJ(θ)
θ=θ−vt
2.Nesterov accelerated gradient
这是对传统momentum方法的一项改进,由Ilya Sutskever(2012 unpublished)在Nesterov工作的启发下提出的。其基本思路如下图
首先,按照原来的更新方向更新一步(棕色线),然后在该位置计算梯度值(红色线),然后用这个梯度值修正最终的更新方向(绿色线)。上图中描述了两步的更新示意图,其中蓝色线是标准momentum更新路径。
Nesterov accelerated gradient
vt=γvt−1+η▽θJ(θ−γvt−1)
θ=θ−vt
3.Adagrad
Adagrad adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters.
For brevity, we set gt,i to be the gradient of the objective function w.r.t. to the parameter θi at time step t:
gt,i=▽θJ(θi)The SGD update for every parameter θi at each time step t then becomes:
θt+i,i=θt,i−ηgt,iIn its update rule, Adagrad modifies the general learning rate η at each time step t for every parameter θi based on the past gradients that have been computed for θi :
θt+i,i=θt,i−ηGt,ii+ε−−−−−−−√gt,iGt here is a diagonal matrix where each diagonal element i,i is the sum of the squares of the gradients w.r.t. θi up to time step t, while ε is a smoothing term that avoids division by zero (usually on the order of 1e-8). Interestingly, without the square root operation, the algorithm performs much worse.
其中 Gt 同样是当前的梯度,连加和开根号都是元素级别的运算。 ε 是初始学习率,由于之后会自动调整学习率,所以初始值就不像之前的算法那样重要了。而ϵ是一个比较小的数,用来保证分母非0。其含义是,对于每个参数,随着其更新的总距离增多,其学习速率也随之变慢。
4.Adadelta
Adagrad算法存在三个问题
- 其学习率是单调递减的,训练后期学习率非常小
- 其需要手工设置一个全局的初始学习率
- 更新xt时,左右两边的单位不统一
5.RMSprop
6.Adam
Parallelizing and distributing SGD
1.Hogwild!
2.Downpour SGD
3.Delay-tolerant Algorithms for SGD
4.TensorFlow
5.Elastic Averaging SGD
未完待续。。。