An overview of gradient descent optimization algorithms(阅读笔记)

最新推荐文章于 2021-11-26 22:06:28 发布

Zhuwei_Qin

最新推荐文章于 2021-11-26 22:06:28 发布

阅读量633

点赞数

分类专栏： Computer Vision 文章标签： SGD

Computer Vision 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

An overview of gradient descent optimization algorithms

原文地址http://sebastianruder.com/optimizing-gradient-descent/
借鉴了很多人写的东西，例如博文：http://blog.csdn.net/luo123n/article/details/48239963

系统漂亮的总结一下。

@(Sample)
首次使用，马克飞象+印象笔记，感觉还不错。

Aim： Providing the reader with intuitions with regard to the behaviour of different algorithms for optimizing gradient descent that will help her put them to use.

Hypothesis: $h(\theta) = \sum_{J=0}^{n}\theta _{j}x_{j} = \theta_{0}x_{0}+\theta_{1}x_{2}+ … + \theta_{n}x_{n} = \theta^{T}x$
Model’s parameters: $\theta : \theta_{0},\theta_{1},…\theta_{n}$
Objectiv(Cost) function: $J(\theta) = \frac{1}{2m}\sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})^{2}$
Learning rate: $\eta$
Gradient of the objective function: $\bigtriangledown_{\theta_{j}} J(\theta)$

Gradient descent variants

1.Batch gradient descent

Vanilla gradient descent computes the gradient of the cost function to the parameters for the entire training dataset.

Batch gradient descent: $\theta_{j}: = \theta_{j} - \eta\bigtriangledown_{\theta} J(\theta)$

如果训练集m很大，可能出现1）收敛过程非常缓慢 2）如果误差曲面上有多个局极小值，那么不能保证这个过程会找到全局最小值

2.Stochastic gradient descent(SGD)

Stochastic gradient descent (SGD) in contrast performs a parameter update for each raining example $x^{(i)}$ and label $y^{(i)}$ .

Stochastic gradient descent: $\theta_{j}: = \theta_{j} -\eta\bigtriangledown_{\theta} J(\theta; x^{(i)}; y^{(i)})$

SGD伴随的一个问题是噪音较BGD要多，使得SGD并不是每次迭代都向着整体最优化方向。

3.Mini-batch gradient descent

Mini-batch gradient descent performs an update for every mini-batch of n training examples.
对于整个训练数据集有m个样本，我们每次更新都利用一个mini-batch 的数据，n个样本，而非整个训练集，那么整个训练可以分成m/n 个 mini-batch

Mini-batch gradient descent: $\theta_{j}: = \theta_{j} -\eta\bigtriangledown_{\theta} J(\theta; x^{(i:i+n)}; y^{(i:i+n)})$

Gradient descent optimization algorithms

1.Momentum
这里写图片描述

Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations as can be seen in Figure 2b. It does this by adding a fraction $\gamma$ of the update vector of the past time step to the current update vector.

Momentum term: $\gamma \approx 0.9$

$v_{t} = \gamma v_{t-1} + \eta\bigtriangledown_{\theta} J(\theta)$

$\theta = \theta - v_{t}$

2.Nesterov accelerated gradient

这是对传统momentum方法的一项改进，由Ilya Sutskever(2012 unpublished)在Nesterov工作的启发下提出的。其基本思路如下图这里写图片描述

首先，按照原来的更新方向更新一步（棕色线），然后在该位置计算梯度值（红色线），然后用这个梯度值修正最终的更新方向（绿色线）。上图中描述了两步的更新示意图，其中蓝色线是标准momentum更新路径。

Nesterov accelerated gradient

$v_{t} = \gamma v_{t-1} + \eta\bigtriangledown_{\theta} J(\theta-\gamma v_{t-1} )$

$\theta = \theta - v_{t}$

3.Adagrad

Adagrad adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters.

For brevity, we set $g_{t,i}$ to be the gradient of the objective function w.r.t. to the parameter $\theta_{i}$ at time step t:

$g t, i = ▽ θ J (θ i)$ $g_{t,i} =\bigtriangledown_{\theta} J(\theta_{i})$
The SGD update for every parameter $\theta_{i}$ at each time step t then becomes:

$θ t + i, i = θ t, i - η g t, i$ $\theta_{t+i,i} = \theta_{t,i} - \eta g_{t,i}$
In its update rule, Adagrad modifies the general learning rate $\eta$ at each time step t for every parameter $\theta_{i}$ based on the past gradients that have been computed for $\theta_{i}$ :

$θ t + i, i = θ t, i - η G t , i i + ε - - - - - - - \sqrt g t, i$ $\theta_{t+i,i} = \theta_{t,i} - \frac{\eta}{\sqrt{G_{t,ii}+ \varepsilon }} g_{t,i}$
$G_{t}$ here is a diagonal matrix where each diagonal element i,i is the sum of the squares of the gradients w.r.t. $\theta_{i}$ up to time step t, while $\varepsilon$ is a smoothing term that avoids division by zero (usually on the order of 1e-8). Interestingly, without the square root operation, the algorithm performs much worse.