深度学习优化方法总结

最新推荐文章于 2020-03-29 16:27:58 发布

临江轩

最新推荐文章于 2020-03-29 16:27:58 发布

阅读量435

点赞数 2

分类专栏：实践文章标签：优化方法

本文链接：https://blog.csdn.net/weixin_39880579/article/details/86771741

版权

实践专栏收录该内容

10 篇文章 3 订阅

订阅专栏

先敬大佬的一篇文章

《深度学习最全优化方法总结比较（SGD，Adagrad，Adadelta，Adam，Adamax，Nadam）》

在assignment2 FullyConnectedNets作业中的optim.py里有以下几种优化机制（cs231n_2018_lecture07）：

SGD
SGD + Momentum
RMSprop
Adam

1. SGD

公式： w = w -lr * dw

缺点：1.来回振荡 2.会困在局部最优点或鞍点（鞍点在高维向量中很常见）

代码：

def sgd(w, dw, config=None):

    if config is None: config = {}
    config.setdefault('learning_rate', 1e-2)

    w -= config['learning_rate'] * dw
    return w, config

2. SGD + Momentum

公式： $w_{i+1} = w_{i} - lr * dw + \rho *v_{i-1}$

优缺点：帮助快速收敛，但会跳过某些点

代码：

def sgd_momentum(w, dw, config=None):

    if config is None: config = {}
    config.setdefault('learning_rate', 1e-2)
    config.setdefault('momentum', 0.9)
    v = config.get('velocity', np.zeros_like(w))

    next_w = None

    v = config['momentum'] * v - config['learning_rate'] * dw
    next_w = w + v

    config['velocity'] = v

    return next_w, config

3. RMSprop

公式： $w_{i+1} = w_{i} - lr * \frac{dw_{i}}{\sqrt{cache }+ \varepsilon }$

优点：可以动态调节梯度，当dw较小时，cache较小，则 $dw_{i}/(\sqrt{cache }+ \varepsilon )$ 较大，提高速度。

缺点：当开始速度慢，因为梯度小

代码：

def rmsprop(w, dw, config=None):
    """
    Uses the RMSProp update rule, which uses a moving average of squared
    gradient values to set adaptive per-parameter learning rates.

    config format:
    - learning_rate: Scalar learning rate.
    - decay_rate: Scalar between 0 and 1 giving the decay rate for the squared
      gradient cache.
    - epsilon: Small scalar used for smoothing to avoid dividing by zero.
    - cache: Moving average of second moments of gradients.
    """
    if config is None: config = {}
    config.setdefault('learning_rate', 1e-2)
    config.setdefault('decay_rate', 0.99)
    config.setdefault('epsilon', 1e-8)
    config.setdefault('cache', np.zeros_like(w))

    next_w = None
    
    cache = config['decay_rate'] * config['cache'] + (1 - config['decay_rate']) * dw ** 2
    next_w = w - config['learning_rate'] * dw / (np.sqrt(cache) + config['epsilon'])
    config['cache'] = cache

    return next_w, config

4. Adam（最常用）

公式：

$m_{t} = \frac{m}{1 - \beta _{1}^{t}}$

$v_{t} = \frac{v}{1 - \beta _{2}^{t}}$

$w_{i+1} = w_{i} - lr * \frac{m_{t}}{\sqrt{v_{t} }+ \varepsilon }$

优缺点：当迭代次数变多时，m变小，lr变小，逼近全局最优。

    """
    Uses the Adam update rule, which incorporates moving averages of both the
    gradient and its square and a bias correction term.

    config format:
    - learning_rate: Scalar learning rate.
    - beta1: Decay rate for moving average of first moment of gradient.
    - beta2: Decay rate for moving average of second moment of gradient.
    - epsilon: Small scalar used for smoothing to avoid dividing by zero.
    - m: Moving average of gradient.
    - v: Moving average of squared gradient.
    - t: Iteration number.
    """
    if config is None: config = {}
    config.setdefault('learning_rate', 1e-3)
    config.setdefault('beta1', 0.9)
    config.setdefault('beta2', 0.999)
    config.setdefault('epsilon', 1e-8)
    config.setdefault('m', np.zeros_like(w))
    config.setdefault('v', np.zeros_like(w))
    config.setdefault('t', 0)

    next_w = None

    config['t'] += 1
    m = config['beta1'] * config['m'] + (1 - config['beta1']) * dw
    mt = m / (1 - config['beta1'] ** config['t'])
    v = config['beta2'] * config['v'] + (1 - config['beta2']) * dw ** 2
    vt = v / (1 - config['beta2'] ** config['t'])
    next_w = w - config['learning_rate'] * mt / (np.sqrt(vt) + config['epsilon'])
    config['m'] = m
    config['v'] = v
    return next_w, config

临江轩

关注

2
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
深度学习优化方法总结

先敬大佬的一篇文章《深度学习最全优化方法总结比较（SGD，Adagrad，Adadelta，Adam，Adamax，Nadam）》在assignment2 FullyConnectedNets作业中的optim.py里有以下几种优化机制（cs231n_2018_lecture07）：SGD SGD + Momentum RMSprop Adam1. SGD公式：缺...
复制链接

扫一扫

专栏目录