深度学习中的优化算法

最新推荐文章于 2023-10-27 09:41:08 发布

WYXHAHAHA123

最新推荐文章于 2023-10-27 09:41:08 发布

阅读量491

点赞数

分类专栏： python offer

本文链接：https://blog.csdn.net/WYXHAHAHA123/article/details/96013301

版权

python 同时被 2 个专栏收录

172 篇文章 3 订阅

订阅专栏

offer

85 篇文章 0 订阅

订阅专栏

Adam算法是最常用的自适应学习率算法。对于优化器而言，学习率显然是最为重要的超参数之一。

https://zhuanlan.zhihu.com/p/32626442

总结：无论是SGD系列的优化算法或者是Adam系列的优化算法，其根本的依据都是先使用反向传播算法求出损失函数相对于网络模型每层网络权值的梯度，然后再根据对于当前batch size的训练样本得到的梯度值对网络权值进行更新，仅仅是更新使用的策略不同而已，假设梯度向量记作g，进行权值更新之前的权值为θ，用粗体字标出的是向量，未用粗体字标出的是标量

随机梯度下降法：（今天面试被问到，随机梯度下降法的随机体现在哪里？我当时说的是通常说的随机梯度下降法是batch size=1，但是现在理解可能还有一个地方是随机的：从整个训练数据集中随机采样出 mini batch的样本进行当前批次的训练样本，这个采样的过程是随机的），

θ=θ-lr*g

带有动量学习率的随机梯度下降法：

θ=momentum*θ-(1-momentum)*lr*g 一阶动量

自适应学习率算法系列(所谓的自适应并不仅仅是说能够自动地调整学习率的大小，并且还能对于不同的网络模型参数分量，自适应的采用不同的学习率)

Adagrad

初始的累计梯度平方和，初始化为0 r=0

r=r+g(element wise product)g 更新完r后，用r来动态地调整学习率

θ=θ-(lr/sqrt(r+delta))*g element wise operation

RMSprop

考虑到Adagrad使用的累计梯度平方和是从开始训练一直到当前时刻所累积的平方和，RMS prop仅仅使用上一次的r和当前的梯度值来更新r，二阶动量

初始的累计梯度平方和，初始化为0 r=0

r=moment*r+(1-momentum)*g(element wise product)g 更新完r后，用r来动态地调整学习率

θ=θ-(lr/sqrt(r+delta))*g element wise operation

Adam

Adam相当于momentum-SGD中的一阶动量（对于梯度本身引入动量）和RMS Prop中的二阶动量（对于梯度的变化率即自适应的学习率这一项引入了动量）的结合

Adam的全局学习率通常设置成lr=1e-3

beta1=0.9 beta2=0.999 delta=1e-8 累积一阶动量s 累计二阶动量 r

首先计算一阶动量更新之后的累计一阶动量 s=beta1*s+(1-beta1)*g

计算二阶动量更新之后的累计二阶动量 r=beta2*r+(1-beta2)*(g(element wise product)g)

校正累积动量 s=s/(1-beta1.t) r=r/(1-beta2.t) 注意这里的t表示是beta2的t次幂，t由进行权值更新算法的迭代次数决定，每进行一次梯度下降算法(iteration)，t+=1

计算梯度更新量 v=-(lr/sqrt(r+delta))*s

更新网络模型参数 θ=θ+v

import numpy as np

"""
This file implements various first-order update rules that are commonly used
for training neural networks. Each update rule accepts current weights and the
gradient of the loss with respect to those weights and produces the next set of
weights. Each update rule has the same interface:

def update(w, dw, config=None):

Inputs:
  - w: A numpy array giving the current weights.
  - dw: A numpy array of the same shape as w giving the gradient of the
    loss with respect to w.
  - config: A dictionary containing hyperparameter values such as learning
    rate, momentum, etc. If the update rule requires caching values over many
    iterations, then config will also hold these cached values.

Returns:
  - next_w: The next point after the update.
  - config: The config dictionary to be passed to the next iteration of the
    update rule.

NOTE: For most update rules, the default learning rate will probably not
perform well; however the default values of the other hyperparameters should
work well for a variety of different problems.

For efficiency, update rules may perform in-place updates, mutating w and
setting next_w equal to w.
"""


def sgd(w, dw, config=None):
    """
    Performs vanilla stochastic gradient descent.

    config format:
    - learning_rate: Scalar learning rate.
    """
    if config is None: config = {}
    config.setdefault('learning_rate', 1e-2)

    w -= config['learning_rate'] * dw

    """
    随机梯度下降法：
    1.根据当前batch size的训练数据样本(从整个训练数据集中随机采样得到的)的前向传播过程
    以及反向传播算法得到在batch size个样本上的平均梯度值(梯度值是指网络模型的权值的梯度值，
    它的shape与网络权值的shape相同)，记作dw
    2.由于梯度方向是损失函数数值loss变化/上升最快的方向，故而沿着梯度的反方向即为
    函数下降最快的方向，步长为学习率
    v=-learning_rate*dw   v 表示权值的实际移动
    w=w+v
    """
    return w, config

def sgd_momentum(w, dw, config=None):
    """
    Performs stochastic gradient descent with momentum.

    config format:
    - learning_rate: Scalar learning rate.
    - momentum: Scalar between 0 and 1 giving the momentum value.
      Setting momentum = 0 reduces to sgd.
    - velocity: A numpy array of the same shape as w and dw used to store a
      moving average of the gradients.

      使用momentum SGD更新网络模型的权值参数，momentum SGD的优化效果比SGD更好

      momentum SGD  加入了动量的梯度下降法
      在每个iteration/step根据当前的batch size个样本得到网络模型权值的梯度值之后
      直接根据梯度值和学习率对于权值进行更新，会导致权值变化很不平稳，故而引入动量项
      在当前的step中，更新网络模型权值时，同样需要考虑当前权值的数值，并在当前权值的基础上做微调
      这样可以保证在整个训练过程中，权值变化的方向较为连续和平稳

      为此引入两个超参数  momentum (动量系数)  和 learning rate
      根据当前batch size个随机采样出来的样本计算得到网络模型权值的梯度值dw之后，准备更新权值

      v=momentum*w-learning_rate*dw
      w=w+v

    """
    if config is None: config = {}
    config.setdefault('learning_rate', 1e-2)
    config.setdefault('momentum', 0.9)
    v = config.get('velocity', np.zeros_like(w))

    next_w = None
    ###########################################################################
    # TODO: Implement the momentum update formula. Store the updated value in #
    # the next_w variable. You should also use and update the velocity v.     #
    ###########################################################################
    pass
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################

    v=config['momentum']*v-config['learning_rate']*dw

    config['velocity'] = v

    next_w=w+v

    return next_w, config

'''
Adagrad 自适应学习率梯度下降法，对于网络模型的不同权值，使用不同大小的自适应学习率
初始化的梯度平方累积 cache=0  (累积的是梯度的平方值)
根据当前的batch size得到的梯度dw
1.更新梯度平方累积 cache=cache+ dw(element wise product)dw
2.根据梯度平方累积确定不同权值参数的学习率 v=-[learning_rate/(epslion+sqrt(cache))]*dw
3.w=w+v
权值参数所对应的梯度值越大，它的学习率设置得相应较小，
权值参数所对应的梯度值越小，它的学习率设置得相应较大，
则在整个参数空间中，权重变化的方向较为平缓
'''
def rmsprop(x, dx, config=None):
    """
    Uses the RMSProp update rule, which uses a moving average of squared
    gradient values to set adaptive per-parameter learning rates.

    config format:
    - learning_rate: Scalar learning rate.
    - decay_rate: Scalar between 0 and 1 giving the decay rate for the squared
      gradient cache.
    - epsilon: Small scalar used for smoothing to avoid dividing by zero.
    - cache: Moving average of second moments of gradients.

    rmsprop 自适应学习率梯度下降法
    rmsprop可以看作是对于adagrad的微小改进，adagrad根据权值梯度的大小来调整相应权值
    的学习率，但是根据梯度累计平方和来进行自适应学习率的计算可能会造成数值的累积，则
    在经过一段时间的梯度平方和的累积之后，所有权值参数的学习率都将急剧下降，
    故而引入rmsprop，用指数衰减平均替代梯度累计平方和

    假设梯度累计值的衰减速率为p
    累积值 cache=p*cache+(1-p)* dw(element wise product)dw
    根据累积值得到 v=-[learning_rate/(epsilon+sqrt(cache))]*dw
    w=w+v
    """
    if config is None: config = {}
    config.setdefault('learning_rate', 1e-2)
    config.setdefault('decay_rate', 0.99)
    config.setdefault('epsilon', 1e-8)
    config.setdefault('cache', np.zeros_like(x))

    next_x = None
    ###########################################################################
    # TODO: Implement the RMSprop update formula, storing the next value of x #
    # in the next_x variable. Don't forget to update cache value stored in    #
    # config['cache'].                                                        #
    ###########################################################################
    pass
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################

    config['cache']=config['decay_rate']*config['cache']+(1-config['decay_rate'])*np.square(dx)

    next_x=x-config['learning_rate']/np.sqrt(config['cache']+config['epsilon'])*dx

    return next_x, config

'''
adam
对于一阶矩估计(权重的梯度) s和二阶矩估计 r(权重对应的学习率)，都采用指数衰减的形式
对应的指数衰减速率分别为beta1 和 beta2
1.初始化状态   r=0  s=0
2.根据当前batch size计算得到的梯度值更新一阶矩估计和二阶矩估计
s=beta1*s+(1-beta1)*dw
r=beta2*r+(1-beta2)* dw(element wise product)dw
3.梯度修正
s=s/(1-(beta1)**t)
r=r/(1-(beta2)**t)
4.根据修正后的一阶矩估计和二阶矩估计得到权重更新
v=-[learning_rate/(epsilon+sqrt(r))]*s
w=w+v
'''

def adam(x, dx, config=None):
    """
    Uses the Adam update rule, which incorporates moving averages of both the
    gradient and its square and a bias correction term.

    config format:
    - learning_rate: Scalar learning rate.
    - beta1: Decay rate for moving average of first moment of gradient.
    - beta2: Decay rate for moving average of second moment of gradient.
    - epsilon: Small scalar used for smoothing to avoid dividing by zero.
    - m: Moving average of gradient.
    - v: Moving average of squared gradient.
    - t: Iteration number.
    """
    if config is None: config = {}
    config.setdefault('learning_rate', 1e-3)
    config.setdefault('beta1', 0.9)
    config.setdefault('beta2', 0.999)
    config.setdefault('epsilon', 1e-8)
    config.setdefault('m', np.zeros_like(x))
    config.setdefault('v', np.zeros_like(x))
    config.setdefault('t', 1)

    next_x = None
    ###########################################################################
    # TODO: Implement the Adam update formula, storing the next value of x in #
    # the next_x variable. Don't forget to update the m, v, and t variables   #
    # stored in config.                                                       #
    ###########################################################################
    pass
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################

    # v是二阶动量
    # m是一阶动量
    config['t']+=1
    config['m']=config['beta1']*config['m']+(1-config['beta1'])*dx

    config['v']=config['beta2']*config['v']+(1-config['beta2'])*np.square(dx)

    tm=config['m']/(1-config['beta1']**config['t'])
    tv=config['v']/(1-config['beta2']**config['t'])

    next_x=x-config['learning_rate']*tm/np.sqrt(tv+config['epsilon'])
    return next_x, config

WYXHAHAHA123

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
深度学习中的优化算法

Adam算法是最常用的自适应学习率算法。对于优化器而言，学习率显然是最为重要的超参数之一。https://zhuanlan.zhihu.com/p/32626442总结：无论是SGD系列的优化算法或者是Adam系列的优化算法，其根本的依据都是先使用反向传播算法求出损失函数相对于网络模型每层网络权值的梯度，然后再根据对于当前batch size的训练样本得到的梯度值对网络权值进行更新...
复制链接

扫一扫

专栏目录