Tensorflow 中 learning rate decay 的奇技淫巧

最新推荐文章于 2024-04-16 13:23:39 发布

zaf赵

最新推荐文章于 2024-04-16 13:23:39 发布

阅读量5.4k

点赞数 9

分类专栏： TensorFlow 图像处理与机器视觉机器学习深度学习

本文链接：https://blog.csdn.net/zaf0516/article/details/90720759

版权

TensorFlow 同时被 3 个专栏收录

56 篇文章 5 订阅

订阅专栏

深度学习

53 篇文章 1 订阅

订阅专栏

图像处理与机器视觉

33 篇文章 3 订阅

订阅专栏

深度学习中参数更新的方法想必大家都十分清楚了——sgd，adam等等，孰优孰劣相关的讨论也十分广泛。可是，learning rate的衰减策略大家有特别关注过吗？

在训练神经网络时，使用学习率控制参数的更新速度．学习率较小时，会大大降低参数的更新速度；学习率较大时，会使搜索过程中发生震荡，导致参数在极优值附近徘徊．为此，在训练过程中引入学习率衰减，使学习率随着训练的进行逐渐衰减．

learning rate衰减策略文件在tensorflow/tensorflow/python/training/learning_rate_decay.py中，函数中调用方法类似tf.train.exponential_decay就可以了。

文章目录

           1.1基数衰减
                1.1.exponential_decay
                1.2.piecewise_constant
                1.3.polynomial_decay
                1.4.natural_exp_decay
                1.5.inverse_time_decay
            2.基于余弦的衰减
                2.1.cosine_decay
                2.2.cosine_decay_restarts
                2.3.linear_cosine_decay
                2.4.noisy_linear_cosine_decay
            3.自定义
                3.1.auto_learning_rate_decay
            4.小结

以下，我将在ipython中逐个介绍各种lr衰减策

1.基于指数型的衰减

下面的几个实现都是基于指数型的衰减。个人理解其问题在于一开始lr就快速下降，在复杂问题中可能会导致快速收敛于局部最小值而没有较好地探索一定范围内的参数空间。

1. 指数衰减（exponential_decay）

exponential_decay(learning_rate, global_step, decay_steps, decay_rate,
                  staircase=False, name=None)

指数型lr衰减法是最常用的衰减方法，在大量模型中都广泛使用。
参数：

learning_rate：初始学习率．
global_step：用于衰减计算的全局步数，非负．用于逐步计算衰减指数．
decay_steps：衰减步数，必须是正值．决定衰减周期．
decay_rate：衰减率．
staircase：若为True，则以不连续的间隔衰减学习速率即阶梯型衰减（就是在一段时间内或相同的eproch内保持相同的学习率）；若为False，则是标准指数型衰减．
name：操作的名称，默认为ExponentialDecay．（可选项）

指数衰减的学习速率计算公式为：

decayed_learning_rate = learning_rate * decay_rate ^ (global_step / decay_steps)

优点：简单直接，收敛速度快．

示例，阶梯型衰减与指数型衰减对比：

# coding:utf-8
#exponential_decay 指数衰减
import matplotlib.pyplot as plt
import tensorflow as tf
#global_step = tf.Variable(0, name='global_step', trainable=False)

y = []
z = []
N = 200
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for global_step in range(N):
        # 阶梯型衰减
        learing_rate1 = tf.train.exponential_decay(
            learning_rate=0.5, global_step=global_step, decay_steps=10, decay_rate=0.9, staircase=True)
        # 标准指数型衰减
        learing_rate2 = tf.train.exponential_decay(
            learning_rate=0.5, global_step=global_step, decay_steps=10, decay_rate=0.9, staircase=False)
        lr1 = sess.run([learing_rate1])
        lr2 = sess.run([learing_rate2])
        y.append(lr1[0])
        z.append(lr2[0])

x = range(N)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.set_ylim([0, 0.55])
plt.plot(x, y, 'r-', linewidth=2)
plt.plot(x, z, 'g-', linewidth=2)
plt.title('exponential_decay')
ax.set_xlabel('step')
ax.set_ylabel('learing rate')
plt.show()

图1. exponential_decay示例，其中红色线条是staircase=False，即指数型下降曲线，蓝色线条是staircase=True，即阶梯式下降曲线

该衰减方法的优点是收敛速度较快，简单直接。

1.2.分段常数衰减（piecewise_constant）

分段常数衰减就是在定义好的区间上，分别设置不同的常数值，作为学习率的初始值和后续衰减的取值．

函数原型

piecewise_constant(x, boundaries, values, name=None)

参数：

x：0-D标量Tensor．
boundaries：边界，tensor或list.
values：指定定义区间的值．
name：操作的名称，默认为PiecewiseConstant．

分段常数下降法类似于exponential_decay中的阶梯式下降法，不过各阶段的值是自己设定的。

其中，x即为global step，boundaries=[step_1, step_2, …, step_n]定义了在第几步进行lr衰减，values=[val_0, val_1, val_2, …, val_n]定义了lr的初始值和后续衰减时的具体取值。需要注意的是，values应该比boundaries长一个维度。

特点
这种方法有助于使用者针对不同任务进行精细地调参，在任意步长后下降任意数值的learning rate。

代码示例：

# piecewise_constant 阶梯式下降法
import matplotlib.pyplot as plt
import tensorflow as tf

#global_step = tf.Variable(0, name='global_step', trainable=False)
boundaries = [10, 20, 30]
learing_rates = [0.1, 0.07, 0.025, 0.0125]
y = []
N = 40
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for global_step in range(N):
        learing_rate = tf.train.piecewise_constant(global_step, boundaries=boundaries, values=learing_rates)
        lr = sess.run([learing_rate])
        y.append(lr[0])

x = range(N)
plt.plot(x, y, 'r-', linewidth=2)
plt.title('piecewise_constant')
plt.show()

图2. piecewise_constant示例

这种方法有助于使用者针对不同任务进行精细地调参，在任意步长后下降任意数值的learning rate。

1.3多项式衰减（polynomial_decay）

polynomial_decay(learning_rate, global_step, decay_steps,
                 end_learning_rate=0.0001, power=1.0,
                 cycle=False, name=None)

参数：

learning_rate：初始学习率．
global_step：用于衰减计算的全局步数，非负.
decay_steps：衰减步数，必须是正值．
end_learning_rate：最低的最终学习率．
power：多项式的幂，默认为1.0（线性）．
cycle：学习率下降后是否重新上升．
name：操作的名称，默认为PolynomialDecay。

函数使用多项式衰减，以给定的decay_steps将初始学习率（learning_rate）衰减至指定的学习率（end_learning_rate）．

多项式衰减的学习率计算公式为：

# 如果cycle=False
global_step = min(global_step, decay_steps)
decayed_learning_rate = (learning_rate - end_learning_rate) *
                          (1 - global_step / decay_steps) ^ (power) +
                          end_learning_rate
# 如果cycle=True
decay_steps = decay_steps * ceil(global_step / decay_steps)
decayed_learning_rate = (learning_rate - end_learning_rate) *
                          (1 - global_step / decay_steps) ^ (power) +
                          end_learning_rate

代码示例：

# 学习率下降后是否重新上升
import matplotlib.pyplot as plt
import tensorflow as tf
y = []
z = []
N = 200
#global_step = tf.Variable(0, name='global_step', trainable=False)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for global_step in range(N):
        # cycle=False
        learing_rate1 = tf.train.polynomial_decay(
            learning_rate=0.1, global_step=global_step, decay_steps=50,
            end_learning_rate=0.01, power=0.5, cycle=False)
        # cycle=True
        learing_rate2 = tf.train.polynomial_decay(
            learning_rate=0.1, global_step=global_step, decay_steps=50,
            end_learning_rate=0.01, power=0.5, cycle=True)
        lr1 = sess.run([learing_rate1])
        lr2 = sess.run([learing_rate2])
        y.append(lr1[0])
        z.append(lr2[0])

x = range(N)
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(x, z, 'g-', linewidth=2)
plt.plot(x, y, 'r--', linewidth=2)
plt.title('polynomial_decay')
ax.set_xlabel('step')
ax.set_ylabel('learing rate')
plt.show()

图3. polynomial_decay示例，cycle=True，颜色同上

可以看到学习率在decay_steps=50迭代次数后到达最小值；同时，当cycle=False时，学习率达到预设的最小值后，就保持最小值不再变化；当cycle=True时，学习率将会瞬间增大，再降低；

多项式衰减中设置学习率可以往复升降的目的：时为了防止在神经网络训练后期由于学习率过小，导致网络参数陷入局部最优，将学习率升高，有可能使其跳出局部最优；

1.4 自然指数衰减（natural_exp_decay）

natural_exp_decay(learning_rate, global_step, decay_steps, decay_rate,
                  staircase=False, name=None)

参数

learning_rate：初始学习率．
global_step：用于衰减计算的全局步数，非负.
decay_steps：衰减步数．
decay_rate：衰减率．
staircase：若为True，则是离散的阶梯型衰减（就是在一段时间内或相同的eproch内保持相同的学习率）；若为False，则是标准型衰减．
name: 操作的名称，默认为ExponentialTimeDecay．

natural_exp_decay和exponential_decay形式差不多，只不过自然指数下降的底数是 $1/e$ 型。

exponential_decay：
decayed_learning_rate = learning_rate * decay_rate ^ (global_step / decay_steps)

natural_exp_decay：
decayed_learning_rate = learning_rate * exp(-decay_rate * global_step / decay_steps)

# 如果staircase=True，则学习率会在得到离散值，每decay_steps迭代次数，更新一次；

代码示例：

import matplotlib.pyplot as plt
import tensorflow as tf
#global_step = tf.Variable(0, name='global_step', trainable=False)

y = []
z = []
w = []
N = 200
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for global_step in range(N):
        # 阶梯型衰减
        learing_rate1 = tf.train.natural_exp_decay(
            learning_rate=0.5, global_step=global_step, decay_steps=10, decay_rate=0.9, staircase=True)
        # 标准指数型衰减
        learing_rate2 = tf.train.natural_exp_decay(
            learning_rate=0.5, global_step=global_step, decay_steps=10, decay_rate=0.9, staircase=False)
        # 指数衰减
        learing_rate3 = tf.train.exponential_decay(
            learning_rate=0.5, global_step=global_step, decay_steps=10, decay_rate=0.9, staircase=False)
        lr1 = sess.run([learing_rate1])
        lr2 = sess.run([learing_rate2])
        lr3 = sess.run([learing_rate3])
        y.append(lr1[0])
        z.append(lr2[0])
        w.append(lr3[0])

x = range(N)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.set_ylim([0, 0.55])
plt.plot(x, y, 'r-', linewidth=2)
plt.plot(x, z, 'g-', linewidth=2)
plt.plot(x, w, 'b-', linewidth=2)
plt.title('natural_exp_decay')
ax.set_xlabel('step')
ax.set_ylabel('learing rate')
plt.show()

图4. natural_exp_decay与exponential_decay对比图，其中红色线为natural_exp_decay，蓝色线为natural_exp_decay的阶梯形曲线，绿线为exponential_decay

由图可知，自然数指数下降比exponential_decay要快许多，适用于较快收敛，容易训练的网络。

1.5倒数衰减（inverse_time_decay）

inverse_time_decay(learning_rate, global_step, decay_steps, decay_rate,
                   staircase=False, name=None)

参数：

learning_rate：初始学习率．
global_step：用于衰减计算的全局步数．
decay_steps：衰减步数．
decay_rate：衰减率．
staircase：是否应用离散阶梯型衰减．（否则为连续型）
name：操作的名称，默认为InverseTimeDecay．

inverse_time_decay为倒数衰减，衰减公式如下所示：

decayed_learning_rate = learning_rate / (1 + decay_rate * global_step / decay_step)

代码示例

import matplotlib.pyplot as plt
import tensorflow as tf
y = []
z = []
N = 200
#global_step = tf.Variable(0, name='global_step', trainable=False)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for global_step in range(N):
        # 阶梯型衰减
        learing_rate1 = tf.train.inverse_time_decay(
            learning_rate=0.1, global_step=global_step, decay_steps=20,
            decay_rate=0.2, staircase=True)
        # 连续型衰减
        learing_rate2 = tf.train.inverse_time_decay(
            learning_rate=0.1, global_step=global_step, decay_steps=20,
            decay_rate=0.2, staircase=False)
        lr1 = sess.run([learing_rate1])
        lr2 = sess.run([learing_rate2])

        y.append(lr1[0])
        z.append(lr2[0])

x = range(N)
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(x, z, 'r-', linewidth=2)
plt.plot(x, y, 'g-', linewidth=2)
plt.title('inverse_time_decay')
ax.set_xlabel('step')
ax.set_ylabel('learing rate')
plt.show()

图5. inverse_time_decay示例

以上几种衰减方式相差不大，主要都是基于指数型的衰减。个人理解其问题在于一开始lr就快速下降，在复杂问题中可能会导致快速收敛于局部最小值而没有较好地探索一定范围内的参数空间

2.基于余弦的衰减

下面的几个实现，都是基于cos函数的。

2.1余弦衰减(cosine_decay)

cosine_decay(learning_rate, global_step, decay_steps, alpha=0.0,
                 name=None)

参数：

learning_rate：标初始学习率．
global_step：用于衰减计算的全局步数.
decay_steps：衰减步数．
alpha：最小学习率（learning_rate的部分）。
name：操作的名称，默认为CosineDecay

cosine_decay是近一年才提出的一种lr衰减策略，基本形状是余弦函数。其方法是基于论文实现的：SGDR: Stochastic Gradient Descent with Warm Restarts

计算步骤如下：

global_step = min(global_step, decay_steps)
cosine_decay = 0.5 * (1 + cos(pi * global_step / decay_steps))
decayed = (1 - alpha) * cosine_decay + alpha
decayed_learning_rate = learning_rate * decayed

示例代码

import matplotlib.pyplot as plt
import tensorflow as tf
y = []
z = []
N = 200
#global_step = tf.Variable(0, name='global_step', trainable=False)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for global_step in range(N):
        # 阶梯型衰减
        learing_rate1 = tf.train.cosine_decay(
            learning_rate=0.1, global_step=global_step, decay_steps=150,
            alpha=0.0)
        # 连续型衰减
        learing_rate2 = tf.train.cosine_decay(
            learning_rate=0.1, global_step=global_step, decay_steps=150,
            alpha=0.3)
        lr1 = sess.run([learing_rate1])
        lr2 = sess.run([learing_rate2])

        y.append(lr1[0])
        z.append(lr2[0])

x = range(N)
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(x, z, 'r-', linewidth=2)
plt.plot(x, y, 'g-', linewidth=2)
plt.title('cosine_decay')
ax.set_xlabel('step')
ax.set_ylabel('learing rate')
plt.show()

alpha的作用可以看作是baseline，保证lr不会低于某个值。不同alpha的影响如下：

图6. cosine_decay示例，其中红色线的alpha=0.3，蓝色线的alpha=0.0

2.2重启余弦衰减(cosine_decay_restarts)

cosine_decay_restarts(learning_rate, global_step, first_decay_steps,
                          t_mul=2.0, m_mul=1.0, alpha=0.0, name=None)

参数：

learning_rate ：标量float32或float64 Tensor或Python数字。初始学习率。
global_step ：标量int32或int64 Tensor或Python数字。用于衰减计算的全局步骤。
first_decay_steps ：标量int32或int64 Tensor或Python数字。衰减的步骤数。
t_mul ：标量float32或float64 Tensor或Python数字。用于导出第i个周期中的迭代次数
m_mul ：标量float32或float64 Tensor或Python数字。用于导出第i个周期的初始学习率：
alpha ：标量float32或float64 Tensor或Python数字。最小学习率值作为learning_rate的一部分。
name ：String。操作的可选名称。默认为’SGDRDecay’。

cosine_decay_restarts是cosine_decay的cycle版本。first_decay_steps是指第一次完全下降的step数，t_mul是指每一次循环的步数都将乘以t_mul倍，m_mul指每一次循环重新开始时的初始lr是上一次循环初始值的m_mul倍。

代码示例

# coding:utf-8
import matplotlib.pyplot as plt
import tensorflow as tf

y = []
z = []
EPOCH = 100
global_step = tf.Variable(0, name='global_step', trainable=False)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for global_step in range(EPOCH):
        # 重启余弦衰减
        learing_rate1 = tf.train.cosine_decay_restarts(learning_rate=0.1, global_step=global_step,
                                           first_decay_steps=40)
        learing_rate2 = tf.train.cosine_decay_restarts(learning_rate=0.1, global_step=global_step,
                                                       first_decay_steps=60)

        lr1 = sess.run([learing_rate1])
        lr2 = sess.run([learing_rate2])
        y.append(lr1)
        z.append(lr2)

x = range(EPOCH)
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(x, y, 'r-', linewidth=2)
plt.plot(x, z, 'b-', linewidth=2)
plt.title('cosine_decay')
ax.set_xlabel('step')
ax.set_ylabel('learing rate')
plt.legend(labels=['decay_steps=40', 'decay_steps=60'], loc='upper right')
plt.show()

图7. cosine_decay_restarts示例，红色线条t_mul=2.0，m_mul=0.5，蓝色线条t_mul=2.0，m_mul=1.0

余弦函数式的下降模拟了大lr找潜力区域然后小lr快速收敛的过程，加之restart带来的cycle效果，有涨1-2个点的可能。

2.3线性余弦衰减(linear_cosine_decay)

linear_cosine_decay(learning_rate, global_step, decay_steps,
                        num_periods=0.5, alpha=0.0, beta=0.001,
                        name=None)

参数：

learning_rate：标初始学习率．
global_step：用于衰减计算的全局步数.
decay_steps：衰减步数。
num_periods：衰减余弦部分的周期数．
alpha：见计算．
beta：见计算．
name：操作的名称，默认为LinearCosineDecay。

linear_cosine_decay的参考文献是Neural Optimizer Search with RL，主要应用领域是增强学习领域，本人未尝试过。可以看出，该方法也是基于余弦函数的衰减策略。

图9. linear_cosine_decay示例

noisy_linear_cosine_decay

noisy_linear_cosine_decay(learning_rate, global_step, decay_steps,
                              initial_variance=1.0, variance_decay=0.55,
                              num_periods=0.5, alpha=0.0, beta=0.001,
                              name=None)

2.4.noisy_linear_cosine_decay

将噪声线性余弦衰减应用于学习率．
计算方法
与linear_cosine_decay相同

参数

learning_rate：标初始学习率．
global_step：用于衰减计算的全局步数.
decay_steps：衰减步数．
initial_variance：噪声的初始方差．
variance_decay：衰减噪声的方差．
num_periods：衰减余弦部分的周期数．
alpha：见计算．
beta：见计算．
name：操作的名称，默认为NoisyLinearCosineDecay．

#!/usr/bin/python
# coding:utf-8
import matplotlib.pyplot as plt
import tensorflow as tf
y = []
z = []
w = []
N = 200
global_step = tf.Variable(0, name='global_step', trainable=False)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for global_step in range(N):
        # 余弦衰减
        learing_rate1 = tf.train.cosine_decay(
            learning_rate=0.1, global_step=global_step, decay_steps=50,
            alpha=0.5)
        # 线性余弦衰减
        learing_rate2 = tf.train.linear_cosine_decay(
            learning_rate=0.1, global_step=global_step, decay_steps=50,
            num_periods=0.2, alpha=0.5, beta=0.2)
        # 噪声线性余弦衰减
        learing_rate3 = tf.train.noisy_linear_cosine_decay(
            learning_rate=0.1, global_step=global_step, decay_steps=50,
            initial_variance=0.01, variance_decay=0.1, num_periods=0.2, alpha=0.5, beta=0.2)
        lr1 = sess.run([learing_rate1])
        lr2 = sess.run([learing_rate2])
        lr3 = sess.run([learing_rate3])
        y.append(lr1[0])
        z.append(lr2[0])
        w.append(lr3[0])

x = range(N)
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(x, z, 'b-', linewidth=2)
plt.plot(x, y, 'r-', linewidth=2)
plt.plot(x, w, 'g-', linewidth=2)
plt.title('cosine_decay')
ax.set_xlabel('step')
ax.set_ylabel('learing rate')
plt.show()

3.自定义
3.1.auto_learning_rate_decay

当然大家还可以自定义学习率衰减策略，如设置检测器监控valid的loss或accuracy值，若一定时间内loss持续有效下降／acc持续有效上升则保持lr，否则下降；loss上升／acc下降地越厉害，lr下降的速度就越快等等自适性方案。

4.小结

在我的实际使用中，最常用的就是exponential_decay，但是可以尝试一下cosine_decay_restarts，一定会带给你惊喜的~

参考文献

Tensorflow中learning rate decay的奇技淫巧

TensorFlow学习－－学习率衰减/learning rate decay

TensorFlow中设置学习率的方式

深度模型训练之learning rate

TensorFlow学习－－学习率衰减/learning rate decay

图像分类训练技巧集锦（论文笔记）

https://github.com/zsweet/blog_code/blob/master/learning_rate_decay_method.ipynb

zaf赵

关注

9
点赞
踩
27

收藏

觉得还不错? 一键收藏
2
评论
Tensorflow 中 learning rate decay 的奇技淫巧

深度学习中参数更新的方法想必大家都十分清楚了——sgd，adam等等，孰优孰劣相关的讨论也十分广泛。可是，learning rate的衰减策略大家有特别关注过吗？在训练神经网络时，使用学习率控制参数的更新速度．学习率较小时，会大大降低参数的更新速度；学习率较大时，会使搜索过程中发生震荡，导致参数在极优值附近徘徊．为此，在训练过程中引入学习率衰减，使学习率随着训练的进行逐渐衰减．learni...
复制链接

扫一扫