静态图与动态图+Tensorflow中run和.eval的区别+TensorFlow中的Learning rate decay介绍

最新推荐文章于 2024-08-09 11:09:18 发布

小伟db

最新推荐文章于 2024-08-09 11:09:18 发布

阅读量2.8k

点赞数 1

原文链接：https://www.cnblogs.com/dynmi/p/13645162.html

版权

可参考 https://www.cnblogs.com/dynmi/p/13645162.html

目前神经网络框架分为静态图框架和动态图框架，PyTorch 和 TensorFlow、Caffe 等框架最大的区别就是他们拥有不同的计算图表现形式。 TensorFlow 使用静态图，这意味着我们先定义计算图，然后不断使用它，而在 PyTorch 中，每次都会重新构建一个新的计算图。他们之间的区别和差异如下:：

动态图是运算和搭建同时进行，也就是可以先计算前面的节点的值，再根据这些值搭建后面的计算图。优点是灵活，易调节，易调试。PyTorch 里的很多写法跟其他 Python 库的代码的使用方法是完全一致的，没有任何额外的学习成本。

静态图是先搭建图，然后再输入数据进行运算。优点是高效，因为静态计算是通过先定义后运行的方式，之后再次运行的时候就不再需要重新构建计算图，所以速度会比动态图更快。但是不灵活。TensorFlow 每次运行的时候图都是一样的，是不能够改变的，所以不能直接使用 Python 的 while 循环语句，需要使用辅助函数 tf.while_loop 写成 TensorFlow 内部的形式。

动态图计算意味着程序将按照我们编写命令的顺序进行执行。这种机制将使得调试更加容易，并且也使得我们将大脑中的想法转化为实际代码变得更加容易。而静态计算则意味着程序在编译执行时将先生成神经网络的结构，然后再执行相应操作。而静态图计算是通过先定义后运行的方式，之后再次运行的时候就不再需要重新构建计算图，所以速度会比动态图更快。从理论上讲，静态计算这样的机制允许编译器进行更大程度的优化，但是这也意味着你所期望的程序与编译器实际执行之间存在着更多的代沟。这也意味着，代码中的错误将更加难以发现（比如，如果计算图的结构出现问题，你可能只有在代码执行到相应操作的时候才能发现它）。

学术界选择pytorch一般有如下理由：

更加容易调试
动态计算更适用于自然语言处理
传统的面向对象编程风格（这对我们来说更加自然）
TensorFlow 中采用的诸如 scope 和 sessions 等不寻常的机制容易使人感到疑惑不解，而且需要花费更多时间学习

当然，tensorflow2.0的推出使得tensorflow开始支持动态图，以后的走向大家可以拭目以待。

原文地址：https://segmentfault.com/a/1190000015287066?utm_source=tag-newest

如果你有一个Tensor t，在使用t.eval()时，等价于：tf.get_default_session().run(t).
举例：

t = tf.constant(42.0)
sess = tf.Session()
with sess.as_default():   # or `with sess:` to close on exit
    assert sess is tf.get_default_session()
    assert t.eval() == sess.run(t)

这其中最主要的区别就在于你可以使用sess.run()在同一步获取多个tensor中的值，
例如：

t = tf.constant(42.0)
u = tf.constant(37.0)
tu = tf.mul(t, u)
ut = tf.mul(u, t)
with sess.as_default():
   tu.eval()  # runs one step
   ut.eval()  # runs one step
   sess.run([tu, ut])  # evaluates both tensors in a single step

注意到：每次使用 eval 和 run时，都会执行整个计算图，为了获取计算的结果，将它分配给tf.Variable，然后获取。

原文地址：https://www.sohu.com/a/217389557_717210

深度学习中参数更新的方法想必大家都十分清楚了——sgd，adam 等等，孰优孰劣相关的讨论也十分广泛。除此之外，learning rate 的衰减策略也是调参的一大重点。一般来说，大家用的是指数衰减方式，但实际上，tensorflow中有许多其他的衰减方式可供大家尝试：

learning rate 衰减策略文件在 tensorflow/tensorflow/python/training/learning_rate_decay.py（http://t.cn/RQJ78Lg ）中，函数中调用方法类似 tf.train.exponential_decay 就可以了。

以下，我将在 ipython 中逐个介绍各种 lr 衰减策略。

exponential_decay

exponential_decay(learning_rate, global_step, decay_steps, decay_rate,

staircase=False, name=None)

指数型 lr 衰减法是最常用的衰减方法，在大量模型中都广泛使用。

learning_rate 传入初始 lr 值，global_step 用于逐步计算衰减指数，decay_steps 用于决定衰减周期，decay_rate 是每次衰减的倍率，staircase 若为 False 则是标准的指数型衰减，True 时则是阶梯式的衰减方法，目的是为了在一段时间内（往往是相同的 epoch 内）保持相同的 learning rate。

图 1. exponential_decay 示例，其中红色线条是 staircase=False，即指数型下降曲线，蓝色线条是 staircase=True，即阶梯式下降曲线

该衰减方法的优点是收敛速度较快，简单直接。

piecewise_constant

piecewise_constant(x, boundaries, values, name=None)

分段常数下降法类似于 exponential_decay 中的阶梯式下降法，不过各阶段的值是自己设定的。

其中，x 即为 global step，boundaries=[step_1, step_2, ..., step_n] 定义了在第几步进行 lr 衰减，values=[val_0, val_1, val_2, ..., val_n] 定义了 lr 的初始值和后续衰减时的具体取值。需要注意的是，values 应该比 boundaries 长一个维度。

图 2. piecewise_constant 示例

这种方法有助于使用者针对不同任务进行精细地调参，在任意步长后下降任意数值的 learning rate。

polynomial_decay

polynomial_decay(learning_rate, global_step, decay_steps,

end_learning_rate=0.0001, power=1.0,

cycle=False, name=None)

polynomial_decay 是以多项式的方式衰减学习率的。

It is commonly observed that a monotonically decreasing learning rate, whose degree of change is carefully chosen, results in a better performing model.

This function applies a polynomial decay function to a provided initial `learning_rate` to reach an `end_learning_rate` in the given `decay_steps`.

其下降公式也在函数注释中阐释了：

global_step = min(global_step, decay_steps)

decayed_learning_rate = (learning_rate - end_learning_rate) *

(1 - global_step / decay_steps) ^ (power) + end_learning_rate

图 3. polynomial_decay 示例，cycle=False，其中红色线为 power=1，即线性下降；蓝色线为 power=0.5，即开方下降；绿色线为 power=2，即二次下降

cycle 参数是决定 lr 是否在下降后重新上升的过程。cycle 参数的初衷是为了防止网络后期 lr 十分小导致一直在某个局部最小值中振荡，突然调大 lr 可以跳出注定不会继续增长的区域探索其他区域。

图 4. polynomial_decay 示例，cycle=True，颜色同上

natural_exp_decay

natural_exp_decay(learning_rate, global_step, decay_steps, decay_rate,

staircase=False, name=None)

natural_exp_decay 和 exponential_decay 形式差不多，只不过自然指数下降的底数是型。

exponential_decay：

decayed_learning_rate = learning_rate * decay_rate ^ (global_step / decay_steps)

natural_exp_decay：

decayed_learning_rate = learning_rate * exp(-decay_rate * global_step / decay_steps)

图 5. natural_exp_decay 与 exponential_decay 对比图，其中红色线为 natural_exp_decay，蓝色线为 natural_exp_decay 的阶梯形曲线，绿线为 exponential_decay

由图可知，自然数指数下降比 exponential_decay 要快许多，适用于较快收敛，容易训练的网络。

inverse_time_decay

inverse_time_decay(learning_rate, global_step, decay_steps, decay_rate, staircase=False, name=None)

inverse_time_decay 为倒数衰减，衰减公式如下所示：

decayed_learning_rate = learning_rate / (1 + decay_rate * global_step / decay_step)

图 6. inverse_time_decay 示例

以上几种衰减方式相差不大，主要都是基于指数型的衰减。个人理解其问题在于一开始 lr 就快速下降，在复杂问题中可能会导致快速收敛于局部最小值而没有较好地探索一定范围内的参数空间。

cosine_decay

cosine_decay(learning_rate, global_step, decay_steps, alpha=0.0, name=None)

cosine_decay 是近一年才提出的一种 lr 衰减策略，基本形状是余弦函数。其方法是基于论文实现的：SGDR: Stochastic Gradient Descent with Warm Restarts（https://arxiv.org/abs/1608.03983 ）

计算步骤如下：

global_step = min(global_step, decay_steps)

cosine_decay = 0.5 * (1 + cos(pi * global_step / decay_steps))

decayed = (1 - alpha) * cosine_decay + alpha

decayed_learning_rate = learning_rate * decayed

alpha 的作用可以看作是 baseline，保证 lr 不会低于某个值。不同 alpha 的影响如下：

图 7. cosine_decay 示例，其中红色线的 alpha=0.3，蓝色线的 alpha=0.0

cosine_decay_restarts

cosine_decay_restarts(learning_rate, global_step, first_decay_steps,

t_mul=2.0, m_mul=1.0, alpha=0.0, name=None)

cosine_decay_restarts 是 cosine_decay 的 cycle 版本。first_decay_steps 是指第一次完全下降的 step 数，t_mul 是指每一次循环的步数都将乘以 t_mul 倍，m_mul 指每一次循环重新开始时的初始 lr 是上一次循环初始值的 m_mul 倍。