Tensorflow中的梯度和自动微分：tf.GradientTape理解

November丶Chopin

已于 2022-10-29 20:55:14 修改

阅读量1.9k

点赞数 3

分类专栏：专栏07-TensorFlow&Keras 文章标签： tensorflow

于 2022-10-26 20:22:26 首次发布

本文链接：https://blog.csdn.net/u012762410/article/details/127525359

版权

专栏07-TensorFlow&Keras 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

前言

1. 前言
2. 自动微分简介
3. tf.GradientTape

1. 前言

之前使用tfv1写静态计算图，很少接触过自动微分(Automatic Differentiation，AD)，所有的功能基本上都能通过低阶API来实现。在最近的一个项目中使用到了Keras来实现Triplet Loss，但keras提供的损失函数需要传如 y_pred 和 y_true 两个参数，并不能满足功能需要，而动态图下需要实现Model继承类的部分求微分计算，因此在博客中记录一下。

本文的代码是在tf.__version__==2.6.2下运行的。

2. 自动微分简介

参考wiki百科：Automatic differentiation
自动微分，是用计算机程序计算函数导数的一组技术。区别于符号微分(symbolic differentiation)和数值微分(numerical differentiation)。

自动微分利用了一个事实：

任何程序，无论多么复杂，都是由执行一系列基本算术运算（加、减、乘、除等）和基本函数（exp、log、sin、cos等）组成。

通过反复将链式法则应用于这些运算，可以自动计算任意阶的导数，精确到满足工作需要的精度，并且最多是原始程序算术运算的较小的常数倍复杂度。

自动微分不同于符号微分和数值微分。符号微分面临着将计算机程序转换为单个数学表达式的困难，并可能导致低效的代码。数值微分(有限差分方法)可能在离散过程和cancellation中引入舍入误差。这两种经典方法都存在计算高阶导数的问题，其中复杂性和误差增加。最后，这两种经典方法在计算函数相对于许多输入的偏导数时都很慢，这是基于梯度的优化算法所需要的。自动微分解决了所有这些问题。

自动微分的基本原理是利用链式法则对微分进行分解。

3. tf.GradientTape

3.1 GradientTape基本使用

3.1.1 GradientTape梯度计算简介

要实现自动微分，TF 需要记住在前向传播过程中，什么 operations 以什么顺序发生。随后，在反向传播期间，TF以相反的顺序遍历此 operations 列表以计算梯度。

TF 为自动微分提供了 tf.GradientTape API；也就是说，计算关于某些输入(通常是tf.Variable)计算过程的梯度。TF 将 tf.GradientTape 上下文中执行的相关operations 被记录(recorded)到 “tape” 上。TF然后使用这个tape通过"反向模式微分"去计算一个被记录的计算过程的梯度。

参见：自动微分–向前模式和反向模式-知乎

一旦记录了operations，使用 GradientTape.gradient(target, sources) 去计算target关于source的梯度（如loss关于模型变量的梯度）。

梯度计算可以应用在scalars、tensors和model上！

3.1.2 应用在标量(scalars)上

求 $y=x^2$ 在 $x = 3$ 处的导数值。

x = tf.Variable(3.0)
with tf.GradientTape() as tape:
    y = x**2
    dy_dx = tape.gradient(y, x)
print(dy_dx.numpy())

3.1.3 应用在tensors上

要获取关于两个变量的损失梯度，可以将两个变量都作为source传递给梯度方法。tape对如何传递source很灵活，并将接受list或dict的任何嵌套组合，并以相同的方式返回梯度结构。

w = tf.Variable(np.arange(6).reshape(3,2).astype('f'), name='w')
b = tf.Variable(np.arange(2).astype('f'), name='b')
x = tf.Variable(np.arange(1,4).reshape(1,3).astype('f'), name='x')

with tf.GradientTape(persistent=True) as tape:
    y = x @ w + b
    loss = tf.reduce_mean(y)

[dl_dw, dl_db, dl_dx] = tape.gradient(loss, [w, b, x])
"""
也可以传递为如下形式:
my_vars = {'w': w, 'b': b}
tape.gradient(loss, my_vars)
"""

$\triangleright$ 解释：
对于 $\bold{W}=\begin{bmatrix} w_{11} & w_{21} \\ w_{12} & w_{22} \\ w_{13} & w_{23} \end{bmatrix}$ ， $\bold{x}=\begin{bmatrix} x_{1},\,x_{2},\,x_{3} \end{bmatrix}$ ， $\bold{b}=\begin{bmatrix} b_{1},\,b_{2} \end{bmatrix}$ ，
可知：
$y=xW+b=\begin{bmatrix} w_{11}x_1+w_{21}x_2+w_{31}x_3+b_1,w_{21}x_1+w_{22}x_2+w_{32}x_3+b_2\end{bmatrix}$ 所以，损失函数 $L$ 为：
$L(w,b)={\frac 1 2}(w_{11}x_1+w_{21}x_2+w_{31}x_3+b_1+w_{21}x_1+w_{22}x_2+w_{32}x_3+b_2)$ 所以，上述代码中的 $dl_dw , dl_db \text{dl\_dw}, \text{dl\_db}$ 就是：
$dl_dw = [ 1 2 x 1 1 2 x 1 1 2 x 2 1 2 x 2 1 2 x 3 1 2 x 3 ] = [ 0.5 0.5 1.0 1.0 1.5 1.5 ] \text{dl\_dw}=\begin{bmatrix} {\frac 1 2}x_1 & {\frac 1 2}x_1 \\ {\frac 1 2}x_2 & {\frac 1 2}x_2 \\ {\frac 1 2}x_3 & {\frac 1 2}x_3 \end{bmatrix}=\begin{bmatrix} 0.5 & 0.5 \\ 1.0 & 1.0 \\ 1.5 & 1.5 \end{bmatrix} \\$ $dl_db = [ 0.5 , 0.5 ] \text{dl\_db}=\begin{bmatrix}0.5,\, 0.5\end{bmatrix}$

3.1.4 应用在model上

将tf.Variables收集到tf.Module或其子类(layers.Layer，keras.Model)中以进行检查点(checkpointing)和导出(exporting)。(来自官方文档的说法，后续解释)

通常，需要计算模型可训练变量的梯度。由于tf.Module的所有子类都将其变量聚合在Module.trainable_variables属性中，而tf.keras.layers.Dense 是 tf.Module 的子类，所以可以使用该方法：
下面是 tf.keras.layers.Dense 文档中的MRO，其实 tf.Module 的子类：

 |  Method resolution order:
 |      Dense
 |      keras.engine.base_layer.Layer
 |      tensorflow.python.module.module.Module
 |      tensorflow.python.training.tracking.tracking.AutoTrackable
 |      tensorflow.python.training.tracking.base.Trackable
 |      keras.utils.version_utils.LayerVersionSelector
 |      builtins.object

layer = tf.keras.layers.Dense(2, activation=lambda x:x)
x = tf.constant([[1., 2., 3.]])

with tf.GradientTape() as tape:
    # Forward pass
    y = layer(x)
    loss = tf.reduce_mean(y)
grad = tape.gradient(loss, layer.trainable_variables)

3.2 控制tape监视的内容

在访问一个trainable tf.Variable后，默认的行为是记录(record)所有的运算(operations). 原因是：

tape需要知道前向传播中记录哪些运算，以计算后向传播中的梯度；
tape包含对中间输出的引用(references)，因此应避免记录不必要的运算；
最常见用例涉及计算loss相对于模型的所有可训练变量的梯度。

tf主要靠tf.Variable和tf.constant来创建变量，但两者创建的类型不同：

tf.Variable--------------<tf.Variable 'x0:0' shape=() dtype=float32, numpy=3.0>
tf.constant--------------<tf.Tensor: shape=(), dtype=float32, numpy=3.0>

tf.Tensor是不可导的，tf.Variable在参数trainable=True的情况下可导。具体情况如下：

# tf.Variable 可导
x0 = tf.Variable(3.0, name='x0')
# trainable=False的tf.Variable 不可导
x1 = tf.Variable(3.0, name='x1', trainable=False)
# tf.Variable+常数: 返回Tensor 不可导
x2 = tf.Variable(2.0, name='x2') + 1.0
# tf.constant定义的是tensor,不可导
x3 = tf.constant(3.0, name='x3')

with tf.GradientTape() as tape:
    y = (x0**2) + (x1**2) + (x2**2) + (x3**2)

grad = tape.gradient(y, [x0, x1, x2, x3])

for g in grad:
    print(g)
"""
输出:
tf.Tensor(6.0, shape=(), dtype=float32)
None
None
None
"""

3.2.1 通过方法watch

如果想watch某个Tensor，可以使用tape.watch(x3)。继续上面的例子：

# tf.Variable 可导
x0 = tf.Variable(3.0, name='x0')
# trainable=False的tf.Variable 不可导
x1 = tf.Variable(3.0, name='x1', trainable=False)
# tf.Variable+常数: 返回Tensor 不可导
x2 = tf.Variable(2.0, name='x2') + 1.0
# tf.constant定义的是tensor,不可导
x3 = tf.constant(3.0, name='x3')

with tf.GradientTape() as tape:
    tape.watch(x2)
    y = (x0**2) + (x1**2) + (x2**2) + (x3**2)

grad = tape.gradient(y, [x0, x1, x2, x3])

for g in grad:
    print(g)
"""
输出:
tf.Tensor(6.0, shape=(), dtype=float32)
None
tf.Tensor(6.0, shape=(), dtype=float32)
None
"""

可以发现，x2的梯度不再为None。

可以通过tape.watched_variables()来查看tape正在watch的变量。注意，该方法只显示tf.Variable类型的变量，不会显示tf.Tensor类型的变量。

3.2.2 通过参数watch_accessed_variables

设置 watch_accessed_variables=False，则默认不会将任何变量记录到tape中：

x0 = tf.Variable(3.0, name='x0')
x1 = tf.Variable(3.0, name='x0')
x2 = tf.Variable(3.0, name='x0')

with tf.GradientTape(watch_accessed_variables=False) as tape:
    y = (x0**2) + (x1**2) + (x2**2)
    grad = tape.gradient(y, [x0, x1, x2])
for g in grad:
    print(g)
# 输出
None
None
None

tape.watched_variables()显示没有变量可以求梯度，可以通过tape.watch选取想要求梯度的变量。

3.2.3 求中间结果的梯度

只需要用tape 记录中间结果变量即可：

x = tf.constant(3.0)

with tf.GradientTape() as tape:
    tape.watch(x)
    y = x * x
    z = y * y

print(tape.gradient(z, [x,y]))

3.2.4 非标量的梯度

非标量的梯度，即target不是标量(target)，而梯度从根本上说是对标量的运算。因此如果求多个target的梯度，每个source的结果为：

多个target的和的梯度，或其他等效形式；
每个target的梯度的和。

比如，如下代码求y0和y1的梯度，结果为：

x = tf.Variable(2.0)
with tf.GradientTape(persistent=True) as tape:
  y0 = x**2
  y1 = 1 / x

print(tape.gradient(y0, x).numpy())
print(tape.gradient(y1, x).numpy())

"""
输出:
4.0
-0.25
"""

如果将y0,y1共同作为一个target，则输出为两者之和：

x = tf.Variable(2.0)
with tf.GradientTape() as tape:
  y0 = x**2
  y1 = 1 / x

print(tape.gradient({'y0': y0, 'y1': y1}, x).numpy())

"""
输出:
3.75
"""

同理，如果target是非标量的张量(Tensor)¹，也是张量求和：

x = tf.Variable(2.)

with tf.GradientTape() as tape:
    y = x * [3., 4.]

print(tape.gradient(y, x).numpy())
"""
输出:
7.0
"""

x = tf.Variable(2.)
with tf.GradientTape() as tape:
    y = x * [[1.,2.],[3., 4.]]
print(tape.gradient(y, x).numpy())
"""
输出:
10.0
"""

如果target是非标量的Variable ¹，则是target相对于每个元素的梯度：

x = tf.Variable([1.,2.,3., 4.])
with tf.GradientTape() as tape:
    y = tf.nn.sigmoid(x)
print(tape.gradient(y, x).numpy())
"""
输出:
[0.19661193 0.10499357 0.04517666 0.01766273]
"""

3.4 gradient返回None的情况

3.4.1 target与source没有关联

x = tf.Variable(2.)
y = tf.Variable(3.)

with tf.GradientTape() as tape:
  z = y * y
print(tape.gradient(z, x))
"""
输出:
None
"""

但，如果使用z=y*y+0*x则对x的梯度为0

3.4.2 tape不会自动监控Tensor

tape 会自动监视 tf.Variable，但不会监视 tf.Tensor。如下：

x = tf.Variable(2.0)
print(type(x).__name__)
with tf.GradientTape() as tape:
    y = x+1
    print(tape.gradient(y, x))
"""
输出:
ResourceVariable
tf.Tensor(1.0, shape=(), dtype=float32)
"""

x = tf.Variable(2.0)
x = x + 1
print(type(x).__name__)
with tf.GradientTape() as tape:
    y = x+1
    print(tape.gradient(y, x))
"""
输出:
EagerTensor
None
"""

如果在tape中设置：tape.watch(x)则会监视Tensor：

x = tf.Variable(2.0)
x = x + 1
print(type(x).__name__)
with tf.GradientTape() as tape:
    tape.watch(x)
    y = x+1
    print(tape.gradient(y, x))
"""
输出:
EagerTensor
tf.Tensor(1.0, shape=(), dtype=float32)
"""

3.4.3 在 TF 之外进行了计算

如果计算退出 TensorFlow，梯度带将无法记录梯度路径

x = tf.Variable([[1.0, 2.0],
                 [3.0, 4.0]], dtype=tf.float32)
with tf.GradientTape() as tape:
    x2 = x**2
    y = np.mean(x2, axis=0)
    y = tf.reduce_mean(y, axis=0)
print(tape.gradient(y, x))
"""
输出:
None
"""

3.4.4 整数和字符串不可微分

如果计算路径使用整数和字符串这些数据类型，则不会出现梯度。TF 不会在类型之间自动进行转换，所以需要在执行前自己检查数据类型。

x = tf.constant(10)
with tf.GradientTape() as g:
    g.watch(x)
    y = x * x
print(g.gradient(y, x))
"""
输出:
WARNING:tensorflow:The dtype of the watched tensor must be floating (e.g. tf.float32), got tf.int32
WARNING:tensorflow:The dtype of the target tensor must be floating (e.g. tf.float32) when calling GradientTape.gradient, got tf.int32
WARNING:tensorflow:The dtype of the source tensor must be floating (e.g. tf.float32) when calling GradientTape.gradient, got tf.int32
None
"""

3.5 将梯度返回的None改为0

x = tf.Variable([2., 2.])
y = tf.Variable(3.)

with tf.GradientTape() as tape:
    z = y**2
print(tape.gradient(z, x, unconnected_gradients=tf.UnconnectedGradients.ZERO))
"""
输出:
tf.Tensor(0.0, shape=(), dtype=float32)
"""