tensorflow with求导_TensorFlow的自动求导具体是在哪部分代码里实现的？

最新推荐文章于 2022-07-25 13:21:32 发布

weixin_39996750

最新推荐文章于 2022-07-25 13:21:32 发布

阅读量353

点赞数 2

文章标签： tensorflow with求导

本文链接：https://blog.csdn.net/weixin_39996750/article/details/111552066

版权

本文深入探讨了TensorFlow自动求导的实现，从损失函数优化的Optimizer.minimize方法出发，分析了静态图模式下如何通过compute_gradients和gradients_impl模块计算梯度，并详细解释了_GradientsHelper函数的作用。同时，介绍了动态图模式下GradientTape类如何记录操作并计算梯度。通过对Log、Sub等操作符的示例，展示了自定义梯度函数的方法。

摘要由CSDN通过智能技术生成

前段时间刚好写过一篇这方面的博客《自动微分》，最后介绍了一下TF自动求导的做法。具体内容贴在下面了

为了了解TensorFlow中自动微分的实现，需要先找到如何计算梯度。考虑到梯度常见的用处是最小化损失函数，因此可以先从损失函数如何优化的方向上探索，即从Optimizer类的minimize方法入手。这个方法调用了compute_gradients方法以获得参数的梯度(然后会调用apply_gradients以利用梯度更新参数，与本文讨论的内容暂时没有什么关系，所以先略去了)。由于TensorFlow有两种计算梯度的方法：一种是经典的静态图模式，一种是新加入的动态图模式(官方说法是eager execution模式)，因此对于不同模式，compute_gradients采取了不同的实现逻辑

静态图模式

TensorFlow的经典模式是先建立一个静态图，然后这个静态图在一个会话里执行。在这种模式下，compute_gradients方法进一步调用tensorflow.python.ops.gradients_impl里的gradients方法

grads = gradients.gradients(

loss, var_refs, grad_ys=grad_loss,

gate_gradients=(gate_gradients == Optimizer.GATE_OP),

aggregation_method=aggregation_method,

colocate_gradients_with_ops=colocate_gradients_with_ops)

其中loss是计算损失值的张量，var_refs是变量列表，grad_ys存储计算出的梯度，gate_gradients是一个布尔变量，指示所有梯度是否在使用前被算出，如果设为True，可以避免竞争条件。不过gradients方法在实现上用途更广泛一些，简单说，它就是为了计算一组输出张量ys = [y0, y1, ...]对输入张量xs = [x0, x1, ...]的梯度，对每个xi有grad_i = sum[dy_j/dx_i for y_j in ys]。默认情况下，grad_loss是None，此时grad_ys被初始化为全1向量

gradients实际上直接调用内部方法_GradientsHelper

@tf_export("gradients")

def gradients(ys,

xs,

grad_ys=None,

name="gradients",

colocate_gradients_with_ops=False,

gate_gradients=False,

aggregation_method=None,

stop_gradients=None):

# Creating the gradient graph for control flow mutates Operations.

# _mutation_lock ensures a Session.run call cannot occur between creating and

# mutating new ops.

with ops.get_default_graph()._mutation_lock(): # pylint: disable=protected-access

return _GradientsHelper(ys, xs, grad_ys, name, colocate_gradients_with_ops,

gate_gradients, aggregation_method, stop_gradients)

这个方法会维护两个重要变量一个队列queue，队列里存放计算图里所有出度为0的操作符

一个字典grads，字典的键是操作符本身，值是该操作符每个输出端收到的梯度列表

反向传播求梯度时，每从队列中弹出一个操作符，都会把它输出变量的梯度加起来(对应全微分定理)得到out_grads，然后获取对应的梯度计算函数grad_fn。操作符op本身和out_grads会传递给grad_fn做参数，求出输入的梯度

if grad_fn:

# If grad_fn was found, do not use SymbolicGradient even for

# functions.

in_grads = _MaybeCompile(grad_scope, op, func_call,

lambda: grad_fn(op, *out_grads))

else:

# For function call ops, we add a 'SymbolicGradient'

# node to the graph to compute gradients.

in_grads = _MaybeCompile(grad_scope, op, func_call,

lambda: _SymGrad(op, out_grads, xs))

(不过这里似乎说明TensorFlow是自动微分和符号微分混用的)

该操作符处理以后，会更新所有未经过处理的操作符的出度和queue(实际上就是一个拓扑排序的过程)。这样，当queue为空的时候，整个计算图处理完毕，可以得到每个参数的梯度

静态图模式下梯度计算的调用过程大致如下所示

Optimizer.minimize

|---Optimizer.compute_gradients

|---gradients (gradients_impl.py)

|---_GradientsHelper (gradients_impl.py)

梯度计算函数

前面提到，在_GradientsHelper函数里要调用一个grad_fn函数，该函数用来计算给定操作符的梯度。在TensorFlow里，每个计算图都可以分解到操作符(op)层级，每个操作符都会定义一个对应的梯度计算函数。例如，在python/ops/math_grad.py里定义的Log函数的梯度

@ops.RegisterGradient("Log")

def _LogGrad(op, grad):

"""Returns grad * (1/x)."""

x = op.inputs[0]

with ops.control_dependencies([grad]):

x = math_ops.conj(x)

return grad * math_ops.reciprocal(x)

返回的就是已有梯度和x倒数的积，对应于

注意每个函数都使用了装饰器RegisterGradient包装，对有m个输入，n个输出的操作符，相应的梯度函数需要传入两个参数操作符本身

n个张量对象，代表对每个输出的梯度

返回m个张量对象，代表对每个输入的梯度

大部分操作符的梯度计算方式已经由框架给出，但是也可以自定义操作和对应的梯度计算函数。假设要定义一个Sub操作，接受两个输入x和y，输出一个x-y，那么这个函数是

显然有

那么对应的代码就是

@tf.RegisterGradient("Sub")

def _sub_grad(unused_op, grad):

return grad, tf.negative(grad)

动态图模式

在动态图模式下，TensorFlow不需要预先定义好完整的计算图，每个操作也可以返回具体的值，方便调试。下面给出了一个使用动态图求解线性回归的例子(改动自官方示例代码)

import tensorflow as tf

tf.enable_eager_execution()

NUM_EXAMPLES = 1000

training_inputs = tf.random_normal([NUM_EXAMPLES])

noise = tf.random_normal([NUM_EXAMPLES])

training_outputs = training_inputs * 3 + 2 + noise

def prediction(x, w, b):

return x * w + b

# A loss function using mean-squared error

def loss(weights, biases):

error = prediction(training_inputs, weights, biases) - training_outputs

return tf.reduce_mean(tf.square(error))

train_steps = 200

learning_rate = 0.1

# Start with arbitrary values for W and B on the same batch of data

weight = tf.Variable(5.)

bias = tf.Variable(10.)

optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)

for i in range(20):

print("Initial loss: {:.3f}".format(loss(weight, bias)))

optimizer.minimize(lambda: loss(weight, bias))

print("Final loss: {:.3f}".format(loss(weight, bias)))

print("W = {}, B = {}".format(weight.numpy(), bias.numpy()))

仍然以Optimizer类的minimize方法为入口，跟进到compute_gradients方法，可以看到在动态图模式下，相关代码比较简短

if callable(loss):

with backprop.GradientTape() as tape:

if var_list is not None:

tape.watch(var_list)

loss_value = loss()

# Scale loss if using a "mean" loss reduction and multiple towers.

# Have to be careful to call distribute_lib.get_loss_reduction()

# *after* loss() is evaluated, so we know what loss reduction it uses.

# TODO(josh11b): Test that we handle weight decay in a reasonable way.

if (distribute_lib.get_loss_reduction() ==

variable_scope.VariableAggregation.MEAN):

num_towers = distribution_strategy_context.get_distribution_strategy(

).num_towers

if num_towers > 1:

loss_value *= (1. / num_towers)

if var_list is None:

var_list = tape.watched_variables()

grads = tape.gradient(loss_value, var_list, grad_loss)

return list(zip(grads, var_list))

之前看到过一个比喻：自动微分的工作原理就像是录制一盘磁带：前向计算所有操作的时候，实际上是在录制正在进行的操作。等到录制结束，倒带播放，就得到了梯度。TensorFlow也遵循了这样的比喻，所以在动态图模式下自动微分的灵魂是一个GradientTape(“磁带”)类的对象，通过这个对象记录数据，求出梯度

在该方法的第一步里，GradientTape类对象tape会在自己的context下“观察”所有需要被记录的对象。默认情况下，使用tf.Variable 或tf.get_variable()创建的对象都是trainable的，也是会被观察的(自动放在watched_variables里)。然后，调用gradient方法来计算所有被观察对象的梯度，核心代码为

flat_grad = imperative_grad.imperative_grad(

_default_vspace, self._tape, nest.flatten(target), flat_sources,

output_gradients=output_gradients)

这个函数最后会调用一个C++实现的ComputeGradient函数，其伪代码大致如下

template

// 使用了传统C的约定，返回一个状态码，结果保存在result变量里 // 核心思想还是对有向图使用拓扑排序，找到出度为0的点，聚合上游梯度，求出下游梯度 Status GradientTape::ComputeGradient(

const VSpace& vspace,

gtl::ArraySlice target_tensor_ids,

gtl::ArraySlice source_tensor_ids,

gtl::ArraySlice output_gradients,

std::vector* result) {

// 构建一个输入张量的集合 gtl::FlatSet sources_set(source_tensor_ids.begin(),