【翻译】Autograd mechanics 自动梯度机制

最新推荐文章于 2021-06-29 17:28:47 发布

敲代码的小风

最新推荐文章于 2021-06-29 17:28:47 发布

阅读量211

点赞数

分类专栏：零基础学习SSD网络PyTorch实现 Deep-Learning-with-PyTorch 《深度学习之PyTorch实战计算机视觉》

本文链接：https://blog.csdn.net/m0_46653437/article/details/112509571

版权

零基础学习SSD网络PyTorch实现同时被 3 个专栏收录

293 篇文章 26 订阅

订阅专栏

Deep-Learning-with-PyTorch

216 篇文章 19 订阅

订阅专栏

《深度学习之PyTorch实战计算机视觉》

198 篇文章 41 订阅

订阅专栏

参考链接: Autograd mechanics

在这里插入图片描述

原文及翻译:

Autograd mechanics 自动梯度机制

This note will present an overview of how autograd works and records the operations. 
It’s not strictly necessary to understand all this, but we recommend getting 
familiar with it, as it will help you write more efficient, cleaner programs, 
and can aid you in debugging.
本笔记将对自动梯度如何运行如何记录操作的情况做一个概述.这些内容读者并非需要完全严格地搞明白,
但是我们建议读者能够熟悉这些内容,因为这有助于你编写更高效更干净的程序,也有助于你调试程序.

Excluding subgraphs from backward 将子图从backward中排除出去
Every Tensor has a flag: requires_grad that allows for fine grained exclusion of 
subgraphs from gradient computation and can increase efficiency.
每个张量都有一个标记requires_grad:通过使用requires_grad可以更细粒度地将子图从
梯度计算中排除出去,用以提高效率.

requires_grad 属性标记:requires_grad 
If there’s a single input to an operation that requires gradient, its output will also 
require gradient. Conversely, only if all inputs don’t require gradient, the output 
also won’t require it. Backward computation is never performed in the subgraphs, 
where all Tensors didn’t require gradients.
如果一个运算操作的输入是单个输入数据,并且该输入数据是需要计算梯度的,那么该运算操作的输出数据也是需
要计算梯度的.相反,只有当所有的输入数据都不需要计算梯度时,输出数据才不需要计算梯度.如果一个子图中
的所有张量都不需要计算梯度,那么在这个子图中永远不会执行backward反向传播计算.

# 例子:
>>> x = torch.randn(5, 5)  # requires_grad=False by default  # 默认情况下不需要计算梯度
>>> y = torch.randn(5, 5)  # requires_grad=False by default  # 默认情况下不需要计算梯度
>>> z = torch.randn((5, 5), requires_grad=True)
>>> a = x + y
>>> a.requires_grad
False
>>> b = a + z
>>> b.requires_grad
True


This is especially useful when you want to freeze part of your model, or you know in 
advance that you’re not going to use gradients w.r.t. some parameters. For example 
if you want to finetune a pretrained CNN, it’s enough to switch the requires_grad 
flags in the frozen base, and no intermediate buffers will be saved, until the 
computation gets to the last layer, where the affine transform will use weights that 
require gradient, and the output of the network will also require them.
这个机制非常有用,尤其在某些情况下使用时.比如,你希望冻结住你训练的模型的部分参数,或者你预先知道
你并不需要使用某些参数的梯度.举个例子,如果你想要微调一个预训练的CNN,使用这样的方式就可以在冻结的
网络主干base部分切换requires_grad标记,这样自动梯度机制就不会在计算到最后一层网络之前保存中间
的缓存buffers.在这最后一层的网络结构中会使用仿射变换,该仿射变换会使用权重weights,该权重需要计算
梯度,因此最后网络的输出数据也是需要计算梯度的.

# 例子展示:
model = torchvision.models.resnet18(pretrained=True)
for param in model.parameters():
    param.requires_grad = False
# Replace the last fully-connected layer
# 替换掉网络的最后一层全连接层
# Parameters of newly constructed modules have requires_grad=True by default
# 新创建的模块中所使用的参数默认requires_grad=True,即需要计算梯度
model.fc = nn.Linear(512, 100)

# Optimize only the classifier  # 只优化分类器的参数
optimizer = optim.SGD(model.fc.parameters(), lr=1e-2, momentum=0.9)




How autograd encodes the history  自动梯度是如何编码记录历史的

Autograd is reverse automatic differentiation system. Conceptually, autograd records a 
graph recording all of the operations that created the data as you execute operations, 
giving you a directed acyclic graph whose leaves are the input tensors and roots are the 
output tensors. By tracing this graph from roots to leaves, you can automatically compute 
the gradients using the chain rule.
自动梯度Autograd是一个逆向自动微分系统.从概念上来说,自动梯度记录了一个图结构,在你执行运算操作创建数
据时,该图结构记录了所有这些运算操作,并为你产生一个有向无环图,图中的叶节点是输入张量,图中的根节点是输
出张量.通过在这个图结构中从根节点跟踪到叶节点,使用链式求导法则你就可以自动地计算梯度.

Internally, autograd represents this graph as a graph of Function objects (really expressions), 
which can be apply() ed to compute the result of evaluating the graph. When computing the 
forwards pass, autograd simultaneously performs the requested computations and builds up 
a graph representing the function that computes the gradient (the .grad_fn attribute of 
each torch.Tensor is an entry point into this graph). When the forwards pass is completed, 
we evaluate this graph in the backwards pass to compute the gradients.
在自动梯度地内部,自动梯度呈现了一个图结构,这个图结构是Function对象构成的图结构(实际上是表达式),这些
Function对象能够使用apply()方法从生成的图结构中计算出结果.在执行前向传递计算时,与此同时自动梯度会执行
所需的计算操作并建立能代表梯度计算函数的图结构(其中每个torch.Tensor对象的.grad_fn属性是这个图结构的
访问入口).当前向传递计算完成,我们就能够在后向传播过程中求得这个图结构的值,以此计算出梯度.

An important thing to note is that the graph is recreated from scratch at every iteration, and 
this is exactly what allows for using arbitrary Python control flow statements, that can change 
the overall shape and size of the graph at every iteration. You don’t have to encode all 
possible paths before you launch the training - what you run is what you differentiate.
有一点很重要,值得一提:在每一次迭代的过程中,这个图结构都是从头开始创建的,因此我们就可以使用任意的Python控制
流语句,在这些语句中可以在每一次迭代过程中改变图结构的整个形状和尺寸.在启动训练之前你不需要编码所有可能的路径
 - 你所运行的即你所需要求微分的.



In-place operations with autograd 在自动梯度中使用原地操作

Supporting in-place operations in autograd is a hard matter, and we discourage their use in most 
cases. Autograd’s aggressive buffer freeing and reuse makes it very efficient and there are very 
few occasions when in-place operations actually lower memory usage by any significant amount. 
Unless you’re operating under heavy memory pressure, you might never need to use them.
在自动梯度中使用原地inplace操作运算非常困难,而且在大多数情况下我们会阻止使用原地操作.自动梯度中侵略性的缓
冲buffer释放和复用使其更高效,而且仅有在极少数的情况中原地操作才会真正显著地降低存储的使用.除非你的操作处于极其
沉重内存压力,否则你可能永远也不会使用这些原地操作.

There are two main reasons that limit the applicability of in-place operations:
之所以我们要限制原地操作的使用主要是出于以下两点原因:

1.In-place operations can potentially overwrite values required to compute gradients.
1.原地操作会有潜在的可能性来对需要求梯度的值进行覆写.

2.Every in-place operation actually requires the implementation to rewrite the computational graph.
Out-of-place versions simply allocate new objects and keep references to the old graph, while 
in-place operations, require changing the creator of all inputs to the Function representing 
this operation. This can be tricky, especially if there are many Tensors that reference the same 
storage (e.g. created by indexing or transposing), and in-place functions will actually raise 
an error if the storage of modified inputs is referenced by any other Tensor.
2.每个原地操作实际上都需要一个实现,用来重写计算图.而该操作的非原地版本(in-place版本)则只需要简单地分配一个
新的对象并保持向原图的引用就可以了,然而在原地操作中,则需要改变所有代表这些操作的Function输入数据的创建者.
这会让问题更加复杂更加棘手,尤其是当有许多张量引用到同一个存储空间storage时(举个例子, 使用索引或者转置来创建),
并且如果被修改的张量的存储空间被其他的张量所引用时,原地操作函数实际上将会报错.

In-place correctness checks  原地正确性检查

Every tensor keeps a version counter, that is incremented every time it is marked dirty in 
any operation. When a Function saves any tensors for backward, a version counter of their 
containing Tensor is saved as well. Once you access self.saved_tensors it is checked, and 
if it is greater than the saved value an error is raised. This ensures that if you’re using 
in-place functions and not seeing any errors, you can be sure that the computed gradients 
are correct.
每个张量都保持一个版本计数器,在任何操作中的每次被标记为脏dirty时,他都会被增加.当一个Function对象保存了
任何用于后向传播的张量时,这个Function对象的所包含的张量的版本计数器也同样会被保存起来.一旦你访问属性
self.save_tensors,它就会被检查,如果大于已保存的值,那么就会抛出一个错误信息.这样一来就可以确认,在你使用
原地函数时,并且没有看到报错信息,那么你就可以确保计算得到的梯度是正确无误的.