PyTorch-Tutorials【pytorch官方教程中英文详解】- 6 Autograd

72 篇文章 29 订阅
28 篇文章 18 订阅
本文详细讲解了PyTorch中torch.autograd的原理与应用,涉及反向传播、计算图、梯度计算、参数优化、梯度追踪的启用与禁用,以及计算雅可比积的概念。了解如何利用autograd进行神经网络训练和优化过程中的梯度计算关键步骤。
摘要由CSDN通过智能技术生成

在文章PyTorch-Tutorials【pytorch官方教程中英文详解】- 5 Build Model介绍了如何搭建模型,可以通过继承nn.Module类来定义我们自己的神经网络,接下来看看反向传播中著名的torch.autograd微分引擎,能够实现自己计算梯度。

官网链接:Automatic Differentiation with torch.autograd — PyTorch Tutorials 1.10.1+cu102 documentation

AUTOMATIC DIFFERENTIATION WITH TORCH.AUTOGRAD

When training neural networks, the most frequently used algorithm is back propagation. In this algorithm, parameters (model weights) are adjusted according to the gradient of the loss function with respect to the given parameter.

【在训练神经网络时,最常用的算法是反向传播。在该算法中,参数(模型权重)根据损失函数相对于给定参数的梯度进行调整。】

To compute those gradients, PyTorch has a built-in differentiation engine called torch.autograd. It supports automatic computation of gradient for any computational graph.

【为了计算这些梯度,PyTorch内置了一个名为torch.autograd的微分引擎。它支持自动计算任何计算图的梯度。】

Consider the simplest one-layer neural network, with input x, parameters w and b, and some loss function. It can be defined in PyTorch in the following manner:

【考虑最简单的单层神经网络,输入x,参数w和b,以及一些损失函数。它可以在PyTorch中以如下方式定义:】

import torch

x = torch.ones(5)  # input tensor
y = torch.zeros(3)  # expected output
w = torch.randn(5, 3, requires_grad=True)
b = torch.randn(3, requires_grad=True)
z = torch.matmul(x, w)+b
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)

1 Tensors, Functions and Computational graph

This code defines the following computational graph:

【这段代码定义了以下计算图:】

In this network, w and b are parameters, which we need to optimize. Thus, we need to be able to compute the gradients of loss function with respect to those variables. In order to do that, we set the requires_grad property of those tensors.

【在这个网络中,w和b是需要优化的参数。因此,我们需要能够计算损失函数相对于这些变量的梯度。为了做到这一点,我们设置这些张量的requires_grad属性。】

NOTE

  • You can set the value of requires_grad when creating a tensor, or later by using x.requires_grad_(True) method.
  • 【您可以在创建张量时设置requires_grad的值,或者稍后使用x.requires_grad_(True)方法。】

A function that we apply to tensors to construct computational graph is in fact an object of class Function. This object knows how to compute the function in the forward direction, and also how to compute its derivative during the backward propagation step. A reference to the backward propagation function is stored in grad_fn property of a tensor. You can find more information of Function in the documentation.

【我们应用在张量上构造计算图的函数实际上是一个函数类的对象。该对象知道如何在forward方向计算函数,也知道如何在反向传播步骤中计算其导数。对向后传播函数的引用存储在张量的grad_fn属性中。您可以在文档中找到关于Function的更多信息。】

print('Gradient function for z =', z.grad_fn)
print('Gradient function for loss =', loss.grad_fn)

输出结果:

Gradient function for z = <AddBackward0 object at 0x7fdb096facf8>
Gradient function for loss = <BinaryCrossEntropyWithLogitsBackward object at 0x7fdb096fac50>

2 Computing Gradients

To optimize weights of parameters in the neural network, we need to compute the derivatives of our loss function with respect to parameters, namely, we need ∂w/∂loss and ∂b/∂loss under some fixed values of x and y. To compute those derivatives, we call loss.backward(), and then retrieve the values from w.grad and b.grad:

【优化神经网络参数的权重,我们需要计算我们的损失函数对参数的导数,即我们需要计算在某些固定的x和y值下的∂w /∂loss和∂b /∂loss。为了计算这些导数,我们调用loss.backward(),然后从w.grad 和 b.grad检索值:】

loss.backward()
print(w.grad)
print(b.grad)

输出如下:

tensor([[0.3305, 0.0949, 0.2263],
        [0.3305, 0.0949, 0.2263],
        [0.3305, 0.0949, 0.2263],
        [0.3305, 0.0949, 0.2263],
        [0.3305, 0.0949, 0.2263]])
tensor([0.3305, 0.0949, 0.2263])

NOTE

  • We can only obtain the grad properties for the leaf nodes of the computational graph, which have requires_grad property set to True. For all other nodes in our graph, gradients will not be available.
  • 【我们只能获得计算图的叶节点的grad属性,它们的requires_grad属性设置为True。对于图中的所有其他节点,梯度将不可用。】
  • We can only perform gradient calculations using backward once on a given graph, for performance reasons. If we need to do several backward calls on the same graph, we need to pass retain_graph=True to the backward call.
  • 【由于性能原因,我们只能在给定的图上使用一次向后执行梯度计算。如果需要对同一个图进行几个向后调用,则需要将retain_graph=True传递给向后调用。】

3 Disabling Gradient Tracking

By default, all tensors with requires_grad=True are tracking their computational history and support gradient computation. However, there are some cases when we do not need to do that, for example, when we have trained the model and just want to apply it to some input data, i.e. we only want to do forward computations through the network. We can stop tracking computations by surrounding our computation code with torch.no_grad() block:

【默认情况下,所有require_grad =True的张量都会跟踪它们的计算历史,并支持梯度计算。但是,在某些情况下,我们不需要这样做,例如,我们训练了模型,只是想把它应用到一些输入数据上,即我们只想通过网络进行正向计算。我们可以通过使用torch.no_grad()块包围我们的计算代码来停止跟踪计算:】

z = torch.matmul(x, w)+b
print(z.requires_grad)

with torch.no_grad():
    z = torch.matmul(x, w)+b
print(z.requires_grad)

输出:

True
False

Another way to achieve the same result is to use the detach() method on the tensor:

【另一种实现相同结果的方法是对张量使用detach()方法:】

z = torch.matmul(x, w)+b
z_det = z.detach()
print(z_det.requires_grad)

输出:

False

There are reasons you might want to disable gradient tracking:

【以下是禁用梯度跟踪的原因:】

  • To mark some parameters in your neural network as frozen parameters. This is a very common scenario for finetuning a pretrained network
  • 【将神经网络中的一些参数标记为冻结参数。这是对预先训练的网络进行微调的一个非常常见的场景】
  • To speed up computations when you are only doing forward pass, because computations on tensors that do not track gradients would be more efficient.
  • 【在只进行正向传递的情况下加快计算速度,因为在不跟踪梯度的张量上的计算将更加有效。】

4 More on Computational Graphs

Conceptually, autograd keeps a record of data (tensors) and all executed operations (along with the resulting new tensors) in a directed acyclic graph (DAG) consisting of Function objects. In this DAG, leaves are the input tensors, roots are the output tensors. By tracing this graph from roots to leaves, you can automatically compute the gradients using the chain rule.

【从概念上讲,autograd在由Function对象组成的有向无环图(DAG)中保存数据(张量)和所有执行的操作(以及产生的新张量)的记录。在这个DAG中,叶是输入张量,根是输出张量。通过从根到叶跟踪这个图,可以使用链式法则自动计算梯度。】

In a forward pass, autograd does two things simultaneously:

【在向前传播时,autograd会同时做两件事:】

  • run the requested operation to compute a resulting tensor
  • 【运行请求的操作来计算结果张量】
  • maintain the operation’s gradient function in the DAG.
  • 【在DAG中保持操作的梯度函数。】

The backward pass kicks off when .backward() is called on the DAG root. autograd then:

【当在DAG root. autograd上调用.backward()时,向后传递开始,然后:】

  • computes the gradients from each .grad_fn,
  • 【计算每个.grad_fn的梯度,】
  • accumulates them in the respective tensor’s .grad attribute
  • 【将它们累加到各自张量的.grad属性中】
  • using the chain rule, propagates all the way to the leaf tensors.
  • 【利用链式法则,一直传播到叶张量。】

NOTE

  • DAGs are dynamic in PyTorch An important thing to note is that the graph is recreated from scratch; after each .backward() call, autograd starts populating a new graph. This is exactly what allows you to use control flow statements in your model; you can change the shape, size and operations at every iteration if needed.
  • 【在PyTorch中,DAG是动态的。在每次.backward()调用之后,autograd开始填充一个新的图。这正是允许你在模型中使用控制流语句的原因;如果需要,您可以在每次迭代中更改形状、大小和操作。】

5 Optional Reading: Tensor Gradients and Jacobian Products

【可选阅读:张量梯度和雅可比积
In many cases, we have a scalar loss function, and we need to compute the gradient with respect to some parameters. However, there are cases when the output function is an arbitrary tensor. In this case, PyTorch allows you to compute so-called Jacobian product, and not the actual gradient.

【在很多情况下,我们有一个标量损失函数,我们需要计算关于一些参数的梯度。然而,也有输出函数是任意张量的情况。在这种情况下,PyTorch允许你计算所谓的雅可比乘积,而不是实际的梯度。】

For a vector function y = f (x),where x = <x1,...,xn> and y = <y1,...,yn>, a gradient of y with respect to x is given by Jacobian matrix:

【对于向量函数y = f (x),其中x = <x1,...,xn>和y = <y1,...,yn>, yx的梯度由雅可比矩阵给出:】

上述加粗的xy表示向量。

Instead of computing the Jacobian matrix itself, PyTorch allows you to compute Jacobian Product v^T⋅J for a given input vector v=(v1,…,vm). This is achieved by calling backward with v as an argument. The size of v should be the same as the size of the original tensor, with respect to which we want to compute the product:

【对于给定的输入向量v=(v1,…,vm), PyTorch允许计算雅可比矩阵v^T⋅J,而不是计算雅可比矩阵本身。这是通过反向调用v作为参数来实现的。v的大小应该和原始张量的大小相同,我们要根据它来计算乘积:】

inp = torch.eye(5, requires_grad=True)
out = (inp+1).pow(2)
out.backward(torch.ones_like(inp), retain_graph=True)
print("First call\n", inp.grad)
out.backward(torch.ones_like(inp), retain_graph=True)
print("\nSecond call\n", inp.grad)
inp.grad.zero_()
out.backward(torch.ones_like(inp), retain_graph=True)
print("\nCall after zeroing gradients\n", inp.grad)

输出结果:

First call
 tensor([[4., 2., 2., 2., 2.],
        [2., 4., 2., 2., 2.],
        [2., 2., 4., 2., 2.],
        [2., 2., 2., 4., 2.],
        [2., 2., 2., 2., 4.]])

Second call
 tensor([[8., 4., 4., 4., 4.],
        [4., 8., 4., 4., 4.],
        [4., 4., 8., 4., 4.],
        [4., 4., 4., 8., 4.],
        [4., 4., 4., 4., 8.]])

Call after zeroing gradients
 tensor([[4., 2., 2., 2., 2.],
        [2., 4., 2., 2., 2.],
        [2., 2., 4., 2., 2.],
        [2., 2., 2., 4., 2.],
        [2., 2., 2., 2., 4.]])

Notice that when we call backward for the second time with the same argument, the value of the gradient is different. This happens because when doing backward propagation, PyTorch accumulates the gradients, i.e. the value of computed gradients is added to the grad property of all leaf nodes of computational graph. If you want to compute the proper gradients, you need to zero out the grad property before. In real-life training an optimizer helps us to do this.

注意,当我们使用相同的参数第二次向后调用时,梯度的值是不同的。这是因为在做反向传播时,PyTorch会对梯度进行累加,即计算出的梯度的值被添加到计算图的所有叶子节点的grad属性中。如果你想计算正确的梯度,你需要在此之前将grad属性归零。在现实训练中,优化器可以帮助我们做到这一点。】

NOTE

  • Previously we were calling backward() function without parameters. This is essentially equivalent to calling backward(torch.tensor(1.0)), which is a useful way to compute the gradients in case of a scalar-valued function, such as loss during neural network training.
  • 【以前我们调用的是不带参数的backward()函数。这本质上相当于向后调用(torch.tensor(1.0)),对于标量值函数(如神经网络训练期间的损耗),这是一种计算梯度的有用方法。】

6 Further Reading

说明:记录学习笔记,如果错误欢迎指正!写文章不易,转载请联系我。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值