使用 PyTorch 进行深度学习：60 分钟速成_deep learning with pytorch: a 60 minute blitz-CSDN博客

Deep Learning with PyTorch: A 60 Minute Blitz

Author: Soumith Chintala

What is PyTorch?

PyTorch is a Python-based scientific computing package serving two broad purposes:

A replacement for NumPy to use the power of GPUs and other accelerators.
An automatic differentiation library that is useful to implement neural networks.

PyTorch 是一个基于 Python 的科学计算软件包，有两大用途：

它是 NumPy 的替代品，可以使用 GPU 和其他加速器的强大功能。
用于实现神经网络的自动微分库。

Goal of this tutorial:

Understand PyTorch’s Tensor library and neural networks at a high level.
Train a small neural network to classify images

To run the tutorials below, make sure you have the torch, torchvision, and matplotlib packages installed.

深入了解 PyTorch 的张量库和神经网络。
训练一个小型神经网络对图像进行分类

要运行下面的教程，请确保已安装 torch、torchvision 和 matplotlib 软件包。

Tensors

Tensors are a specialized data structure that are very similar to arrays and matrices. In PyTorch, we use tensors to encode the inputs and outputs of a model, as well as the model’s parameters.

Tensors are similar to NumPy’s ndarrays, except that tensors can run on GPUs or other specialized hardware to accelerate computing. If you’re familiar with ndarrays, you’ll be right at home with the Tensor API. If not, follow along in this quick API walkthrough.

张量是一种专门的数据结构，与数组和矩阵非常相似。在 PyTorch 中，我们使用张量来编码模型的输入和输出以及模型的参数。

张量类似于 NumPy 的ndarrays，只不过张量可以运行在 GPU 或其他专用硬件上，以加速计算。如果你熟悉ndarrays，那么使用张量 API 就会得心应手。如果不熟悉，请跟随本快速 API 指南学习。

import torch
import numpy as np

Tensor Initialization

Tensors can be initialized in various ways. Take a look at the following examples:

可以通过多种方式初始化张量。请看下面的示例：

Directly from data

Tensors can be created directly from data. The data type is automatically inferred.

可以直接从数据中创建张量。数据类型会自动推断。

data = [[1, 2], [3, 4]] # 直接从数据中创建张量
x_data = torch.tensor(data)

From a NumPy array

Tensors can be created from NumPy arrays (and vice versa - see Bridge with NumPy).

可以从 NumPy 数组创建张量（反之亦然，请参见与 NumPy 的桥接）。

np_array = np.array(data) # 从 NumPy 数组创建张量
x_np = torch.from_numpy(np_array)

From another tensor:

The new tensor retains the properties (shape, datatype) of the argument tensor, unless explicitly overridden.

除非明确重载，否则新张量将保留参数张量的属性（形状、数据类型）。

x_ones = torch.ones_like(x_data) # 保留了 x_data 的属性
print(f"Ones Tensor: \n {x_ones} \n")

x_rand = torch.rand_like(x_data, dtype=torch.float) # 覆盖 x_data 的数据类型
print(f"Random Tensor: \n {x_rand} \n")

Ones Tensor:
 tensor([[1, 1],
        [1, 1]])

Random Tensor:
 tensor([[0.8823, 0.9150],
        [0.3829, 0.9593]])

With random or constant values:

shape is a tuple of tensor dimensions. In the functions below, it determines the dimensionality of the output tensor.

shape 是张量维度的元组。在下面的函数中，它决定了输出张量的维度。

shape = (2, 3,)
rand_tensor = torch.rand(shape)
ones_tensor = torch.ones(shape)
zeros_tensor = torch.zeros(shape)

print(f"Random Tensor: \n {rand_tensor} \n")
print(f"Ones Tensor: \n {ones_tensor} \n")
print(f"Zeros Tensor: \n {zeros_tensor}")

Random Tensor:
 tensor([[0.3904, 0.6009, 0.2566],
        [0.7936, 0.9408, 0.1332]])

Ones Tensor:
 tensor([[1., 1., 1.],
        [1., 1., 1.]])

Zeros Tensor:
 tensor([[0., 0., 0.],
        [0., 0., 0.]])

Tensor Attributes

Tensor attributes describe their shape, datatype, and the device on which they are stored.

张量属性描述了张量的形状、数据类型以及存储张量的设备。

tensor = torch.rand(3, 4)

print(f"Shape of tensor: {tensor.shape}") # 形状
print(f"Datatype of tensor: {tensor.dtype}") # 类型
print(f"Device tensor is stored on: {tensor.device}") # 存储张量的设备

Shape of tensor: torch.Size([3, 4])
Datatype of tensor: torch.float32
Device tensor is stored on: cpu

Tensor Operations

Over 100 tensor operations, including transposing, indexing, slicing, mathematical operations, linear algebra, random sampling, and more are comprehensively described here.

Each of them can be run on the GPU (at typically higher speeds than on a CPU). If you’re using Colab, allocate a GPU by going to Edit > Notebook Settings.

这里全面介绍了 100 多种张量运算，包括转置、索引、切片、数学运算、线性代数、随机抽样等。

每种操作都可以在 GPU 上运行（速度通常高于 CPU）。如果您正在使用 Colab，请进入 “编辑”>"笔记本设置 "分配 GPU。

# 如果有的话，我们将张量移动到 GPU 上
if torch.cuda.is_available():
  tensor = tensor.to('cuda')
  print(f"Device tensor is stored on: {tensor.device}")

Device tensor is stored on: cuda:0

Try out some of the operations from the list. If you’re familiar with the NumPy API, you’ll find the Tensor API a breeze to use.

试试列表中的一些操作。如果你熟悉 NumPy API，你会发现 Tensor API 的使用轻而易举。

Standard numpy-like indexing and slicing:

tensor = torch.ones(4, 4)
tensor[:,1] = 0
print(tensor)

tensor([[1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.]])

Joining tensors You can use torch.cat to concatenate a sequence of tensors along a given dimension. See also torch.stack, another tensor joining op that is subtly different from torch.cat.

连接张量你可以使用 torch.cat 沿着给定维度连接一系列张量。另请参阅 torch.stack，它是另一种与 torch.cat 有细微差别的张量连接操作。

t1 = torch.cat([tensor, tensor, tensor], dim=1)
print(t1)

tensor([[1., 0., 1., 1., 1., 0., 1., 1., 1., 0., 1., 1.],
        [1., 0., 1., 1., 1., 0., 1., 1., 1., 0., 1., 1.],
        [1., 0., 1., 1., 1., 0., 1., 1., 1., 0., 1., 1.],
        [1., 0., 1., 1., 1., 0., 1., 1., 1., 0., 1., 1.]])

Multiplying tensors

# 这样就可以计算出元素与元素之间的乘积
print(f"tensor.mul(tensor) \n {tensor.mul(tensor)} \n")
# 替代语法
print(f"tensor * tensor \n {tensor * tensor}")

tensor.mul(tensor)
 tensor([[1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.]])

tensor * tensor
 tensor([[1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.]])

This computes the matrix multiplication between two tensors

计算两个张量之间的矩阵乘法

print(f"tensor.matmul(tensor.T) \n {tensor.matmul(tensor.T)} \n")
# 替代语法
print(f"tensor @ tensor.T \n {tensor @ tensor.T}")

tensor.matmul(tensor.T)
 tensor([[3., 3., 3., 3.],
        [3., 3., 3., 3.],
        [3., 3., 3., 3.],
        [3., 3., 3., 3.]])

tensor @ tensor.T
 tensor([[3., 3., 3., 3.],
        [3., 3., 3., 3.],
        [3., 3., 3., 3.],
        [3., 3., 3., 3.]])

In-place operations Operations that have a _ suffix are in-place. For example: x.copy_(y), x.t_(), will change x.

原地操作 后缀为 _ 的操作是原地操作。例如：x.copy_(y), x.t_()，将改变 x。

print(tensor, "\n")
tensor.add_(5)
print(tensor)

tensor([[1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.]])

tensor([[6., 5., 6., 6.],
        [6., 5., 6., 6.],
        [6., 5., 6., 6.],
        [6., 5., 6., 6.]])

NOTE

In-place operations save some memory, but can be problematic when computing derivatives because of an immediate loss of history. Hence, their use is discouraged.

原地操作可以节省一些内存，但在计算导数时会出现问题，因为会立即丢失历史记录。因此，我们不鼓励使用这种方法。

Bridge with NumPy

Tensors on the CPU and NumPy arrays can share their underlying memory locations, and changing one will change the other.

CPU 上的张量和 NumPy 数组可以共享底层内存位置，改变其中一个就会改变另一个。

Tensor to NumPy array

t = torch.ones(5)
print(f"t: {t}")
n = t.numpy()
print(f"n: {n}")

t: tensor([1., 1., 1., 1., 1.])
n: [1. 1. 1. 1. 1.]

A change in the tensor reflects in the NumPy array.

张量的变化反映在 NumPy 数组中。

t.add_(1)
print(f"t: {t}")
print(f"n: {n}")

t: tensor([2., 2., 2., 2., 2.])
n: [2. 2. 2. 2. 2.]

NumPy array to Tensor

n = np.ones(5)
t = torch.from_numpy(n)

Changes in the NumPy array reflects in the tensor.

NumPy 数组的变化会反映在张量中。

np.add(n, 1, out=n)
print(f"t: {t}")
print(f"n: {n}")

t: tensor([2., 2., 2., 2., 2.], dtype=torch.float64)
n: [2. 2. 2. 2. 2.]

Total running time of the script: ( 0 minutes 0.131 seconds)

A Gentle Introduction to `torch.autograd`

torch.autograd is PyTorch’s automatic differentiation engine that powers neural network training. In this section, you will get a conceptual understanding of how autograd helps a neural network train.

torch.autograd 是 PyTorch 的自动微分引擎，为神经网络训练提供动力。在本节中，你将从概念上了解 autograd 如何帮助神经网络训练。

Background

Neural networks (NNs) are a collection of nested functions that are executed on some input data. These functions are defined by parameters (consisting of weights and biases), which in PyTorch are stored in tensors.

Training a NN happens in two steps:

Forward Propagation: In forward prop, the NN makes its best guess about the correct output. It runs the input data through each of its functions to make this guess.

Backward Propagation: In backprop, the NN adjusts its parameters proportionate to the error in its guess. It does this by traversing backwards from the output, collecting the derivatives of the error with respect to the parameters of the functions (gradients), and optimizing the parameters using gradient descent. For a more detailed walkthrough of backprop, check out this video from 3Blue1Brown.

神经网络 (NN) 是对一些输入数据执行的嵌套函数的集合。这些函数由参数（由权重和偏置组成）定义，在 PyTorch 中，这些参数存储在张量中。

训练 NN 分两个步骤：

前向传播：在前向传播中，NN 会对正确的输出做出最佳猜测。它通过每个函数运行输入数据来做出这一猜测。

后向传播：在后向传播中，NN 会根据其猜测的误差按比例调整参数。其方法是从输出向后遍历，收集误差相对于函数参数的导数（梯度），并使用梯度下降法优化参数。有关反推的更多详情，请观看 3Blue1Brown 提供的视频。

Usage in PyTorch

Let’s take a look at a single training step. For this example, we load a pretrained resnet18 model from torchvision. We create a random data tensor to represent a single image with 3 channels, and height & width of 64, and its corresponding label initialized to some random values. Label in pretrained models has shape (1,1000).

让我们来看看单个训练步骤。在这个示例中，我们从 torchvision 中加载一个预训练的 resnet18 模型。我们创建了一个随机数据张量来表示一张具有 3 个通道、高度和宽度均为 64 的图像，并将其对应的 label 初始化为一些随机值。预训练模型中的标签形状为 (1,1000)。

NOTE

This tutorial works only on the CPU and will not work on GPU devices (even if tensors are moved to CUDA).

本教程仅在 CPU 上运行，无法在 GPU 设备上运行（即使将张量移至 CUDA）。

import torch
from torchvision.models import resnet18, ResNet18_Weights
model = resnet18(weights=ResNet18_Weights.DEFAULT) # 加载预训练的 resnet18 模型
data = torch.rand(1, 3, 64, 64) # 数据：表示一张具有 3 个通道、高度和宽度均为 64 的图像
labels = torch.rand(1, 1000) # 标签：形状为 (1,1000)的一些随机值

Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /root/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth

  0%|          | 0.00/44.7M [00:00<?, ?B/s]
 31%|###       | 13.7M/44.7M [00:00<00:00, 144MB/s]
 62%|######2   | 27.8M/44.7M [00:00<00:00, 146MB/s]
 94%|#########3| 42.0M/44.7M [00:00<00:00, 147MB/s]
100%|##########| 44.7M/44.7M [00:00<00:00, 146MB/s]

Next, we run the input data through the model through each of its layers to make a prediction. This is the forward pass.

接下来，我们将输入数据通过模型的每一层进行预测。这就是前向传递。

prediction = model(data) # 前向传播

We use the model’s prediction and the corresponding label to calculate the error (loss). The next step is to backpropagate this error through the network. Backward propagation is kicked off when we call .backward() on the error tensor. Autograd then calculates and stores the gradients for each model parameter in the parameter’s .grad attribute.

我们使用模型的预测结果和相应的标签来计算误差（ loss ）。下一步是通过网络反向传播误差。当我们在误差张量上调用 .backward() 时，反向传播就开始了。然后，Autograd 会计算每个模型参数的梯度，并存储在参数的 .grad 属性中。

loss = (prediction - labels).sum()
loss.backward() # 反向传播
# 当我们在误差张量上调用 .backward() 时，反向传播就开始了。
# 然后，Autograd 会计算每个模型参数的梯度，并存储在参数的 `.grad` 属性中

Next, we load an optimizer, in this case SGD with a learning rate of 0.01 and momentum of 0.9. We register all the parameters of the model in the optimizer.

接下来，我们加载一个优化器，在本例中，SGD 的学习率为 0.01，动量为 0.9。我们在优化器中注册模型的所有参数。

optim = torch.optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)

Finally, we call .step() to initiate gradient descent. The optimizer adjusts each parameter by its gradient stored in .grad.

最后，我们调用 .step() 启动梯度下降。优化器会根据 .grad 中存储的梯度调整每个参数。

optim.step() # 启动梯度下降，优化器会根据 .grad 中存储的梯度调整每个参数

At this point, you have everything you need to train your neural network. The below sections detail the workings of autograd - feel free to skip them.

至此，您已经拥有了训练神经网络所需的一切。以下部分将详细介绍 autograd 的工作原理，请随意跳过。

Differentiation in Autograd

Let’s take a look at how autograd collects gradients. We create two tensors a and b with requires_grad=True. This signals to autograd that every operation on them should be tracked.

让我们来看看 autograd 是如何收集梯度的。我们创建两个张量 a 和 b，并设置 requirements_grad=True。这就向 autograd 发出信号：对它们的每一次操作都应该被跟踪。

import torch

a = torch.tensor([2., 3.], requires_grad=True)
b = torch.tensor([6., 4.], requires_grad=True)

We create another tensor Q from a and b.

我们用 a 和 b 创建另一个张量 Q。

Q=3a3−b2

Q = 3*a**3 - b**2

Let’s assume a and b to be parameters of an NN, and Q to be the error. In NN training, we want gradients of the error w.r.t. parameters, i.e.

假设 a 和 b 是一个 NN 的参数，Q 是误差。在 NN 训练中，我们需要误差随参数变化的梯度，即

∂Q/∂ba=9a*2

∂Q/∂b=−2b

When we call .backward() on Q, autograd calculates these gradients and stores them in the respective tensors’ .grad attribute.

We need to explicitly pass a gradient argument in Q.backward() because it is a vector. gradient is a tensor of the same shape as Q, and it represents the gradient of Q w.r.t. itself, i.e.

当我们在 Q 上调用 .backward() 时，autograd 会计算这些梯度，并将其存储在相应张量的 .grad 属性中。

我们需要在 Q.backward() 中明确传递一个梯度参数，因为它是一个矢量。gradient 是一个与 Q 形状相同的张量，它表示 Q 相对于自身的梯度，即

dQ/dQ=1

Equivalently, we can also aggregate Q into a scalar and call backward implicitly, like Q.sum().backward().

同样，我们也可以将 Q 集合成一个标量，然后隐式调用 backward，如 Q.sum().backward()。

external_grad = torch.tensor([1., 1.])
Q.backward(gradient=external_grad)

Gradients are now deposited in a.grad and b.grad

梯度现在存放在 a.grad 和 b.grad 中

# 检查收集的梯度是否正确
print(9*a**2 == a.grad)
print(-2*b == b.grad)

tensor([True, True])
tensor([True, True])

Optional Reading - Vector Calculus using `autograd`

Mathematically, if you have a vector valued function �⃗=�(�⃗)y=f(x), then the gradient of �⃗y with respect to �⃗x is a Jacobian matrix �J:

�=(∂�∂�1…∂�∂��)=(∂�1∂�1⋯∂�1∂��⋮⋱⋮∂��∂�1⋯∂��∂��)J=(∂x1∂y…∂x**n∂y)=∂x1∂y1⋮∂x1∂y**m⋯⋱⋯∂x**n∂y1⋮∂x**n∂y**m

Generally speaking, torch.autograd is an engine for computing vector-Jacobian product. That is, given any vector �⃗v, compute the product ��⋅�⃗J**T⋅v

If �⃗v happens to be the gradient of a scalar function �=�(�⃗)l=g(y):

�⃗=(∂�∂�1⋯∂�∂��)�v=(∂y1∂l⋯∂y**m∂l)T

then by the chain rule, the vector-Jacobian product would be the gradient of �l with respect to �⃗x:

��⋅�⃗=(∂�1∂�1⋯∂��∂�1⋮⋱⋮∂�1∂��⋯∂��∂��)(∂�∂�1⋮∂�∂��)=(∂�∂�1⋮∂�∂��)J**T⋅v=∂x1∂y1⋮∂x**n∂y1⋯⋱⋯∂x1∂y**m⋮∂x**n∂y**m∂y1∂l⋮∂y**m∂l=∂x1∂l⋮∂x**n∂l

This characteristic of vector-Jacobian product is what we use in the above example; external_grad represents �⃗v.

Computational Graph

Conceptually, autograd keeps a record of data (tensors) & all executed operations (along with the resulting new tensors) in a directed acyclic graph (DAG) consisting of Function objects. In this DAG, leaves are the input tensors, roots are the output tensors. By tracing this graph from roots to leaves, you can automatically compute the gradients using the chain rule.

In a forward pass, autograd does two things simultaneously:

run the requested operation to compute a resulting tensor, and
maintain the operation’s gradient function in the DAG.

The backward pass kicks off when .backward() is called on the DAG root. autograd then:

computes the gradients from each .grad_fn,
accumulates them in the respective tensor’s .grad attribute, and
using the chain rule, propagates all the way to the leaf tensors.

Below is a visual representation of the DAG in our example. In the graph, the arrows are in the direction of the forward pass. The nodes represent the backward functions of each operation in the forward pass. The leaf nodes in blue represent our leaf tensors a and b.

从概念上讲，autograd 在由 Function 对象组成的有向无环图（DAG）中保存数据（张量）和所有已执行操作（以及产生的新张量）的记录。在这个 DAG 中，叶是输入张量，根是输出张量。通过从根追踪到叶的图，可以使用链式规则自动计算梯度。

在前向传递中，autograd 会同时做两件事：

运行请求的操作以计算出结果张量，以及
在 DAG 中维护操作的梯度函数。

当在 DAG 根上调用 .backward() 时，后向传递开始：

计算每个 .grad_fn 的梯度、
累加到相应张量的 .grad 属性中，然后
利用链式规则，一直传播到叶子张量。

下面是我们示例中 DAG 的可视化表示。在图中，箭头是向前传递的方向。节点代表前向传递中每个操作的后向函数。蓝色的叶节点代表叶张量 a 和 b。

../../_images/dag_autograd.png

NOTE

DAGs are dynamic in PyTorch An important thing to note is that the graph is recreated from scratch; after each .backward() call, autograd starts populating a new graph. This is exactly what allows you to use control flow statements in your model; you can change the shape, size and operations at every iteration if needed.

在 PyTorch 中，DAG 是动态的需要注意的是，图形是从头开始创建的；每次调用 .backward() 之后，autograd 都会开始填充一个新的图形。这正是你在模型中使用控制流语句的原因；如果需要，你可以在每次迭代时改变图形的形状、大小和操作。

Exclusion from the DAG

torch.autograd tracks operations on all tensors which have their requires_grad flag set to True. For tensors that don’t require gradients, setting this attribute to False excludes it from the gradient computation DAG.

The output tensor of an operation will require gradients even if only a single input tensor has requires_grad=True.

torch.autograd 会跟踪所有将 requires_grad 标志设置为 True 的张量的操作。对于不需要梯度的张量，将此属性设置为 False，就会将其排除在梯度计算 DAG 之外。

即使只有一个输入张量的 requirements_grad=True 属性为 True，操作的输出张量也需要梯度。

x = torch.rand(5, 5)
y = torch.rand(5, 5)
z = torch.rand((5, 5), requires_grad=True)

a = x + y
print(f"Does `a` require gradients? : {a.requires_grad}")
b = x + z
print(f"Does `b` require gradients?: {b.requires_grad}")

Does `a` require gradients? : False
Does `b` require gradients?: True

In a NN, parameters that don’t compute gradients are usually called frozen parameters. It is useful to “freeze” part of your model if you know in advance that you won’t need the gradients of those parameters (this offers some performance benefits by reducing autograd computations).

In finetuning, we freeze most of the model and typically only modify the classifier layers to make predictions on new labels. Let’s walk through a small example to demonstrate this. As before, we load a pretrained resnet18 model, and freeze all the parameters.

在 NN 中，不计算梯度的参数通常称为冻结参数。如果事先知道不需要这些参数的梯度，那么 "冻结 "模型的一部分是非常有用的（这可以减少自动梯度计算，从而提高性能）。

在微调过程中，我们会冻结大部分模型，通常只修改分类器层，以便对新标签进行预测。让我们通过一个小例子来说明这一点。和之前一样，我们加载一个预训练的 resnet18 模型，并冻结所有参数。

from torch import nn, optim

model = resnet18(weights=ResNet18_Weights.DEFAULT)

# 冻结网络中的所有参数
for param in model.parameters():
    param.requires_grad = False

Let’s say we want to finetune the model on a new dataset with 10 labels. In resnet, the classifier is the last linear layer model.fc. We can simply replace it with a new linear layer (unfrozen by default) that acts as our classifier.

比方说，我们想在一个有 10 个标签的新数据集上对模型进行微调。在 resnet 中，分类器是最后一个线性层 model.fc。我们只需将其替换为一个新的线性层（默认情况下解冻），作为我们的分类器。

model.fc = nn.Linear(512, 10)

Now all parameters in the model, except the parameters of model.fc, are frozen. The only parameters that compute gradients are the weights and bias of model.fc.

现在，除了 model.fc 的参数外，模型中的所有参数都被冻结。唯一能计算梯度的参数是 model.fc 的权重和偏差。

# 只优化分类器
optimizer = optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)

Notice although we register all the parameters in the optimizer, the only parameters that are computing gradients (and hence updated in gradient descent) are the weights and bias of the classifier.

The same exclusionary functionality is available as a context manager in torch.no_grad()

请注意，虽然我们在优化器中注册了所有参数，但计算梯度（并因此在梯度下降过程中更新）的唯一参数是分类器的权重和偏置。

同样的排除功能在 torch.no_grad() 中作为上下文管理器提供

Neural Networks

Neural networks can be constructed using the torch.nn package.

Now that you had a glimpse of autograd, nn depends on autograd to define models and differentiate them. An nn.Module contains layers, and a method forward(input) that returns the output.

For example, look at this network that classifies digit images:

可以使用 torch.nn 软件包构建神经网络。

现在你已经了解了 autograd，nn 依赖 autograd 来定义模型和区分模型。nn.Module 包含层和一个返回输出的 forward(input) 方法。

例如，请看这个对数字图像进行分类的网络：

convnet

It is a simple feed-forward network. It takes the input, feeds it through several layers one after the other, and then finally gives the output.

A typical training procedure for a neural network is as follows:

Define the neural network that has some learnable parameters (or weights)
Iterate over a dataset of inputs
Process input through the network
Compute the loss (how far is the output from being correct)
Propagate gradients back into the network’s parameters
Update the weights of the network, typically using a simple update rule: weight = weight - learning_rate * gradient

convnet

这是一个简单的前馈网络。它接收输入信息，将其逐层传递，最后给出输出结果。

神经网络的典型训练过程如下：

定义神经网络，其中包含一些可学习的参数（或权重）
对输入数据集进行迭代
通过网络处理输入
计算损失（输出离正确还有多远）
将梯度传回网络参数中
更新网络权重，通常使用简单的更新规则：权重 = 权重 - 学习率 * 梯度

Define the network

Let’s define this network:

让我们来定义一下这个网络：

import torch
import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        # 1 个输入图像通道，6 个输出通道，5x5 平方卷积
        # kernel
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.conv2 = nn.Conv2d(6, 16, 5)
        # 线性运算： y = Wx + b
        self.fc1 = nn.Linear(16 * 5 * 5, 120)  # 5*5 图片尺寸
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        # 在 (2, 2) 窗口内进行最大池化
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        # 如果尺寸是正方形，可以用一个数字指定
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = torch.flatten(x, 1) # 将除批次维度之外的所有维度进行扁平化处理
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


net = Net()
print(net)

Net(
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)

You just have to define the forward function, and the backward function (where gradients are computed) is automatically defined for you using autograd. You can use any of the Tensor operations in the forward function.

The learnable parameters of a model are returned by net.parameters()

您只需定义前向函数，后向函数（计算梯度的地方）就会使用 autograd 自动为您定义。您可以在前向函数中使用任何张量运算。

模型的可学习参数由 net.parameters() 返回

params = list(net.parameters())
print(len(params))
print(params[0].size())  # conv1 的 .weight

10
torch.Size([6, 1, 5, 5])

Let’s try a random 32x32 input. Note: expected input size of this net (LeNet) is 32x32. To use this net on the MNIST dataset, please resize the images from the dataset to 32x32.

让我们试试 32x32 的随机输入。注：该网络（LeNet）的预期输入大小为 32x32。要在 MNIST 数据集上使用该网络，请将数据集中的图像调整为 32x32。

input = torch.randn(1, 1, 32, 32)
out = net(input)
print(out)

tensor([[ 0.1453, -0.0590, -0.0065,  0.0905,  0.0146, -0.0805, -0.1211, -0.0394,
         -0.0181, -0.0136]], grad_fn=<AddmmBackward0>)

Zero the gradient buffers of all parameters and backprops with random gradients:

将所有参数的梯度缓冲区清零，并使用随机梯度反推：

net.zero_grad()
out.backward(torch.randn(1, 10))

NOTE

torch.nn only supports mini-batches. The entire torch.nn package only supports inputs that are a mini-batch of samples, and not a single sample.

For example, nn.Conv2d will take in a 4D Tensor of nSamples x nChannels x Height x Width.

If you have a single sample, just use input.unsqueeze(0) to add a fake batch dimension.

torch.nn 仅支持迷你批次。整个 torch.nn 软件包只支持迷你批次样本的输入，而不支持单个样本的输入。

例如，nn.Conv2d 将接收 nSamples x nChannels x Height x Width 的 4D 张量。

如果只有单一样本，只需使用 input.unsqueeze(0) 添加一个假的批次维度即可。

Before proceeding further, let’s recap all the classes you’ve seen so far.

Recap:

torch.Tensor - A multi-dimensional array with support for autograd operations like backward(). Also holds the gradient w.r.t. the tensor.
nn.Module - Neural network module. Convenient way of encapsulating parameters, with helpers for moving them to GPU, exporting, loading, etc.
nn.Parameter - A kind of Tensor, that is automatically registered as a parameter when assigned as an attribute to a Module.
autograd.Function - Implements forward and backward definitions of an autograd operation. Every Tensor operation creates at least a single Function node that connects to functions that created a Tensor and encodes its history.

At this point, we covered:

Defining a neural network
Processing inputs and calling backward

Still Left:

Computing the loss
Updating the weights of the network

在继续学习之前，让我们先回顾一下目前所看到的所有课程。

回顾：

torch.Tensor - 一个多维数组，支持自动梯度操作（如 backward()）。还保存了张量的梯度。
nn.Module - 神经网络模块。封装参数的便捷方式，并提供将参数移至 GPU、导出、加载等操作的助手。
nn.Parameter - 一种张量，在作为属性分配给模块时会自动注册为参数。
autograd.Function - 实现 autograd 操作的前向和后向定义。每个张量操作都会创建至少一个 Function 节点，该节点会连接到创建张量的函数，并对其历史进行编码。

至此，我们介绍了

定义神经网络
处理输入并向后调用

还剩

计算损失
更新网络权重

Loss Function

A loss function takes the (output, target) pair of inputs, and computes a value that estimates how far away the output is from the target.

There are several different loss functions under the nn package . A simple loss is: nn.MSELoss which computes the mean-squared error between the output and the target.

For example:

损失函数利用（输出、目标）这对输入，计算出一个估计输出离目标有多远的值。

nn 软件包中有几种不同的损失函数。一个简单的损失函数是：nn.MSELoss，它计算输出与目标之间的均方误差。

例如

output = net(input)
target = torch.randn(10)  # 假目标，例如
target = target.view(1, -1)  # 使其与输出相同
criterion = nn.MSELoss()

loss = criterion(output, target)
print(loss)

tensor(1.3619, grad_fn=<MseLossBackward0>)

Now, if you follow loss in the backward direction, using its .grad_fn attribute, you will see a graph of computations that looks like this:

现在，如果使用 .grad_fn 属性向后跟踪损耗，就会看到如下所示的计算图表：

input -> conv2d -> relu -> maxpool2d -> conv2d -> relu -> maxpool2d
      -> flatten -> linear -> relu -> linear -> relu -> linear
      -> MSELoss
      -> loss

So, when we call loss.backward(), the whole graph is differentiated w.r.t. the neural net parameters, and all Tensors in the graph that have requires_grad=True will have their .grad Tensor accumulated with the gradient.

For illustration, let us follow a few steps backward:

因此，当我们调用 loss.backward() 时，整个图都会根据神经网络参数进行微分，图中所有 requires_grad=True 的张量都会根据梯度累积其 .grad 张量。

为了便于说明，让我们往回走几步：

print(loss.grad_fn)  # MSELoss 均方误差
print(loss.grad_fn.next_functions[0][0])  # Linear
print(loss.grad_fn.next_functions[0][0].next_functions[0][0])  # ReLU

<MseLossBackward0 object at 0x7f0f0dc3a3e0>
<AddmmBackward0 object at 0x7f0f0dc39b40>
<AccumulateGrad object at 0x7f0eeb96a200>

Backprop

To backpropagate the error all we have to do is to loss.backward(). You need to clear the existing gradients though, else gradients will be accumulated to existing gradients.

Now we shall call loss.backward(), and have a look at conv1’s bias gradients before and after the backward.

要反向传播错误，我们只需 loss.backward()。不过您需要清除现有的梯度，否则梯度将累积到现有梯度中。

现在我们调用 loss.backward()，看看 conv1 在回溯前后的偏差梯度。

net.zero_grad()     # 将所有参数的梯度缓冲区清零

print('conv1.bias.grad before backward')
print(net.conv1.bias.grad)

loss.backward()

print('conv1.bias.grad after backward')
print(net.conv1.bias.grad)

conv1.bias.grad before backward
None
conv1.bias.grad after backward
tensor([ 0.0081, -0.0080, -0.0039,  0.0150,  0.0003, -0.0105])

Now, we have seen how to use loss functions.

现在，我们已经知道了如何使用损失函数。

Read Later:

The neural network package contains various modules and loss functions that form the building blocks of deep neural networks. A full list with documentation is here.

神经网络软件包包含各种模块和损失函数，它们构成了深度神经网络的组成部分。完整列表及文档请点击此处。

The only thing left to learn is:

Updating the weights of the network
更新网络权重

Update the weights

The simplest update rule used in practice is the Stochastic Gradient Descent (SGD):

实际应用中最简单的更新规则是随机梯度下降法（SGD）：

weight = weight - learning_rate * gradient

We can implement this using simple Python code:

我们可以用简单的 Python 代码实现这一功能：

learning_rate = 0.01
for f in net.parameters():
    f.data.sub_(f.grad.data * learning_rate)

However, as you use neural networks, you want to use various different update rules such as SGD, Nesterov-SGD, Adam, RMSProp, etc. To enable this, we built a small package: torch.optim that implements all these methods. Using it is very simple:

不过，由于使用的是神经网络，因此需要使用各种不同的更新规则，如 SGD、Nesterov-SGD、Adam、RMSProp 等。为此，我们开发了一个小软件包：torch.optim，用于实现所有这些方法。使用它非常简单：

import torch.optim as optim

# 创建您的优化器
optimizer = optim.SGD(net.parameters(), lr=0.01)

# 在你的训练循环中
optimizer.zero_grad()   # 将梯度缓冲区清零
output = net(input)
loss = criterion(output, target)
loss.backward()
optimizer.step()    # 更新是否

NOTE

Observe how gradient buffers had to be manually set to zero using optimizer.zero_grad(). This is because gradients are accumulated as explained in the Backprop section.

请注意，梯度缓冲区必须使用 optimizer.zero_grad() 手动设置为零。这是因为梯度是累加的，这一点在 "反推 "一节中已有解释。

Total running time of the script: ( 0 minutes 1.909 seconds)

Training a Classifier

This is it. You have seen how to define neural networks, compute loss and make updates to the weights of the network.

Now you might be thinking,

就是这样。你已经了解了如何定义神经网络、计算损失和更新网络权重。

现在你可能会想

What about data?

Generally, when you have to deal with image, text, audio or video data, you can use standard python packages that load data into a numpy array. Then you can convert this array into a torch.*Tensor.

For images, packages such as Pillow, OpenCV are useful
For audio, packages such as scipy and librosa
For text, either raw Python or Cython based loading, or NLTK and SpaCy are useful

Specifically for vision, we have created a package called torchvision, that has data loaders for common datasets such as ImageNet, CIFAR10, MNIST, etc. and data transformers for images, viz., torchvision.datasets and torch.utils.data.DataLoader.

This provides a huge convenience and avoids writing boilerplate code.

For this tutorial, we will use the CIFAR10 dataset. It has the classes: ‘airplane’, ‘automobile’, ‘bird’, ‘cat’, ‘deer’, ‘dog’, ‘frog’, ‘horse’, ‘ship’, ‘truck’. The images in CIFAR-10 are of size 3x32x32, i.e. 3-channel color images of 32x32 pixels in size.

一般来说，当你需要处理图像、文本、音频或视频数据时，可以使用标准的 python 软件包将数据加载到 numpy 数组中。然后，你就可以将该数组转换成 torch.*Tensor.

对于图像，Pillow、OpenCV 等软件包非常有用
对于音频，可使用 scipy 和 librosa 等软件包
对于文本，原始 Python 或基于 Cython 的加载，或 NLTK 和 SpaCy 都很有用

专门针对视觉，我们创建了一个名为 torchvision 的软件包，其中包含用于 ImageNet、CIFAR10、MNIST 等常见数据集的数据加载器和用于图像的数据转换器，即 torchvision.datasets 和 torch.utils.data.DataLoader。

这提供了极大的便利，并避免了编写模板代码。

在本教程中，我们将使用 CIFAR10 数据集。该数据集包含以下类别：“飞机”、“汽车”、“鸟”、“猫”、“鹿”、“狗”、“青蛙”、“马”、“船”、“卡车”。CIFAR-10 中的图像大小为 3x32x32，即 32x32 像素大小的三通道彩色图像。

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

cifar10

Training an image classifier

We will do the following steps in order:

Load and normalize the CIFAR10 training and test datasets using torchvision
Define a Convolutional Neural Network
Define a loss function
Train the network on the training data
Test the network on the test data

我们将按顺序完成以下步骤：

使用 torchvision 加载 CIFAR10 训练数据集和测试数据集并将其标准化
定义卷积神经网络
定义损失函数
在训练数据上训练网络
在测试数据上测试网络

1. Load and normalize CIFAR10

Using torchvision, it’s extremely easy to load CIFAR10.

使用 torchvision，加载 CIFAR10 非常简单。

import torch
import torchvision
import torchvision.transforms as transforms

The output of torchvision datasets are PILImage images of range [0, 1]. We transform them to Tensors of normalized range [-1, 1].

torchvision 数据集的输出是范围为 [0, 1] 的 PILImage 图像。我们将其转换为归一化范围为 [-1, 1] 的张量。

NOTE

If running on Windows and you get a BrokenPipeError, try setting the num_worker of torch.utils.data.DataLoader() to 0.

如果在 Windows 上运行时出现 BrokenPipeError，请尝试将 torch.utils.data.DataLoader() 的 num_worker 设置为 0。

transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

batch_size = 4

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
                                          shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
                                         shuffle=False, num_workers=2)

classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz

  0%|          | 0/170498071 [00:00<?, ?it/s]
  0%|          | 655360/170498071 [00:00<00:25, 6549593.08it/s]
  5%|4         | 8126464/170498071 [00:00<00:03, 46608350.21it/s]
 11%|#1        | 19464192/170498071 [00:00<00:01, 77069022.31it/s]
 18%|#7        | 30081024/170498071 [00:00<00:01, 88479710.10it/s]
 24%|##4       | 41484288/170498071 [00:00<00:01, 97595089.87it/s]
 31%|###1      | 53182464/170498071 [00:00<00:01, 104129413.61it/s]
 38%|###8      | 64880640/170498071 [00:00<00:00, 108278413.79it/s]
 45%|####4     | 76578816/170498071 [00:00<00:00, 111041579.78it/s]
 52%|#####1    | 88309760/170498071 [00:00<00:00, 112877178.82it/s]
 59%|#####8    | 99844096/170498071 [00:01<00:00, 113545127.51it/s]
 65%|######5   | 111214592/170498071 [00:01<00:00, 113272290.15it/s]
 72%|#######2  | 122912768/170498071 [00:01<00:00, 114332194.13it/s]
 79%|#######8  | 134610944/170498071 [00:01<00:00, 115090849.49it/s]
 86%|########5 | 146309120/170498071 [00:01<00:00, 115652585.91it/s]
 93%|#########2| 158007296/170498071 [00:01<00:00, 116028884.49it/s]
100%|#########9| 169672704/170498071 [00:01<00:00, 116193315.18it/s]
100%|##########| 170498071/170498071 [00:01<00:00, 105794870.01it/s]
Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified

Let us show some of the training images, for fun.

让我们展示一些训练图像，以示乐趣。

import matplotlib.pyplot as plt
import numpy as np

# 显示图像的功能

def imshow(img):
    img = img / 2 + 0.5     # 非正则化
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0)))
    plt.show()


# 随机获取一些训练图像
dataiter = iter(trainloader)
images, labels = next(dataiter)

# 显示图像
imshow(torchvision.utils.make_grid(images))
# 打印标签
print(' '.join(f'{classes[labels[j]]:5s}' for j in range(batch_size)))

cifar10 tutorial

frog  plane deer  car

2. Define a Convolutional Neural Network

Copy the neural network from the Neural Networks section before and modify it to take 3-channel images (instead of 1-channel images as it was defined).

复制之前 "神经网络 "部分中的神经网络，并对其进行修改，以获取三通道图像（而不是之前定义的单通道图像）。

import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = torch.flatten(x, 1) # 将除批次外的所有尺寸压平
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


net = Net()

3. Define a Loss function and optimizer

Let’s use a Classification Cross-Entropy loss and SGD with momentum.

让我们使用分类交叉熵损失和带动量的 SGD。

import torch.optim as optim

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

4. Train the network

This is when things start to get interesting. We simply have to loop over our data iterator, and feed the inputs to the network and optimize.

此时，事情开始变得有趣起来。我们只需在数据迭代器上循环，并将将输入传入到网络中，然后进行优化。

for epoch in range(2):  # 对数据集进行多次循环

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # 获取输入；数据是 [输入、标签] 列表
        inputs, labels = data

        # 参数梯度归零
        optimizer.zero_grad()

        # 前进 + 后退 + 优化
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # 打印统计
        running_loss += loss.item()
        if i % 2000 == 1999:    # 每 2000 个小批量打印
            print(f'[{epoch + 1}, {i + 1:5d}] loss: {running_loss / 2000:.3f}')
            running_loss = 0.0

print('Finished Training')

[1,  2000] loss: 2.144
[1,  4000] loss: 1.835
[1,  6000] loss: 1.677
[1,  8000] loss: 1.573
[1, 10000] loss: 1.526
[1, 12000] loss: 1.447
[2,  2000] loss: 1.405
[2,  4000] loss: 1.363
[2,  6000] loss: 1.341
[2,  8000] loss: 1.340
[2, 10000] loss: 1.315
[2, 12000] loss: 1.281
Finished Training

Let’s quickly save our trained model:

让我们快速保存训练好的模型：

PATH = './cifar_net.pth'
torch.save(net.state_dict(), PATH)

See here for more details on saving PyTorch models.

有关保存 PyTorch 模型的详细信息，请参阅此处。

5. Test the network on the test data

We have trained the network for 2 passes over the training dataset. But we need to check if the network has learnt anything at all.

We will check this by predicting the class label that the neural network outputs, and checking it against the ground-truth. If the prediction is correct, we add the sample to the list of correct predictions.

Okay, first step. Let us display an image from the test set to get familiar.

我们已经在训练数据集上对网络进行了 2 次训练。但我们需要检查网络是否学到了任何东西。

我们将通过预测神经网络输出的类别标签来检查这一点，并将其与真是情况进行核对。如果预测正确，我们就将样本添加到正确预测列表中。

好的，第一步。让我们显示测试集中的一幅图像，熟悉一下。

dataiter = iter(testloader)
images, labels = next(dataiter)

# 打印图像
imshow(torchvision.utils.make_grid(images))
print('GroundTruth: ', ' '.join(f'{classes[labels[j]]:5s}' for j in range(4)))

cifar10 tutorial

GroundTruth:  cat   ship  ship  plane

Next, let’s load back in our saved model (note: saving and re-loading the model wasn’t necessary here, we only did it to illustrate how to do so):

接下来，让我们重新加载已保存的模型（注：这里不需要保存并重新加载模型，我们这样做只是为了说明如何操作）：

net = Net()
net.load_state_dict(torch.load(PATH))

<All keys matched successfully>

Okay, now let us see what the neural network thinks these examples above are:

好了，现在让我们看看神经网络认为上面这些例子是什么：

outputs = net(images)

The outputs are energies for the 10 classes. The higher the energy for a class, the more the network thinks that the image is of the particular class. So, let’s get the index of the highest energy:

输出是 10 个类别的能量。某个类别的能量越高，网络就越认为图像属于该类别。因此，我们来获取最高能量的指数：

_, predicted = torch.max(outputs, 1)

print('Predicted: ', ' '.join(f'{classes[predicted[j]]:5s}'
                              for j in range(4)))

Predicted:  cat   ship  truck ship

The results seem pretty good.

Let us look at how the network performs on the whole dataset.

结果似乎还不错。

让我们看看该网络在整个数据集上的表现。

correct = 0
total = 0
# 因为我们不是在训练，所以不需要计算输出的梯度
with torch.no_grad():
    for data in testloader:
        images, labels = data
        # 通过网络运行图像计算输出
        outputs = net(images)
        # 我们选择能量最高的类别作为预测类别
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f'Accuracy of the network on the 10000 test images: {100 * correct // total} %')

Accuracy of the network on the 10000 test images: 54 %

That looks way better than chance, which is 10% accuracy (randomly picking a class out of 10 classes). Seems like the network learnt something.

Hmmm, what are the classes that performed well, and the classes that did not perform well:

这看起来比偶然性要好得多，偶然性的准确率是 10%（从 10 个类别中随机挑选一个类别）。看来网络学到了什么。

嗯，哪些类别表现良好，哪些类别表现不佳：

# 准备类别的预测数
correct_pred = {classname: 0 for classname in classes}
total_pred = {classname: 0 for classname in classes}

# 同样不需要梯度
with torch.no_grad():
    for data in testloader:
        images, labels = data
        outputs = net(images)
        _, predictions = torch.max(outputs, 1)
        # 收集每个类别的正确预测
        for label, prediction in zip(labels, predictions):
            if label == prediction:
                correct_pred[classes[label]] += 1
            total_pred[classes[label]] += 1


# 每个类别打印精度
for classname, correct_count in correct_pred.items():
    accuracy = 100 * float(correct_count) / total_pred[classname]
    print(f'Accuracy for class: {classname:5s} is {accuracy:.1f} %')

Accuracy for class: plane is 37.9 %
Accuracy for class: car   is 62.2 %
Accuracy for class: bird  is 45.6 %
Accuracy for class: cat   is 29.2 %
Accuracy for class: deer  is 50.3 %
Accuracy for class: dog   is 45.9 %
Accuracy for class: frog  is 60.1 %
Accuracy for class: horse is 70.3 %
Accuracy for class: ship  is 82.9 %
Accuracy for class: truck is 63.1 %

Okay, so what next?

How do we run these neural networks on the GPU?

好了，接下来怎么办？

我们如何在 GPU 上运行这些神经网络？

Training on GPU

Just like how you transfer a Tensor onto the GPU, you transfer the neural net onto the GPU.

Let’s first define our device as the first visible cuda device if we have CUDA available:

就像把张量传输到 GPU 上一样，把神经网络传输到 GPU 上也是如此。

首先，如果我们有可用的 CUDA，就将我们的设备定义为第一个可见的 cuda 设备：

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

# 假设我们使用的是 CUDA 机器，则应打印 CUDA 设备

print(device)

cuda:0

The rest of this section assumes that device is a CUDA device.

Then these methods will recursively go over all modules and convert their parameters and buffers to CUDA tensors:

本节其余部分假设设备为 CUDA 设备。

然后，这些方法将递归遍历所有模块，并将其参数和缓冲区转换为 CUDA 张量：

net.to(device)

Remember that you will have to send the inputs and targets at every step to the GPU too:

请记住，您还必须将每一步的输入和目标发送到 GPU：

inputs, labels = data[0].to(device), data[1].to(device)

Why don’t I notice MASSIVE speedup compared to CPU? Because your network is really small.

Exercise: Try increasing the width of your network (argument 2 of the first nn.Conv2d, and argument 1 of the second nn.Conv2d – they need to be the same number), see what kind of speedup you get.

Goals achieved: