重点
- 继承torch.nn.Module来定义复杂的网络结构
- torch.nn下定义了常见的损失函数
- torch.optim负责优化算法配置
- autograd可以自己定义,以实现复杂的前向/后向逻辑(torch默认是启动autograd,除非显示调用torch.no_grad())
- torch采用的是动态图,可以用python的流程控制语句实现权值共享
- 本教程从繁琐到简单,一步步展示torch的功能和易用性
Author: Justin Johnson
本教程通过pythorch的应用实例介绍其基本概念。
本质上,pythorch提供两个主要功能:
* n维Tensor,类似一个运行在GPU上的numpy
* 对神经网络自动构建和差分
我们将使用一个全连接的ReLU网络作为例子。这个网络有一个隐含层,使用梯度下降算法学习,欧式距离作为损失函数。
Tensors
Warm-up: numpy
在介绍Pytorch之前,我们先利用numpy构建一个网络。Numpy提供n维数组对象和很多操作数组的函数。numpy是科学计算的一个通用框架,不支持计算图,深度学习或梯度。但是我们利用numpy提供的操作可以很容易实现一个两层网络的前向/后向逻辑,并训练这个网络学习一个随机生成的数据的分布。
# -*- coding: utf-8 -*-
import numpy as np
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)
# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)
learning_rate = 1e-6
for t in range(500):
# Forward pass: compute predicted y
h = x.dot(w1)
h_relu = np.maximum(h, 0)
y_pred = h_relu.dot(w2)
# Compute and print loss
loss = np.square(y_pred - y).sum()
print(t, loss)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.T.dot(grad_y_pred)
grad_h_relu = grad_y_pred.dot(w2.T)
grad_h = grad_h_relu.copy()
grad_h[h < 0] = 0
grad_w1 = x.T.dot(grad_h)
# Update weights
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
Pytorch: Tensors
numpy是一个很棒的框架,但是它不支持GPU以加速运算。现代深度神经网络,GPU常常提供50x以上的加速,所以numpy不能满足深度学习的需求。
我们先介绍pytorch最基础的概念:Tensor。逻辑上pytorch tensor和numpy array是一样的:tensor是一个n维数组,pytorch提供很多函数操作tensor。和numpy数组类似,pytorch也不支持深度学习,计算图或梯度;它只是一个科学计算的工具。
和numpy不同的是,pytorch可以利用GPU加速。只需要强制改变数据类型,就可以让tensor运行在GPU上。
我们利用tensor在随机数据上训练一个两层的网络。和前面的例子类似,我们需要实现前向/后向逻辑:
# -*- coding: utf-8 -*-
import torch
dtype = torch.float
device = torch.device("cpu")
# dtype = torch.device("cuda:0") # Uncomment this to run on GPU
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)
# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)
learning_rate = 1e-6
for t in range(500):
# Forward pass: compute predicted y
h = x.mm(w1)
h_relu = h.clamp(min=0)
y_pred = h_relu.mm(w2)
# Compute and print loss
loss = (y_pred - y).pow(2).sum().item()
print(t, loss)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.t().mm(grad_y_pred)
grad_h_relu = grad_y_pred.mm(w2.t())
grad_h = grad_h_relu.clone()
grad_h[h < 0] = 0
grad_w1 = x.t().mm(grad_h)
# Update weights using gradient descent
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
Autograd
pytorch: Tensors and autograd
在上面的例子里,需要我们手动实现前向/后向逻辑。对于简单的两层网络,手动实现前向/后向逻辑比较容易,但是对于复杂的网络就比较麻烦了。
庆幸的是,pythorc提供了autograd,可以自动计算后向逻辑。使用autograd时,网络前向逻辑将定义一个计算图,图中的节点是tensor,边是函数,这些函数是输入tensor到输出tensor的映射。这使得在网络中反向传播梯度十分简单。
听起来似乎很复杂,但实际应用中很简单。每个tensor表示图中的一个节点。如果
x
x
是一个标量,属性,那么
x.grad
x
.
g
r
a
d
就是
x
x
相对某个标量的梯度。
我们尝试用pytorch tensor和autogrand实现两层网络。这次我们不需要手动实现反向逻辑了:
# -*- coding: utf-8 -*-
import torch
dtype = torch.float
device = torch.device("cpu")
# dtype = torch.device("cuda:0") # Uncomment this to run on GPU
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold input and outputs.
# Setting requires_grad=False indicates that we do not need to compute gradients
# with respect to these Tensors during the backward pass.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)
# Create random Tensors for weights.
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)
learning_rate = 1e-6
for t in range(500):
# Forward pass: compute predicted y using operations on Tensors; these
# are exactly the same operations we used to compute the forward pass using
# Tensors, but we do not need to keep references to intermediate values since
# we are not implementing the backward pass by hand.
y_pred = x.mm(w1).clamp(min=0).mm(w2)
# Compute and print loss using operations on Tensors.
# Now loss is a Tensor of shape (1,)
# loss.item() gets the a scalar value held in the loss.
loss = (y_pred - y).pow(2).sum()
print(t, loss.item())
# Use autograd to compute the backward pass. This call will compute the
# gradient of loss with respect to all Tensors with requires_grad=True.
# After this call w1.grad and w2.grad will be Tensors holding the gradient
# of the loss with respect to w1 and w2 respectively.
loss.backward()
# Manually update weights using gradient descent. Wrap in torch.no_grad()
# because weights have requires_grad=True, but we don't need to track this
# in autograd.
# An alternative way is to operate on weight.data and weight.grad.data.
# Recall that tensor.data gives a tensor that shares the storage with
# tensor, but doesn't track history.
# You can also use torch.optim.SGD to achieve this.
with torch.no_grad():
w1 -= learning_rate * w1.grad
w2 -= learning_rate * w2.grad
# Manually zero the gradients after updating weights
w1.grad.zero_()
w2.grad.zero_()
pytorch:Defining new autograd functions
本质上,任何autograd操作都是tensor上的两个函数。
* 前向函数从输入tensor计算输出tensor
* 后向函数接受输出tensor关于某标量的梯度,计算输入tensor相对这个标量的梯度
pytorch中我们可以通过继承torch.autograd.Function,并实现forward和backward函数,构造自己的autograd操作。我们可以实例化一个新的autograd,像一个函数那样输入包含数据的tensor。
下面的实验中,我们自定义一个autograd函数,展示ReLU非线性。并用它实现我们的两层网络:
# -*- coding: utf-8 -*-
import torch
class MyReLU(torch.autograd.Function):
"""
We can implement our own custom autograd Functions by subclassing
torch.autograd.Function and implementing the forward and backward passes
which operate on Tensors.
"""
@staticmethod
def forward(ctx, input):
"""
In the forward pass we receive a Tensor containing the input and return
a Tensor containing the output. ctx is a context object that can be used
to stash information for backward computation. You can cache arbitrary
objects for use in the backward pass using the ctx.save_for_backward method.
"""
ctx.save_for_backward(input)
return input.clamp(min=0)
@staticmethod
def backward(ctx, grad_output):
"""
In the backward pass we receive a Tensor containing the gradient of the loss
with respect to the output, and we need to compute the gradient of the loss
with respect to the input.
"""
input, = ctx.saved_tensors
grad_input = grad_output.clone()
grad_input[input < 0] = 0
return grad_input
dtype = torch.float
device = torch.device("cpu")
# dtype = torch.device("cuda:0") # Uncomment this to run on GPU
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold input and outputs.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)
# Create random Tensors for weights.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)
learning_rate = 1e-6
for t in range(500):
# To apply our Function, we use Function.apply method. We alias this as 'relu'.
relu = MyReLU.apply
# Forward pass: compute predicted y using operations; we compute
# ReLU using our custom autograd operation.
y_pred = relu(x.mm(w1)).mm(w2)
# Compute and print loss
loss = (y_pred - y).pow(2).sum()
print(t, loss.item())
# Use autograd to compute the backward pass.
loss.backward()
# Update weights using gradient descent
with torch.no_grad():
w1 -= learning_rate * w1.grad
w2 -= learning_rate * w2.grad
# Manually zero the gradients after updating weights
w1.grad.zero_()
w2.grad.zero_()
TensorFlow:Static Graphs
pytorch autograd看起来很像TensorFlow:两个框架都定义了计算图并利用自动差分计算梯度。二者最大的差异是TensorFlow的计算图是静态的,而pytorch使用的是动态计算图。
在TensorFlow中,我们定义一个计算图然后就是反复送入不同数据,执行这个计算图。在pytorch中,每个前向逻辑都定义一个新的计算图。
静态图的优势在于你可以提前优化它;一个框架可以选择融合一些图操作以提高效率,或者采用某种策略把图分配到不同GPU或设备上。如果你会反复使用同样的计算图,预先优化所带来的开销将被分摊到每次运行时,变得可以忽略。
静态图和动态图的一个差异是控制流。对某些模型,我们可能希望对不同数据执行不同的计算;比如一个循环网络对于一个数据可能会被展开成不同长度;通过循环可以进行这种展开。对于静态图,这个循环的构造要内嵌到图中;因此TensorFlow提供类似tf.scan的函数在图中嵌入循环。对于动态图,处理这种问题就比较简单了:因为我们对每个例子动态构造图,我们可以对每个输入构造控制流。
为了和上面的例子做对比,我们用TensorFlow训练一个两层的网络:
# -*- coding: utf-8 -*-
import tensorflow as tf
import numpy as np
# First we set up the computational graph:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create placeholders for the input and target data; these will be filled
# with real data when we execute the graph.
x = tf.placeholder(tf.float32, shape=(None, D_in))
y = tf.placeholder(tf.float32, shape=(None, D_out))
# Create Variables for the weights and initialize them with random data.
# A TensorFlow Variable persists its value across executions of the graph.
w1 = tf.Variable(tf.random_normal((D_in, H)))
w2 = tf.Variable(tf.random_normal((H, D_out)))
# Forward pass: Compute the predicted y using operations on TensorFlow Tensors.
# Note that this code does not actually perform any numeric operations; it
# merely sets up the computational graph that we will later execute.
h = tf.matmul(x, w1)
h_relu = tf.maximum(h, tf.zeros(1))
y_pred = tf.matmul(h_relu, w2)
# Compute loss using operations on TensorFlow Tensors
loss = tf.reduce_sum((y - y_pred) ** 2.0)
# Compute gradient of the loss with respect to w1 and w2.
grad_w1, grad_w2 = tf.gradients(loss, [w1, w2])
# Update the weights using gradient descent. To actually update the weights
# we need to evaluate new_w1 and new_w2 when executing the graph. Note that
# in TensorFlow the the act of updating the value of the weights is part of
# the computational graph; in PyTorch this happens outside the computational
# graph.
learning_rate = 1e-6
new_w1 = w1.assign(w1 - learning_rate * grad_w1)
new_w2 = w2.assign(w2 - learning_rate * grad_w2)
# Now we have built our computational graph, so we enter a TensorFlow session to
# actually execute the graph.
with tf.Session() as sess:
# Run the graph once to initialize the Variables w1 and w2.
sess.run(tf.global_variables_initializer())
# Create numpy arrays holding the actual data for the inputs x and targets
# y
x_value = np.random.randn(N, D_in)
y_value = np.random.randn(N, D_out)
for _ in range(500):
# Execute the graph many times. Each time it executes we want to bind
# x_value to x and y_value to y, specified with the feed_dict argument.
# Each time we execute the graph we want to compute the values for loss,
# new_w1, and new_w2; the values of these Tensors are returned as numpy
# arrays.
loss_value, _, _ = sess.run([loss, new_w1, new_w2],
feed_dict={x: x_value, y: y_value})
print(loss_value)
nn module
pytorch: nn
计算图和autograd是十分强大的工具,可以定义复杂的操作并自动求导;但是对于大规模的网络,autograd太过于底层。
在构造神经网络时,我们经常按层进行计算,层中定义一些需要在学习中学习的参数。
TensorFlow之上有Keras,TensorFlow-Slim和TFLearn提供高度抽象的接口,封装了底层的计算图,使得构建网络十分方便。
pytorch中的nn包完成了同样的功能。nn包中定义一组等价于层的模块。一个模块接受输入的tesnor,计算输出的tensor,而且还保存了一些内部状态,比如需要学习的tensor参数。nn包中也定义了一组损失函数,用来训练神经网络。
下面我们用nn包实现两层的网络:
# -*- coding: utf-8 -*-
import torch
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
model = torch.nn.Sequential(
torch.nn.Linear(D_in, H),
torch.nn.ReLU(),
torch.nn.Linear(H, D_out),
)
# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(size_average=False)
learning_rate = 1e-4
for t in range(500):
# Forward pass: compute predicted y by passing x to the model. Module objects
# override the __call__ operator so you can call them like functions. When
# doing so you pass a Tensor of input data to the Module and it produces
# a Tensor of output data.
y_pred = model(x)
# Compute and print loss. We pass Tensors containing the predicted and true
# values of y, and the loss function returns a Tensor containing the
# loss.
loss = loss_fn(y_pred, y)
print(t, loss.item())
# Zero the gradients before running the backward pass.
model.zero_grad()
# Backward pass: compute gradient of the loss with respect to all the learnable
# parameters of the model. Internally, the parameters of each Module are stored
# in Tensors with requires_grad=True, so this call will compute gradients for
# all learnable parameters in the model.
loss.backward()
# Update the weights using gradient descent. Each parameter is a Tensor, so
# we can access and gradients like we did before.
with torch.no_grad():
for param in model.parameters():
param -= learning_rate * param.grad
pytorch:optim
此前更新模型权重都是通过手动修改可学习的tensor(利用torch.no_grad()或.data()避免autograd跟踪变化历史)。对于
SGD这种简单的优化算法,这种操作不是很大负担。但实际中我们经常使用更加高级的优化算法,像AdaGrad,RMSProp和Adam等。
optim包对优化算法做了抽象,提供经常使用的优化算法的实现。
下面的实验中,我们利用nn包定义模型,但这次利用optim提供的Adam算法优化:
# -*- coding: utf-8 -*-
import torch
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
torch.nn.Linear(D_in, H),
torch.nn.ReLU(),
torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(size_average=False)
# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use Adam; the optim package contains many other
# optimization algoriths. The first argument to the Adam constructor tells the
# optimizer which Tensors it should update.
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
for t in range(500):
# Forward pass: compute predicted y by passing x to the model.
y_pred = model(x)
# Compute and print loss.
loss = loss_fn(y_pred, y)
print(t, loss.item())
# Before the backward pass, use the optimizer object to zero all of the
# gradients for the variables it will update (which are the learnable
# weights of the model). This is because by default, gradients are
# accumulated in buffers( i.e, not overwritten) whenever .backward()
# is called. Checkout docs of torch.autograd.backward for more details.
optimizer.zero_grad()
# Backward pass: compute gradient of the loss with respect to model
# parameters
loss.backward()
# Calling the step function on an Optimizer makes an update to its
# parameters
optimizer.step()
pytorch: Custom nn Modules
你可能会需要定义一个比已有模块更加复杂的模块;你只需要继承nn.Module并定义forward逻辑,接收输入tensor,利用其他模块或其他autograd操作计算输出tensor。
下面的例子里我们用自定义的模块构建两层网络:
# -*- coding: utf-8 -*-
import torch
class TwoLayerNet(torch.nn.Module):
def __init__(self, D_in, H, D_out):
"""
In the constructor we instantiate two nn.Linear modules and assign them as
member variables.
"""
super(TwoLayerNet, self).__init__()
self.linear1 = torch.nn.Linear(D_in, H)
self.linear2 = torch.nn.Linear(H, D_out)
def forward(self, x):
"""
In the forward function we accept a Tensor of input data and we must return
a Tensor of output data. We can use Modules defined in the constructor as
well as arbitrary operators on Tensors.
"""
h_relu = self.linear1(x).clamp(min=0)
y_pred = self.linear2(h_relu)
return y_pred
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
# Construct our model by instantiating the class defined above
model = TwoLayerNet(D_in, H, D_out)
# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
criterion = torch.nn.MSELoss(size_average=False)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
# Forward pass: Compute predicted y by passing x to the model
y_pred = model(x)
# Compute and print loss
loss = criterion(y_pred, y)
print(t, loss.item())
# Zero gradients, perform a backward pass, and update the weights.
optimizer.zero_grad()
loss.backward()
optimizer.step()
pytorch: Control Flow + Weight Sharing
作为动态图和权值共享的例子,我们将实现一个奇怪的模型:一个全连接的ReLU网路,每次前向执行时,在中随机选择一个指,然后用同样的权重隐层这么多次(译注:相当于把隐层扩充成权重共享的N次,
N∈{1,2,3,4}
N
∈
{
1
,
2
,
3
,
4
}
。我们利用普通的python控制流实现循环,另外我们利用同一个模块连续相乘,来实现隐层权重共享。
我们利用Mudule的子类很容易实现这个模型:
# -*- coding: utf-8 -*-
import random
import torch
class DynamicNet(torch.nn.Module):
def __init__(self, D_in, H, D_out):
"""
In the constructor we construct three nn.Linear instances that we will use
in the forward pass.
"""
super(DynamicNet, self).__init__()
self.input_linear = torch.nn.Linear(D_in, H)
self.middle_linear = torch.nn.Linear(H, H)
self.output_linear = torch.nn.Linear(H, D_out)
def forward(self, x):
"""
For the forward pass of the model, we randomly choose either 0, 1, 2, or 3
and reuse the middle_linear Module that many times to compute hidden layer
representations.
Since each forward pass builds a dynamic computation graph, we can use normal
Python control-flow operators like loops or conditional statements when
defining the forward pass of the model.
Here we also see that it is perfectly safe to reuse the same Module many
times when defining a computational graph. This is a big improvement from Lua
Torch, where each Module could be used only once.
"""
h_relu = self.input_linear(x).clamp(min=0)
for _ in range(random.randint(0, 3)):
h_relu = self.middle_linear(h_relu).clamp(min=0)
y_pred = self.output_linear(h_relu)
return y_pred
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
# Construct our model by instantiating the class defined above
model = DynamicNet(D_in, H, D_out)
# Construct our loss function and an Optimizer. Training this strange model with
# vanilla stochastic gradient descent is tough, so we use momentum
criterion = torch.nn.MSELoss(size_average=False)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)
for t in range(500):
# Forward pass: Compute predicted y by passing x to the model
y_pred = model(x)
# Compute and print loss
loss = criterion(y_pred, y)
print(t, loss.item())
# Zero gradients, perform a backward pass, and update the weights.
optimizer.zero_grad()
loss.backward()
optimizer.step()