一个简单两层网络的演变

近日,在网上看到一个很好的Pytorch学习资料,它用多种方式实现了一个简单的神经网络,将神经网络的部分概念、Numpy和Pytorch语言特性很好地结合在了一起,虽然网络结构简单,但不失为一个很好的学习例子。为了经常能回顾回顾,我于是把它抄在了这里。
出处:https://github.com/jcjohnson/pytorch-examples

一、用Numpy实现的简单网络

Numpy is a generic framework for scientific computing; it does not know anything about computation graphs, or deep learning, or gradients. However we can easily use numpy to fit a two-layer network to random data by manually implementing the forward and backward passes through the network using numpy operations.
Numpy是一个常用的科学计算框架,但由于它开发日早,并没有考虑使用GPU,因而,已不太适合当今Deep Learning网络计算。然而,由于它的api非常丰富,用它来搭建一个简单的神经网络还是很容易的。以下,就是一个由Numpy构建的两层全连接网络的例子,该网络所选择的激活函数是ReLU。

# Code in file tensor/two_layer_net_numpy.py
import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500):
  # Forward pass: compute predicted y
  h = x.dot(w1)
  h_relu = np.maximum(h, 0)
  y_pred = h_relu.dot(w2)
  
  # Compute and print loss
  loss = np.square(y_pred - y).sum()
  print(t, loss)
  
  # Backprop to compute gradients of w1 and w2 with respect to loss
  grad_y_pred = 2.0 * (y_pred - y)
  grad_w2 = h_relu.T.dot(grad_y_pred)
  grad_h_relu = grad_y_pred.dot(w2.T)
  grad_h = grad_h_relu.copy()
  grad_h[h < 0] = 0
  grad_w1 = x.T.dot(grad_h)
 
  # Update weights
  w1 -= learning_rate * grad_w1
  w2 -= learning_rate * grad_w2

Numpy并无自动求导的功能,因此只能手动完成神经网络的forward和backward计算。
(一)、前向(forward)计算
前向计算可分为两步:
1、预测 y p \mathbf y_p yp 的生成:
y p k = W 2 ⋅ σ ( W 1 ⋅ x k ) ( 1 ) where:  x k = [ x 1 , x 2 , ⋯ &ThinSpace; , x m ]  is an m-dim vector , y k = [ y 1 , y 2 , ⋯ &ThinSpace; , y n ]  is an n-dim vector σ ( x i ) = { x i x i ≥ 0 0 o t h e r w i s e  is an element-wise ReLU nonlinear function \mathbf y_p^k = W_2\cdot\sigma(W_1\cdot \mathbf x^k)\qquad(1)\\ \text{where: } \mathbf x^k = [x_1, x_2, \cdots, x_m] \text{ is an m-dim vector}, \\ \mathbf y^k = [y_1, y_2, \cdots, y_n] \text{ is an n-dim vector}\\ \sigma(x_i)=\left\{ \begin{array}{cc} x_i &amp; x_i\ge 0 \\ 0 &amp; otherwise \end{array}\right. \text{ is an element-wise ReLU nonlinear function} ypk=W2σ(W1xk)(1)where: xk=[x1,x2,,xm] is an m-dim vector,yk=[y1,y2,,yn] is an n-dim vectorσ(xi)={xi0xi0otherwise is an element-wise ReLU nonlinear function
(1)式为简单计,省略了神经元的偏置(bias),只考虑参数矩阵( W 1 , W 2 W_1,W_2 W1,W2),其中 W 1 W_1 W1的形状为: h × m h \times m h×m,输入矢量 x \mathbf x x m × 1 m \times 1 m×1 W 2 W_2 W2的形状为 n × h n \times h n×h,输出矢量 y \mathbf y y n × 1 n \times 1 n×1,由矩阵乘法形状要求,从输入到输出有如下关系:
n × 1 = n × h ⋅ h × m ⋅ m × 1 n \times 1 = n \times h\cdot h \times m\cdot m \times 1 n×1=n×hh×mm×1
2、Loss的构造:
Loss:  L = ∑ k = 1 N ∥ y p k − y k ∥ 2 ( 2 ) \text{Loss: }L=\sum_{k=1}^N \Vert \mathbf y_p^k - y^k \Vert^2 \qquad(2) Loss: L=k=1Nypkyk2(2)
批处理的样本个数是N。
(二)、后向(backward)计算
1、Loss对 W 2 W_2 W2 求导,先考虑(2)式中一个样本的情况:
L k = ( y p k − y k ) T ( y p k − y k ) Let  h k = σ ( W 1 ⋅ x k ) , so:  L k = ( W 2 h k − y k ) T ( W 2 h k − y k ) ∂ L k ∂ W 2 = 2 ( W 2 h k − y k ) ( h k ) T ( 3 ) L^k=(\mathbf y_p^k - \mathbf y^k)^T(\mathbf y_p^k - \mathbf y^k)\\ \text{} \\ \text{Let $h^k=\sigma(W_1\cdot \mathbf x^k)$, so: }\\ \text{} \\ L^k=(W_2h^k - \mathbf y^k)^T(W_2h^k - \mathbf y^k)\\ \frac{\partial L^k}{\partial W_2} = 2 (W_2h^k - \mathbf y^k)(h^k)^T \qquad(3) Lk=(ypkyk)T(ypkyk)Let hk=σ(W1xk), so: Lk=(W2hkyk)T(W2hkyk)W2Lk=2(W2hkyk)(hk)T(3)
由此,有:
∂ L ∂ W 2 = ∑ k = 1 N 2 ( W 2 h k − y k ) ( h k ) T = 2 { W 2 [ h 1 , h 2 , ⋯ &ThinSpace; , h N ] − [ y 1 , y 2 , ⋯ &ThinSpace; , y N ] } ⋅ [ ( h 1 ) T ( h 2 ) T ⋮ ( h N ) T ] = 2 ( W 2 H − Y ) H T ( 4 ) \frac{\partial L}{\partial W_2}=\sum_{k=1}^N 2 (W_2h^k - \mathbf y^k)(h^k)^T =2\{W_2[h^1,h^2,\cdots,h^N]-[\mathbf y^1,\mathbf y^2,\cdots,\mathbf y^N]\}\cdot \left[ \begin{array}{c}(h^1)^T \\(h^2)^T \\ \vdots \\(h^N)^T \end{array} \right]\\=2(W_2H-Y)H^T \qquad(4) W2L=k=1N2(W2hkyk)(hk)T=2{W2[h1,h2,,hN][y1,y2,,yN]}(h1)T(h2)T(hN)T=2(W2HY)HT(4)
其中, Y = [ y 1 , y 2 , ⋯ &ThinSpace; , y N ] Y=[\mathbf y^1,\mathbf y^2,\cdots,\mathbf y^N] Y=[y1,y2,,yN] 表示批处理的N个样本的ground_truth, H = [ h 1 , h 2 , ⋯ &ThinSpace; , h N ] H=[h^1,h^2,\cdots,h^N] H=[h1,h2,,hN] 是批量样本第一层的输出,由上面代码的 h_relu 表示。
2、对 W 1 W_1 W1 的梯度
考虑一个样本的情况,为方便省略右上角的‘k’,有:
L = ( y p − y ) T ( y p − y ) = [ W 2 h ( W 1 x ) − y ] T [ W 2 h ( W 1 x ) − y ] ∂ L ∂ W 1 = ∂ L ∂ h ∂ h ∂ W 1 = 2 ( W 2 h − y ) ( W 2 ) T ∂ h ∂ W 1 = 2 ( y p − y ) ( W 2 ) T ∂ h ∂ W 1 ( 5 ) L=(\mathbf y_p-\mathbf y)^T(\mathbf y_p-\mathbf y)= [W_2h(W_1\mathbf x)-\mathbf y]^T[W_2h(W_1\mathbf x)-\mathbf y]\\ \text{} \\ \frac{\partial L}{\partial W_1}=\frac{\partial L}{\partial h}\frac{\partial h}{\partial W_1}=2 (W_2h - \mathbf y)(W_2)^T\frac{\partial h}{\partial W_1}=2 (\mathbf y_p - \mathbf y)(W_2)^T\frac{\partial h}{\partial W_1}\qquad(5) L=(ypy)T(ypy)=[W2h(W1x)y]T[W2h(W1x)y]W1L=hLW1h=2(W2hy)(W2)TW1h=2(ypy)(W2)TW1h(5)
再重看一次代码:

# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.T.dot(grad_y_pred)
grad_h_relu = grad_y_pred.dot(w2.T)
grad_h = grad_h_relu.copy()
grad_h[h < 0] = 0
grad_w1 = x.T.dot(grad_h)

(5)式的前半部分,在代码中可以很清晰地看到,唯一还未解开的是: ∂ h ∂ W 1 \frac{\partial h}{\partial W_1} W1h,其中:
h = σ ( W 1 x ) . ∂ h ∂ W 1 = σ ′ ( W 1 x ) x . ∂ L ∂ W 1 = 2 ( y p − y ) ( W 2 ) T σ ′ ( W 1 x ) x ( 6 ) h=\sigma(W_1 \mathbf x)\\ . \\ \frac{\partial h}{\partial W_1}= \sigma&#x27;(W_1 \mathbf x)\mathbf x \\ . \\ \frac{\partial L}{\partial W_1}=2 (\mathbf y_p - \mathbf y)(W_2)^T \sigma&#x27;(W_1 \mathbf x)\mathbf x \qquad(6) h=σ(W1x).W1h=σ(W1x)x.W1L=2(ypy)(W2)Tσ(W1x)x(6)
σ ( x ) \sigma(x) σ(x) 是ReLU,该函数是由两段直线构成,所以,其导数也由两部分构成,x大于0部分等于1,x小于0部分等于0,即:
σ ′ ( W 1 x ) = { 1 element of  W 1 x ≥ 0 0 otherwise \sigma&#x27;(W_1 \mathbf x)=\left \{ \begin{array}{cc}1 &amp; \text{element of } W_1\mathbf x\ge 0 \\ 0 &amp; \text{otherwise}\end{array} \right . σ(W1x)={10element of W1x0otherwise
其代码表现为:

grad_h[h < 0] = 0
grad_w1 = x.T.dot(grad_h)

此处的 h<0 就是 W 1 x &lt; 0 W_1\mathbf x\lt 0 W1x<0,它就像一个过滤器那样,将 W 1 x &lt; 0 W_1\mathbf x\lt 0 W1x<0的位置置0,不让其通过,而其它的放过去。为了不对原先的数据影响,使用了: grad_h = grad_h_relu.copy(),即copy的方法,得到一个新的变量,在此新变量上操作。

从上分析知,Numpy代码很好地、不折不扣地完成了整个网络的前向和后向计算过程。

二、用Pytorch来代替Numpy

Here we introduce the most fundamental PyTorch concept: the Tensor. A PyTorch Tensor is conceptually identical to a numpy array: a Tensor is an n-dimensional array, and PyTorch provides many functions for operating on these Tensors.
Pytorch是一个独立于Numpy的计算框架,它实现了大量的Numpy计算接口,而且做了一定的精简,或者是还没来得及都做全吧。但大部分Numpy常用的接口,都有Pytorch的对应,以下就是对前面Numpy的一个Pytorch“翻译”。

import torch

device = torch.device('cpu')
#device = torch.device('cuda') # Uncomment this to run on GPU

#N is batch size; D_in is input dimension;
#H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

#Create random input and output data
x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)

#Randomly initialize weights
w1 = torch.randn(D_in, H, device=device)
w2 = torch.randn(H, D_out, device=device)

learning_rate = 1e-6
for t in range(500):
  # Forward pass: compute predicted y
  h = x.mm(w1)
  h_relu = h.clamp(min=0)
  y_pred = h_relu.mm(w2)

  # Compute and print loss; loss is a scalar, and is stored in a PyTorch Tensor
  # of shape (); we can get its value as a Python number with loss.item().
  loss = (y_pred - y).pow(2).sum()
  print(t, loss.item())

  # Backprop to compute gradients of w1 and w2 with respect to loss
  grad_y_pred = 2.0 * (y_pred - y)
  grad_w2 = h_relu.t().mm(grad_y_pred)
  grad_h_relu = grad_y_pred.mm(w2.t())
  grad_h = grad_h_relu.clone()
  grad_h[h < 0] = 0
  grad_w1 = x.t().mm(grad_h)

  # Update weights using gradient descent
  w1 -= learning_rate * grad_w1
  w2 -= learning_rate * grad_w2
  

比较两段代码,Numpy的实现与Pytorch实现并无很大区别,只是两个框架实现的api接口名称不同罢了,比如:

# Numpy
h_relu = np.maximum(h, 0)
# Pytorch
 h_relu = h.clamp(min=0)

在Numpy中x,y,w1,w2是Array,在Pytorch中x,y,w1,w2是Tensor,两者在上述代码实现时,使用方法极其相似,不折不扣地显示地实现了公式对应的运算。在调用方法时,都直接在该对象后调用该对象拥有的方法:

#Numpy
grad_w1 = x.T.dot(grad_h)
#Pytorch
grad_w1 = x.t().mm(grad_h)

要注意的是,在定义x,y,w1,w2时,都不可加:requires_grad=True,因为在这个实现的过程中,我们并不希望在定义Tensor时创建计算图,这与后面的实现不同。

三、带Autograd的Pytorch

In the above examples, we had to manually implement both the forward and backward passes of our neural network. Manually implementing the backward pass is not a big deal for a small two-layer network, but can quickly get very hairy for large complex networks.

Thankfully, we can use automatic differentiation to automate the computation of backward passes in neural networks. The autograd package in PyTorch provides exactly this functionality. When using autograd, the forward pass of your network will define a computational graph; nodes in the graph will be Tensors, and edges will be functions that produce output Tensors from input Tensors. Backpropagating through this graph then allows you to easily compute gradients.
所谓自动求导,是在代码上不出现后向的梯度计算过程,但实际上这些计算过程实际上是存在的,它们都被封装到计算图(Computational Graph)的节点(Nodes——Variable)和边(Edges–Function)中了。计算图是在执行前向运算时被创建的。
If we want to compute gradients with respect to some Tensor, then we set requires_grad=Truewhen constructing that Tensor. Any PyTorch operations on that Tensor will cause a computational graph to be constructed, allowing us to later perform backpropagation through the graph. If x is a Tensor with requires_grad=True, then after back propagation x.grad will be another Tensor holding the gradient of x with respect to some scalar value.
以下为本部分实现的代码:

import torch

device = torch.device('cpu')
# device = torch.device('cuda') # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs
x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)

# Create random Tensors for weights; setting requires_grad=True means that we
# want to compute gradients for these Tensors during the backward pass.
w1 = torch.randn(D_in, H, device=device, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
  # Forward pass: compute predicted y using operations on Tensors. 
  # Since w1 and w2 have requires_grad=True, operations involving these
  # Tensors will cause PyTorch to build a computational graph, allowing 
  # automatic computation of gradients. Since we are no longer 
  # implementing the backward pass by hand we don't need to keep 
  # references to intermediate values.
  y_pred = x.mm(w1).clamp(min=0).mm(w2)
  
  # Compute and print loss. Loss is a Tensor of shape (), and loss.item()
  # is a Python number giving its value.
  loss = (y_pred - y).pow(2).sum()
  print(t, loss.item())

  # Use autograd to compute the backward pass. This call will compute the
  # gradient of loss with respect to all Tensors with requires_grad=True.
  # After this call w1.grad and w2.grad will be Tensors holding the gradient
  # of the loss with respect to w1 and w2 respectively.
  loss.backward()

  # Update weights using gradient descent. For this step we just want to mutate
  # the values of w1 and w2 in-place; we don't want to build up a computational
  # graph for the update steps, so we use the torch.no_grad() context manager
  # to prevent PyTorch from building a computational graph for the updates
  with torch.no_grad():
    w1 -= learning_rate * w1.grad
    w2 -= learning_rate * w2.grad

    # Manually zero the gradients after running the backward pass
    w1.grad.zero_()
    w2.grad.zero_()

在Pytorch中,Tensor并不是一个简单的数据类型,它包含了本身数据x,还可包含梯度数据x.grad,可在这一段代码中看到:

w1 -= learning_rate * w1.grad

一个Tensor是否拥有梯度成员,在创建时就决定了,比较两个创建语句:

w1 = torch.randn(D_in, H, device=device) # 来自 (二)

w1 = torch.randn(D_in, H, device=device, requires_grad=True) # 来自 (三)

如果想看看该网络的结构,可参看:https://github.com/szagoruyko/pytorchviz 的方法。
在这里插入图片描述
从上图中可以看到,backward的梯度过程是由Pytorch自己完成的,它包含三个Backward:MmBackward、ClampMinBackward、MmBackward,分别对应着:矩阵乘、ReLU、矩阵乘。因为在计算图中每条边都定义了自己的求导方法,并已经实现了,因此在封装为计算图时,就可以自动求导了。——非常笼统。

四、定义一个自动梯度函数

torch.autograd中,包含了Function模块,它可以将计算图中的边(edge)所需处理,打包成为一个既有forward,又有backward的一个模块,可以装配到计算图中,使它具备自动梯度的能力。Function对象不包含状态,适合用来封装激活函数和Loss函数。

import torch

class MyReLU(torch.autograd.Function):
  """
  We can implement our own custom autograd Functions by subclassing
  torch.autograd.Function and implementing the forward and backward passes
  which operate on Tensors.
  """
  @staticmethod
  def forward(ctx, x):
    """
    In the forward pass we receive a context object and a Tensor containing the
    input; we must return a Tensor containing the output, and we can use the
    context object to cache objects for use in the backward pass.
    """
    ctx.save_for_backward(x)
    return x.clamp(min=0)

  @staticmethod
  def backward(ctx, grad_output):
    """
    In the backward pass we receive the context object and a Tensor containing
    the gradient of the loss with respect to the output produced during the
    forward pass. We can retrieve cached data from the context object, and must
    compute and return the gradient of the loss with respect to the input to the
    forward function.
    """
    x, = ctx.saved_tensors
    grad_x = grad_output.clone()
    grad_x[x < 0] = 0
    return grad_x


device = torch.device('cpu')
# device = torch.device('cuda') # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and output
x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)

# Create random Tensors for weights.
w1 = torch.randn(D_in, H, device=device, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
  # Forward pass: compute predicted y using operations on Tensors; we call our
  # custom ReLU implementation using the MyReLU.apply function
  y_pred = MyReLU.apply(x.mm(w1)).mm(w2)
 
  # Compute and print loss
  loss = (y_pred - y).pow(2).sum()
  print(t, loss.item())

  # Use autograd to compute the backward pass.
  loss.backward()

  with torch.no_grad():
    # Update weights using gradient descent
    w1 -= learning_rate * w1.grad
    w2 -= learning_rate * w2.grad

    # Manually zero the gradients after running the backward pass
    w1.grad.zero_()
    w2.grad.zero_()

从代码中看到,我们在这里是在MyReLU中,手动实现了backward过程的,有具体的实现,这点不同于(三)的直接loss.backward(),而没有具体的实现,因为(三)的每一步生成的计算图都具有了backward方法,而(四)自己做了一个自定义的Function,其等价于torch.clamp,从下图可以看到。另外,torch.clamp是有自己的backward——ClampMinBackward方法。
在这里插入图片描述
在Pytorch的官方教程中有一句描述Tensor与Function关系的话:
Tensor and Function are interconnected and build up an acyclic graph, that encodes a complete history of computation. Each tensor has a .grad_fn attribute that references a Function that has created the Tensor (except for Tensors created by the user - their grad_fn is None).
翻译:每一个Tensor中都有一个.grad_fn属性指向创建它的Function,除非这个Tensor是由用户自己创建的。在上述代码中:

y_pred = MyReLU.apply(x.mm(w1)).mm(w2)

MyReLU.apply(x.mm(w1))将产生一个Tensor,它就包含这个属性(.grad_fn)指向MyReLU这个Function,在执行自动梯度时:

#Use autograd to compute the backward pass.
  loss.backward()

自动会由Tensor顺藤摸瓜找到MyReLU的Backward方法。

五、用torch.nn来实现

When building neural networks we frequently think of arranging the computation into layers, some of which have learnable parameters which will be optimized during learning.
In PyTorch, the nn package serves this purpose. The nn package defines a set of Modules, which are roughly equivalent to neural network layers. A Module receives input Tensors and computes output Tensors, but may also hold internal state such as Tensors containing learnable parameters. The nn package also defines a set of useful loss functions that are commonly used when training neural networks.
在Pytorch中,将节点和边组合起来的是Module,它是含有状态的,而Function是不含状态的,如何理解呢?我们看(四)的代码,在MyReLU(它是Function)中是不含w1和w2的,而在本节的实现中,model将所需优化的参数包含在其中。

import torch

device = torch.device('cpu')
# device = torch.device('cuda') # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
# After constructing the model we use the .to() method to move it to the
# desired device.
model = torch.nn.Sequential(
          torch.nn.Linear(D_in, H),
          torch.nn.ReLU(),
          torch.nn.Linear(H, D_out),
        ).to(device)

# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function. Setting
# reduction='sum' means that we are computing the *sum* of squared errors rather
# than the mean; this is for consistency with the examples above where we
# manually compute the loss, but in practice it is more common to use mean
# squared error as a loss by setting reduction='elementwise_mean'.
loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-4
for t in range(500):
  # Forward pass: compute predicted y by passing x to the model. Module objects
  # override the __call__ operator so you can call them like functions. When
  # doing so you pass a Tensor of input data to the Module and it produces
  # a Tensor of output data.
  y_pred = model(x)

  # Compute and print loss. We pass Tensors containing the predicted and true
  # values of y, and the loss function returns a Tensor containing the loss.
  loss = loss_fn(y_pred, y)
  print(t, loss.item())
  
  # Zero the gradients before running the backward pass.
  model.zero_grad()

  # Backward pass: compute gradient of the loss with respect to all the learnable
  # parameters of the model. Internally, the parameters of each Module are stored
  # in Tensors with requires_grad=True, so this call will compute gradients for
  # all learnable parameters in the model.
  loss.backward()

  # Update the weights using gradient descent. Each parameter is a Tensor, so
  # we can access its data and gradients like we did before.
  with torch.no_grad():
    for param in model.parameters():
      param.data -= learning_rate * param.grad

model是序列化对象(nn.Sequential),在其中加入若干个Module,每个Module皆有自己的forward和backward,以及状态,下图中浅蓝色的就是状态,即W1和W2。
在这里插入图片描述

六、自定义模块

(五)是使用nn的Sequential对象来构建Module,pytorch提供了自定义模块的方法,以适应具有复杂内部连接的Module。

import torch

class TwoLayerNet(torch.nn.Module):
  def __init__(self, D_in, H, D_out):
    """
    In the constructor we instantiate two nn.Linear modules and assign them as
    member variables.
    """
    super(TwoLayerNet, self).__init__()
    self.linear1 = torch.nn.Linear(D_in, H)
    self.linear2 = torch.nn.Linear(H, D_out)

  def forward(self, x):
    """
    In the forward function we accept a Tensor of input data and we must return
    a Tensor of output data. We can use Modules defined in the constructor as
    well as arbitrary (differentiable) operations on Tensors.
    """
    h_relu = self.linear1(x).clamp(min=0)
    y_pred = self.linear2(h_relu)
    return y_pred

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above.
model = TwoLayerNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
loss_fn = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
  # Forward pass: Compute predicted y by passing x to the model
  y_pred = model(x)

  # Compute and print loss
  loss = loss_fn(y_pred, y)
  print(t, loss.item())

  # Zero gradients, perform a backward pass, and update the weights.
  optimizer.zero_grad()
  loss.backward()
  optimizer.step()

在这里,我们将多个小的Module打包成为一个大的Module,由于所用的每一个Module本身都有backward方法,因此自定义的TwoLayerNet就不需要实现自己的backward了。
在这里插入图片描述
从上图看,可发现这与(五)中的结构图是一样的。

七、包含自定义Function的自定义Module

此处是将(四)和(六)合在一起。

import torch

class TwoLayerNet(torch.nn.Module):
  def __init__(self, D_in, H, D_out):
    """
    In the constructor we instantiate two nn.Linear modules and assign them as
    member variables.
    """
    super(TwoLayerNet, self).__init__()
    self.linear1 = torch.nn.Linear(D_in, H)
    self.linear2 = torch.nn.Linear(H, D_out)
    self.relu = MyReLU.apply # 改动了这里

  def forward(self, x):
    """
    In the forward function we accept a Tensor of input data and we must return
    a Tensor of output data. We can use Modules defined in the constructor as
    well as arbitrary (differentiable) operations on Tensors.
    """
    h_relu = self.relu(self.linear1(x))  # 改动了这里
    y_pred = self.linear2(h_relu)
    return y_pred

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above.
model = TwoLayerNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
loss_fn = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
  # Forward pass: Compute predicted y by passing x to the model
  y_pred = model(x)

  # Compute and print loss
  loss = loss_fn(y_pred, y)
  print(t, loss.item())

  # Zero gradients, perform a backward pass, and update the weights.
  optimizer.zero_grad()
  loss.backward()
  optimizer.step()

在这里插入图片描述
比较前后两幅图,发现(六)图中的ClampMinBackward被这里的MyReLUBackward所代替。

八、动态计算图下的编程优势

TensorFlow的计算图被认为是静态的,也就是一经生成,在整个训练过程,及predict过程都不会改变的,因而被称为静态(Static)。Pytorch了不起的地方是:它生成的是动态计算图(Dynamic)。
什么是动态计算图呢?就是在运行时,该计算图是可以变动的。那有什么好处呢?这可以用我们熟悉的IF判断和LOOP循环来改变运行中的计算图,根据需要随时变化计算图。或许AutoML时,这种特性可以很好的发挥。以下是一个例子:

import random
import torch

class DynamicNet(torch.nn.Module):
  def __init__(self, D_in, H, D_out):
    """
    In the constructor we construct three nn.Linear instances that we will use
    in the forward pass.
    """
    super(DynamicNet, self).__init__()
    self.input_linear = torch.nn.Linear(D_in, H)
    self.middle_linear = torch.nn.Linear(H, H)
    self.output_linear = torch.nn.Linear(H, D_out)

  def forward(self, x):
    """
    For the forward pass of the model, we randomly choose either 0, 1, 2, or 3
    and reuse the middle_linear Module that many times to compute hidden layer
    representations.

    Since each forward pass builds a dynamic computation graph, we can use normal
    Python control-flow operators like loops or conditional statements when
    defining the forward pass of the model.

    Here we also see that it is perfectly safe to reuse the same Module many
    times when defining a computational graph. This is a big improvement from Lua
    Torch, where each Module could be used only once.
    """
    h_relu = self.input_linear(x).clamp(min=0)
    for _ in range(random.randint(0, 3)):
      h_relu = self.middle_linear(h_relu).clamp(min=0)
    y_pred = self.output_linear(h_relu)
    return y_pred


# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs.
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = DynamicNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. Training this strange model with
# vanilla stochastic gradient descent is tough, so we use momentum
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)
for t in range(500):
  # Forward pass: Compute predicted y by passing x to the model
  y_pred = model(x)

  # Compute and print loss
  loss = criterion(y_pred, y)
  print(t, loss.item())

  # Zero gradients, perform a backward pass, and update the weights.
  optimizer.zero_grad()
  loss.backward()
  optimizer.step()

在上述代码中,我们可以看到DynamicNet的forward中,有随机循环控制的计算图:

for _ in range(random.randint(0, 3)):
      h_relu = self.middle_linear(h_relu).clamp(min=0)

造成了每次训练时,所用的计算图都不一样。
在这里插入图片描述
在这里插入图片描述
以上是同一程序运行了两次计算图的结果,可以很明显地看到两图结构上的差别:前图运行了3次LOOP,后图运行了1次LOOP。这种运行中改变计算图的方法:一方面,让计算图的构造很灵活,另一方面,很符合程序员的编程习惯。正是由于采用了动态计算图技术,Pytorch被称为第二代深度学习框架。但我有一个疑问,这会不会造成运行效率的损失呢?

八、小结

https://github.com/jcjohnson/pytorch-examples 由浅入深,给出了多种实现两层网络的方法,涵盖了Pytorch语言的基本特性,是一个很好的学习样例。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值