


Numpy is a generic framework for scientific computing; it does not know anything about computation graphs, or deep learning, or gradients. However we can easily use numpy to fit a two-layer network to random data by manually implementing the forward and backward passes through the network using numpy operations.
Numpy是一个常用的科学计算框架,但由于它开发日早,并没有考虑使用GPU,因而,已不太适合当今Deep Learning网络计算。然而,由于它的api非常丰富,用它来搭建一个简单的神经网络还是很容易的。以下,就是一个由Numpy构建的两层全连接网络的例子,该网络所选择的激活函数是ReLU。

# Code in file tensor/two_layer_net_numpy.py
import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500):
  # Forward pass: compute predicted y
  h = x.dot(w1)
  h_relu = np.maximum(h, 0)
  y_pred = h_relu.dot(w2)
  # Compute and print loss
  loss = np.square(y_pred - y).sum()
  print(t, loss)
  # Backprop to compute gradients of w1 and w2 with respect to loss
  grad_y_pred = 2.0 * (y_pred - y)
  grad_w2 = h_relu.T.dot(grad_y_pred)
  grad_h_relu = grad_y_pred.dot(w2.T)
  grad_h = grad_h_relu.copy()
  grad_h[h < 0] = 0
  grad_w1 = x.T.dot(grad_h)
  # Update weights
  w1 -= learning_rate * grad_w1
  w2 -= learning_rate * grad_w2

1、预测 y p \mathbf y_p yp 的生成:
y p k = W 2 ⋅ σ ( W 1 ⋅ x k ) ( 1 ) where:  x k = [ x 1 , x 2 , ⋯ &ThinSpace; , x m ]  is an m-dim vector , y k = [ y 1 , y 2 , ⋯ &ThinSpace; , y n ]  is an n-dim vector σ ( x i ) = { x i x i ≥ 0 0 o t h e r w i s e  is an element-wise ReLU nonlinear function \mathbf y_p^k = W_2\cdot\sigma(W_1\cdot \mathbf x^k)\qquad(1)\\ \text{where: } \mathbf x^k = [x_1, x_2, \cdots, x_m] \text{ is an m-dim vector}, \\ \mathbf y^k = [y_1, y_2, \cdots, y_n] \text{ is an n-dim vector}\\ \sigma(x_i)=\left\{ \begin{array}{cc} x_i &amp; x_i\ge 0 \\ 0 &amp; otherwise \end{array}\right. \text{ is an element-wise ReLU nonlinear function} ypk=W2σ(W1xk)(1)where: xk=[x1,x2,,xm] is an m-dim vector,yk=[y1,y2,,yn] is an n-dim vectorσ(xi)={xi0xi0otherwise is an element-wise ReLU nonlinear function
(1)式为简单计,省略了神经元的偏置(bias),只考虑参数矩阵( W 1 , W 2 W_1,W_2 W1,W2),其中 W 1 W_1 W1的形状为: h × m h \times m h×m,输入矢量 x \mathbf x x m × 1 m \times 1 m×1 W 2 W_2 W2的形状为 n × h n \times h n×h,输出矢量 y \mathbf y y n × 1 n \times 1 n×1,由矩阵乘法形状要求,从输入到输出有如下关系:
n × 1 = n × h ⋅ h × m ⋅ m × 1 n \times 1 = n \times h\cdot h \times m\cdot m \times 1 n×1=n×hh×mm×1
Loss:  L = ∑ k = 1 N ∥ y p k − y k ∥ 2 ( 2 ) \text{Loss: }L=\sum_{k=1}^N \Vert \mathbf y_p^k - y^k \Vert^2 \qquad(2) Loss: L=k=1Nypkyk2(2)
1、Loss对 W 2 W_2 W2 求导,先考虑(2)式中一个样本的情况:
L k = ( y p k − y k ) T ( y p k − y k ) Let  h k = σ ( W 1 ⋅ x k ) , so:  L k = ( W 2 h k − y k ) T ( W 2 h k − y k ) ∂ L k ∂ W 2 = 2 ( W 2 h k − y k ) ( h k ) T ( 3 ) L^k=(\mathbf y_p^k - \mathbf y^k)^T(\mathbf y_p^k - \mathbf y^k)\\ \text{} \\ \text{Let $h^k=\sigma(W_1\cdot \mathbf x^k)$, so: }\\ \text{} \\ L^k=(W_2h^k - \mathbf y^k)^T(W_2h^k - \mathbf y^k)\\ \frac{\partial L^k}{\partial W_2} = 2 (W_2h^k - \mathbf y^k)(h^k)^T \qquad(3) Lk=(ypkyk)T(ypkyk)Let hk=σ(W1xk), so: Lk=(W2hkyk)T(W2hkyk)W2Lk=2(W2hkyk)(hk)T(3)
∂ L ∂ W 2 = ∑ k = 1 N 2 ( W 2 h k − y k ) ( h k ) T = 2 { W 2 [ h 1 , h 2 , ⋯ &ThinSpace; , h N ] − [ y 1 , y 2 , ⋯ &ThinSpace; , y N ] } ⋅ [ ( h 1 ) T ( h 2 ) T ⋮ ( h N ) T ] = 2 ( W 2 H − Y ) H T ( 4 ) \frac{\partial L}{\partial W_2}=\sum_{k=1}^N 2 (W_2h^k - \mathbf y^k)(h^k)^T =2\{W_2[h^1,h^2,\cdots,h^N]-[\mathbf y^1,\mathbf y^2,\cdots,\mathbf y^N]\}\cdot \left[ \begin{array}{c}(h^1)^T \\(h^2)^T \\ \vdots \\(h^N)^T \end{array} \right]\\=2(W_2H-Y)H^T \qquad(4) W2L=k=1N2(W2hkyk)(hk)T=2{W2[h1,h2,,hN][y1,y2,,yN]}(h1)T(h2)T(hN)T=2(W2HY)HT(4)
其中, Y = [ y 1 , y 2 , ⋯ &ThinSpace; , y N ] Y=[\mathbf y^1,\mathbf y^2,\cdots,\mathbf y^N] Y=[y1,y2,,yN] 表示批处理的N个样本的ground_truth, H = [ h 1 , h 2 , ⋯ &ThinSpace; , h N ] H=[h^1,h^2,\cdots,h^N] H=[h1,h2,,hN] 是批量样本第一层的输出,由上面代码的 h_relu 表示。
2、对 W 1 W_1 W1 的梯度
L = ( y p − y ) T ( y p − y ) = [ W 2 h ( W 1 x ) − y ] T [ W 2 h ( W 1 x ) − y ] ∂ L ∂ W 1 = ∂ L ∂ h ∂ h ∂ W 1 = 2 ( W 2 h − y ) ( W 2 ) T ∂ h ∂ W 1 = 2 ( y p − y ) ( W 2 ) T ∂ h ∂ W 1 ( 5 ) L=(\mathbf y_p-\mathbf y)^T(\mathbf y_p-\mathbf y)= [W_2h(W_1\mathbf x)-\mathbf y]^T[W_2h(W_1\mathbf x)-\mathbf y]\\ \text{} \\ \frac{\partial L}{\partial W_1}=\frac{\partial L}{\partial h}\frac{\partial h}{\partial W_1}=2 (W_2h - \mathbf y)(W_2)^T\frac{\partial h}{\partial W_1}=2 (\mathbf y_p - \mathbf y)(W_2)^T\frac{\partial h}{\partial W_1}\qquad(5) L=(ypy)T(ypy)=[W2h(W1x)y]T[W2h(W1x)y]W1L=hLW1h=2(W2hy)(W2)TW1h=2(ypy)(W2)TW1h(5)

# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.T.dot(grad_y_pred)
grad_h_relu = grad_y_pred.dot(w2.T)
grad_h = grad_h_relu.copy()
grad_h[h < 0] = 0
grad_w1 = x.T.dot(grad_h)

(5)式的前半部分,在代码中可以很清晰地看到,唯一还未解开的是: ∂ h ∂ W 1 \frac{\partial h}{\partial W_1} W1h,其中:
h = σ ( W 1 x ) . ∂ h ∂ W 1 = σ ′ ( W 1 x ) x . ∂ L ∂ W 1 = 2 ( y p − y ) ( W 2 ) T σ ′ ( W 1 x ) x ( 6 ) h=\sigma(W_1 \mathbf x)\\ . \\ \frac{\partial h}{\partial W_1}= \sigma&#x27;(W_1 \mathbf x)\mathbf x \\ . \\ \frac{\partial L}{\partial W_1}=2 (\mathbf y_p - \mathbf y)(W_2)^T \sigma&#x27;(W_1 \mathbf x)\mathbf x \qquad(6) h=σ(W1x).W1h=σ(W1x)x.W1L=2(ypy)(W2)Tσ(W1x)x(6)
σ ( x ) \sigma(x) σ(x) 是ReLU,该函数是由两段直线构成,所以,其导数也由两部分构成,x大于0部分等于1,x小于0部分等于0,即:
σ ′ ( W 1 x ) = { 1 element of  W 1 x ≥ 0 0 otherwise \sigma&#x27;(W_1 \mathbf x)=\left \{ \begin{array}{cc}1 &amp; \text{element of } W_1\mathbf x\ge 0 \\ 0 &amp; \text{otherwise}\end{array} \right . σ(W1x)={10element of W1x0otherwise

grad_h[h < 0] = 0
grad_w1 = x.T.dot(grad_h)

此处的 h<0 就是 W 1 x &lt; 0 W_1\mathbf x\lt 0 W1x<0,它就像一个过滤器那样,将 W 1 x &lt; 0 W_1\mathbf x\lt 0 W1x<0的位置置0,不让其通过,而其它的放过去。为了不对原先的数据影响,使用了: grad_h = grad_h_relu.copy(),即copy的方法,得到一个新的变量,在此新变量上操作。



Here we introduce the most fundamental PyTorch concept: the Tensor. A PyTorch Tensor is conceptually identical to a numpy array: a Tensor is an n-dimensional array, and PyTorch provides many functions for operating on these Tensors.

import torch

device = torch.device('cpu')
#device = torch.device('cuda') # Uncomment this to run on GPU

#N is batch size; D_in is input dimension;
#H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

#Create random input and output data
x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)

#Randomly initialize weights
w1 = torch.randn(D_in, H, device=device)
w2 = torch.randn(H, D_out, device=device)

learning_rate = 1e-6
for t in range(500):
  # Forward pass: compute predicted y
  h = x.mm(w1)
  h_relu = h.clamp(min=0)
  y_pred = h_relu.mm(w2)

  # Compute and print loss; loss is a scalar, and is stored in a PyTorch Tensor
  # of shape (); we can get its value as a Python number with loss.item().
  loss = (y_pred - y).pow(2).sum()
  print(t, loss.item())

  # Backprop to compute gradients of w1 and w2 with respect to loss
  grad_y_pred = 2.0 * (y_pred - y)
  grad_w2 = h_relu.t().mm(grad_y_pred)
  grad_h_relu = grad_y_pred.mm(w2.t())
  grad_h = grad_h_relu.clone()
  grad_h[h < 0] = 0
  grad_w1 = x.t().mm(grad_h)

  # Update weights using gradient descent
  w1 -= learning_rate * grad_w1
  w2 -= learning_rate * grad_w2


# Numpy
h_relu = np.maximum(h, 0)
# Pytorch
 h_relu = h.clamp(min=0)


grad_w1 = x.T.dot(grad_h)
grad_w1 = x.t().mm(grad_h)



In the above examples, we had to manually implement both the forward and backward passes of our neural network. Manually implementing the backward pass is not a big deal for a small two-layer network, but can quickly get very hairy for large complex networks.

Thankfully, we can use automatic differentiation to automate the computation of backward passes in neural networks. The autograd package in PyTorch provides exactly this functionality. When using autograd, the forward pass of your network will define a computational graph; nodes in the graph will be Tensors, and edges will be functions that produce output Tensors from input Tensors. Backpropagating through this graph then allows you to easily compute gradients.
所谓自动求导,是在代码上不出现后向的梯度计算过程,但实际上这些计算过程实际上是存在的,它们都被封装到计算图(Computational Graph)的节点(Nodes——Variable)和边(Edges–Function)中了。计算图是在执行前向运算时被创建的。
If we want to compute gradients with respect to some Tensor, then we set requires_grad=Truewhen constructing that Tensor. Any PyTorch operations on that Tensor will cause a computational graph to be constructed, allowing us to later perform backpropagation through the graph. If x is a Tensor with requires_grad=True, then after back propagation x.grad will be another Tensor holding the gradient of x with respect to some scalar value.

import torch

device = torch.device('cpu')
# device = torch.device('cuda') # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs
x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)

# Create random Tensors for weights; setting requires_grad=True means that we
# want to compute gradients for these Tensors during the backward pass.
w1 = torch.randn(D_in, H, device=device, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
  # Forward pass: compute predicted y using operations on Tensors. 
  # Since w1 and w2 have requires_grad=True, operations involving these
  # Tensors will cause PyTorch to build a computational graph, allowing 
  # automatic computation of gradients. Since we are no longer 
  # implementing the backward pass by hand we don't need to keep 
  # references to intermediate values.
  y_pred = x.mm(w1).clamp(min=0).mm(w2)
  # Compute and print loss. Loss is a Tensor of shape (), and loss.item()
  # is a Python number giving its value.
  loss = (y_pred - y).pow(2).sum()
  print(t, loss.item())

  # Use autograd to compute the backward pass. This call will compute the
  # gradient of loss with respect to all Tensors with requires_grad=True.
  # After this call w1.grad and w2.grad will be Tensors holding the gradient
  # of the loss with respect to w1 and w2 respectively.

  # Update weights using gradient descent. For this step we just want to mutate
  # the values of w1 and w2 in-place; we don't want to build up a computational
  # graph for the update steps, so we use the torch.no_grad() context manager
  # to prevent PyTorch from building a computational graph for the updates
  with torch.no_grad():
    w1 -= learning_rate * w1.grad
    w2 -= learning_rate * w2.grad

    # Manually zero the gradients after running the backward pass


w1 -= learning_rate * w1.grad


w1 = torch.randn(D_in, H, device=device) # 来自 (二)

w1 = torch.randn(D_in, H, device=device, requires_grad=True) # 来自 (三)

如果想看看该网络的结构,可参看:https://github.com/szagoruyko/pytorchviz 的方法。



import torch

class MyReLU(torch.autograd.Function):
  We can implement our own custom autograd Functions by subclassing
  torch.autograd.Function and implementing the forward and backward passes
  which operate on Tensors.
  def forward(ctx, x):
    In the forward pass we receive a context object and a Tensor containing the
    input; we must return a Tensor containing the output, and we can use the
    context object to cache objects for use in the backward pass.
    return x.clamp(min=0)

  def backward(ctx, grad_output):
    In the backward pass we receive the context object and a Tensor containing
    the gradient of the loss with respect to the output produced during the
    forward pass. We can retrieve cached data from the context object, and must
    compute and return the gradient of the loss with respect to the input to the
    forward function.
    x, = ctx.saved_tensors
    grad_x = grad_output.clone()
    grad_x[x < 0] = 0
    return grad_x

device = torch.device('cpu')
# device = torch.device('cuda') # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and output
x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)

# Create random Tensors for weights.
w1 = torch.randn(D_in, H, device=device, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
  # Forward pass: compute predicted y using operations on Tensors; we call our
  # custom ReLU implementation using the MyReLU.apply function
  y_pred = MyReLU.apply(x.mm(w1)).mm(w2)
  # Compute and print loss
  loss = (y_pred - y).pow(2).sum()
  print(t, loss.item())

  # Use autograd to compute the backward pass.

  with torch.no_grad():
    # Update weights using gradient descent
    w1 -= learning_rate * w1.grad
    w2 -= learning_rate * w2.grad

    # Manually zero the gradients after running the backward pass

Tensor and Function are interconnected and build up an acyclic graph, that encodes a complete history of computation. Each tensor has a .grad_fn attribute that references a Function that has created the Tensor (except for Tensors created by the user - their grad_fn is None).

y_pred = MyReLU.apply(x.mm(w1)).mm(w2)


#Use autograd to compute the backward pass.



When building neural networks we frequently think of arranging the computation into layers, some of which have learnable parameters which will be optimized during learning.
In PyTorch, the nn package serves this purpose. The nn package defines a set of Modules, which are roughly equivalent to neural network layers. A Module receives input Tensors and computes output Tensors, but may also hold internal state such as Tensors containing learnable parameters. The nn package also defines a set of useful loss functions that are commonly used when training neural networks.

import torch

device = torch.device('cpu')
# device = torch.device('cuda') # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
# After constructing the model we use the .to() method to move it to the
# desired device.
model = torch.nn.Sequential(
          torch.nn.Linear(D_in, H),
          torch.nn.Linear(H, D_out),

# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function. Setting
# reduction='sum' means that we are computing the *sum* of squared errors rather
# than the mean; this is for consistency with the examples above where we
# manually compute the loss, but in practice it is more common to use mean
# squared error as a loss by setting reduction='elementwise_mean'.
loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-4
for t in range(500):
  # Forward pass: compute predicted y by passing x to the model. Module objects
  # override the __call__ operator so you can call them like functions. When
  # doing so you pass a Tensor of input data to the Module and it produces
  # a Tensor of output data.
  y_pred = model(x)

  # Compute and print loss. We pass Tensors containing the predicted and true
  # values of y, and the loss function returns a Tensor containing the loss.
  loss = loss_fn(y_pred, y)
  print(t, loss.item())
  # Zero the gradients before running the backward pass.

  # Backward pass: compute gradient of the loss with respect to all the learnable
  # parameters of the model. Internally, the parameters of each Module are stored
  # in Tensors with requires_grad=True, so this call will compute gradients for
  # all learnable parameters in the model.

  # Update the weights using gradient descent. Each parameter is a Tensor, so
  # we can access its data and gradients like we did before.
  with torch.no_grad():
    for param in model.parameters():
      param.data -= learning_rate * param.grad




import torch

class TwoLayerNet(torch.nn.Module):
  def __init__(self, D_in, H, D_out):
    In the constructor we instantiate two nn.Linear modules and assign them as
    member variables.
    super(TwoLayerNet, self).__init__()
    self.linear1 = torch.nn.Linear(D_in, H)
    self.linear2 = torch.nn.Linear(H, D_out)

  def forward(self, x):
    In the forward function we accept a Tensor of input data and we must return
    a Tensor of output data. We can use Modules defined in the constructor as
    well as arbitrary (differentiable) operations on Tensors.
    h_relu = self.linear1(x).clamp(min=0)
    y_pred = self.linear2(h_relu)
    return y_pred

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above.
model = TwoLayerNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
loss_fn = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
  # Forward pass: Compute predicted y by passing x to the model
  y_pred = model(x)

  # Compute and print loss
  loss = loss_fn(y_pred, y)
  print(t, loss.item())

  # Zero gradients, perform a backward pass, and update the weights.




import torch

class TwoLayerNet(torch.nn.Module):
  def __init__(self, D_in, H, D_out):
    In the constructor we instantiate two nn.Linear modules and assign them as
    member variables.
    super(TwoLayerNet, self).__init__()
    self.linear1 = torch.nn.Linear(D_in, H)
    self.linear2 = torch.nn.Linear(H, D_out)
    self.relu = MyReLU.apply # 改动了这里

  def forward(self, x):
    In the forward function we accept a Tensor of input data and we must return
    a Tensor of output data. We can use Modules defined in the constructor as
    well as arbitrary (differentiable) operations on Tensors.
    h_relu = self.relu(self.linear1(x))  # 改动了这里
    y_pred = self.linear2(h_relu)
    return y_pred

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above.
model = TwoLayerNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
loss_fn = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
  # Forward pass: Compute predicted y by passing x to the model
  y_pred = model(x)

  # Compute and print loss
  loss = loss_fn(y_pred, y)
  print(t, loss.item())

  # Zero gradients, perform a backward pass, and update the weights.




import random
import torch

class DynamicNet(torch.nn.Module):
  def __init__(self, D_in, H, D_out):
    In the constructor we construct three nn.Linear instances that we will use
    in the forward pass.
    super(DynamicNet, self).__init__()
    self.input_linear = torch.nn.Linear(D_in, H)
    self.middle_linear = torch.nn.Linear(H, H)
    self.output_linear = torch.nn.Linear(H, D_out)

  def forward(self, x):
    For the forward pass of the model, we randomly choose either 0, 1, 2, or 3
    and reuse the middle_linear Module that many times to compute hidden layer

    Since each forward pass builds a dynamic computation graph, we can use normal
    Python control-flow operators like loops or conditional statements when
    defining the forward pass of the model.

    Here we also see that it is perfectly safe to reuse the same Module many
    times when defining a computational graph. This is a big improvement from Lua
    Torch, where each Module could be used only once.
    h_relu = self.input_linear(x).clamp(min=0)
    for _ in range(random.randint(0, 3)):
      h_relu = self.middle_linear(h_relu).clamp(min=0)
    y_pred = self.output_linear(h_relu)
    return y_pred

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs.
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = DynamicNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. Training this strange model with
# vanilla stochastic gradient descent is tough, so we use momentum
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)
for t in range(500):
  # Forward pass: Compute predicted y by passing x to the model
  y_pred = model(x)

  # Compute and print loss
  loss = criterion(y_pred, y)
  print(t, loss.item())

  # Zero gradients, perform a backward pass, and update the weights.


for _ in range(random.randint(0, 3)):
      h_relu = self.middle_linear(h_relu).clamp(min=0)



https://github.com/jcjohnson/pytorch-examples 由浅入深,给出了多种实现两层网络的方法,涵盖了Pytorch语言的基本特性,是一个很好的学习样例。





