关于pytorch官网教程中的What is torch.nn really?（二）

MYTCHITOS

已于 2022-12-20 11:46:27 修改

阅读量462

点赞数

文章标签： pytorch 深度学习 python

于 2022-12-17 13:06:31 首次发布

本文链接：https://blog.csdn.net/MYTCHITOS/article/details/128339645

版权

文章目录

原文在这里 What is torch.nn really?
本来并没有打算分篇，只是写着写着发现已经很长了，于是就分了。
接下来就是逐步用 torch.nn中的函数来替换前面手搓的神经网络。

Using `torch.nn.functional`

使用pytorch的nn，对于我们的代码有这么几点好处：更短，更好理解，更灵活。
第一步，使用torch.nn.functional提供的activation和loss函数，来代替我们手写的。

import torch.nn.functional as F

loss_func = F.cross_entropy

def model(xb):
    return xb @ weights + bias

通常torch.nn.functional会import为F，当然这只是个所谓的习惯或者惯例。

先比较一下之前的定义：

def log_softmax(x):
    return x - x.exp().sum(-1).log().unsqueeze(-1)
    
def nll(input, target):
    return -input[range(target.shape[0]), target].mean()

loss_func = nll

def model(xb):
    return log_softmax(xb @ weights + bias)

显然，区别是

使用F.cross_entropy作为loss function，移除了log_softmax和nll的定义。
model不再调用log_softmax。

前面已经提到过在pytorch中nll+log_softmax等价于cross_entropy，详情就看相关文档吧。

然而，调用方式是没变的，仍然还是

print(loss_func(model(xb), yb), accuracy(model(xb), yb))

运行结果也是一样的。当然到这里还没有开始训练。
从这里我们可以验证一下nil，log_softmax和cross_entropy的关系：
新代码中实际上是cross_entropy(xb @ weights + bias)
原本的代码则是nll(log_softmax(xb @ weights + bias))
所以，就是这样了。

Refactor using `nn.Module`

接下来使用 nn.Module 和 nn.Parameter。注意这里的Module是nn中的一个类，是大写的M，不要与python中的module的概念混淆起来。

from torch import nn

class Mnist_Logistic(nn.Module):
    def __init__(self):
        super().__init__()
        self.weights = nn.Parameter(torch.randn(784, 10) / math.sqrt(784))
        self.bias = nn.Parameter(torch.zeros(10))

    def forward(self, xb):
        return xb @ self.weights + self.bias

显然，这里定义的Mnist_Logistic是nn.Module的一个子类。在这个类里边，保存了参数weights和bias，必须要注意的地方在于，这里与前面不同，没有标注requires_grad,
因为Parameter的定义是这样的：
class torch.nn.parameter.Parameter(data=None, requires_grad=True)
当然，上面的784和10这种所谓的magic number看上去还是挺碍眼的，不过这里就先不管吧。
除了保存参数之外，还定义了一个forward方法，其内容等价于前面定义的model。
于是model的定义又改版了，从一个函数变成了一个对象，这也是符合程序设计思想的演变：

model = Mnist_Logistic()

然而，调用方式还是没有变的：

print(loss_func(model(xb), yb))

不要奇怪上面的调用方式，pytorch的底层逻辑保证这里model(xb)调用的是model.forward(xb)，我想，这里不应该有什么疑问。

接下来，是定义一个函数来封装训练过程：

def fit():
    for epoch in range(epochs):
        for i in range((n - 1) // bs + 1):
            start_i = i * bs
            end_i = start_i + bs
            xb = x_train[start_i:end_i]
            yb = y_train[start_i:end_i]
            pred = model(xb)
            loss = loss_func(pred, yb)

            loss.backward()
            with torch.no_grad():
                for p in model.parameters():
                    p -= p.grad * lr
                model.zero_grad()

可以看到，在with torch.no_grad()上下文中的变化，一目了然，就不废话了。

接下来做个check，当然，是先训练，然后打印一下看看loss是否下降：

fit()
print(loss_func(model(xb), yb))

Refactor using `nn.Linear`

接下来改进Mnist_Logistic的定义

class Mnist_Logistic(nn.Module):
    def __init__(self):
        super().__init__()
        self.lin = nn.Linear(784, 10)

    def forward(self, xb):
        return self.lin(xb)

nn.Linear类构造了一个线性层，代替了前面手工定义self.weights和self.bias, 以及xb @ self.weights + self.bias。
nn.Linear的声明是这样的：

class torch.nn.Linear(in_features, out_features, bias=True, device=None, dtype=None)

在其定义中，有这么一些语句：

def __init__(self, in_features: int, out_features: int, bias: bool = True,
                 device=None, dtype=None) -> None:
        factory_kwargs = {'device': device, 'dtype': dtype}
        super(Linear, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
        if bias:
            self.bias = Parameter(torch.empty(out_features, **factory_kwargs))
        else:
            self.register_parameter('bias', None)
        self.reset_parameters()
 def forward(self, input: Tensor) -> Tensor:
 		#如前所述，F 来自于 “import torch.nn.functional as F”
        return F.linear(input, self.weight, self.bias)

我发现pytorch里边各种定义声明的parameter的名字都很奇特，不管这里in_features和out_features这两个名字的本意怎么样吧，从上面的代码我们可以知道：

nn.Linear(784,10)构造了一个weight，这是一个 $784\times10$ 的矩阵，或者说weight.shape是torch.Size([784,10])。
上面出现了两个bias，当然，很容易知道不是一回事。这里只说self.bias，显然这才是我们之前一直看到的bias，一个 $1\times10$ 的向量，或者一个shape为torch.Size([10])的张量。
调用了torch.nn.functional中的方法linear，实现了 $[X]\times[W]+[b]$ 。

所以，不要疑惑为什么没有指定self.weights和self.bias, 以及为什么没有看到计算xb @ self.weights + self.bias，因为nn.Linear都做了。
因此，当我们的神经网络需要一个线性层的时候，也就是说需要计算像xb @ self.weights + self.bias这样的式子的时候，就像上面那样构造一个nn.Linear类。

使用上还是一样的：

model = Mnist_Logistic()
print(loss_func(model(xb), yb))
fit()
print(loss_func(model(xb), yb))

两个print语句打印训练前后loss的变化。

Refactor using `optim`

from torch import optim #注意是从torch导入的，而不是nn

def get_model():
    model = Mnist_Logistic()
    return model, optim.SGD(model.parameters(), lr=lr)

model, opt = get_model()
print(loss_func(model(xb), yb))

for epoch in range(epochs):
    for i in range((n - 1) // bs + 1):
        start_i = i * bs
        end_i = start_i + bs
        xb = x_train[start_i:end_i]
        yb = y_train[start_i:end_i]
        pred = model(xb)
        loss = loss_func(pred, yb)

        loss.backward()
        opt.step()
        opt.zero_grad()

print(loss_func(model(xb), yb))

这里显著的区别是：

定义了一个新的函数get_model，用来获取model和opt，optim.SGD会在下面另外说明。
用

	opt.step()
  	opt.zero_grad()

代替了

with torch.no_grad():
    for p in model.parameters(): p -= p.grad * lr
    model.zero_grad()

fit()函数没有了，当然，这个无关紧要，因为这里直接执行了。非得先定义一个fit()，然后再调用也没什么问题。

区别就是这些，接下来看看SGD。
声明是这样的

class torch.optim.SGD(params, lr=<required parameter>, momentum=0, dampening=0,
  weight_decay=0, nesterov=False, *, maximize=False, foreach=None, differentiable=False)

所谓SGD指的是stochastic gradient descent ，也就是随机梯度下降，当然在pytorch中，还有可选的momentum参数，这里不解释了。
专门把optim.SGD提出来，其实主要是想说，optim中还提供了许多其他的优化方法，具体的可以查阅文档。

Refactor using `Dataset`

对数据的处理或者说训练的相关内容做了一些改版，接下来要看看对数据本身的处理。先看代码：

from torch.utils.data import TensorDataset

train_ds = TensorDataset(x_train, y_train)
model, opt = get_model()
for epoch in range(epochs):
    for i in range((n - 1) // bs + 1):
        xb, yb = train_ds[i * bs: i * bs + bs]
        pred = model(xb)
        loss = loss_func(pred, yb)

        loss.backward()
        opt.step()
        opt.zero_grad()

print(loss_func(model(xb), yb))

与前面相比，主要差别在于：
用

	xb,yb = train_ds[i*bs : i*bs+bs]

代替了

 	start_i = i * bs
  	end_i = start_i + bs
  	xb = x_train[start_i:end_i]
    yb = y_train[start_i:end_i]

当然，上面start_i 和end_i 的计算放在外面独立计算，这是一个程序优化的细节问题，所以不必疑惑与train_ds[i*bs : i*bs+bs]这样写法的差异，因为这里就是这样。
可以看到，这里只是import了TensorDataset，那么说好的Dataset在哪里？
先看TensorDataset的定义：

class TensorDataset(Dataset[Tuple[Tensor, ...]]):
    r"""Dataset wrapping tensors.

    Each sample will be retrieved by indexing tensors along the first dimension.

    Args:
        *tensors (Tensor): tensors that have the same size of the first dimension.
    """
    tensors: Tuple[Tensor, ...]

    def __init__(self, *tensors: Tensor) -> None:
        assert all(tensors[0].size(0) == tensor.size(0) for tensor in tensors), "Size mismatch between tensors"
        self.tensors = tensors

    def __getitem__(self, index):
        return tuple(tensor[index] for tensor in self.tensors)

    def __len__(self):
        return self.tensors[0].size(0)

显然，TensorDataset是Dataset的派生类，虽然如上面注释所说，“Dataset wrapping tensors”，但这里只不过是强调，TensorDataset这个类是针对tensor的一个Dataset封装。

实际上，在文档里边我们还可以看到这样的定义：

__all__ = [
    "Dataset",
    "IterableDataset",
    "TensorDataset",
    "ConcatDataset",
    "ChainDataset",
    "Subset",
    "random_split",
]

所以，“Dataset wrapping tensors”还是很好理解的。
另外，就是强调了“tensors that have the same size of the first dimension”，这里意思就是对于Tuple[Tensor, ...]这么一个Tuple，其中所有Tensor元素的第一维的大小都必须一样，但是并没有要求其余维度的大小，甚至没有要求所有元素有同样的维度，

回头说Dataset，这是一个抽象类，定义大体上是这样：

class Dataset(Generic[T_co]):
    r"""An abstract class representing a :class:`Dataset`.

    All datasets that represent a map from keys to data samples should subclass
    it. All subclasses should overwrite :meth:`__getitem__`, supporting fetching a
    data sample for a given key. Subclasses could also optionally overwrite
    :meth:`__len__`, which is expected to return the size of the dataset by many
    :class:`~torch.utils.data.Sampler` implementations and the default options
    of :class:`~torch.utils.data.DataLoader`.

    .. note::
      :class:`~torch.utils.data.DataLoader` by default constructs a index
      sampler that yields integral indices.  To make it work with a map-style
      dataset with non-integral indices/keys, a custom sampler must be provided.
    """

    def __getitem__(self, index) -> T_co:
        raise NotImplementedError

    def __add__(self, other: 'Dataset[T_co]') -> 'ConcatDataset[T_co]':
        return ConcatDataset([self, other])

我想这也没什么好说的，毕竟我们也不会直接使用Dataset，具体使用的时候，就照着上面代码依样画葫芦就是，或者说我们在构造对象的时候总是按照声明来做的。
具体地说，上面两个类的声明分别是：

class torch.utils.data.Dataset(*args, **kwds)
class torch.utils.data.TensorDataset(*tensors)

当然会有一些临时对象以及类型转换的问题，不过这也不关我们的事，对吧。

Refactor using `DataLoader`

还是先看代码：

from torch.utils.data import DataLoader

train_ds = TensorDataset(x_train, y_train)
train_dl = DataLoader(train_ds, batch_size=bs)

model, opt = get_model()

for epoch in range(epochs):
    for xb, yb in train_dl:
        pred = model(xb)
        loss = loss_func(pred, yb)

        loss.backward()
        opt.step()
        opt.zero_grad()

print(loss_func(model(xb), yb))

显然，增加了

train_dl = DataLoader(train_ds, batch_size=bs)

以及在for循环中

for i in range((n - 1) // bs + 1):
        xb, yb = train_ds[i * bs: i * bs + bs]

改成了

for xb, yb in train_dl:

这里改动的本质就是使用train_dl这个DataLoader对象来处理batch。当然，我们可以先看下DataLoader的声明：

class torch.utils.data.DataLoader(
		dataset, batch_size=1, shuffle=None, sampler=None,
 		batch_sampler=None, num_workers=0, collate_fn=None, 
 		pin_memory=False, drop_last=False, timeout=0, 
 		worker_init_fn=None, multiprocessing_context=None, generator=None,
  		*, prefetch_factor=2, persistent_workers=False, pin_memory_device='')

我们可以为任意的一个Dataset创建一个DataLoader，从而可以很方便的通过DataLoader来迭代每一个batch的数据，而不需要再通过train_ds[i*bs : i*bs+bs]来访问数据。而且，创建DataLoader对象时传入的batch_size参数，使得我们不需要再计算(n - 1) // bs + 1，很多时候，手工计算以及维护循环次数是件挺麻烦的事情。

到目前为止，我们通过使用torch.nn.functional，nn.Module，nn.Parameter，nn.Linear，optim，Dataset，and DataLoader重构了之前手搓的神经网络，完整的代码是这样的：

import torch
import numpy as np
import requests
import pickle
import gzip
import torch.nn.functional as F
from torch import nn
from torch import optim
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader
from pathlib import Path
from matplotlib import pyplot

DATA_PATH = Path("data")
PATH = DATA_PATH / "mnist"
PATH.mkdir(parents=True, exist_ok=True)

class Mnist_Logistic(nn.Module):
    def __init__(self):
        super().__init__()
        self.lin = nn.Linear(784, 10)

    def forward(self, xb):
        return self.lin(xb)
        
def get_model():
    model = Mnist_Logistic()
    return model, optim.SGD(model.parameters(), lr=lr)

with gzip.open((PATH / FILENAME).as_posix(), "rb") as f:
        ((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding="latin-1")
#URL = "https://github.com/pytorch/tutorials/raw/master/_static/"
URL = "https://resources.oreilly.com/live-training/inside-unsupervised-learning/-/raw/9f262477e62c3f5a0aa7eb788e557fc7ad1310de/data/mnist_data/"
FILENAME = "mnist.pkl.gz"

if not (PATH / FILENAME).exists():
        content = requests.get(URL + FILENAME).content
        (PATH / FILENAME).open("wb").write(content)
        
x_train, y_train, x_valid, y_valid = map(
    torch.tensor, (x_train, y_train, x_valid, y_valid)
)
  
lr = 0.5  # learning rate
epochs = 2  # how many epochs to train for
bs = 64
        
train_ds = TensorDataset(x_train, y_train)
train_dl = DataLoader(train_ds, batch_size=bs)

loss_func = F.cross_entropy
model, opt = get_model()

for epoch in range(epochs):
    for xb, yb in train_dl:
        pred = model(xb)
        loss = loss_func(pred, yb)

        loss.backward()
        opt.step()
        opt.zero_grad()

print(loss_func(model(xb), yb))

真的挺短的。

Add validation

validation set的作用主要在于验证是否过拟合，这点我想应该都知道的。
先引入validation set：

train_ds = TensorDataset(x_train, y_train)
train_dl = DataLoader(train_ds, batch_size=bs, shuffle=True)

valid_ds = TensorDataset(x_valid, y_valid)
valid_dl = DataLoader(valid_ds, batch_size=bs * 2)

可以看到，除了TensorDataset来源不同之外，DataLoader的构建参数也是不同的：

train_dl = DataLoader(train_ds, batch_size=bs, shuffle=True)表示在每个epoch都需要重新打乱训练数据，或者说重新洗牌，这也是降低过拟合的方法之一。
DataLoader(valid_ds, batch_size=bs * 2)，因为验证集不需要backpropagation，理论上能节省内存，所以增加了batch_size，这样计算loss也更快一些。

于是我们程序的主要部分变成了这样：

model, opt = get_model()

for epoch in range(epochs):
    model.train()  #注意
    for xb, yb in train_dl:
        pred = model(xb) 
        loss = loss_func(pred, yb)

        loss.backward()
        opt.step()
        opt.zero_grad()

    model.eval() #注意
    with torch.no_grad():
        valid_loss = sum(loss_func(model(xb), yb) for xb, yb in valid_dl)

    print(epoch, valid_loss / len(valid_dl))

为了避免再往上翻影响阅读体验，就再贴一遍之前的代码：

model, opt = get_model()

for epoch in range(epochs):
    for xb, yb in train_dl:
        pred = model(xb)
        loss = loss_func(pred, yb)

        loss.backward()
        opt.step()
        opt.zero_grad()

print(loss_func(model(xb), yb))

可以看到，

在训练的每一个epoch，都会有一次model.train() 调用，以及一次 model.eval()调用，之前纯手工的代码并不需要进行这样的调用，当然也可以自己写一个。这两个调用是确保模型在不同的阶段，这里是指训练和推理，都有正常的行为。这是pytorch内部机制决定的，依样画葫芦就是。
此外，valid_loss = sum(loss_func(model(xb), yb) for xb, yb in valid_dl)中的sum，跟之前的sum当然是同一个函数，只是调用的方式不同，前面是作为一个对象的方法，这里是一个独立的函数，这也就决定了参数的不同，这里只需要说这么多，多了也没意义，看看文档sum即可。

Create `fit()` and `get_data()`

我们可以进一步精简代码：

def get_data(train_ds, valid_ds, bs):
    return (
        DataLoader(train_ds, batch_size=bs, shuffle=True),
        DataLoader(valid_ds, batch_size=bs * 2),
    )

def loss_batch(model, loss_func, xb, yb, opt=None):
    loss = loss_func(model(xb), yb)

    if opt is not None:
        loss.backward()
        opt.step()
        opt.zero_grad()

    return loss.item(), len(xb)

def fit(epochs, model, loss_func, opt, train_dl, valid_dl):
    for epoch in range(epochs):
        model.train()
        for xb, yb in train_dl:
            loss_batch(model, loss_func, xb, yb, opt)

        model.eval()
        with torch.no_grad():
            losses, nums = zip(
                *[loss_batch(model, loss_func, xb, yb) for xb, yb in valid_dl]
            )
        val_loss = np.sum(np.multiply(losses, nums)) / np.sum(nums)

        print(epoch, val_loss)

上面zip是个python的内置函数，自己搜一下就好了，还有上面的sum是numpy的sum，不是pytorch的，这一点是需要注意的。
其他的部分，对比一下就很明了了，也就不废话了。

最后，我们的代码变成了这样

import torch
import numpy as np
import requests
import pickle
import gzip
import torch.nn.functional as F
from torch import nn
from torch import optim
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader
from pathlib import Path
from matplotlib import pyplot

DATA_PATH = Path("data")
PATH = DATA_PATH / "mnist"
PATH.mkdir(parents=True, exist_ok=True)
FILENAME = "mnist.pkl.gz"

class Mnist_Logistic(nn.Module):
    def __init__(self):
        super().__init__()
        self.lin = nn.Linear(784, 10)

    def forward(self, xb):
        return self.lin(xb)
        
def get_model():
    model = Mnist_Logistic()
    return model, optim.SGD(model.parameters(), lr=lr)

def get_data(train_ds, valid_ds, bs):
    return (
        DataLoader(train_ds, batch_size=bs, shuffle=True),
        DataLoader(valid_ds, batch_size=bs * 2),
    )

def loss_batch(model, loss_func, xb, yb, opt=None):
    loss = loss_func(model(xb), yb)

    if opt is not None:
        loss.backward()
        opt.step()
        opt.zero_grad()

    return loss.item(), len(xb)

def fit(epochs, model, loss_func, opt, train_dl, valid_dl):
    for epoch in range(epochs):
        model.train()
        for xb, yb in train_dl:
            loss_batch(model, loss_func, xb, yb, opt)

        model.eval()
        with torch.no_grad():
            losses, nums = zip(
                *[loss_batch(model, loss_func, xb, yb) for xb, yb in valid_dl]
            )
        val_loss = np.sum(np.multiply(losses, nums)) / np.sum(nums)

        print(epoch, val_loss)

with gzip.open((PATH / FILENAME).as_posix(), "rb") as f:
        ((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding="latin-1")
        
#URL = "https://github.com/pytorch/tutorials/raw/master/_static/"
URL = "https://resources.oreilly.com/live-training/inside-unsupervised-learning/-/raw/9f262477e62c3f5a0aa7eb788e557fc7ad1310de/data/mnist_data/"


if not (PATH / FILENAME).exists():
        content = requests.get(URL + FILENAME).content
        (PATH / FILENAME).open("wb").write(content)
        
#pyplot.imshow(x_train[0].reshape((28, 28)), cmap="gray")
#print(x_train.shape)
#print(x_train)

x_train, y_train, x_valid, y_valid = map(
    torch.tensor, (x_train, y_train, x_valid, y_valid)
)
  
lr = 0.5  # learning rate
epochs = 2  # how many epochs to train for
bs = 64
        
train_ds = TensorDataset(x_train, y_train)
valid_ds = TensorDataset(x_valid, y_valid)
train_dl, valid_dl = get_data(train_ds, valid_ds, bs)

loss_func = F.cross_entropy
model, opt = get_model()
fit(epochs, model, loss_func, opt, train_dl, valid_dl)

看起来，似乎确实精简了代码，然而，这些精简与执行效率无关，代码可读性也没有变得更好，如果不是变得更糟糕的话，理论上这里定义的函数有一定的可重用性，但是，真的会重用吗？

MYTCHITOS

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
关于pytorch官网教程中的What is torch.nn really?（二）

使用`torch.nn.functional`，`nn.Module`，`nn.Parameter`，`nn.Linear`，`optim`，`Dataset`，and `DataLoader`重构了之前手搓的神经网络
复制链接

扫一扫