关于pytorch官网教程中的What is torch.nn really?(二)


原文在这里 What is torch.nn really?
本来并没有打算分篇,只是写着写着发现已经很长了,于是就分了。
接下来就是逐步用 torch.nn中的函数来替换前面手搓的神经网络。

Using torch.nn.functional

使用pytorch的nn,对于我们的代码有这么几点好处:更短,更好理解,更灵活。
第一步,使用torch.nn.functional提供的activation和loss函数,来代替我们手写的。

import torch.nn.functional as F

loss_func = F.cross_entropy

def model(xb):
    return xb @ weights + bias

通常torch.nn.functionalimportF,当然这只是个所谓的习惯或者惯例。

先比较一下之前的定义:

def log_softmax(x):
    return x - x.exp().sum(-1).log().unsqueeze(-1)
    
def nll(input, target):
    return -input[range(target.shape[0]), target].mean()

loss_func = nll

def model(xb):
    return log_softmax(xb @ weights + bias)

显然,区别是

  • 使用F.cross_entropy作为loss function,移除了log_softmaxnll的定义。
  • model不再调用log_softmax

前面已经提到过在pytorch中nll+log_softmax等价于cross_entropy,详情就看相关文档吧。

然而,调用方式是没变的,仍然还是

print(loss_func(model(xb), yb), accuracy(model(xb), yb))

运行结果也是一样的。当然到这里还没有开始训练。
从这里我们可以验证一下nillog_softmaxcross_entropy的关系:
新代码中实际上是cross_entropy(xb @ weights + bias)
原本的代码则是nll(log_softmax(xb @ weights + bias))
所以,就是这样了。

Refactor using nn.Module

接下来使用 nn.Modulenn.Parameter。注意这里的Modulenn中的一个类,是大写的M,不要与python中的module的概念混淆起来。

from torch import nn

class Mnist_Logistic(nn.Module):
    def __init__(self):
        super().__init__()
        self.weights = nn.Parameter(torch.randn(784, 10) / math.sqrt(784))
        self.bias = nn.Parameter(torch.zeros(10))

    def forward(self, xb):
        return xb @ self.weights + self.bias

显然,这里定义的Mnist_Logisticnn.Module的一个子类。在这个类里边,保存了参数weightsbias,必须要注意的地方在于,这里与前面不同,没有标注requires_grad,
因为Parameter的定义是这样的:
class torch.nn.parameter.Parameter(data=None, requires_grad=True)
当然,上面的78410这种所谓的magic number看上去还是挺碍眼的,不过这里就先不管吧。
除了保存参数之外,还定义了一个forward方法,其内容等价于前面定义的model
于是model的定义又改版了,从一个函数变成了一个对象,这也是符合程序设计思想的演变:

model = Mnist_Logistic()

然而,调用方式还是没有变的:

print(loss_func(model(xb), yb))

不要奇怪上面的调用方式,pytorch的底层逻辑保证这里model(xb)调用的是model.forward(xb),我想,这里不应该有什么疑问。

接下来,是定义一个函数来封装训练过程:

def fit():
    for epoch in range(epochs):
        for i in range((n - 1) // bs + 1):
            start_i = i * bs
            end_i = start_i + bs
            xb = x_train[start_i:end_i]
            yb = y_train[start_i:end_i]
            pred = model(xb)
            loss = loss_func(pred, yb)

            loss.backward()
            with torch.no_grad():
                for p in model.parameters():
                    p -= p.grad * lr
                model.zero_grad()

可以看到,在with torch.no_grad()上下文中的变化,一目了然,就不废话了。

接下来做个check,当然,是先训练,然后打印一下看看loss是否下降:

fit()
print(loss_func(model(xb), yb))
Refactor using nn.Linear

接下来改进Mnist_Logistic的定义

class Mnist_Logistic(nn.Module):
    def __init__(self):
        super().__init__()
        self.lin = nn.Linear(784, 10)

    def forward(self, xb):
        return self.lin(xb)

nn.Linear类构造了一个线性层,代替了前面手工定义self.weightsself.bias, 以及xb @ self.weights + self.bias
nn.Linear声明是这样的:

class torch.nn.Linear(in_features, out_features, bias=True, device=None, dtype=None)

在其定义中,有这么一些语句:

def __init__(self, in_features: int, out_features: int, bias: bool = True,
                 device=None, dtype=None) -> None:
        factory_kwargs = {'device': device, 'dtype': dtype}
        super(Linear, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
        if bias:
            self.bias = Parameter(torch.empty(out_features, **factory_kwargs))
        else:
            self.register_parameter('bias', None)
        self.reset_parameters()
 def forward(self, input: Tensor) -> Tensor:
 		#如前所述,F 来自于 “import torch.nn.functional as F”
        return F.linear(input, self.weight, self.bias)

我发现pytorch里边各种定义声明的parameter的名字都很奇特,不管这里in_featuresout_features这两个名字的本意怎么样吧,从上面的代码我们可以知道:

  • nn.Linear(784,10)构造了一个weight,这是一个 784 × 10 784\times10 784×10的矩阵,或者说weight.shapetorch.Size([784,10])
  • 上面出现了两个bias,当然,很容易知道不是一回事。这里只说self.bias,显然这才是我们之前一直看到的bias,一个 1 × 10 1\times10 1×10的向量,或者一个shapetorch.Size([10])的张量。
  • 调用了torch.nn.functional中的方法linear,实现了 [ X ] × [ W ] + [ b ] [X]\times[W]+[b] [X]×[W]+[b]

所以,不要疑惑为什么没有指定self.weightsself.bias, 以及为什么没有看到计算xb @ self.weights + self.bias,因为nn.Linear都做了。
因此,当我们的神经网络需要一个线性层的时候,也就是说需要计算像xb @ self.weights + self.bias这样的式子的时候,就像上面那样构造一个nn.Linear类。

使用上还是一样的:

model = Mnist_Logistic()
print(loss_func(model(xb), yb))
fit()
print(loss_func(model(xb), yb))

两个print语句打印训练前后loss的变化。

Refactor using optim
from torch import optim #注意是从torch导入的,而不是nn

def get_model():
    model = Mnist_Logistic()
    return model, optim.SGD(model.parameters(), lr=lr)

model, opt = get_model()
print(loss_func(model(xb), yb))

for epoch in range(epochs):
    for i in range((n - 1) // bs + 1):
        start_i = i * bs
        end_i = start_i + bs
        xb = x_train[start_i:end_i]
        yb = y_train[start_i:end_i]
        pred = model(xb)
        loss = loss_func(pred, yb)

        loss.backward()
        opt.step()
        opt.zero_grad()

print(loss_func(model(xb), yb))

这里显著的区别是:

  • 定义了一个新的函数get_model,用来获取modeloptoptim.SGD会在下面另外说明。
	opt.step()
  	opt.zero_grad()

代替了

with torch.no_grad():
    for p in model.parameters(): p -= p.grad * lr
    model.zero_grad()
  • fit()函数没有了,当然,这个无关紧要,因为这里直接执行了。非得先定义一个fit(),然后再调用也没什么问题。

区别就是这些,接下来看看SGD
声明是这样的

class torch.optim.SGD(params, lr=<required parameter>, momentum=0, dampening=0,
  weight_decay=0, nesterov=False, *, maximize=False, foreach=None, differentiable=False)

所谓SGD指的是stochastic gradient descent ,也就是随机梯度下降,当然在pytorch中,还有可选的momentum参数,这里不解释了。
专门把optim.SGD提出来,其实主要是想说,optim中还提供了许多其他的优化方法,具体的可以查阅文档。

Refactor using Dataset

对数据的处理或者说训练的相关内容做了一些改版,接下来要看看对数据本身的处理。先看代码:

from torch.utils.data import TensorDataset

train_ds = TensorDataset(x_train, y_train)
model, opt = get_model()
for epoch in range(epochs):
    for i in range((n - 1) // bs + 1):
        xb, yb = train_ds[i * bs: i * bs + bs]
        pred = model(xb)
        loss = loss_func(pred, yb)

        loss.backward()
        opt.step()
        opt.zero_grad()

print(loss_func(model(xb), yb))

与前面相比,主要差别在于:

	xb,yb = train_ds[i*bs : i*bs+bs]

代替了

 	start_i = i * bs
  	end_i = start_i + bs
  	xb = x_train[start_i:end_i]
    yb = y_train[start_i:end_i]

当然,上面start_i end_i 的计算放在外面独立计算,这是一个程序优化的细节问题,所以不必疑惑与train_ds[i*bs : i*bs+bs]这样写法的差异,因为这里就是这样。
可以看到,这里只是importTensorDataset,那么说好的Dataset在哪里?
先看TensorDataset的定义:

class TensorDataset(Dataset[Tuple[Tensor, ...]]):
    r"""Dataset wrapping tensors.

    Each sample will be retrieved by indexing tensors along the first dimension.

    Args:
        *tensors (Tensor): tensors that have the same size of the first dimension.
    """
    tensors: Tuple[Tensor, ...]

    def __init__(self, *tensors: Tensor) -> None:
        assert all(tensors[0].size(0) == tensor.size(0) for tensor in tensors), "Size mismatch between tensors"
        self.tensors = tensors

    def __getitem__(self, index):
        return tuple(tensor[index] for tensor in self.tensors)

    def __len__(self):
        return self.tensors[0].size(0)

显然,TensorDatasetDataset的派生类,虽然如上面注释所说,“Dataset wrapping tensors”,但这里只不过是强调,TensorDataset这个类是针对tensor的一个Dataset封装。

实际上,在文档里边我们还可以看到这样的定义:

__all__ = [
    "Dataset",
    "IterableDataset",
    "TensorDataset",
    "ConcatDataset",
    "ChainDataset",
    "Subset",
    "random_split",
]

所以,“Dataset wrapping tensors”还是很好理解的。
另外,就是强调了“tensors that have the same size of the first dimension”,这里意思就是对于Tuple[Tensor, ...]这么一个Tuple,其中所有Tensor元素的第一维的大小都必须一样,但是并没有要求其余维度的大小,甚至没有要求所有元素有同样的维度,

回头说Dataset,这是一个抽象类,定义大体上是这样:

class Dataset(Generic[T_co]):
    r"""An abstract class representing a :class:`Dataset`.

    All datasets that represent a map from keys to data samples should subclass
    it. All subclasses should overwrite :meth:`__getitem__`, supporting fetching a
    data sample for a given key. Subclasses could also optionally overwrite
    :meth:`__len__`, which is expected to return the size of the dataset by many
    :class:`~torch.utils.data.Sampler` implementations and the default options
    of :class:`~torch.utils.data.DataLoader`.

    .. note::
      :class:`~torch.utils.data.DataLoader` by default constructs a index
      sampler that yields integral indices.  To make it work with a map-style
      dataset with non-integral indices/keys, a custom sampler must be provided.
    """

    def __getitem__(self, index) -> T_co:
        raise NotImplementedError

    def __add__(self, other: 'Dataset[T_co]') -> 'ConcatDataset[T_co]':
        return ConcatDataset([self, other])

我想这也没什么好说的,毕竟我们也不会直接使用Dataset,具体使用的时候,就照着上面代码依样画葫芦就是,或者说我们在构造对象的时候总是按照声明来做的。
具体地说,上面两个类的声明分别是:

class torch.utils.data.Dataset(*args, **kwds)
class torch.utils.data.TensorDataset(*tensors)

当然会有一些临时对象以及类型转换的问题,不过这也不关我们的事,对吧。

Refactor using DataLoader

还是先看代码:

from torch.utils.data import DataLoader

train_ds = TensorDataset(x_train, y_train)
train_dl = DataLoader(train_ds, batch_size=bs)

model, opt = get_model()

for epoch in range(epochs):
    for xb, yb in train_dl:
        pred = model(xb)
        loss = loss_func(pred, yb)

        loss.backward()
        opt.step()
        opt.zero_grad()

print(loss_func(model(xb), yb))

显然,增加了

train_dl = DataLoader(train_ds, batch_size=bs)

以及在for循环中

for i in range((n - 1) // bs + 1):
        xb, yb = train_ds[i * bs: i * bs + bs]

改成了

for xb, yb in train_dl:

这里改动的本质就是使用train_dl这个DataLoader对象来处理batch。当然,我们可以先看下DataLoader的声明:

class torch.utils.data.DataLoader(
		dataset, batch_size=1, shuffle=None, sampler=None,
 		batch_sampler=None, num_workers=0, collate_fn=None, 
 		pin_memory=False, drop_last=False, timeout=0, 
 		worker_init_fn=None, multiprocessing_context=None, generator=None,
  		*, prefetch_factor=2, persistent_workers=False, pin_memory_device='')

我们可以为任意的一个Dataset创建一个DataLoader,从而可以很方便的通过DataLoader来迭代每一个batch的数据,而不需要再通过train_ds[i*bs : i*bs+bs]来访问数据。而且,创建DataLoader对象时传入的batch_size参数,使得我们不需要再计算(n - 1) // bs + 1,很多时候,手工计算以及维护循环次数是件挺麻烦的事情。

到目前为止,我们通过使用torch.nn.functionalnn.Modulenn.Parameternn.LinearoptimDataset,and DataLoader重构了之前手搓的神经网络,完整的代码是这样的:

import torch
import numpy as np
import requests
import pickle
import gzip
import torch.nn.functional as F
from torch import nn
from torch import optim
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader
from pathlib import Path
from matplotlib import pyplot

DATA_PATH = Path("data")
PATH = DATA_PATH / "mnist"
PATH.mkdir(parents=True, exist_ok=True)

class Mnist_Logistic(nn.Module):
    def __init__(self):
        super().__init__()
        self.lin = nn.Linear(784, 10)

    def forward(self, xb):
        return self.lin(xb)
        
def get_model():
    model = Mnist_Logistic()
    return model, optim.SGD(model.parameters(), lr=lr)

with gzip.open((PATH / FILENAME).as_posix(), "rb") as f:
        ((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding="latin-1")
#URL = "https://github.com/pytorch/tutorials/raw/master/_static/"
URL = "https://resources.oreilly.com/live-training/inside-unsupervised-learning/-/raw/9f262477e62c3f5a0aa7eb788e557fc7ad1310de/data/mnist_data/"
FILENAME = "mnist.pkl.gz"

if not (PATH / FILENAME).exists():
        content = requests.get(URL + FILENAME).content
        (PATH / FILENAME).open("wb").write(content)
        
x_train, y_train, x_valid, y_valid = map(
    torch.tensor, (x_train, y_train, x_valid, y_valid)
)
  
lr = 0.5  # learning rate
epochs = 2  # how many epochs to train for
bs = 64
        
train_ds = TensorDataset(x_train, y_train)
train_dl = DataLoader(train_ds, batch_size=bs)

loss_func = F.cross_entropy
model, opt = get_model()

for epoch in range(epochs):
    for xb, yb in train_dl:
        pred = model(xb)
        loss = loss_func(pred, yb)

        loss.backward()
        opt.step()
        opt.zero_grad()

print(loss_func(model(xb), yb))

真的挺短的。

Add validation

validation set的作用主要在于验证是否过拟合,这点我想应该都知道的。
先引入validation set:

train_ds = TensorDataset(x_train, y_train)
train_dl = DataLoader(train_ds, batch_size=bs, shuffle=True)

valid_ds = TensorDataset(x_valid, y_valid)
valid_dl = DataLoader(valid_ds, batch_size=bs * 2)

可以看到,除了TensorDataset来源不同之外,DataLoader的构建参数也是不同的:

  • train_dl = DataLoader(train_ds, batch_size=bs, shuffle=True)表示在每个epoch都需要重新打乱训练数据,或者说重新洗牌,这也是降低过拟合的方法之一。
  • DataLoader(valid_ds, batch_size=bs * 2),因为验证集不需要backpropagation,理论上能节省内存,所以增加了batch_size,这样计算loss也更快一些。

于是我们程序的主要部分变成了这样:

model, opt = get_model()

for epoch in range(epochs):
    model.train()  #注意
    for xb, yb in train_dl:
        pred = model(xb) 
        loss = loss_func(pred, yb)

        loss.backward()
        opt.step()
        opt.zero_grad()

    model.eval() #注意
    with torch.no_grad():
        valid_loss = sum(loss_func(model(xb), yb) for xb, yb in valid_dl)

    print(epoch, valid_loss / len(valid_dl))

为了避免再往上翻影响阅读体验,就再贴一遍之前的代码:

model, opt = get_model()

for epoch in range(epochs):
    for xb, yb in train_dl:
        pred = model(xb)
        loss = loss_func(pred, yb)

        loss.backward()
        opt.step()
        opt.zero_grad()

print(loss_func(model(xb), yb))

可以看到,

  • 在训练的每一个epoch,都会有一次model.train() 调用,以及一次 model.eval()调用,之前纯手工的代码并不需要进行这样的调用,当然也可以自己写一个。这两个调用是确保模型在不同的阶段,这里是指训练和推理,都有正常的行为。这是pytorch内部机制决定的,依样画葫芦就是。
  • 此外,valid_loss = sum(loss_func(model(xb), yb) for xb, yb in valid_dl)中的sum,跟之前的sum当然是同一个函数,只是调用的方式不同,前面是作为一个对象的方法,这里是一个独立的函数,这也就决定了参数的不同,这里只需要说这么多,多了也没意义,看看文档sum即可。
Create fit() and get_data()

我们可以进一步精简代码:

def get_data(train_ds, valid_ds, bs):
    return (
        DataLoader(train_ds, batch_size=bs, shuffle=True),
        DataLoader(valid_ds, batch_size=bs * 2),
    )

def loss_batch(model, loss_func, xb, yb, opt=None):
    loss = loss_func(model(xb), yb)

    if opt is not None:
        loss.backward()
        opt.step()
        opt.zero_grad()

    return loss.item(), len(xb)

def fit(epochs, model, loss_func, opt, train_dl, valid_dl):
    for epoch in range(epochs):
        model.train()
        for xb, yb in train_dl:
            loss_batch(model, loss_func, xb, yb, opt)

        model.eval()
        with torch.no_grad():
            losses, nums = zip(
                *[loss_batch(model, loss_func, xb, yb) for xb, yb in valid_dl]
            )
        val_loss = np.sum(np.multiply(losses, nums)) / np.sum(nums)

        print(epoch, val_loss)

上面zip是个python的内置函数,自己搜一下就好了,还有上面的sum是numpy的sum,不是pytorch的,这一点是需要注意的。
其他的部分,对比一下就很明了了,也就不废话了。

最后,我们的代码变成了这样

import torch
import numpy as np
import requests
import pickle
import gzip
import torch.nn.functional as F
from torch import nn
from torch import optim
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader
from pathlib import Path
from matplotlib import pyplot

DATA_PATH = Path("data")
PATH = DATA_PATH / "mnist"
PATH.mkdir(parents=True, exist_ok=True)
FILENAME = "mnist.pkl.gz"

class Mnist_Logistic(nn.Module):
    def __init__(self):
        super().__init__()
        self.lin = nn.Linear(784, 10)

    def forward(self, xb):
        return self.lin(xb)
        
def get_model():
    model = Mnist_Logistic()
    return model, optim.SGD(model.parameters(), lr=lr)

def get_data(train_ds, valid_ds, bs):
    return (
        DataLoader(train_ds, batch_size=bs, shuffle=True),
        DataLoader(valid_ds, batch_size=bs * 2),
    )

def loss_batch(model, loss_func, xb, yb, opt=None):
    loss = loss_func(model(xb), yb)

    if opt is not None:
        loss.backward()
        opt.step()
        opt.zero_grad()

    return loss.item(), len(xb)

def fit(epochs, model, loss_func, opt, train_dl, valid_dl):
    for epoch in range(epochs):
        model.train()
        for xb, yb in train_dl:
            loss_batch(model, loss_func, xb, yb, opt)

        model.eval()
        with torch.no_grad():
            losses, nums = zip(
                *[loss_batch(model, loss_func, xb, yb) for xb, yb in valid_dl]
            )
        val_loss = np.sum(np.multiply(losses, nums)) / np.sum(nums)

        print(epoch, val_loss)

with gzip.open((PATH / FILENAME).as_posix(), "rb") as f:
        ((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding="latin-1")
        
#URL = "https://github.com/pytorch/tutorials/raw/master/_static/"
URL = "https://resources.oreilly.com/live-training/inside-unsupervised-learning/-/raw/9f262477e62c3f5a0aa7eb788e557fc7ad1310de/data/mnist_data/"


if not (PATH / FILENAME).exists():
        content = requests.get(URL + FILENAME).content
        (PATH / FILENAME).open("wb").write(content)
        
#pyplot.imshow(x_train[0].reshape((28, 28)), cmap="gray")
#print(x_train.shape)
#print(x_train)

x_train, y_train, x_valid, y_valid = map(
    torch.tensor, (x_train, y_train, x_valid, y_valid)
)
  
lr = 0.5  # learning rate
epochs = 2  # how many epochs to train for
bs = 64
        
train_ds = TensorDataset(x_train, y_train)
valid_ds = TensorDataset(x_valid, y_valid)
train_dl, valid_dl = get_data(train_ds, valid_ds, bs)

loss_func = F.cross_entropy
model, opt = get_model()
fit(epochs, model, loss_func, opt, train_dl, valid_dl)

看起来,似乎确实精简了代码,然而,这些精简与执行效率无关,代码可读性也没有变得更好,如果不是变得更糟糕的话,理论上这里定义的函数有一定的可重用性,但是,真的会重用吗?

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值