1. 内容来源
2. 底层实现
2.1 参数初始化
RNN与MLP的区别在于增加隐藏层,隐藏层的输入为上一时刻的隐藏层和输入,输出为当前时刻输出,公式如下:
其中,为激活函数,为上一时刻的隐藏层状态和输入,为当前时刻的隐藏层状态和输出。
(图源:54 循环神经网络 RNN【动手学深度学习v2】_哔哩哔哩_bilibili)
而RNN网络的优化目标就是让当前时刻输出尽可能接近当前时刻的输入,因为当前时刻的输出发生在当前时刻输入的前面,所以当前时刻的输出不会得到当前输入。
代码实现:
%matplotlib inline
import math
import torch
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l
def get_params(vocab_size, num_hiddens, device):
num_inputs = num_outputs = vocab_size
def normal(shape):
return torch.randn(size=shape, device=device) * 0.01
W_xh = normal((num_inputs, num_hiddens))
W_hh = normal((num_hiddens, num_hiddens))
b_h = torch.zeros(num_hiddens, device=device)
W_hq = normal((num_hiddens, num_outputs))
b_q = torch.zeros(num_outputs, device=device)
params = [W_xh, W_hh, b_h, W_hq, b_q]
for param in params:
param.requires_grad_(True)
return params
1时刻时,0时刻的X传入,产生1时刻的输出,需要0时刻的隐藏层状态,由于隐藏层没被调用过,需要初始化隐藏层状态:
def init_rnn_state(batch_size, num_hiddens, device):
return (torch.zeros((batch_size, num_hiddens), device=device), )
(沐神设置返回元组是为了下一章的代码重用,不用理会)
2.2 前向计算
有了2.1中的公式,前向计算只需要照着公式编码即可,但是需要注意的时输入X的维度为(时间步长,批量大小,词向量长度)。其中时间步长等于取出一个句子的长度(RNN不适合过长的句子)
若忽略批量大小,则每次取出一个句子进行训练时,以时间步长维度进行遍历,将每次输出的词串联得到输出的句子。那么计算Loss就是将输出的句子与输入的句子进行比对,由于输出维度为词向量,可以看成相同长度的分类问题,使用交叉熵即可(NLP问题常用混淆度,是在平均交叉熵的基础上做指数运算,1为完美,无穷大代表随机输出)
若考虑批量大小,则模型每次拿出N个句子,输出N个句子,则为了Loss计算方便,将批量维度与时间步长维度合并,得到(批量大小*时间步长,词向量长度)。
代码实现:
def rnn(inputs, state, params):
W_xh, W_hh, b_h, W_hq, b_q = params
H, = state
outputs = []
# the number of for loop: num_steps
for X in inputs:
H = torch.tanh(torch.mm(X, W_xh) +
torch.mm(H, W_hh) +
b_h)
Y = torch.mm(H, W_hq) + b_q
outputs.append(Y)
return torch.cat(outputs, dim=0), (H, )
2.3 RNN模型构建
将2.1和2.2两个函数封装,可得到RNN类:
class RNN_scratch:
def __init__(self, vocab_size, num_hiddens, device, get_params, init_state, forward_fn):
self.vocab_size, self.num_hiddens = vocab_size, num_hiddens
self.params = get_params(vocab_size, num_hiddens, device)
self.init_state, self.forward_fn = init_state, forward_fn
def __call__(self, X, state):
# X.shape=(batch_size, time_step)
X = F.one_hot(X.T, self.vocab_size).type(torch.float32)
return self.forward_fn(X, state, self.params)
def begin_state(self, batch_size, device):
return self.init_state(batch_size, self.num_hiddens, device)
2.4 预测函数
预测即测试的时候,给定前文,向后预测N个字符。此时RNN中的W和b保持不变(若将测试中的前文重新训练容易污染模型),使用前文迭代生成隐藏层状态,然后循环N次,每次使用上时刻的隐藏层状态生成当前时刻输出。
代码实现:
def predict(prefix, num_preds, net, vocab, device):
state = net.begin_state(batch_size=1, device=device)
outputs = [vocab[prefix[0]]]
get_input = lambda: torch.tensor([outputs[-1]], device=device).reshape(1, 1)
# Prefix
for y in prefix[1:]:
_, state = net(get_input(), state)
outputs.append(vocab[y])
# Predict
for _ in range(num_preds):
y, state = net(get_input(), state)
outputs.append(int(y.argmax(dim=1).reshape(1)))
return ''.join([vocab.idx_to_token[i] for i in outputs])
2.5 梯度剪裁
梯度剪裁是循环网络中常用的技巧,即每次得到新梯度后检查所有参数的梯度平方和是否超过某个自设阈值,若超过,将全部梯度除以阈值,否则不做调整。该操作作用在于防止梯度过大。
代码实现:
def grad_clipping(net, theta):
if isinstance(net, nn.Module):
params = [p for p in net.parameters() if p.requires_grad]
else:
params = net.params
norm = torch.sqrt(sum(torch.sum((p.grad**2)) for p in params))
if norm > theta:
for param in params:
param.grad[:] *= theta / norm
2.6 模型训练
沐神将循环网络的训练函数统一封装在一个函数中,源码如下:
def train_epoch_ch8(net, train_iter, loss, updater, device, use_random_iter):
"""Train a net within one epoch (defined in Chapter 8).
Defined in :numref:`sec_rnn_scratch`"""
state, timer = None, d2l.Timer()
metric = d2l.Accumulator(2) # Sum of training loss, no. of tokens
for X, Y in train_iter:
if state is None or use_random_iter:
# Initialize `state` when either it is the first iteration or
# using random sampling
state = net.begin_state(batch_size=X.shape[0], device=device)
else:
if isinstance(net, nn.Module) and not isinstance(state, tuple):
# `state` is a tensor for `nn.GRU`
state.detach_()
else:
# `state` is a tuple of tensors for `nn.LSTM` and
# for our custom scratch implementation
for s in state:
s.detach_()
y = Y.T.reshape(-1)
X, y = X.to(device), y.to(device)
y_hat, state = net(X, state)
l = loss(y_hat, y.long()).mean()
if isinstance(updater, torch.optim.Optimizer):
updater.zero_grad()
l.backward()
grad_clipping(net, 1)
updater.step()
else:
l.backward()
grad_clipping(net, 1)
# Since the `mean` function has been invoked
updater(batch_size=1)
metric.add(l * d2l.size(y), d2l.size(y))
return math.exp(metric[0] / metric[1]), metric[1] / timer.stop()
与之前训练函数不同在于以下几点:
- 训练伊始,隐藏状态为空,需要先初始化
- 若每次取(时间步长)句子不连续,上一时刻的隐藏状态不需要,需要重新初始化隐藏状态
- 若每次取(时间步长)句子连续,上一时刻的隐藏状态不丢弃,但是在下一个句子首词传入是不需要对代表前一个句子的隐藏状态求梯度更新,直接覆盖即可
- 每次拿到参数梯度时,都需要做一次梯度剪裁
加入动态展示和predict打印,再次封装源码:
def train_ch8(net, train_iter, vocab, lr, num_epochs, device,
use_random_iter=False):
"""Train a model (defined in Chapter 8).
Defined in :numref:`sec_rnn_scratch`"""
loss = nn.CrossEntropyLoss()
animator = d2l.Animator(xlabel='epoch', ylabel='perplexity',
legend=['train'], xlim=[10, num_epochs])
# Initialize
if isinstance(net, nn.Module):
updater = torch.optim.SGD(net.parameters(), lr)
else:
updater = lambda batch_size: d2l.sgd(net.params, lr, batch_size)
predict = lambda prefix: predict_ch8(prefix, 50, net, vocab, device)
# Train and predict
for epoch in range(num_epochs):
ppl, speed = train_epoch_ch8(
net, train_iter, loss, updater, device, use_random_iter)
if (epoch + 1) % 10 == 0:
print(predict('time traveller'))
animator.add(epoch + 1, [ppl])
print(f'perplexity {ppl:.1f}, {speed:.1f} tokens/sec on {str(device)}')
print(predict('time traveller'))
print(predict('traveller'))
训练代码:
batch_size , num_steps = 32, 35
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)
num_epochs, lr = 500, 1
d2l.train_ch8(net, train_iter, vocab, lr, num_epochs, d2l.try_gpu())
结果:
perplexity 1.0, 51157.4 tokens/sec on cuda:0
time traveller for so it will be convenient to speak of himwas e
traveller with a slight accession ofcheerfulness really thi
从结果中可以看到,虽然混淆度为1几乎完美,RNN记住了所有文本,但是只是能输出正常的词汇,从句子角度看还是错误,原因在于使用的数据集过小,不足以学习到句式等内容。
3. 简洁实现
3.1 模型构建
import torch
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l
# Model
num_hiddens = 256
rnn_layer = nn.RNN(len(vocab), num_hiddens)
Pytorch中的RNN只需要知道词向量长度和隐藏层元素个数,但是与手写的不同,这里RNN的输出是隐藏层的信息,需要串联一个Linear转化成词向量长度。
代码实现:
class RNN_Model(nn.Module):
def __init__(self,rnn_layer, vocab_size, **kwargs):
super(RNN_Model, self).__init__(**kwargs)
self.rnn = rnn_layer
self.vocab_size = vocab_size
self.num_hiddens = self.rnn.hidden_size
if not self.rnn.bidirectional:
self.num_directions = 1
self.linear = nn.Linear(self.num_hiddens, self.vocab_size)
else:
self.num_directions = 2
self.linear = nn.Linear(self.num_hiddens * 2, self.vocab_size)
def forward(self, inputs, state):
X = F.one_hot(inputs.T.long(), self.vocab_size)
X = X.to(torch.float32)
Y, state = self.rnn(X, state)
output = self.linear(Y.reshape((-1, Y.shape[-1])))
return output, state
def begin_state(self, device, batch_size=1):
if not isinstance(self.rnn, nn.LSTM):
return torch.zeros((self.num_directions * self.rnn.num_layers,
batch_size, self.num_hiddens),
device=device)
else:
return (torch.zeros((
self.num_directions * self.rnn.num_layers,
batch_size, self.num_hiddens), device=device),
torch.zeros((
self.num_directions * self.rnn.num_layers,
batch_size, self.num_hiddens), device=device))
其中,关于num_directions和LSTM部分是沐神为了与之后双向网络和LSTM内容的代码重写,不需要理会。与手写的Model相比基本一致,只是不要忘记需要手动增加Linear层才能输出词向量格式
3.2 模型训练
batch_size , num_steps = 32, 35
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)
device = d2l.try_gpu()
net = RNN_Model(rnn_layer, vocab_size=len(vocab))
net = net.to(device)
num_epochs, lr = 500, 1
d2l.train_ch8(net, train_iter, vocab, lr, num_epochs, device)
结果:
perplexity 1.3, 313781.3 tokens/sec on cuda:0
time travellerit s against reason said felbytaccand and thit tim
travelleryou can angathere imer veri that neghismo ge tot s
与手写结果相比,混淆度1.3与1相差可以忽略,但是训练速度从51,157增加到313,781,原因是框架中将多个小矩阵乘法转化成单次大矩阵乘法,避免并行能力的浪费。