文章目录
1. 理论RNN
1.1 潜变量自回归模型
- 使用潜变量
h
t
h_t
ht 总结过去信息
1.2 循环神经网络
-
h
t
h_t
ht相关网络:
h t = ϕ ( W h h h t − 1 + W h x x t − 1 + b h ) (1) h_t=\phi(W_{hh}h_{t-1}+W_{hx}x_{t-1}+b_h)\tag1 ht=ϕ(Whhht−1+Whxxt−1+bh)(1)
注1: h t h_t ht来自于 h t − 1 h_{t-1} ht−1和 x t − 1 x_{t-1} xt−1的影响; O t O_t Ot来自于 h t h_t ht的影响; W h h W_{hh} Whh存储所有的时序信息
O t = ϕ ( W h o h t + b o ) (2) O_t=\phi(W_{ho}h_t+b_o)\tag2 Ot=ϕ(Whoht+bo)(2)
注2:损失的计算是 l o s s = O t − X t loss =O_t-X_t loss=Ot−Xt;因为 O t O_t Ot相当于 Y_hat,而X_t 相当于标签 Y,那么我们就能得到损失值
1.3 困惑度
困惑度(perplexity)表示的是:
- 衡量一个语言模型的好坏可以用平均交叉熵
π = 1 n ∑ i = 1 n − log p ( x t ∣ x t − 1 , . . . ) (3) \pi=\frac{1}{n}\sum_{i=1}^n-\log p(x_t|x_{t-1,...})\tag{3} π=n1i=1∑n−logp(xt∣xt−1,...)(3) - p 是语言模型的预测概率, x t x_t xt是真实词
- 历史原因NLP使用困惑度 exp ( π ) \exp(\pi) exp(π)来衡量,是平均每次可能选项
- 1 表示完美,无穷大是最差情况
1.4 梯度剪裁
- 迭代中计算这 T 个时间步上的梯度,那么每次就有一堆矩阵乘法,T次就会出现梯度爆炸,在反向传播过程中产生长度为 O ( T ) O(T) O(T)的矩阵乘法链,导致数值不稳定
- 梯度裁剪能有效预防梯度爆炸
- 如果整个神经网络所有层的梯度范数
g
g
g 长度超过
θ
\theta
θ,那么拖影回长度
θ
\theta
θ
g ← m i n ( 1 , θ ∣ ∣ g ∣ ∣ ) g (4) g\leftarrow min(1,\frac{\theta}{||g||})g\tag{4} g←min(1,∣∣g∣∣θ)g(4)
2. 图解RNN
3. 代码RNN
3.1 RNN循环神经网络从零开始
# -*- coding: utf-8 -*-
# @Project: zc
# @Author: zc
# @File name: RNN-test
# @Create time: 2022/1/23 14:45
# 1. 导入相关数据库
import matplotlib.pyplot as plt
import math
import torch
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l
# 2.定义相关参数
# batch_size 批量大小;num_steps
batch_size, num_steps = 32, 35
# train_iter:训练迭代器,vocab 字符级词元表
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)
# 3.返回RNN循环神经网络可以学习的参数,因为输出和输出来自同一词汇表,所以num_inputs = num_outputs=vocab_size
def get_params(vocab_size, num_hiddens, device):
num_inputs = num_outputs = vocab_size
# 定义初始化,以正态分布采集数据
def normal(shape):
return torch.randn(size=shape, device=device) * 0.01
# 根据公式 H_t = φ(X_t·W_xh+H_t-1·W_hh + b_h)
# O_t = H_t·W_hq +b_q
# X ∈ R^(n x d)
W_xh = normal((num_inputs, num_hiddens))
W_hh = normal((num_hiddens, num_hiddens))
b_h = torch.zeros(num_hiddens, device=device)
W_hq = normal((num_hiddens, num_outputs))
b_q = torch.zeros(num_outputs, device=device)
# 将RNN的所有参数集中到params中
params = [W_xh, W_hh, b_h, W_hq, b_q]
for param in params:
# 确保每个参数:梯度返传时对该参数计算梯度,并存入tensor.grad中
# 能够进行更新
param.requires_grad_(True)
return params # 返回所有的参数
# 4. 初始化返回隐状态;返回值全为0的张量((批量大小,隐藏单元数),)
def init_rnn_state(batch_size, num_hiddens, device):
return (torch.zeros((batch_size, num_hiddens), device=device),)
# 5.定义了如何在一个时间步内计算隐状态和输出;
# RNN前向传播函数;
# 返回输出O_t和隐藏变量H_t
def rnn(inputs, state, params):
# inputs的形状:(时间步数量,批量大小,词表大小)=(5,2,28)
# 将时间放在前面是为了后续的训练
W_xh, W_hh, b_h, W_hq, b_q = params
# W_xh=[512,512];W_hq=[512,28];W_xh=[28,512];b_h = [512,];b_q =[28,]
H, = state # state为元祖tuple ;tuple_0 = H = tensor(2,512)
# outputs 输出层,相当于 O_t ; H 相当于隐藏层
outputs = []
# X 的大小为 [2,28]
# 计算特定时间步的样本
for X in inputs:
# 隐藏层;激活函数为 torch.tanh;X:(批量大小,词表大小)=(2,28)
# 应用公式 H_t = tanh(X_t·W_xh + H_t-1·W__hh+b_h)
H = torch.tanh(torch.mm(X, W_xh) + torch.mm(H, W_hh) + b_h)
# 输出层;O_t = H_t·W_hq + b_q
Y = torch.mm(H, W_hq) + b_q
outputs.append(Y)
# (批量大小*时间长度,vocab词元大小)
return torch.cat(outputs, dim=0), (H,)
class RNNModelScratch:
"""
vocab_size:词元大小;
num_hiddens:隐藏单元个数
device: GPU
get_params:循环神经网络RNN需要的参数
init_state:初始化状态
forward_fn:前向传播函数
"""
def __init__(self, vocab_size, num_hiddens, device,
get_params, init_state, forward_fn):
self.vocab_size, self.num_hiddens = vocab_size, num_hiddens
self.params = get_params(vocab_size, num_hiddens, device)
self.init_state, self.forward_fn = init_state, forward_fn
# 当类的实例在传入(X,state)的时候,就调用此方法
def __call__(self, X, state):
# 1.将X独热编码;
# 2.RNN前向传播计算
# 3. input:X形状 (批量大小,时间步数)
# 4. output:X形状(时间步数,批量大小,vocab_size))
X = F.one_hot(X.T, self.vocab_size).type(torch.float32)
return self.forward_fn(X, state, self.params)
# 初始化状态;值全为0,大小(批量大小,隐藏单元数)
def begin_state(self, batch_size, device):
return self.init_state(batch_size, self.num_hiddens, device)
num_hiddens = 512
# X = torch.arange(10).reshape((2, 5))
net = RNNModelScratch(len(vocab), num_hiddens, d2l.try_gpu(), get_params, init_rnn_state, rnn)
# state = net.begin_state(X.shape[0], d2l.try_gpu())
# Y.shape=[10,28];new_state=([2,512],)
# Y:输出状态;new_state隐藏状态,此调用RNNModelScratch 里面的__call__方法;
# 即前向传播函数,
# Y, new_state = net(X.to(d2l.try_gpu()), state)
# print(Y.shape, len(new_state), new_state[0].shape)
def predict_ch8(prefix, num_preds, net, vocab, device):
"""
# 在 prefix 后面生成新字符
:param prefix: 给定的字符串,模型根据此字符串预测后续的字符
:param num_preds: 需要预测的字符数
:param net:训练好的RNN模型
:param vocab:词汇表
:param device:GPU
:return:
"""
# 初始化状态RNN神经网络
# 在prefix后面生成新字符
# 神经网络的状态初始化
state = net.begin_state(batch_size=1, device=device)
# prefix = time traveller;t->3;i->5;m->13;e->2;''->1;t->3;r->10;a->4;
# v->22;e->2;l->12;l->12;e->2;r->10;18->g;16->f;24->x
# 将字符词元化[3,5,13,2,1,3,10,4,22,2,12,12,2,10,18,16,24]
# prefix[0]='t';vocab['t']=3;outputs=[3]
outputs = [vocab[prefix[0]]]
# 把outputs最后一个字符取出来作为下一个输入(批量大小=1,时间步数=1)
get_input = lambda: torch.tensor([outputs[-1]], device=device).reshape((1, 1))
# 因为这里的prefix是给定的,所以我们不需要进行预测,即不需要存储Y
# 预热期;prefix = 'time traveller';prefix[1:]='ime traveller'
# 将prefix[1:]='ime traveller'里面的所有字符逐个编码进行RNN计算得到对应的H_t
for y in prefix[1:]:
_, state = net(get_input(), state)
outputs.append(vocab[y])
# 开始预测给定prefix后的num_preds
for _ in range(num_preds):
y, state = net(get_input(), state)
outputs.append(int(y.argmax(dim=1).reshape(1)))
return ''.join([vocab.idx_to_token[i] for i in outputs])
print(predict_ch8('time traveller', 10, net, vocab, d2l.try_gpu()))
# 梯度裁剪
def grad_clipping(net, theta):
"""
:param net: 网络模型
:param theta: 设置的θ;公式 g <- min(1,θ/||g||)g;保证梯度不超过 θ
:return:
"""
if not isinstance(net, nn.Module):
params = net.params
else:
params = [p for p in net.parameters() if p.requires_grad]
norm = torch.sqrt(sum(torch.sum((p.grad ** 2)) for p in params))
if norm > theta:
for param in params:
param.grad[:] *= theta / norm
def train_epoch_ch8(net, train_iter, loss, updater, device, use_random_iter):
state, timer = None, d2l.Timer()
metric = d2l.Accumulator(2)
for X, Y in train_iter:
if state is None or use_random_iter:
state = net.begin_state(batch_size=X.shape[0], device=device)
else:
if isinstance(net, nn.Module) and not isinstance(state, tuple):
state.detach_()
else:
for s in state:
s.detach_()
y = Y.T.reshape(-1)
X, y - X.to(device), y.to(device)
y_hat, state = net(X, state)
l = loss(y_hat,y.long()).mean()
if isinstance(updater,torch.optim.Optimizer):
updater.zero_grad()
l.backward()
grad_clipping(net,1)
updater.step()
else:
l.backward()
grad_clipping(net,1)
updater(batch_size=1)
metric.add(l*y.numel(),y.numel())
return math.exp(metric[0]/metric[1]),metric[1]/timer.stop()
def train_ch8(net,train_iter,vocab,lr,num_epochs,device,use_random_iter=False):
loss = nn.CrossEntropyLoss()
animator = d2l.Accumulator(xlabel='epoch',ylabel='perplexity',
legend=['train'],xlim=[10,num_epochs])
if isinstance(net,nn.Module):
updater = torch.optim.SGD(net.parameters(),lr)
else:
updater = lambda batch_size:d2l.sgd(net.params,lr,batch_size)
predict = lambda prefix:predict_ch8(prefix,50,net,vocab,device)
for epoch in range(num_epochs):
ppl,speed = train_epoch_ch8(
net,train_iter,loss,updater,device,use_random_iter)
if (epoch+1) % 10 == 0:
print(predict('time traveller'))
animator.add(epoch+1,[ppl])
print(f'proxy{ppl:.1f},{speed:.1f}词元/妙{str(device)}')
3.2 RNN循环神经网络简洁实现
3.2.1 torch.nn.RNN
- 链接RNN class torch.nn.RNN
class torch.nn.RNN(input_size,hidden_size,num_layers=1,nonlinearity='tanh',
bias=True,batch_first=False,dropout=0,bidirectional=False)
input_size:
输入特征x的特征维度,一般在NLP中为一个词向量的维度hidden_size:
隐藏单元个数num_layers:
循环层数。例如,设置num_layers=2意味着将两个RNN叠加在一起,形成一个堆叠的RNN,第二个RNN接收第一个RNN的输出并计算最终结果。默认值:1;跟 双向RNN不是一个东西nonlinearity:
表示激活函数是为’tanh’ 或 ‘relu’,默认’tanh’
h t = tanh ⏟ n o n l i n e a r i t y ( W i h x t + b i h + W h h h ( t − 1 ) + b h h ) h_t=\underbrace{\tanh}_{nonlinearity}(W_{ih}x_t+b_{ih}+W_{hh}h_{(t-1)}+b_{hh}) ht=nonlinearity tanh(Wihxt+bih+Whhh(t−1)+bhh)bias:
偏置,如果为False,则不需要用 b i h , b h h b_{ih},b_{hh} bih,bhh,默认为True,需要 b i h , b h h b_{ih},b_{hh} bih,bhhbatch_first:
如果为True,则输入和输出张量被提供为(batch, seq, feature)而不是(seq, batch, feature)。请注意,这并不适用于隐藏或单元格状态。有关详细信息,请参阅下面的输入/输出部分。默认值:Falsedrop_out:
如果非零,则在除最后一层外的每一RNN层的输出上引入一个Dropout层,Dropout概率等于Dropout。默认值:0bidirectional:
如果为True,则为双向RNN。默认值:假
3.2.2 输入输出形状
Ⅰ.输入包含两种输入:
(1)input:
输入 X
batch_first=False:
input :X的形状为:(sequence_length,batch_size,input_size)batch_first=True:
input :X的形状为:(batch_size,sequence_length,input_size)
(2)h_0:
初始化的隐状态;
-
大小: (D*num_layers,batch_size,Hidden_size);D=2表示双向RNN,D=1表示单向 RNN;
单向RNN的 h 0 h_0 h0形状为(1,batch_size,hidden_size)
Ⅱ.输出包含两个输出:(这里是不包含输出的,即 h t h_t ht -> O t O_t Ot 这个映射的。
(1)outputs:
输出的形状为(sequence_length,batch_size,output_size) -
batch_first=False:
outputs :输出的形状为:(sequence_length,batch_size,output_size) -
batch_first=True:
outputs :输出的形状为:(batch_size,sequence_length,output_size)
(2)h_n:
包含每个批次中的每个元素的最终隐藏状态;对于单RNN来说
形状为(1,batch_size,output_size)(3)outputs = Y 指的是隐藏层在各个时间步上计算并输出的隐藏状态,h_n 指的是最后一个时间步上的输出的隐藏状态
-
代码
# -*- coding: utf-8 -*-
# @Project: zc
# @Author: zc
# @File name: RNN_function_test
# @Create time: 2022/1/26 19:03
import torch
from torch import nn
vocab_size = input_size = 10
hidden_size = 20
num_layers = 2
rnn = nn.RNN(input_size=input_size, hidden_size=hidden_size, num_layers=num_layers)
# input_size = (sequence_length,batch_size,input_size)
# h0_size = (1*num_layers=2,batch_size=3,output_size=20)
sequence_length = 5
batch_size = 3
output_size = 20
input = torch.randn(sequence_length, batch_size, input_size)
h0 = torch.randn(num_layers, batch_size, output_size)
# output_shape = (sequence_length=5,batch_size=3,output_size=20)=([5,3,20])
# hn_shape = (1*num_layers=2,batch_size=3,output_size=20)=([2,3,20])
output, hn = rnn(input, h0)
print(f"output_shape={output.shape},hn_shape={hn.shape}")
- 结果:
output_shape=torch.Size([5, 3, 20]),hn_shape=torch.Size([2, 3, 20])
3.2.3 RNN简洁实现代码
import torch
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l
import matplotlib.pyplot as plt
batch_size, num_steps = 32, 35
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)
num_hiddens = 256
rnn_layer = nn.RNN(len(vocab), num_hiddens)#@save
class RNNModel(nn.Module):
"""循环神经网络模型"""
def __init__(self, rnn_layer, vocab_size, **kwargs):
super(RNNModel, self).__init__(**kwargs)
self.rnn = rnn_layer
self.vocab_size = vocab_size
self.num_hiddens = self.rnn.hidden_size
# 如果RNN是双向的(之后将介绍),num_directions应该是2,否则应该是1
if not self.rnn.bidirectional:
self.num_directions = 1
self.linear = nn.Linear(self.num_hiddens, self.vocab_size)
else:
self.num_directions = 2
self.linear = nn.Linear(self.num_hiddens * 2, self.vocab_size)
def forward(self, inputs, state):
X = F.one_hot(inputs.T.long(), self.vocab_size)
X = X.to(torch.float32)
Y, state = self.rnn(X, state)
# 全连接层首先将Y的形状改为(时间步数*批量大小,隐藏单元数)
# 它的输出形状是(时间步数*批量大小,词表大小)。
output = self.linear(Y.reshape((-1, Y.shape[-1])))
return output, state
def begin_state(self, device, batch_size=1):
if not isinstance(self.rnn, nn.LSTM):
# nn.GRU以张量作为隐状态
return torch.zeros((self.num_directions * self.rnn.num_layers,
batch_size, self.num_hiddens),
device=device)
else:
# nn.LSTM以元组作为隐状态
return (torch.zeros((
self.num_directions * self.rnn.num_layers,
batch_size, self.num_hiddens), device=device),
torch.zeros((
self.num_directions * self.rnn.num_layers,
batch_size, self.num_hiddens), device=device))
device = d2l.try_gpu()
net = RNNModel(rnn_layer, vocab_size=len(vocab))
net = net.to(device)
d2l.predict_ch8('time traveller', 10, net, vocab, device)
num_epochs, lr = 500, 1
d2l.train_ch8(net, train_iter, vocab, lr, num_epochs, device)
plt.show()
- 结果:
perplexity 1.3, 264351.2 tokens/sec on cuda:0
time traveller held in his hand was forelwis centitay llane so t
traveller held in his hand was forelwis centitay llane so t
4. 小结RNN
(1)一个简单的循环神经网络语言模型包括输入编码,循环神经网络模型和输出生成
(2)循环神经网络模型在训练以前需要初始化状态,不过随机抽样和顺序划分使用初始化方法不同
(3)在进行任何预测之前,模型通过预热期进行自我更新,获得比初始值更好的隐状态
(4)梯度裁剪可以防止梯度爆炸,但不能应对梯度消失
4.1 torch.nn.RNN 参数形状测试;
注:这里的output指的是中间隐藏层 h_t 的 outputs,即官方的 torch.nn.RNN表达的过程如下图所示:
所以我们在后续得增加一个全连接层来表示 :
O
t
=
ϕ
(
W
h
o
h
t
+
b
o
)
(5)
O_t=\phi(W_{ho}h_t+b_o)\tag5
Ot=ϕ(Whoht+bo)(5)
- 举例1:
# 注解
rnn = nn.RNN(input_size=10, hiden_size=output_size=20, num_layers=2)
input = torch.randn(squence_length=5, batch_size=3, input_size=10)
h0 = torch.randn(num_layers=2, batch_size=3, output_size=20)
output, hn = rnn(input, h0)
# output_shape = ([sequence_length=5,batch_size=3,output_size=20])
# hn_shape = ([num_layers=2,batch_size=3,output_size=20])
- 举例2:
- 代码:
import torch
from torch import nn
sequence_length = 6
batch_size = 10
input_size = 5
output_size = 12
num_layers = 3
input = torch.randn(sequence_length, batch_size, input_size)
h0 = torch.randn(num_layers, batch_size, output_size)
rnn = nn.RNN(input_size=input_size, hidden_size=output_size, num_layers=num_layers)
output, hn = rnn(input, h0)
# output.shape = ([sequence_length=6,batch_size=10,output_size=12])
# hn.shape = ([num_layers=3,batch_size=10,output_size=12])
print(f"output.shape={output.shape},hn.shape={hn.shape}")
- 结果:
output.shape=torch.Size([6, 10, 12]),hn.shape=torch.Size([3, 10, 12])