RNN循环神经网络从零开始

1. 理论RNN

1.1 潜变量自回归模型

  • 使用潜变量 h t h_t ht 总结过去信息
    在这里插入图片描述

1.2 循环神经网络

在这里插入图片描述

  • h t h_t ht相关网络:
    h t = ϕ ( W h h h t − 1 + W h x x t − 1 + b h ) (1) h_t=\phi(W_{hh}h_{t-1}+W_{hx}x_{t-1}+b_h)\tag1 ht=ϕ(Whhht1+Whxxt1+bh)(1)
    注1: h t h_t ht来自于 h t − 1 h_{t-1} ht1 x t − 1 x_{t-1} xt1的影响; O t O_t Ot来自于 h t h_t ht的影响; W h h W_{hh} Whh存储所有的时序信息
    O t = ϕ ( W h o h t + b o ) (2) O_t=\phi(W_{ho}h_t+b_o)\tag2 Ot=ϕ(Whoht+bo)(2)
    注2:损失的计算是 l o s s = O t − X t loss =O_t-X_t loss=OtXt;因为 O t O_t Ot相当于 Y_hat,而X_t 相当于标签 Y,那么我们就能得到损失值

1.3 困惑度

困惑度(perplexity)表示的是:

  • 衡量一个语言模型的好坏可以用平均交叉熵
    π = 1 n ∑ i = 1 n − log ⁡ p ( x t ∣ x t − 1 , . . . ) (3) \pi=\frac{1}{n}\sum_{i=1}^n-\log p(x_t|x_{t-1,...})\tag{3} π=n1i=1nlogp(xtxt1,...)(3)
  • p 是语言模型的预测概率, x t x_t xt是真实词
  • 历史原因NLP使用困惑度 exp ⁡ ( π ) \exp(\pi) exp(π)来衡量,是平均每次可能选项
  • 1 表示完美,无穷大是最差情况

1.4 梯度剪裁

  • 迭代中计算这 T 个时间步上的梯度,那么每次就有一堆矩阵乘法,T次就会出现梯度爆炸,在反向传播过程中产生长度为 O ( T ) O(T) O(T)的矩阵乘法链,导致数值不稳定
  • 梯度裁剪能有效预防梯度爆炸
  • 如果整个神经网络所有层的梯度范数 g g g 长度超过 θ \theta θ,那么拖影回长度 θ \theta θ
    g ← m i n ( 1 , θ ∣ ∣ g ∣ ∣ ) g (4) g\leftarrow min(1,\frac{\theta}{||g||})g\tag{4} gmin(1,gθ)g(4)

2. 图解RNN

在这里插入图片描述

3. 代码RNN

3.1 RNN循环神经网络从零开始

# -*- coding: utf-8 -*-
# @Project: zc
# @Author: zc
# @File name: RNN-test
# @Create time: 2022/1/23 14:45

# 1. 导入相关数据库
import matplotlib.pyplot as plt
import math
import torch
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l

# 2.定义相关参数
# batch_size 批量大小;num_steps
batch_size, num_steps = 32, 35
# train_iter:训练迭代器,vocab 字符级词元表
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)


# 3.返回RNN循环神经网络可以学习的参数,因为输出和输出来自同一词汇表,所以num_inputs = num_outputs=vocab_size
def get_params(vocab_size, num_hiddens, device):
	num_inputs = num_outputs = vocab_size

	# 定义初始化,以正态分布采集数据
	def normal(shape):
		return torch.randn(size=shape, device=device) * 0.01

	# 根据公式 H_t = φ(X_t·W_xh+H_t-1·W_hh + b_h)
	# O_t = H_t·W_hq +b_q
	# X ∈ R^(n x d)
	W_xh = normal((num_inputs, num_hiddens))
	W_hh = normal((num_hiddens, num_hiddens))
	b_h = torch.zeros(num_hiddens, device=device)
	W_hq = normal((num_hiddens, num_outputs))
	b_q = torch.zeros(num_outputs, device=device)
	# 将RNN的所有参数集中到params中
	params = [W_xh, W_hh, b_h, W_hq, b_q]
	for param in params:
		# 确保每个参数:梯度返传时对该参数计算梯度,并存入tensor.grad中
		# 能够进行更新
		param.requires_grad_(True)
	return params  # 返回所有的参数


# 4. 初始化返回隐状态;返回值全为0的张量((批量大小,隐藏单元数),)
def init_rnn_state(batch_size, num_hiddens, device):
	return (torch.zeros((batch_size, num_hiddens), device=device),)


# 5.定义了如何在一个时间步内计算隐状态和输出;
# RNN前向传播函数;
# 返回输出O_t和隐藏变量H_t
def rnn(inputs, state, params):
	# inputs的形状:(时间步数量,批量大小,词表大小)=(5,2,28)
	# 将时间放在前面是为了后续的训练
	W_xh, W_hh, b_h, W_hq, b_q = params
	# W_xh=[512,512];W_hq=[512,28];W_xh=[28,512];b_h = [512,];b_q =[28,]
	H, = state  # state为元祖tuple ;tuple_0 = H = tensor(2,512)
	# outputs 输出层,相当于 O_t ; H 相当于隐藏层
	outputs = []
	# X 的大小为 [2,28]
	# 计算特定时间步的样本
	for X in inputs:
		# 隐藏层;激活函数为 torch.tanh;X:(批量大小,词表大小)=(2,28)
		# 应用公式 H_t = tanh(X_t·W_xh + H_t-1·W__hh+b_h)
		H = torch.tanh(torch.mm(X, W_xh) + torch.mm(H, W_hh) + b_h)
		# 输出层;O_t = H_t·W_hq + b_q
		Y = torch.mm(H, W_hq) + b_q
		outputs.append(Y)
	# (批量大小*时间长度,vocab词元大小)
	return torch.cat(outputs, dim=0), (H,)


class RNNModelScratch:
	"""
	vocab_size:词元大小;
	num_hiddens:隐藏单元个数
	device: GPU
	get_params:循环神经网络RNN需要的参数
	init_state:初始化状态
	forward_fn:前向传播函数

	"""

	def __init__(self, vocab_size, num_hiddens, device,
				 get_params, init_state, forward_fn):
		self.vocab_size, self.num_hiddens = vocab_size, num_hiddens
		self.params = get_params(vocab_size, num_hiddens, device)
		self.init_state, self.forward_fn = init_state, forward_fn

	# 当类的实例在传入(X,state)的时候,就调用此方法
	def __call__(self, X, state):
		# 1.将X独热编码;
		# 2.RNN前向传播计算
		# 3. input:X形状 (批量大小,时间步数)
		# 4. output:X形状(时间步数,批量大小,vocab_size))
		X = F.one_hot(X.T, self.vocab_size).type(torch.float32)
		return self.forward_fn(X, state, self.params)

	# 初始化状态;值全为0,大小(批量大小,隐藏单元数)
	def begin_state(self, batch_size, device):
		return self.init_state(batch_size, self.num_hiddens, device)


num_hiddens = 512
# X = torch.arange(10).reshape((2, 5))
net = RNNModelScratch(len(vocab), num_hiddens, d2l.try_gpu(), get_params, init_rnn_state, rnn)


# state = net.begin_state(X.shape[0], d2l.try_gpu())
# Y.shape=[10,28];new_state=([2,512],)
# Y:输出状态;new_state隐藏状态,此调用RNNModelScratch 里面的__call__方法;
# 即前向传播函数,
# Y, new_state = net(X.to(d2l.try_gpu()), state)
# print(Y.shape, len(new_state), new_state[0].shape)


def predict_ch8(prefix, num_preds, net, vocab, device):
	"""
	# 在 prefix 后面生成新字符
	:param prefix: 给定的字符串,模型根据此字符串预测后续的字符
	:param num_preds: 需要预测的字符数
	:param net:训练好的RNN模型
	:param vocab:词汇表
	:param device:GPU
	:return:
	"""
	# 初始化状态RNN神经网络
	# 在prefix后面生成新字符
	# 神经网络的状态初始化
	state = net.begin_state(batch_size=1, device=device)
	# prefix = time traveller;t->3;i->5;m->13;e->2;''->1;t->3;r->10;a->4;
	# v->22;e->2;l->12;l->12;e->2;r->10;18->g;16->f;24->x
	# 将字符词元化[3,5,13,2,1,3,10,4,22,2,12,12,2,10,18,16,24]
	# prefix[0]='t';vocab['t']=3;outputs=[3]
	outputs = [vocab[prefix[0]]]
	# 把outputs最后一个字符取出来作为下一个输入(批量大小=1,时间步数=1)
	get_input = lambda: torch.tensor([outputs[-1]], device=device).reshape((1, 1))
	# 因为这里的prefix是给定的,所以我们不需要进行预测,即不需要存储Y
	# 预热期;prefix = 'time traveller';prefix[1:]='ime traveller'
	# 将prefix[1:]='ime traveller'里面的所有字符逐个编码进行RNN计算得到对应的H_t
	for y in prefix[1:]:
		_, state = net(get_input(), state)
		outputs.append(vocab[y])
	# 开始预测给定prefix后的num_preds
	for _ in range(num_preds):
		y, state = net(get_input(), state)
		outputs.append(int(y.argmax(dim=1).reshape(1)))
	return ''.join([vocab.idx_to_token[i] for i in outputs])


print(predict_ch8('time traveller', 10, net, vocab, d2l.try_gpu()))


# 梯度裁剪
def grad_clipping(net, theta):
	"""

	:param net: 网络模型
	:param theta: 设置的θ;公式 g <- min(1,θ/||g||)g;保证梯度不超过 θ
	:return:
	"""
	if not isinstance(net, nn.Module):
		params = net.params
	else:
		params = [p for p in net.parameters() if p.requires_grad]
	norm = torch.sqrt(sum(torch.sum((p.grad ** 2)) for p in params))
	if norm > theta:
		for param in params:
			param.grad[:] *= theta / norm


def train_epoch_ch8(net, train_iter, loss, updater, device, use_random_iter):
	state, timer = None, d2l.Timer()
	metric = d2l.Accumulator(2)
	for X, Y in train_iter:
		if state is None or use_random_iter:
			state = net.begin_state(batch_size=X.shape[0], device=device)
		else:
			if isinstance(net, nn.Module) and not isinstance(state, tuple):
				state.detach_()
			else:
				for s in state:
					s.detach_()
		y = Y.T.reshape(-1)
		X, y - X.to(device), y.to(device)
		y_hat, state = net(X, state)
		l = loss(y_hat,y.long()).mean()
		if isinstance(updater,torch.optim.Optimizer):
			updater.zero_grad()
			l.backward()
			grad_clipping(net,1)
			updater.step()
		else:
			l.backward()
			grad_clipping(net,1)
			updater(batch_size=1)
		metric.add(l*y.numel(),y.numel())
	return math.exp(metric[0]/metric[1]),metric[1]/timer.stop()


def train_ch8(net,train_iter,vocab,lr,num_epochs,device,use_random_iter=False):
	loss = nn.CrossEntropyLoss()
	animator = d2l.Accumulator(xlabel='epoch',ylabel='perplexity',
							   legend=['train'],xlim=[10,num_epochs])
	if isinstance(net,nn.Module):
		updater = torch.optim.SGD(net.parameters(),lr)
	else:
		updater = lambda batch_size:d2l.sgd(net.params,lr,batch_size)
	predict = lambda prefix:predict_ch8(prefix,50,net,vocab,device)
	for epoch in range(num_epochs):
		ppl,speed = train_epoch_ch8(
			net,train_iter,loss,updater,device,use_random_iter)
		if (epoch+1) % 10 == 0:
			print(predict('time traveller'))
			animator.add(epoch+1,[ppl])

	print(f'proxy{ppl:.1f},{speed:.1f}词元/妙{str(device)}')

3.2 RNN循环神经网络简洁实现

3.2.1 torch.nn.RNN

class torch.nn.RNN(input_size,hidden_size,num_layers=1,nonlinearity='tanh',
				   bias=True,batch_first=False,dropout=0,bidirectional=False)
  • input_size: 输入特征x的特征维度,一般在NLP中为一个词向量的维度
  • hidden_size: 隐藏单元个数
  • num_layers: 循环层数。例如,设置num_layers=2意味着将两个RNN叠加在一起,形成一个堆叠的RNN,第二个RNN接收第一个RNN的输出并计算最终结果。默认值:1;跟 双向RNN不是一个东西
  • nonlinearity:表示激活函数是为’tanh’ 或 ‘relu’,默认’tanh’
    h t = tanh ⁡ ⏟ n o n l i n e a r i t y ( W i h x t + b i h + W h h h ( t − 1 ) + b h h ) h_t=\underbrace{\tanh}_{nonlinearity}(W_{ih}x_t+b_{ih}+W_{hh}h_{(t-1)}+b_{hh}) ht=nonlinearity tanh(Wihxt+bih+Whhh(t1)+bhh)
  • bias:偏置,如果为False,则不需要用 b i h , b h h b_{ih},b_{hh} bih,bhh,默认为True,需要 b i h , b h h b_{ih},b_{hh} bih,bhh
  • batch_first: 如果为True,则输入和输出张量被提供为(batch, seq, feature)而不是(seq, batch, feature)。请注意,这并不适用于隐藏或单元格状态。有关详细信息,请参阅下面的输入/输出部分。默认值:False
  • drop_out: 如果非零,则在除最后一层外的每一RNN层的输出上引入一个Dropout层,Dropout概率等于Dropout。默认值:0
  • bidirectional: 如果为True,则为双向RNN。默认值:假

3.2.2 输入输出形状

Ⅰ.输入包含两种输入:
(1)input: 输入 X

  • batch_first=False:input :X的形状为:(sequence_length,batch_size,input_size)
  • batch_first=True:input :X的形状为:(batch_size,sequence_length,input_size)

(2)h_0: 初始化的隐状态;

  • 大小: (D*num_layers,batch_size,Hidden_size);D=2表示双向RNN,D=1表示单向 RNN;
    单向RNN的 h 0 h_0 h0形状为(1,batch_size,hidden_size)
    Ⅱ.输出包含两个输出:(这里是不包含输出的,即 h t h_t ht -> O t O_t Ot 这个映射的。
    (1)outputs: 输出的形状为(sequence_length,batch_size,output_size)

  • batch_first=False:outputs :输出的形状为:(sequence_length,batch_size,output_size)

  • batch_first=True: outputs :输出的形状为:(batch_size,sequence_length,output_size)
    (2)h_n: 包含每个批次中的每个元素的最终隐藏状态;对于单RNN来说
    形状为(1,batch_size,output_size)

    (3)outputs = Y 指的是隐藏层在各个时间步上计算并输出的隐藏状态,h_n 指的是最后一个时间步上的输出的隐藏状态
    在这里插入图片描述

  • 代码

# -*- coding: utf-8 -*-
# @Project: zc
# @Author: zc
# @File name: RNN_function_test
# @Create time: 2022/1/26 19:03

import torch
from torch import nn

vocab_size = input_size = 10
hidden_size = 20
num_layers = 2
rnn = nn.RNN(input_size=input_size, hidden_size=hidden_size, num_layers=num_layers)
# input_size = (sequence_length,batch_size,input_size)
# h0_size = (1*num_layers=2,batch_size=3,output_size=20)
sequence_length = 5
batch_size = 3
output_size = 20
input = torch.randn(sequence_length, batch_size, input_size)
h0 = torch.randn(num_layers, batch_size, output_size)
# output_shape = (sequence_length=5,batch_size=3,output_size=20)=([5,3,20])
# hn_shape = (1*num_layers=2,batch_size=3,output_size=20)=([2,3,20])
output, hn = rnn(input, h0)
print(f"output_shape={output.shape},hn_shape={hn.shape}")
  • 结果:
output_shape=torch.Size([5, 3, 20]),hn_shape=torch.Size([2, 3, 20])

3.2.3 RNN简洁实现代码

import torch
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l
import matplotlib.pyplot as plt


batch_size, num_steps = 32, 35
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)
num_hiddens = 256
rnn_layer = nn.RNN(len(vocab), num_hiddens)#@save


class RNNModel(nn.Module):
    """循环神经网络模型"""
    def __init__(self, rnn_layer, vocab_size, **kwargs):
        super(RNNModel, self).__init__(**kwargs)
        self.rnn = rnn_layer
        self.vocab_size = vocab_size
        self.num_hiddens = self.rnn.hidden_size
        # 如果RNN是双向的(之后将介绍),num_directions应该是2,否则应该是1
        if not self.rnn.bidirectional:
            self.num_directions = 1
            self.linear = nn.Linear(self.num_hiddens, self.vocab_size)
        else:
            self.num_directions = 2
            self.linear = nn.Linear(self.num_hiddens * 2, self.vocab_size)

    def forward(self, inputs, state):
        X = F.one_hot(inputs.T.long(), self.vocab_size)
        X = X.to(torch.float32)
        Y, state = self.rnn(X, state)
        # 全连接层首先将Y的形状改为(时间步数*批量大小,隐藏单元数)
        # 它的输出形状是(时间步数*批量大小,词表大小)。
        output = self.linear(Y.reshape((-1, Y.shape[-1])))
        return output, state

    def begin_state(self, device, batch_size=1):
        if not isinstance(self.rnn, nn.LSTM):
            # nn.GRU以张量作为隐状态
            return  torch.zeros((self.num_directions * self.rnn.num_layers,
                                 batch_size, self.num_hiddens),
                                device=device)
        else:
            # nn.LSTM以元组作为隐状态
            return (torch.zeros((
                self.num_directions * self.rnn.num_layers,
                batch_size, self.num_hiddens), device=device),
                    torch.zeros((
                        self.num_directions * self.rnn.num_layers,
                        batch_size, self.num_hiddens), device=device))


device = d2l.try_gpu()
net = RNNModel(rnn_layer, vocab_size=len(vocab))
net = net.to(device)
d2l.predict_ch8('time traveller', 10, net, vocab, device)

num_epochs, lr = 500, 1
d2l.train_ch8(net, train_iter, vocab, lr, num_epochs, device)

plt.show()
  • 结果:
perplexity 1.3, 264351.2 tokens/sec on cuda:0
time traveller held in his hand was forelwis centitay llane so t
traveller held in his hand was forelwis centitay llane so t

在这里插入图片描述

4. 小结RNN

(1)一个简单的循环神经网络语言模型包括输入编码,循环神经网络模型和输出生成
(2)循环神经网络模型在训练以前需要初始化状态,不过随机抽样和顺序划分使用初始化方法不同
(3)在进行任何预测之前,模型通过预热期进行自我更新,获得比初始值更好的隐状态
(4)梯度裁剪可以防止梯度爆炸,但不能应对梯度消失

4.1 torch.nn.RNN 参数形状测试;

注:这里的output指的是中间隐藏层 h_t 的 outputs,即官方的 torch.nn.RNN表达的过程如下图所示:
在这里插入图片描述
所以我们在后续得增加一个全连接层来表示 :
O t = ϕ ( W h o h t + b o ) (5) O_t=\phi(W_{ho}h_t+b_o)\tag5 Ot=ϕ(Whoht+bo)(5)

  • 举例1:
# 注解
rnn = nn.RNN(input_size=10, hiden_size=output_size=20, num_layers=2)
input = torch.randn(squence_length=5, batch_size=3, input_size=10)
h0 = torch.randn(num_layers=2, batch_size=3, output_size=20)
output, hn = rnn(input, h0)
# output_shape = ([sequence_length=5,batch_size=3,output_size=20])
# hn_shape = ([num_layers=2,batch_size=3,output_size=20])
  • 举例2:
  • 代码:
import torch
from torch import nn

sequence_length = 6
batch_size = 10
input_size = 5
output_size = 12
num_layers = 3
input = torch.randn(sequence_length, batch_size, input_size)
h0 = torch.randn(num_layers, batch_size, output_size)
rnn = nn.RNN(input_size=input_size, hidden_size=output_size, num_layers=num_layers)
output, hn = rnn(input, h0)
# output.shape = ([sequence_length=6,batch_size=10,output_size=12])
# hn.shape = ([num_layers=3,batch_size=10,output_size=12])
print(f"output.shape={output.shape},hn.shape={hn.shape}")
  • 结果:
output.shape=torch.Size([6, 10, 12]),hn.shape=torch.Size([3, 10, 12])
  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值