Pytorch学习笔记-RNN

最新推荐文章于 2024-07-09 09:04:37 发布

燥栋

最新推荐文章于 2024-07-09 09:04:37 发布

阅读量825

点赞数 4

分类专栏： Pytorch-学习笔记文章标签：神经网络 python rnn

本文链接：https://blog.csdn.net/qq_45363979/article/details/108058881

版权

Pytorch-学习笔记专栏收录该内容

2 篇文章 0 订阅

订阅专栏

文章目录

1. 时间序列表示方法

图像，视频一类的数据均有位置相关性，卷积神经网络利用感受一个小的区域的方式实现全局共享，可以将网络设置的很深，又保留了逐层抽取的特性。
而语音、文字都是有时间先后顺序的，对于2D的图片数据，用一个像素点的RGB值来表示这个像素的色彩度，那么对于语音、文字，语音的每个时间段都有一个波形，其峰值代表强度，文字的字符没有在Pytorch自带支持，需要将String ==> [seq_len, feture_len]

1.1 Sequence representation

[seq_len, feature_len]其中seq_len代表feature的数量，feature_len代表feature的表示方法

例如：房价
[month, 1] 代表每个月对应的房价
[100days, 1] 代表每天对应的房价

[words, word_vec]使用One-Hot编码方式来表示
在这里插入图片描述
有多少个单词就有多少个One-Hot，[5, 3500]代表这句话有5个单词，但是词库中包含了3500个单词。
One-Hot编码的缺点：one-hot编码是非常稀疏的，占用了大量空间，表示信息非常少，同时维度很高。
[words, word_vec]常用的编码方式，例如glove，word2vec，他们是将单词编码为指定长度的vector，然后通过计算相似度，来计算两个向量的相关性的。

1.2 Batch

batch的位置有两种：
1.[b, word_num, word_vec]：这种和CNN类似，代表第0个句子，第0个句子的第0个单词，每个单词表达方式。
2.[word_num, b, word_vec]：代表有n个单词，每个单词有b条曲线，每个曲线用几位来表达。
其实第二种更容易理解
在这里插入图片描述

1.3 word2vec vs GloVe

首先看一个embedding layer的表示方法，我们使用一个nn.embedding来进行查表操作，首先根据单词编一个索引，然后根据索引来进行查表。
在这里插入图片描述
这张图就代表有10k个单词，每个单词用300的向量来表示，通过单词的索引值我们就可以的到他对应的向量。

import torch
from torch import nn

word_to_idx = {'hello' : 0, 'world' : 1}
lookup_tensor = torch.tensor([word_to_idx['hello']], dtype=torch.long)
embeds = nn.Embedding(2, 5)
hello_embed = embeds(lookup_tensor)
print(hello_embed)

由于这个embedding操作是没有进行初始化的，所以我们得到的这个表是随机的，所以初始化的方法就用word2vec 或者 GloVe然后将其添加到这个表中，但是这张表的是不能求梯度的，无法直接优化。
以下是使用glove方法得到查表操作，使用torchnlp面向自然语言处理的包，直接调用GloVe方式。

from torchnlp.word_to_vector import GloVe

vectors = Glove()
vectors['hello']

每个单词用100维向量来表示。

2. RNN原理

2.1 Sentiment Analysis

首先我们以一个判断淘宝评论是好评还是差评，将I hate this boring movie输入到线性层，假如我们使用的是GloVe编码，那么每个单词是一个[100]的向量，经过线性层，降维，抽象提取特征，加入线性层出来的维度是[2]，再将其合并[5, 2]通过一个线性层，最终得出P(pos|x)——完成一个二分类的问题。
在这里插入图片描述

缺点：
1.现实生活中的句子可能会有很多单词，因此会引入大量的[w, b]参数
2.没有上下文信息，长句子，我们不能通过一个单词来判断他是积极还是消极

2.2 Weight Sharing and Consistent Memory

针对第一个问题，我们将之前的[w1, b1]、[w2, b2 ]…用统一的[w, b]来进行表示，线性层抽取特征是针对所有单词，从而提取整句话的语境信息，而非这一层只针对某一个单词，类似于CNN中的weigh sharing，只考虑局部信息，kernel不断共享，RNN也是这个意思，每一层提取信息的功能共享，从而达到减小参数量
在这里插入图片描述
针对第二个问题，我需要一个长期的memory来保存语境信息，不能只看见like就代表喜欢，也应该综合考虑喜欢之前有没有don‘t，甚至也要考虑之后的信息。因此要有一个持续的单元，贯穿语境，每次train的时候，不能只看当前的输入，还要看之前的信息。
在这里插入图片描述

即[Wxh, Whh]实现了全局共享。还实现了语境的贯穿，每一步不仅考虑了当前输入x，考虑了之前的信息h。
将模型折叠，加入输入为[5, 3, 100]batch为3，每句话5个单词，每个单词用100维向量表示，那么每个x输入的就是[3, 100]，其中h0初始化全为0
在这里插入图片描述
模型展开如下，那么输出选择那个，这个就很灵活，看自己的选择，可以选择最后一个ht，也可以将所有的ht综合考虑

2.3 How to train？

在这里插入图片描述
其中WhhHt-1是提取h的特征，WxhXt是提取x的特征，将其组合乘以激活函数，激活函数要注意，RNN使用的是tanh。y的求取，相当于ht加入线性层后生成y。
下面看看是否能够使用梯度下降法来进行训练，如果可以求取梯度，那么就可以相信它是可以被训练的很好，而且可以证明用梯度公式来进行全职更新的。

在这里插入图片描述
公式中的Wi是Whx，Wr是Whh，W0是Wyh。其中Wih和Whh不变，全局共享，每一时刻的Wih和Whh都对那一时刻的输出y有影响，也同时对最后一时刻的y有影响。h0初始化全为0。Et代表error最后一时刻的输出y与target的均方差。
当求Et与Whh的偏导时，由于全局共享的原因，在反向传播的时候，我们要考虑每一时刻的Whh对最后一个y的影响。因此要累加Et对于每一时刻Whh的偏导。分别查看各个部分的求导情况
在这里插入图片描述

3. RNN Layer使用

3.1 Signal Layer

在这里插入图片描述

X：[seq_len, batch, feature_len]
Xt：[batch, feature_len]
Wxh：[hidden_len, feature_len]
Ht：[batch, hidden_len]
Whh：[hidden_len, hidden_len]

Xt@Wxh=[batch, feature_len]用于表示每一句话当前单词的memory的表达方式
Ht@Whh=[batch,hidden_len]将上一时刻的Ht进行更新
将上面的结果进行相加融合，更新为Ht+1，所以h0初始话不能为[hidden_len]而是[batch, hidden_len]这样才能更方便的使最开始的memory相乘。

代码验证：

from torch import nn

rnn = nn.RNN(100, 10)
print(rnn._parameters.keys())

print(rnn.weight_hh_l0.shape)
print(rnn.bias_hh_l0.shape)
print(rnn.weight_ih_l0.shape)
print(rnn.bias_ih_l0.shape)

运行结果

odict_keys([‘weight_ih_l0’, ‘weight_hh_l0’, ‘bias_ih_l0’, ‘bias_hh_l0’])
torch.Size([10, 10])
torch.Size([10])
torch.Size([10, 100])
torch.Size([10])

3.2 nn.RNN()

__init__

input_size – word embedding的维度，100维的向量表示一个单词，inputsize=100
hidden_size – 用来表示memory
num_layers – 默认为1

out,ht = forward(x, h0) ht默认是最后一个时刻的ht，out是聚合了所有的输出h0-ht，h0默认为0初始化

X=[seq_len, batch, feature_len]
h0/ht=[number_layers, batch, hidden_len]
out=[seq_len, batch, hidden_len]

测试

import torch
from torch import nn

rnn = nn.RNN(input_size=100, hidden_size=20, num_layers=1)
print(rnn)
x = torch.randn(10, 3, 100)
out,h = rnn(x, torch.zeros(1, 3, 20))
print('out.shape =', out.shape)
print('h.shape =', h.shape)

运行结果

RNN(100, 20)
out.shape = torch.Size([10, 3, 20])
h.shape = torch.Size([1, 3, 20])

3.3 Multi Layers

在这里插入图片描述

第一层：h=[1,batch,hidden_len] out=[seq_len,batch,hidden]
第二层：h=[2,batch,hidden_len] out=[seq_len,batch,hidden]

可以看出来，第一层和第二层的变化，ht改变因为层数增加（纵向）。out不变是因为out取的是最后一层每个时刻的ht聚合（横向）

验证：

from torch import nn

rnn = nn.RNN(100, 10, num_layers=2)
print(rnn._parameters.keys())

print('Whh_l0', rnn.weight_hh_l0.shape)
print('Whh_l1', rnn.weight_hh_l1.shape)
print('Wih_l0', rnn.weight_ih_l0.shape)
print('Wih_l1', rnn.weight_ih_l1.shape)

运行结果：

odict_keys([‘weight_ih_l0’, ‘weight_hh_l0’, ‘bias_ih_l0’, ‘bias_hh_l0’, ‘weight_ih_l1’, ‘weight_hh_l1’, ‘bias_ih_l1’, ‘bias_hh_l1’])
Whh_l0 torch.Size([10, 10])
Whh_l1 torch.Size([10, 10])
Wih_l0 torch.Size([10, 100])
Wih_l1 torch.Size([10, 10])

注意：Wih1为word embedding 100将其转换为memory shape 10
Wih2则不用转换，因为第二层的输入就是第一层的memory 10 ==> 10

3.4 nn.RNNCell()

相比一步到位的nn.RNN，也可以使用nn.RNNCell，它将序列上的每个时刻分开来处理。

也就是说，如果要处理的是3个句子，每个句子10个单词，每个单词用长100的向量，那么送入nn.RNN的Tensor的shape就是[10,3,100]。

但如果使用nn.RNNCell，则将每个时刻分开处理，送入的Tensor的shape是[3,100]，但要将此计算单元运行10次。显然这种方式比较麻烦，但使用起来也更灵活。
__init__

input_size – word embedding的维度，100维的向量表示一个单词，inputsize=100
hidden_size – 用来表示memory
num_layers – 默认为1

`ht = rnn.cell(xt, ht_1)

xt=[b, word vec]
ht_1/ht=[num layers, b, h dim]
out = torch.stack([h1,h2,…,ht])

一层

import torch
from torch import nn

# 表示feature_len=100, hidden_len=20
cell = nn.RNNCell(100, 20)
# 某一时刻的输入, 共3个样本序列(batch=3), 每个特征100维度(feature_len=100)
x = torch.randn(3, 100)
# 所有时刻的输入, 一共有10个时刻, 即seq_len=10
xs = [torch.randn(3, 100) for i in range(10)]
# 初始化隐藏记忆单元, batch=3, hidden_len=20
h = torch.zeros(3, 20)
# 对每个时刻的输入, 传入这个nn.RNNCell计算单元, 还要传入上一时h, 以进行前向计算
for xt in xs:
    h = cell(xt, h)
# 查看一下最终输出的h, 其shape还是<batch, hidden_len>
print(h.shape)  # torch.Size([3, 20])

两层

import torch
from torch import nn

# 第0层和第1层的计算单元
cell_l0 = nn.RNNCell(100, 30)  # feature_len=100, hidden_len_l0=30
cell_l1 = nn.RNNCell(30, 20)  # hidden_len_l0=30, hidden_len_l1=20

# 第0层和第1层使用的隐藏记忆单元(图中黄色和绿色)
h_l0 = torch.zeros(3, 30)  # batch=3, hidden_len_l0=30
h_l1 = torch.zeros(3, 20)  # batch=3, hidden_len_l1=20

# 原始输入, batch=3, feature_len=100
xs = [torch.randn(3, 100) for i in range(4)]  # seq_len=4, 即共4个时刻

for xt in xs:
    h_l0 = cell_l0(xt, h_l0)
    h_l1 = cell_l1(h_l0, h_l1)

# 图中最右侧两个输出
print(h_l0.shape)  # torch.Size([3, 30])
print(h_l1.shape)  # torch.Size([3, 20])

4. 时间序列预测实战

波形预测

import  numpy as np
import  torch
import  torch.nn as nn
import  torch.optim as optim
from    matplotlib import pyplot as plt


num_time_steps = 50
input_size = 1
hidden_size = 16
output_size = 1
lr=0.01



class Net(nn.Module):

    def __init__(self, ):
        super(Net, self).__init__()

        self.rnn = nn.RNN(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=1,
            batch_first=True,
        )
        for p in self.rnn.parameters():
          nn.init.normal_(p, mean=0.0, std=0.001)

        self.linear = nn.Linear(hidden_size, output_size)

    def forward(self, x, hidden_prev):

       out, hidden_prev = self.rnn(x, hidden_prev)
       # [b, seq, h]
       out = out.view(-1, hidden_size)
       out = self.linear(out)
       out = out.unsqueeze(dim=0)
       return out, hidden_prev




model = Net()
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr)

hidden_prev = torch.zeros(1, 1, hidden_size)

for iter in range(6000):
    start = np.random.randint(3, size=1)[0] #返回得是（1）ndarray取第0个元素
    time_steps = np.linspace(start, start + 10, num_time_steps) #设置时间起点终点，
    data = np.sin(time_steps)
    data = data.reshape(num_time_steps, 1)
    x = torch.tensor(data[:-1]).float().view(1, num_time_steps - 1, 1) #取0-49然后在reshape
    y = torch.tensor(data[1:]).float().view(1, num_time_steps - 1, 1) #取1-50然后在reshape[b,seq len, feature]

    output, hidden_prev = model(x, hidden_prev)
    hidden_prev = hidden_prev.detach()

    loss = criterion(output, y)
    model.zero_grad()
    loss.backward()
    # for p in model.parameters():
    #     print(p.grad.norm())
    # torch.nn.utils.clip_grad_norm_(p, 10)
    optimizer.step()

    if iter % 100 == 0:
        print("Iteration: {} loss {}".format(iter, loss.item()))

start = np.random.randint(3, size=1)[0]
time_steps = np.linspace(start, start + 10, num_time_steps)
data = np.sin(time_steps)
data = data.reshape(num_time_steps, 1)
x = torch.tensor(data[:-1]).float().view(1, num_time_steps - 1, 1)
y = torch.tensor(data[1:]).float().view(1, num_time_steps - 1, 1)

predictions = []
input = x[:, 0, :]
for _ in range(x.shape[1]):
  input = input.view(1, 1, 1)
  (pred, hidden_prev) = model(input, hidden_prev)
  input = pred
  predictions.append(pred.detach().numpy().ravel()[0])

x = x.data.numpy().ravel()
y = y.data.numpy()
plt.scatter(time_steps[:-1], x.ravel(), s=90)
plt.plot(time_steps[:-1], x.ravel())

plt.scatter(time_steps[1:], predictions)
plt.show()

在这里插入图片描述

5. 梯度离散和梯度爆炸

5.1 梯度离散和梯度爆炸的原因

由之前的梯度推导可知：
在这里插入图片描述

当 $W _h$ $_h$ >1 时， $W _h$ $_h$ $^($ $^k$ $^-$ $^i$ $^)$ ==》∞ 爆炸
当 $W _h$ $_h$ <1 时， $W _h$ $_h$ $^($ $^k$ $^-$ $^i$ $^)$ ==》0 离散

5.2 梯度爆炸的解决办法

梯度爆炸的解决办法：设定阈值
在这里插入图片描述
通过伪代码可以得知，当梯度超过阈值之后，那么阈值除以本身的模得1，然后在乘以阈值，就是说，这一步变小了，但是方向不变。就会一直逼近。

loss = criterion(output, y)
model.zero_grad()
loss.backward()
for p in model.parameters():
	print(p.grad.norm())
	torch.nn.utils.clip_grad_norm_(p, 10)
optimizer.step()