这学期最后一次更新了,后续准备期末考试。下学期接着更
这周看的是双向lstm的内容
Step1.双向lstm的原理
基本参考的是这篇文章,讲的非常的细致,其中还把RNN和LSTM也讲了(因为双向LSTM也只是LSTM的变种)
RNN、LSTM的知识见week6&7(2021.10.23-2021.11.5),不具体说了
双向LSTM(Bi-LSTM)是一种变种的LSTM,我先拿Bi-RNN举例子
Bi-RNN的结构图:
可以看出,加了一层反向的RNN,两个对向实现了Bi-RNN。其中,S0和S‘0都是输入,Si和S'i都是输出。
RNN和LSTM都只能依据之前时刻的时序信息来预测下一时刻的输出,但在有些问题中,当前时刻的输出不仅和之前的状态有关,还可能和未来的状态有关系。比如预测一句话中缺失的单词不仅需要根据前文来判断,还需要考虑它后面的内容,真正做到基于上下文判断。BRNN有两个RNN上下叠加在一起组成的,输出由这两个RNN的状态共同决定。
由上图非常容易理解,理解起来,所谓的Bi-LSTM以及Bi-RNN,可以看成是两层神经网络,第一层从左边作为系列的起始输入,在文本处理上可以理解成从句子的开头开始输入,而第二层则是从右边作为系列的起始输入,在文本处理上可以理解成从句子的最后一个词语作为输入,反向做与第一层一样的处理处理。最后对得到的两个结果进行处理。
Bi-LSTM就是参数变成了两个(s0,h0)......其他同理
Step2.代码实现
一.数据集:
(1)reviews.txt 是原始文本文件,共25000条,一行是一篇英文电影影评文本
(2)labels.txt 是标签文件,共25000条,一行是一个标签,positive 或者 negative
在代码中:测试集为20000条,验证集为2500条,测试集为2500条
二.代码
1.训练数据加载
import os
os.environ["CUDA_VISIBLE_DEVICES"] = '1'
import numpy as np
###############################################################################
####################### 1. 训练数据加载 ###########################
###############################################################################
with open('./data/reviews.txt', 'r') as f:
reviews = f.read()
with open('./data/labels.txt', 'r') as f:
labels = f.read()
print(reviews[:30]) #拿出前30个字母
print()
print(labels[:20]) #拿出前20个字符 a[:n] 代表列表中的第一项到第n项,相当于 a[0:n] 看数据集的标签txt可得出为 positive\nnegativate\npo
结果:
分析:第一部分就是拿出reviwes.txt的前30个字符(含空格),共计30个
第二部分就是拿出label.txt中的前20个字符(含空行),共计30个
2.文本预处理
###############################################################################
########################## 2. 文本预处理 ###########################
###############################################################################
from string import punctuation
# 去掉标点符号
reviews = reviews.lower() # lowercase, standardize
all_text = ''.join([c for c in reviews if c not in punctuation])
# 用新的空行和空格分割
reviews_split = all_text.split('\n')
all_text = ' '.join(reviews_split)
# 创建单词列表
words = all_text.split()
#以下3步是自己加的
print()
print(words[:30]) #测试:拿出前30个字母,看看和15行有什么不同
print()
结果:
分析:一共提取了30个单词(注意:不是前30个字母),words变量指的就是原本reviwes.txt中的所有单词(其中,1全部单词转换成小写。2标点符号直接消失。3去掉空行。4在单词上加' ')
3.创建数据词典,然后将评论换成整数
###############################################################################
################## 3. 创建数据词典,然后将评论换成整数 ##################
###############################################################################
from collections import Counter
## 建立一个将单词映射成整数的字典
counts = Counter(words)
vocab = sorted(counts, key=counts.get, reverse=True)
vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)}
## 使用字典标记reviews_split中的每个reviews
## 在reviews_ints中存储标记化的reviews
reviews_ints = []
for review in reviews_split:
reviews_ints.append([vocab_to_int[word] for word in review.split()])
# 词汇统计
print('Unique words: ', len((vocab_to_int))) # should ~ 74000+
print()
# print tokens in first review
print('Tokenized review: \n', reviews_ints[:1]) #一个张量
结果:
分析:1.counts是计数,看出现了多少种,每种单词出现了多少次
2.vocab是显示出现了多少种单词(不计每种单词出现了多少次)
3.vocab_to_int指的是reviwes.txt中有多少种单词,并从第1到最后排列位置,因此它的长度就是出现的单词种类数(共计74072种单词)
4.reviews_ints是每个review的句子都转换成一个张量(共有25001个),图片中打印了第一个张量
5.由字母转换成数字的过程:
a.review.split()是把单词都分出来,word值的是每个review中的单词集合(for循环)
b.根据每个单词的出现顺序排号(比如the, 出现次数最多就是代表数字1,is是6,a是3。这个是根据vocab_to_int来的)
c.然后改写成数字形式;例如bromwell high is a cartoon comedy:[21025,308,6,3,1050,207]
4.把标签“positive” or "negative"转为整数
###############################################################################
############## 4. 把标签“positive” or "negative"转为整数 ######################
###############################################################################
# 1=positive, 0=negative 标签转换
labels_split = labels.split('\n') #label空行分开
encoded_labels = np.array([1 if label == 'positive' else 0 for label in labels_split])
# outlier review stats
review_lens = Counter([len(x) for x in reviews_ints])
print()
print("Zero-length reviews: {}".format(review_lens[0]))
print("One-length reviews: {}".format(review_lens[1]))
print("Maximum review length: {}".format(max(review_lens)))
结果:
分析:encoded_labels就是根据label.txt整理出的int数据。review_lens指的是对25001个张量的长度len进行归类,结果看出有1个0长度len的张量,最大长度为2514的张量(可验证只有1个)
5.消除长度为0的行
###############################################################################
##################### 5. 消除长度为0的行 #######################
###############################################################################
print()
## 从reviews_ints 列表中删除任何长度为0的reviews和labels.
print('Number of reviews before removing outliers: ', len(reviews_ints))
# 获取长度为0的任何评论的索引
non_zero_idx = [ii for ii, review in enumerate(reviews_ints) if len(review) != 0]
# 删除0长度的评论及其标签
reviews_ints = [reviews_ints[ii] for ii in non_zero_idx]
encoded_labels = np.array([encoded_labels[ii] for ii in non_zero_idx])
print('Number of reviews after removing outliers: ', len(reviews_ints))
print()
结果:
6.将所以句子统一长度为200个单词(多退少补)
###############################################################################
###################### 6. 将所以句子统一长度为200个单词:##########################
########################## 评论长度小于200的,我们对其左边填充0 ################
########################## 对于大于200的,我们只截取其前200个单词 ################
###############################################################################
def pad_features(reviews_ints, seq_length):
''' Return features of review_ints, where each review is padded with 0's
or truncated to the input seq_length.
'''
# getting the correct rows x cols shape
features = np.zeros((len(reviews_ints), seq_length), dtype=int)
# for each review, I grab that review and
for i, row in enumerate(reviews_ints):
features[i, -len(row):] = np.array(row)[:seq_length]
return features #
# 定义长度为200
seq_length = 200
features = pad_features(reviews_ints, seq_length=seq_length)
## test statements - do not change - ##
assert len(features) == len(reviews_ints), "Your features should have as many rows as reviews."
assert len(features[0]) == seq_length, "Each feature row should contain seq_length values."
# 打印前30个批次的前10个值
print(features[:30, :10]) #(30行,10列)
结果:
分析: 由于每个句子的张量都是不同长度的(大部分都是100多,也有200多的),所以要固定数值的长度为200(其中多退少补,大于200的数据不要了,小于200的就补0,0是补在前面的),一共有25000条数据。
其中,features是reviews.txt的数据。因此features就是25000行,200列。图片结果就是显示了30行10列的部分数据
7.分割数据并数据装载
###############################################################################
############## 7. SPLIT DATA & GET (REVIEW, LABEL) DATALOADER ###############
###############################################################################
split_frac = 0.8
## split data into training, validation, and test data (features and labels, x and y)
split_idx = int(len(features) * split_frac)
train_x, remaining_x = features[:split_idx], features[split_idx:]
train_y, remaining_y = encoded_labels[:split_idx], encoded_labels[split_idx:]
test_idx = int(len(remaining_x) * 0.5)
val_x, test_x = remaining_x[:test_idx], remaining_x[test_idx:]
val_y, test_y = remaining_y[:test_idx], remaining_y[test_idx:]
## print out the shapes of your resultant feature data
print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape),
"\nValidation set: \t{}".format(val_x.shape),
"\nTest set: \t\t{}".format(test_x.shape))
import torch
from torch.utils.data import TensorDataset, DataLoader
# create Tensor datasets
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
valid_data = TensorDataset(torch.from_numpy(val_x), torch.from_numpy(val_y))
test_data = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))
# dataloaders
batch_size = 50
# make sure the SHUFFLE your training data
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)
# obtain one batch of training data
dataiter = iter(train_loader)
sample_x, sample_y = dataiter.next()
print('Sample input size: ', sample_x.size()) # batch_size, seq_length
print('Sample input: \n', sample_x)
print()
print('Sample label size: ', sample_y.size()) # batch_size
print('Sample label: \n', sample_y)
# First checking if GPU is available
train_on_gpu = torch.cuda.is_available()
if (train_on_gpu):
print('Training on GPU.')
else:
print('No GPU available, training on CPU.')
结果:
分析: 将25000行,200列的features分出成三部分:train set、validation set、test set(看结果)
train_x等都是ndarray类型的数据,所以将其转换成张量的数据并将对应的标签合并在一起成为train_data、valid_data、test_data等。其中从train_loader开始就变成了20000/batch_size=400 组数据
后边包含iter()和next()就是用到了一个语法,具体可以查一下(不是很难),可以看到每组的数据(程序run一下就会变的)(这步骤和整个程序没什么必要关系,仅仅测试)
8.定义lstm模型
###############################################################################
######################### 8. 定义lstm模型 ##########################
###############################################################################
import torch.nn as nn
class SentimentRNN(nn.Module):
"""
The RNN model that will be used to perform Sentiment analysis.
"""
def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, bidirectional=True, drop_prob=0.5):
"""
Initialize the model by setting up the layers.
"""
super(SentimentRNN, self).__init__()
self.output_size = output_size
self.n_layers = n_layers
self.hidden_dim = hidden_dim
self.bidirectional = bidirectional
# embedding and LSTM layers
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers,
dropout=drop_prob, batch_first=True,
bidirectional=bidirectional)
# dropout layer
self.dropout = nn.Dropout(0.3)
# linear and sigmoid layers
if bidirectional:
self.fc = nn.Linear(hidden_dim * 2, output_size)
else:
self.fc = nn.Linear(hidden_dim, output_size)
self.sig = nn.Sigmoid()
def forward(self, x, hidden):
"""
Perform a forward pass of our model on some input and hidden state.
"""
batch_size = x.size(0)
# embeddings and lstm_out
x = x.long()
embeds = self.embedding(x)
lstm_out, hidden = self.lstm(embeds, hidden)
# if bidirectional:
# lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim*2)
# else:
# lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
# dropout and fully-connected layer
out = self.dropout(lstm_out)
out = self.fc(out)
# sigmoid function
sig_out = self.sig(out)
# reshape to be batch_size first
sig_out = sig_out.view(batch_size, -1)
sig_out = sig_out[:, -1] # get last batch of labels
# return last sigmoid output and hidden state
return sig_out, hidden
def init_hidden(self, batch_size):
''' Initializes hidden state '''
# Create two new tensors with sizes n_layers x batch_size x hidden_dim,
# initialized to zero, for hidden state and cell state of LSTM
weight = next(self.parameters()).data
number = 1
if self.bidirectional:
number = 2
if (train_on_gpu):
hidden = (weight.new(self.n_layers * number, batch_size, self.hidden_dim).zero_().cuda(),
weight.new(self.n_layers * number, batch_size, self.hidden_dim).zero_().cuda()
)
else:
hidden = (weight.new(self.n_layers * number, batch_size, self.hidden_dim).zero_(),
weight.new(self.n_layers * number, batch_size, self.hidden_dim).zero_()
)
return hidden
9.用超参数实例化模型
###############################################################################
################ 9. 用超参数实例化模型 #################
###############################################################################
vocab_size = len(vocab_to_int) + 1 # +1 for the 0 padding + our word tokens
output_size = 1
embedding_dim = 400
hidden_dim = 256
n_layers = 2
bidirectional = False # 这里为True,为双向LSTM
net = SentimentRNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers, bidirectional)
print()
print(net)
结果:
for name,parameters in net.named_parameters():
print(name,':',parameters.size())
结果:
分析: 定义的网络中含有嵌入层、lstm层、脱落层、激活层,经过每层后的张量形状如下图所示,最后取50*200的张量的 所有行的最后一列并print出来(这个数值会根据不同次的print而不同)
10.定义损失和优化器
###############################################################################
####################### 10. 定义损失和优化器 #########################
###############################################################################
lr = 0.001
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
11.训练网络
###############################################################################
########################## 11. 训练网络 ###########################
###############################################################################
epochs = 4 # 大约3到4次,停止减少损失
print_every = 100
clip = 5 # gradient clipping
# move model to GPU, if available
if (train_on_gpu):
net.cuda()
net.train()
torch.backends.cudnn.enabled=False
# train for 4 epochs
for e in range(epochs):
# initialize hidden state
h = net.init_hidden(batch_size)
counter = 0
# batch loop
for inputs, labels in train_loader:
counter += 1
if (train_on_gpu):
inputs, labels = inputs.cuda(), labels.cuda()
# Creating new variables for the hidden state, otherwise
# we'd backprop through the entire training history
h = tuple([each.data for each in h])
# zero accumulated gradients
net.zero_grad()
# get the output from the model
output, h = net(inputs, h)
# calculate the loss and perform backprop
loss = criterion(output.squeeze(), labels.float())
loss.backward()
# `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
nn.utils.clip_grad_norm_(net.parameters(), clip)
optimizer.step()
# loss stats
if counter % print_every == 0:
# Get validation loss
val_h = net.init_hidden(batch_size)
val_losses = []
net.eval()
for inputs, labels in valid_loader:
# Creating new variables for the hidden state, otherwise
# we'd backprop through the entire training history
val_h = tuple([each.data for each in val_h])
if (train_on_gpu):
inputs, labels = inputs.cuda(), labels.cuda()
output, val_h = net(inputs, val_h)
val_loss = criterion(output.squeeze(), labels.float())
val_losses.append(val_loss.item())
net.train()
print("Epoch: {}/{}...".format(e + 1, epochs),
"Step: {}...".format(counter),
"Loss: {:.6f}...".format(loss.item()),
"Val Loss: {:.6f}".format(np.mean(val_losses)))
结果:
分析: 这步骤包括 训练集 和 验证集。
其中,训练100组(即100个batch_size,也就是5000个review)后,得到Loss。接下来开始对验证集(2500个review)进行验证,从而得到Val Loss。
4个100组构成了1 Epoch(即20000个训练集),同时验证集的整体被验证了4次(4*2500个review)
然后累计进行4个Epoch
12.在测试集上测试训练模型
###############################################################################
################ 12. 在测试集上测试训练模型 #################
###############################################################################
# Get test data loss and accuracy
test_losses = [] # track loss
num_correct = 0
# init hidden state
h = net.init_hidden(batch_size)
net.eval()
# iterate over test data
for inputs, labels in test_loader:
# Creating new variables for the hidden state, otherwise
# we'd backprop through the entire training history
h = tuple([each.data for each in h])
if (train_on_gpu):
inputs, labels = inputs.cuda(), labels.cuda()
# get predicted outputs
output, h = net(inputs, h)
# calculate loss
test_loss = criterion(output.squeeze(), labels.float())
test_losses.append(test_loss.item())
# convert output probabilities to predicted class (0 or 1)
pred = torch.round(output.squeeze()) # rounds to the nearest integer
# compare predictions to true label
correct_tensor = pred.eq(labels.float().view_as(pred))
correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
num_correct += np.sum(correct)
# -- stats! -- ##
# avg test loss
print("Test loss: {:.3f}".format(np.mean(test_losses)))
# accuracy over all test data
test_acc = num_correct / len(test_loader.dataset)
print("Test accuracy: {:.3f}".format(test_acc))
结果:
分析: test的数据集包含2500个review,最后求出loss和识别的准确率
参考文章:
[深度学习] PyTorch 实现双向LSTM 情感分析_小墨鱼的专栏-CSDN博客_pytorch 双向lstm
RuntimeError: cudnn RNN backward can only be called in training mode_沉迷单车的追风少年-CSDN博客RuntimeError:CUDA error:out of memory问题解决_Slim's Hello World-CSDN博客
......(参考的文章太多了,遇到不会的结构、网络、函数都不去查一下。整理不完)