week11&12&13(2021.11.27-2021.12.17)

最新推荐文章于 2024-09-30 12:50:55 发布

浅蓝的爽肤水

最新推荐文章于 2024-09-30 12:50:55 发布

阅读量380

点赞数 1

文章标签： lstm pytorch 深度学习

本文链接：https://blog.csdn.net/sohu564/article/details/121887262

版权

这学期最后一次更新了，后续准备期末考试。下学期接着更

这周看的是双向lstm的内容

Step1.双向lstm的原理

基本参考的是这篇文章，讲的非常的细致，其中还把RNN和LSTM也讲了（因为双向LSTM也只是LSTM的变种）

RNN、LSTM的知识见week6&7（2021.10.23-2021.11.5），不具体说了

双向LSTM（Bi-LSTM）是一种变种的LSTM，我先拿Bi-RNN举例子

Bi-RNN的结构图:

可以看出，加了一层反向的RNN，两个对向实现了Bi-RNN。其中，S0和S‘0都是输入，Si和S'i都是输出。

RNN和LSTM都只能依据之前时刻的时序信息来预测下一时刻的输出，但在有些问题中，当前时刻的输出不仅和之前的状态有关，还可能和未来的状态有关系。比如预测一句话中缺失的单词不仅需要根据前文来判断，还需要考虑它后面的内容，真正做到基于上下文判断。BRNN有两个RNN上下叠加在一起组成的，输出由这两个RNN的状态共同决定。

由上图非常容易理解，理解起来，所谓的Bi-LSTM以及Bi-RNN，可以看成是两层神经网络，第一层从左边作为系列的起始输入，在文本处理上可以理解成从句子的开头开始输入，而第二层则是从右边作为系列的起始输入，在文本处理上可以理解成从句子的最后一个词语作为输入，反向做与第一层一样的处理处理。最后对得到的两个结果进行处理。

Bi-LSTM就是参数变成了两个（s0,h0）......其他同理

Step2.代码实现

一.数据集：
（1）reviews.txt 是原始文本文件，共25000条，一行是一篇英文电影影评文本

（2）labels.txt 是标签文件，共25000条，一行是一个标签，positive 或者 negative

在代码中：测试集为20000条，验证集为2500条，测试集为2500条

二.代码

1.训练数据加载

import os
os.environ["CUDA_VISIBLE_DEVICES"] = '1'

import numpy as np

###############################################################################
#######################        1. 训练数据加载        ###########################
###############################################################################

with open('./data/reviews.txt', 'r') as f:
    reviews = f.read()
with open('./data/labels.txt', 'r') as f:
    labels = f.read()

print(reviews[:30])             #拿出前30个字母
print()
print(labels[:20])                #拿出前20个字符            a[:n] 代表列表中的第一项到第n项，相当于 a[0:n]    看数据集的标签txt可得出为 positive\nnegativate\npo

结果：

分析：第一部分就是拿出reviwes.txt的前30个字符（含空格），共计30个

第二部分就是拿出label.txt中的前20个字符（含空行），共计30个

2.文本预处理

###############################################################################
##########################        2. 文本预处理       ###########################
###############################################################################

from string import punctuation

# 去掉标点符号
reviews = reviews.lower()  # lowercase, standardize
all_text = ''.join([c for c in reviews if c not in punctuation])

# 用新的空行和空格分割
reviews_split = all_text.split('\n')
all_text = ' '.join(reviews_split)

# 创建单词列表
words = all_text.split()
#以下3步是自己加的
print()
print(words[:30])             #测试：拿出前30个字母，看看和15行有什么不同
print()

结果：

分析:一共提取了30个单词（注意：不是前30个字母），words变量指的就是原本reviwes.txt中的所有单词（其中，1全部单词转换成小写。2标点符号直接消失。3去掉空行。4在单词上加' '）

3.创建数据词典，然后将评论换成整数

###############################################################################
##################        3. 创建数据词典，然后将评论换成整数      ##################
###############################################################################

from collections import Counter

## 建立一个将单词映射成整数的字典
counts = Counter(words)
vocab = sorted(counts, key=counts.get, reverse=True)
vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)}

## 使用字典标记reviews_split中的每个reviews
## 在reviews_ints中存储标记化的reviews
reviews_ints = []
for review in reviews_split:
    reviews_ints.append([vocab_to_int[word] for word in review.split()])

# 词汇统计
print('Unique words: ', len((vocab_to_int)))  # should ~ 74000+
print()

# print tokens in first review
print('Tokenized review: \n', reviews_ints[:1])   #一个张量

结果：

分析：1.counts是计数，看出现了多少种，每种单词出现了多少次

2.vocab是显示出现了多少种单词（不计每种单词出现了多少次）

3.vocab_to_int指的是reviwes.txt中有多少种单词，并从第1到最后排列位置，因此它的长度就是出现的单词种类数（共计74072种单词）

4.reviews_ints是每个review的句子都转换成一个张量（共有25001个），图片中打印了第一个张量

5.由字母转换成数字的过程：

a.review.split()是把单词都分出来，word值的是每个review中的单词集合(for循环)

b.根据每个单词的出现顺序排号（比如the, 出现次数最多就是代表数字1，is是6，a是3。这个是根据vocab_to_int来的）

c.然后改写成数字形式;例如bromwell high is a cartoon comedy:[21025,308,6,3,1050,207]

4.把标签“positive” or "negative"转为整数

###############################################################################
##############  4. 把标签“positive” or "negative"转为整数   ######################
###############################################################################

# 1=positive, 0=negative 标签转换
labels_split = labels.split('\n')    #label空行分开
encoded_labels = np.array([1 if label == 'positive' else 0 for label in labels_split])

# outlier review stats
review_lens = Counter([len(x) for x in reviews_ints])
print()
print("Zero-length reviews: {}".format(review_lens[0]))
print("One-length reviews: {}".format(review_lens[1]))
print("Maximum review length: {}".format(max(review_lens)))

结果：

分析：encoded_labels就是根据label.txt整理出的int数据。review_lens指的是对25001个张量的长度len进行归类，结果看出有1个0长度len的张量,最大长度为2514的张量（可验证只有1个）

5.消除长度为0的行

###############################################################################
#####################          5. 消除长度为0的行          #######################
###############################################################################
print()
## 从reviews_ints 列表中删除任何长度为0的reviews和labels.
print('Number of reviews before removing outliers: ', len(reviews_ints))
# 获取长度为0的任何评论的索引
non_zero_idx = [ii for ii, review in enumerate(reviews_ints) if len(review) != 0]

# 删除0长度的评论及其标签
reviews_ints = [reviews_ints[ii] for ii in non_zero_idx]
encoded_labels = np.array([encoded_labels[ii] for ii in non_zero_idx])

print('Number of reviews after removing outliers: ', len(reviews_ints))
print()

结果：

6.将所以句子统一长度为200个单词（多退少补）

###############################################################################
######################  6. 将所以句子统一长度为200个单词：##########################
##########################   评论长度小于200的，我们对其左边填充0    ################
##########################   对于大于200的，我们只截取其前200个单词  ################
###############################################################################
def pad_features(reviews_ints, seq_length):
    ''' Return features of review_ints, where each review is padded with 0's
        or truncated to the input seq_length.
    '''

    # getting the correct rows x cols shape
    features = np.zeros((len(reviews_ints), seq_length), dtype=int)

    # for each review, I grab that review and
    for i, row in enumerate(reviews_ints):
        features[i, -len(row):] = np.array(row)[:seq_length]

    return features           #


# 定义长度为200
seq_length = 200
features = pad_features(reviews_ints, seq_length=seq_length)

## test statements - do not change - ##
assert len(features) == len(reviews_ints), "Your features should have as many rows as reviews."
assert len(features[0]) == seq_length, "Each feature row should contain seq_length values."

# 打印前30个批次的前10个值
print(features[:30, :10])        #（30行，10列）

结果：

分析：由于每个句子的张量都是不同长度的（大部分都是100多，也有200多的），所以要固定数值的长度为200（其中多退少补，大于200的数据不要了，小于200的就补0，0是补在前面的），一共有25000条数据。

其中，features是reviews.txt的数据。因此features就是25000行，200列。图片结果就是显示了30行10列的部分数据

7.分割数据并数据装载

###############################################################################
##############  7. SPLIT DATA & GET (REVIEW, LABEL) DATALOADER  ###############
###############################################################################

split_frac = 0.8
## split data into training, validation, and test data (features and labels, x and y)

split_idx = int(len(features) * split_frac)
train_x, remaining_x = features[:split_idx], features[split_idx:]
train_y, remaining_y = encoded_labels[:split_idx], encoded_labels[split_idx:]

test_idx = int(len(remaining_x) * 0.5)
val_x, test_x = remaining_x[:test_idx], remaining_x[test_idx:]
val_y, test_y = remaining_y[:test_idx], remaining_y[test_idx:]

## print out the shapes of your resultant feature data
print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape),
      "\nValidation set: \t{}".format(val_x.shape),
      "\nTest set: \t\t{}".format(test_x.shape))

import torch
from torch.utils.data import TensorDataset, DataLoader

# create Tensor datasets
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
valid_data = TensorDataset(torch.from_numpy(val_x), torch.from_numpy(val_y))
test_data = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))

# dataloaders
batch_size = 50

# make sure the SHUFFLE your training data
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)

# obtain one batch of training data
dataiter = iter(train_loader)
sample_x, sample_y = dataiter.next()

print('Sample input size: ', sample_x.size())  # batch_size, seq_length
print('Sample input: \n', sample_x)
print()
print('Sample label size: ', sample_y.size())  # batch_size
print('Sample label: \n', sample_y)

# First checking if GPU is available
train_on_gpu = torch.cuda.is_available()

if (train_on_gpu):
    print('Training on GPU.')
else:
    print('No GPU available, training on CPU.')

结果：

分析：将25000行，200列的features分出成三部分：train set、validation set、test set(看结果)

train_x等都是ndarray类型的数据，所以将其转换成张量的数据并将对应的标签合并在一起成为train_data、valid_data、test_data等。其中从train_loader开始就变成了20000/batch_size=400 组数据

后边包含iter()和next()就是用到了一个语法，具体可以查一下（不是很难），可以看到每组的数据（程序run一下就会变的）（这步骤和整个程序没什么必要关系，仅仅测试）

8.定义lstm模型

###############################################################################
#########################       8. 定义lstm模型       ##########################
###############################################################################
import torch.nn as nn
class SentimentRNN(nn.Module):
    """
    The RNN model that will be used to perform Sentiment analysis.
    """

    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, bidirectional=True, drop_prob=0.5):
        """
        Initialize the model by setting up the layers.
        """
        super(SentimentRNN, self).__init__()

        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        self.bidirectional = bidirectional

        # embedding and LSTM layers
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers,
                            dropout=drop_prob, batch_first=True,
                            bidirectional=bidirectional)

        # dropout layer
        self.dropout = nn.Dropout(0.3)

        # linear and sigmoid layers
        if bidirectional:
            self.fc = nn.Linear(hidden_dim * 2, output_size)
        else:
            self.fc = nn.Linear(hidden_dim, output_size)

        self.sig = nn.Sigmoid()

    def forward(self, x, hidden):
        """
        Perform a forward pass of our model on some input and hidden state.
        """
        batch_size = x.size(0)

        # embeddings and lstm_out
        x = x.long()
        embeds = self.embedding(x)
        lstm_out, hidden = self.lstm(embeds, hidden)

        #         if bidirectional:
        #           lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim*2)
        #         else:
        #           lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)

        # dropout and fully-connected layer
        out = self.dropout(lstm_out)
        out = self.fc(out)
        # sigmoid function
        sig_out = self.sig(out)

        # reshape to be batch_size first
        sig_out = sig_out.view(batch_size, -1)
        sig_out = sig_out[:, -1]  # get last batch of labels

        # return last sigmoid output and hidden state
        return sig_out, hidden

    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data

        number = 1
        if self.bidirectional:
            number = 2

        if (train_on_gpu):
            hidden = (weight.new(self.n_layers * number, batch_size, self.hidden_dim).zero_().cuda(),
                      weight.new(self.n_layers * number, batch_size, self.hidden_dim).zero_().cuda()
                      )
        else:
            hidden = (weight.new(self.n_layers * number, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers * number, batch_size, self.hidden_dim).zero_()
                      )

        return hidden

9.用超参数实例化模型

###############################################################################
################                9. 用超参数实例化模型            #################
###############################################################################
vocab_size = len(vocab_to_int) + 1  # +1 for the 0 padding + our word tokens
output_size = 1
embedding_dim = 400
hidden_dim = 256
n_layers = 2
bidirectional = False  # 这里为True，为双向LSTM

net = SentimentRNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers, bidirectional)
print()
print(net)

结果：

for name,parameters in net.named_parameters():
    print(name,':',parameters.size())

结果：

分析：定义的网络中含有嵌入层、lstm层、脱落层、激活层，经过每层后的张量形状如下图所示，最后取50*200的张量的所有行的最后一列并print出来（这个数值会根据不同次的print而不同）

10.定义损失和优化器

###############################################################################
#######################       10. 定义损失和优化器       #########################
###############################################################################
lr = 0.001

criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=lr)

11.训练网络

###############################################################################
##########################        11. 训练网络       ###########################
###############################################################################
epochs = 4  # 大约3到4次，停止减少损失
print_every = 100
clip = 5  # gradient clipping

# move model to GPU, if available
if (train_on_gpu):
    net.cuda()

net.train()
torch.backends.cudnn.enabled=False
# train for 4 epochs
for e in range(epochs):
    # initialize hidden state
    h = net.init_hidden(batch_size)
    counter = 0

    # batch loop
    for inputs, labels in train_loader:
        counter += 1

        if (train_on_gpu):
            inputs, labels = inputs.cuda(), labels.cuda()

        # Creating new variables for the hidden state, otherwise
        # we'd backprop through the entire training history
        h = tuple([each.data for each in h])
        # zero accumulated gradients
        net.zero_grad()

        # get the output from the model
        output, h = net(inputs, h)

        # calculate the loss and perform backprop
        loss = criterion(output.squeeze(), labels.float())

        loss.backward()
        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        nn.utils.clip_grad_norm_(net.parameters(), clip)
        optimizer.step()

        # loss stats
        if counter % print_every == 0:
            # Get validation loss
            val_h = net.init_hidden(batch_size)
            val_losses = []
            net.eval()
            for inputs, labels in valid_loader:

                # Creating new variables for the hidden state, otherwise
                # we'd backprop through the entire training history
                val_h = tuple([each.data for each in val_h])

                if (train_on_gpu):
                    inputs, labels = inputs.cuda(), labels.cuda()

                output, val_h = net(inputs, val_h)
                val_loss = criterion(output.squeeze(), labels.float())

                val_losses.append(val_loss.item())

            net.train()
            print("Epoch: {}/{}...".format(e + 1, epochs),
                  "Step: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))

结果：

分析：这步骤包括训练集和验证集。

其中，训练100组（即100个batch_size，也就是5000个review）后，得到Loss。接下来开始对验证集（2500个review）进行验证，从而得到Val Loss。

4个100组构成了1 Epoch（即20000个训练集），同时验证集的整体被验证了4次（4*2500个review）

然后累计进行4个Epoch

12.在测试集上测试训练模型

###############################################################################
################            12. 在测试集上测试训练模型            #################
###############################################################################
# Get test data loss and accuracy

test_losses = []  # track loss
num_correct = 0

# init hidden state
h = net.init_hidden(batch_size)

net.eval()
# iterate over test data
for inputs, labels in test_loader:

    # Creating new variables for the hidden state, otherwise
    # we'd backprop through the entire training history
    h = tuple([each.data for each in h])

    if (train_on_gpu):
        inputs, labels = inputs.cuda(), labels.cuda()

    # get predicted outputs
    output, h = net(inputs, h)

    # calculate loss
    test_loss = criterion(output.squeeze(), labels.float())
    test_losses.append(test_loss.item())

    # convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze())  # rounds to the nearest integer

    # compare predictions to true label
    correct_tensor = pred.eq(labels.float().view_as(pred))
    correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
    num_correct += np.sum(correct)

# -- stats! -- ##
# avg test loss
print("Test loss: {:.3f}".format(np.mean(test_losses)))

# accuracy over all test data
test_acc = num_correct / len(test_loader.dataset)
print("Test accuracy: {:.3f}".format(test_acc))

结果：