week11&12&13(2021.11.27-2021.12.17)

这学期最后一次更新了,后续准备期末考试。下学期接着更

这周看的是双向lstm的内容

Step1.双向lstm的原理

        基本参考的是这篇文章,讲的非常的细致,其中还把RNN和LSTM也讲了(因为双向LSTM也只是LSTM的变种)

        RNN、LSTM的知识见week6&7(2021.10.23-2021.11.5),不具体说了

        双向LSTM(Bi-LSTM)是一种变种的LSTM,我先拿Bi-RNN举例子

Bi-RNN的结构图:

         可以看出,加了一层反向的RNN,两个对向实现了Bi-RNN。其中,S0和S‘0都是输入,Si和S'i都是输出。

        RNN和LSTM都只能依据之前时刻的时序信息来预测下一时刻的输出,但在有些问题中,当前时刻的输出不仅和之前的状态有关,还可能和未来的状态有关系。比如预测一句话中缺失的单词不仅需要根据前文来判断,还需要考虑它后面的内容,真正做到基于上下文判断。BRNN有两个RNN上下叠加在一起组成的,输出由这两个RNN的状态共同决定。

        由上图非常容易理解,理解起来,所谓的Bi-LSTM以及Bi-RNN,可以看成是两层神经网络,第一层从左边作为系列的起始输入,在文本处理上可以理解成从句子的开头开始输入,第二层则是从右边作为系列的起始输入,在文本处理上可以理解成从句子的最后一个词语作为输入,反向做与第一层一样的处理处理最后对得到的两个结果进行处理

        Bi-LSTM就是参数变成了两个(s0,h0)......其他同理

Step2.代码实现

一.数据集:
        (1)reviews.txt 是原始文本文件,共25000条,一行是一篇英文电影影评文本


        (2)labels.txt 是标签文件,共25000条,一行是一个标签,positive 或者 negative

        在代码中:测试集为20000条,验证集为2500条,测试集为2500条

二.代码

1.训练数据加载

import os
os.environ["CUDA_VISIBLE_DEVICES"] = '1'

import numpy as np

###############################################################################
#######################        1. 训练数据加载        ###########################
###############################################################################

with open('./data/reviews.txt', 'r') as f:
    reviews = f.read()
with open('./data/labels.txt', 'r') as f:
    labels = f.read()

print(reviews[:30])             #拿出前30个字母
print()
print(labels[:20])                #拿出前20个字符            a[:n] 代表列表中的第一项到第n项,相当于 a[0:n]    看数据集的标签txt可得出为 positive\nnegativate\npo

结果:

 

分析:第一部分就是拿出reviwes.txt的前30个字符(含空格),共计30个

           第二部分就是拿出label.txt中的前20个字符(含空行),共计30个

2.文本预处理

###############################################################################
##########################        2. 文本预处理       ###########################
###############################################################################

from string import punctuation

# 去掉标点符号
reviews = reviews.lower()  # lowercase, standardize
all_text = ''.join([c for c in reviews if c not in punctuation])

# 用新的空行和空格分割
reviews_split = all_text.split('\n')
all_text = ' '.join(reviews_split)

# 创建单词列表
words = all_text.split()
#以下3步是自己加的
print()
print(words[:30])             #测试:拿出前30个字母,看看和15行有什么不同
print()

结果:

 分析:一共提取了30个单词(注意:不是前30个字母),words变量指的就是原本reviwes.txt中的所有单词(其中,1全部单词转换成小写。2标点符号直接消失。3去掉空行。4在单词上加' ')

3.创建数据词典,然后将评论换成整数

###############################################################################
##################        3. 创建数据词典,然后将评论换成整数      ##################
###############################################################################

from collections import Counter

## 建立一个将单词映射成整数的字典
counts = Counter(words)
vocab = sorted(counts, key=counts.get, reverse=True)
vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)}

## 使用字典标记reviews_split中的每个reviews
## 在reviews_ints中存储标记化的reviews
reviews_ints = []
for review in reviews_split:
    reviews_ints.append([vocab_to_int[word] for word in review.split()])

# 词汇统计
print('Unique words: ', len((vocab_to_int)))  # should ~ 74000+
print()

# print tokens in first review
print('Tokenized review: \n', reviews_ints[:1])   #一个张量

结果:

分析:1.counts是计数,看出现了多少种,每种单词出现了多少次

           2.vocab是显示出现了多少种单词(不计每种单词出现了多少次)

           3.vocab_to_int指的是reviwes.txt中有多少种单词,并从第1到最后排列位置,因此它的长度就是出现的单词种类数(共计74072种单词)

           4.reviews_ints是每个review的句子都转换成一个张量(共有25001个),图片中打印了第一个张量

           5.由字母转换成数字的过程:

                a.review.split()是把单词都分出来,word值的是每个review中的单词集合(for循环)

                b.根据每个单词的出现顺序排号(比如the, 出现次数最多就是代表数字1,is是6,a是3。这个是根据vocab_to_int来的)

                c.然后改写成数字形式;例如bromwell high is a cartoon comedy:[21025,308,6,3,1050,207]

4.把标签“positive” or "negative"转为整数

###############################################################################
##############  4. 把标签“positive” or "negative"转为整数   ######################
###############################################################################

# 1=positive, 0=negative 标签转换
labels_split = labels.split('\n')    #label空行分开
encoded_labels = np.array([1 if label == 'positive' else 0 for label in labels_split])

# outlier review stats
review_lens = Counter([len(x) for x in reviews_ints])
print()
print("Zero-length reviews: {}".format(review_lens[0]))
print("One-length reviews: {}".format(review_lens[1]))
print("Maximum review length: {}".format(max(review_lens)))

结果:

分析:encoded_labels就是根据label.txt整理出的int数据。review_lens指的是对25001个张量的长度len进行归类,结果看出有1个0长度len的张量,最大长度为2514的张量(可验证只有1个)

5.消除长度为0的行

###############################################################################
#####################          5. 消除长度为0的行          #######################
###############################################################################
print()
## 从reviews_ints 列表中删除任何长度为0的reviews和labels.
print('Number of reviews before removing outliers: ', len(reviews_ints))
# 获取长度为0的任何评论的索引
non_zero_idx = [ii for ii, review in enumerate(reviews_ints) if len(review) != 0]

# 删除0长度的评论及其标签
reviews_ints = [reviews_ints[ii] for ii in non_zero_idx]
encoded_labels = np.array([encoded_labels[ii] for ii in non_zero_idx])

print('Number of reviews after removing outliers: ', len(reviews_ints))
print()

结果:

6.将所以句子统一长度为200个单词(多退少补)

###############################################################################
######################  6. 将所以句子统一长度为200个单词:##########################
##########################   评论长度小于200的,我们对其左边填充0    ################
##########################   对于大于200的,我们只截取其前200个单词  ################
###############################################################################
def pad_features(reviews_ints, seq_length):
    ''' Return features of review_ints, where each review is padded with 0's
        or truncated to the input seq_length.
    '''

    # getting the correct rows x cols shape
    features = np.zeros((len(reviews_ints), seq_length), dtype=int)

    # for each review, I grab that review and
    for i, row in enumerate(reviews_ints):
        features[i, -len(row):] = np.array(row)[:seq_length]

    return features           #


# 定义长度为200
seq_length = 200
features = pad_features(reviews_ints, seq_length=seq_length)

## test statements - do not change - ##
assert len(features) == len(reviews_ints), "Your features should have as many rows as reviews."
assert len(features[0]) == seq_length, "Each feature row should contain seq_length values."

# 打印前30个批次的前10个值
print(features[:30, :10])        #(30行,10列)

结果:

分析:     由于每个句子的张量都是不同长度的(大部分都是100多,也有200多的),所以要固定数值的长度为200(其中多退少补,大于200的数据不要了,小于200的就补0,0是补在前面的),一共有25000条数据。

                其中,features是reviews.txt的数据。因此features就是25000行,200列。图片结果就是显示了30行10列的部分数据

7.分割数据并数据装载

###############################################################################
##############  7. SPLIT DATA & GET (REVIEW, LABEL) DATALOADER  ###############
###############################################################################

split_frac = 0.8
## split data into training, validation, and test data (features and labels, x and y)

split_idx = int(len(features) * split_frac)
train_x, remaining_x = features[:split_idx], features[split_idx:]
train_y, remaining_y = encoded_labels[:split_idx], encoded_labels[split_idx:]

test_idx = int(len(remaining_x) * 0.5)
val_x, test_x = remaining_x[:test_idx], remaining_x[test_idx:]
val_y, test_y = remaining_y[:test_idx], remaining_y[test_idx:]

## print out the shapes of your resultant feature data
print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape),
      "\nValidation set: \t{}".format(val_x.shape),
      "\nTest set: \t\t{}".format(test_x.shape))

import torch
from torch.utils.data import TensorDataset, DataLoader

# create Tensor datasets
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
valid_data = TensorDataset(torch.from_numpy(val_x), torch.from_numpy(val_y))
test_data = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))

# dataloaders
batch_size = 50

# make sure the SHUFFLE your training data
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)

# obtain one batch of training data
dataiter = iter(train_loader)
sample_x, sample_y = dataiter.next()

print('Sample input size: ', sample_x.size())  # batch_size, seq_length
print('Sample input: \n', sample_x)
print()
print('Sample label size: ', sample_y.size())  # batch_size
print('Sample label: \n', sample_y)

# First checking if GPU is available
train_on_gpu = torch.cuda.is_available()

if (train_on_gpu):
    print('Training on GPU.')
else:
    print('No GPU available, training on CPU.')

结果:

分析: 将25000行,200列的features分出成三部分:train set、validation set、test set(看结果)

            train_x等都是ndarray类型的数据,所以将其转换成张量的数据并将对应的标签合并在一起成为train_data、valid_data、test_data等。其中从train_loader开始就变成了20000/batch_size=400 组数据

            后边包含iter()和next()就是用到了一个语法,具体可以查一下(不是很难),可以看到每组的数据(程序run一下就会变的)(这步骤和整个程序没什么必要关系,仅仅测试)

8.定义lstm模型

###############################################################################
#########################       8. 定义lstm模型       ##########################
###############################################################################
import torch.nn as nn
class SentimentRNN(nn.Module):
    """
    The RNN model that will be used to perform Sentiment analysis.
    """

    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, bidirectional=True, drop_prob=0.5):
        """
        Initialize the model by setting up the layers.
        """
        super(SentimentRNN, self).__init__()

        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        self.bidirectional = bidirectional

        # embedding and LSTM layers
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers,
                            dropout=drop_prob, batch_first=True,
                            bidirectional=bidirectional)

        # dropout layer
        self.dropout = nn.Dropout(0.3)

        # linear and sigmoid layers
        if bidirectional:
            self.fc = nn.Linear(hidden_dim * 2, output_size)
        else:
            self.fc = nn.Linear(hidden_dim, output_size)

        self.sig = nn.Sigmoid()

    def forward(self, x, hidden):
        """
        Perform a forward pass of our model on some input and hidden state.
        """
        batch_size = x.size(0)

        # embeddings and lstm_out
        x = x.long()
        embeds = self.embedding(x)
        lstm_out, hidden = self.lstm(embeds, hidden)

        #         if bidirectional:
        #           lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim*2)
        #         else:
        #           lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)

        # dropout and fully-connected layer
        out = self.dropout(lstm_out)
        out = self.fc(out)
        # sigmoid function
        sig_out = self.sig(out)

        # reshape to be batch_size first
        sig_out = sig_out.view(batch_size, -1)
        sig_out = sig_out[:, -1]  # get last batch of labels

        # return last sigmoid output and hidden state
        return sig_out, hidden

    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data

        number = 1
        if self.bidirectional:
            number = 2

        if (train_on_gpu):
            hidden = (weight.new(self.n_layers * number, batch_size, self.hidden_dim).zero_().cuda(),
                      weight.new(self.n_layers * number, batch_size, self.hidden_dim).zero_().cuda()
                      )
        else:
            hidden = (weight.new(self.n_layers * number, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers * number, batch_size, self.hidden_dim).zero_()
                      )

        return hidden

9.用超参数实例化模型

###############################################################################
################                9. 用超参数实例化模型            #################
###############################################################################
vocab_size = len(vocab_to_int) + 1  # +1 for the 0 padding + our word tokens
output_size = 1
embedding_dim = 400
hidden_dim = 256
n_layers = 2
bidirectional = False  # 这里为True,为双向LSTM

net = SentimentRNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers, bidirectional)
print()
print(net)

结果:

for name,parameters in net.named_parameters():
    print(name,':',parameters.size())

结果:

 

分析: 定义的网络中含有嵌入层、lstm层、脱落层、激活层,经过每层后的张量形状如下图所示,最后取50*200的张量的 所有行的最后一列并print出来(这个数值会根据不同次的print而不同)

10.定义损失和优化器

###############################################################################
#######################       10. 定义损失和优化器       #########################
###############################################################################
lr = 0.001

criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=lr)

11.训练网络

###############################################################################
##########################        11. 训练网络       ###########################
###############################################################################
epochs = 4  # 大约3到4次,停止减少损失
print_every = 100
clip = 5  # gradient clipping

# move model to GPU, if available
if (train_on_gpu):
    net.cuda()

net.train()
torch.backends.cudnn.enabled=False
# train for 4 epochs
for e in range(epochs):
    # initialize hidden state
    h = net.init_hidden(batch_size)
    counter = 0

    # batch loop
    for inputs, labels in train_loader:
        counter += 1

        if (train_on_gpu):
            inputs, labels = inputs.cuda(), labels.cuda()

        # Creating new variables for the hidden state, otherwise
        # we'd backprop through the entire training history
        h = tuple([each.data for each in h])
        # zero accumulated gradients
        net.zero_grad()

        # get the output from the model
        output, h = net(inputs, h)

        # calculate the loss and perform backprop
        loss = criterion(output.squeeze(), labels.float())

        loss.backward()
        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        nn.utils.clip_grad_norm_(net.parameters(), clip)
        optimizer.step()

        # loss stats
        if counter % print_every == 0:
            # Get validation loss
            val_h = net.init_hidden(batch_size)
            val_losses = []
            net.eval()
            for inputs, labels in valid_loader:

                # Creating new variables for the hidden state, otherwise
                # we'd backprop through the entire training history
                val_h = tuple([each.data for each in val_h])

                if (train_on_gpu):
                    inputs, labels = inputs.cuda(), labels.cuda()

                output, val_h = net(inputs, val_h)
                val_loss = criterion(output.squeeze(), labels.float())

                val_losses.append(val_loss.item())

            net.train()
            print("Epoch: {}/{}...".format(e + 1, epochs),
                  "Step: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))

结果:

分析: 这步骤包括 训练集 和 验证集。

            其中,训练100组(即100个batch_size,也就是5000个review)后,得到Loss。接下来开始对验证集(2500个review)进行验证,从而得到Val Loss。

            4个100组构成了1 Epoch(即20000个训练集),同时验证集的整体被验证了4次(4*2500个review)

            然后累计进行4个Epoch

12.在测试集上测试训练模型

###############################################################################
################            12. 在测试集上测试训练模型            #################
###############################################################################
# Get test data loss and accuracy

test_losses = []  # track loss
num_correct = 0

# init hidden state
h = net.init_hidden(batch_size)

net.eval()
# iterate over test data
for inputs, labels in test_loader:

    # Creating new variables for the hidden state, otherwise
    # we'd backprop through the entire training history
    h = tuple([each.data for each in h])

    if (train_on_gpu):
        inputs, labels = inputs.cuda(), labels.cuda()

    # get predicted outputs
    output, h = net(inputs, h)

    # calculate loss
    test_loss = criterion(output.squeeze(), labels.float())
    test_losses.append(test_loss.item())

    # convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze())  # rounds to the nearest integer

    # compare predictions to true label
    correct_tensor = pred.eq(labels.float().view_as(pred))
    correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
    num_correct += np.sum(correct)

# -- stats! -- ##
# avg test loss
print("Test loss: {:.3f}".format(np.mean(test_losses)))

# accuracy over all test data
test_acc = num_correct / len(test_loader.dataset)
print("Test accuracy: {:.3f}".format(test_acc))

结果:

分析: test的数据集包含2500个review,最后求出loss和识别的准确率

参考文章:

[深度学习] PyTorch 实现双向LSTM 情感分析_小墨鱼的专栏-CSDN博客_pytorch 双向lstm

RuntimeError: cudnn RNN backward can only be called in training mode_沉迷单车的追风少年-CSDN博客RuntimeError:CUDA error:out of memory问题解决_Slim's Hello World-CSDN博客

Python中numpy数组切片:print(a[0::2])、a[::2]、[:,2]、[1:,-1:]、a[::-1]、[ : n]、[m : ]、[-1]、[:-1]、[1:]等的含义(详细)_锵锵锵锵蒋的博客-CSDN博客_[::2]

......(参考的文章太多了,遇到不会的结构、网络、函数都不去查一下。整理不完)

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值