PyTorch搭建LSTM对IMDB数据集进行情感分析(详细的数据分析与处理过程)

36 篇文章 149 订阅

1. 数据介绍

数据链接:数据
提取码:p1ua

本次作业的难点主要在于对数据的处理,我们先看下原始数据:
在这里插入图片描述
我们这里只需要用到测试集下面的neg、pos以及训练集下面的neg、pos。我们以test/neg为例:
在这里插入图片描述
随便打开一个txt文件:
在这里插入图片描述
可以看到,每个txt文件里都是一句很长的评论,我们的任务是对该评论进行分类。

2. 数据处理

接下来我们先说一下LSTM需要什么样的数据。比如我们一共有25000句话,每句话有250个单词(多去少补,后面会详细介绍),然后每个单词用一个50维的向量表示,即每一个句子的维度是[250, 50]。假设我们把所有的训练集(25000)分成250批,每一批100句话,那么所有的训练集的规模就是[250, 100, 250, 50]。第一个250表示一共250批数据,100表示每批数据有100句话,第二个250表示每句话有250个单词,最后一个50表示每个单词为一个50维度的向量。接下来我们就详细介绍怎么得到这个数据集。

2.1 生成词向量表

  • 首先我们需要得到每一个单词对应的50维度向量,我们这里用网上已经训练好的glove数据集:
    在这里插入图片描述
    每个文件里面都有40000行,每一行代表一个单词的词向量(有单词标签)。 第一个文件为50维,后面依次为100/200/300维度。我们读取第一个文件,根据每一行的单词标签与该单词的向量,建立一个词向量表:
def load_cab_vector():
    word_list = []
    vocabulary_vectors = []
    data = open('glove.6B.50d.txt', encoding='utf-8')
    for line in data.readlines():
        temp = line.strip('\n').split(' ')  # 一个列表
        name = temp[0]
        word_list.append(name.lower())
        vector = [temp[i] for i in range(1, len(temp))]  # 向量
        vector = list(map(float, vector))  # 变成浮点数
        vocabulary_vectors.append(vector)
    # 保存
    vocabulary_vectors = np.array(vocabulary_vectors)
    word_list = np.array(word_list)
    np.save('npys/vocabulary_vectors', vocabulary_vectors)
    np.save('npys/word_list', word_list)
    return vocabulary_vectors, word_list

这样,我们就得到了一个词向量表。表由两个列表组成:word_list里面包含了40000个单词,vocabulary_vectors包含了40000个50维度的向量。加载数据十分缓慢,所以我们将这个两个列表转成array并利用np.save(file)存下来:(这个操作在后面经常用到)

vocabulary_vectors = np.array(vocabulary_vectors)
word_list = np.array(word_list)
np.save('npys/vocabulary_vectors', vocabulary_vectors)
np.save('npys/word_list', word_list)
return vocabulary_vectors, word_list

于是我们得到了两个npy文件:vocabulary_vectors.npy与word_list.npy。

2.2 处理训练集和测试集

  • 对训练集和数据集进行处理。我们读取所有的文件(训练+测试一共50000条数据):
def load_data(path, flag='train'):
    labels = ['pos', 'neg']
    data = []
    for label in labels:
        files = os.listdir(os.path.join(path, flag, label))
        # 去除标点符号
        r = '[’!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\n。!,]+'
        for file in files:
            with open(os.path.join(path, flag, label, file), 'r', encoding='utf8') as rf:
                temp = rf.read().replace('\n', '')
                temp = temp.replace('<br /><br />', ' ')
                temp = re.sub(r, '', temp)
                temp = temp.split(' ')
                temp = [temp[i].lower() for i in range(len(temp)) if temp[i] != '']
                if label == 'pos':
                    data.append([temp, 1])
                elif label == 'neg':
                    data.append([temp, 0])
    return data

最终返回的是一个列表。列表里每一个元素都是一个列表,该列表包含该句话的每一个单词以及标签(1表示pos,0表示neg)。比如我们输出一下train_data[0]:

train_data = load_data('Imdb')
print(train_data[0])

输出为:

[[‘bromwell’, ‘high’, ‘is’, ‘a’, ‘cartoon’, ‘comedy’, ‘it’, ‘ran’, ‘at’, ‘the’, ‘same’, ‘time’, ‘as’, ‘some’, ‘other’, ‘programs’, ‘about’, ‘school’, ‘life’, ‘such’, ‘as’, ‘teachers’, ‘my’, ‘35’, ‘years’, ‘in’, ‘the’, ‘teaching’, ‘profession’, ‘lead’, ‘me’, ‘to’, ‘believe’, ‘that’, ‘bromwell’, ‘highs’, ‘satire’, ‘is’, ‘much’, ‘closer’, ‘to’, ‘reality’, ‘than’, ‘is’, ‘teachers’, ‘the’, ‘scramble’, ‘to’, ‘survive’, ‘financially’, ‘the’, ‘insightful’, ‘students’, ‘who’, ‘can’, ‘see’, ‘right’, ‘through’, ‘their’, ‘pathetic’, ‘teachers’, ‘pomp’, ‘the’, ‘pettiness’, ‘of’, ‘the’, ‘whole’, ‘situation’, ‘all’, ‘remind’, ‘me’, ‘of’, ‘the’, ‘schools’, ‘i’, ‘knew’, ‘and’, ‘their’, ‘students’, ‘when’, ‘i’, ‘saw’, ‘the’, ‘episode’, ‘in’, ‘which’, ‘a’, ‘student’, ‘repeatedly’, ‘tried’, ‘to’, ‘burn’, ‘down’, ‘the’, ‘school’, ‘i’, ‘immediately’, ‘recalled’, ‘at’, ‘high’, ‘a’, ‘classic’, ‘line’, ‘inspector’, ‘im’, ‘here’, ‘to’, ‘sack’, ‘one’, ‘of’, ‘your’, ‘teachers’, ‘student’, ‘welcome’, ‘to’, ‘bromwell’, ‘high’, ‘i’, ‘expect’, ‘that’, ‘many’, ‘adults’, ‘of’, ‘my’, ‘age’, ‘think’, ‘that’, ‘bromwell’, ‘high’, ‘is’, ‘far’, ‘fetched’, ‘what’, ‘a’, ‘pity’, ‘that’, ‘it’, ‘isnt’], 1]

可以看到,该列表第一个元素为一个单词列表,第二个元素为标签。

  • 对每一个句子进行处理,找到其中每一个单词在word_list中的索引值。比如对于上面这句话,我们找到里面每一个单词的在word_list中的索引。我们规定每个句子的最大长度为250,若影评单词个数超过250则自动截去,否则末尾补0:
def process_sentence(flag):
    sentence_code = []
    vocabulary_vectors = np.load('npys/vocabulary_vectors.npy', allow_pickle=True)
    word_list = np.load('npys/word_list.npy', allow_pickle=True)
    word_list = word_list.tolist()
    test_data = load_data('Imdb', flag)
    for i in range(len(test_data)):
        # print(i)
        vec = test_data[i][0]
        temp = []
        index = 0
        for j in range(len(vec)):
            try:
                index = word_list.index(vec[j])
            except ValueError:  # 没找到
                index = 399999
            finally:
                temp.append(index)  # temp表示一个单词在词典中的序号
        if len(temp) < 250:
            for k in range(len(temp), 250):  # 不足补0
                temp.append(0)
        else:
            temp = temp[0:250]  # 只保留250个
        sentence_code.append(temp)

    # print(sentence_code)

    sentence_code = np.array(sentence_code)
    if flag == 'train':
        np.save('npys/sentence_code_1', sentence_code)
    else:
        np.save('npys/sentence_code_2', sentence_code)

通过上面代码,我们最终得到了两个文件:sentence_code_1.npy与sentence_code_2.npy。每一个数组都是[25000, 250],代表里面一共有25000句话,每句话的250个单词在word_list的索引保存在里面。

2.3 批量处理

  • 批量处理数据。我们把25000个数据分成250批,每一批100句话,然后通过word_list与vocabulary_vectors,找到每个单词的向量:
def process_batch(batch_size):
    # 25000维
    # (25000, 2)  25000句话, 单词向量+标签
    test_data = load_data('Imdb', flag='test')
    train_data = load_data('Imdb')
    # 加载句子的索引
    # (25000句话, 250单词)
    sentence_code_1 = np.load('npys/sentence_code_1.npy', allow_pickle=True)
    sentence_code_1 = sentence_code_1.tolist()
    # 25000 * 250测试集
    sentence_code_2 = np.load('npys/sentence_code_2.npy', allow_pickle=True)
    sentence_code_2 = sentence_code_2.tolist()
    vocabulary_vectors = np.load('npys/vocabulary_vectors.npy', allow_pickle=True)
    vocabulary_vectors = vocabulary_vectors.tolist()

    # 每个sentence_code都是25000 * 250 * 50
    for i in range(25000):
        sentence_code_1[i] = [vocabulary_vectors[x] for x in sentence_code_1[i]]
        sentence_code_2[i] = [vocabulary_vectors[x] for x in sentence_code_2[i]]
        # for j in range(250):
        #     sentence_code_1[i][j] = vocabulary_vectors[sentence_code_1[i][j]]
        #     sentence_code_2[i][j] = vocabulary_vectors[sentence_code_2[i][j]]
    # 重新划分数据集,40000训练10000测试
    data = train_data + test_data
    sentence_code = np.r_[sentence_code_1, sentence_code_2]
    # shuffle
    shuffle_ix = np.random.permutation(np.arange(len(data)))
    data = np.array(data)[shuffle_ix].tolist()
    sentence_code = sentence_code[shuffle_ix]

    train_data = data[:int(len(data) * 0.8)]
    test_data = data[int(len(data) * 0.8):]
    sentence_code_1 = sentence_code[:int(len(sentence_code) * 0.8)]
    sentence_code_2 = sentence_code[int(len(sentence_code) * 0.8):]

    labels_train = []
    labels_test = []
    arr_train = []
    arr_test = []

    # mini-batch操作
    for i in range(1, int(len(train_data) / batch_size) + 1):
        arr_train.append(sentence_code_1[(i - 1) * batch_size:i * batch_size])
        labels_train.append([train_data[j][1] for j in range((i - 1) * batch_size, i * batch_size)])
    for i in range(1, int(len(test_data) / batch_size) + 1):
        arr_test.append(sentence_code_2[(i - 1) * batch_size:i * batch_size])
        labels_test.append([test_data[j][1] for j in range((i - 1) * batch_size, i * batch_size)])

    arr_train = np.array(arr_train)
    arr_test = np.array(arr_test)
    labels_train = np.array(labels_train)
    labels_test = np.array(labels_test)
    np.save('npys/arr_train', arr_train)
    np.save('npys/arr_test', arr_test)
    np.save('npys/labels_train', labels_train)
    np.save('npys/labels_test', labels_test)

    return arr_train, labels_train, arr_test, labels_test

上述代码对数据进行了重新划分,训练集占比为80%,测试集占比为20%。最终返回的是四个数组,以arr_train为例,其维度为[400, 100, 250, 50],第一个400表示一共400批数据,100表示每批数据有100句话,第三个250表示每句话有250个单词,最后一个50表示每个单词为一个50维度的向量。

3. 模型

3.1 模型搭建

  • 搭建LSTM网络:
class LSTM(nn.Module):
    def __init__(self, hidden_size):
        super(LSTM, self).__init__()
        self.lstm = nn.LSTM(input_size=50, hidden_size=hidden_size, num_layers=1,
                            batch_first=True)
        self.fc = nn.Sequential(nn.Dropout(0.5),
                                nn.Linear(hidden_size, 32),
                                nn.Linear(32, 2),
                                nn.ReLU())

    def forward(self, input_seq):
        # print(x.size())
        x, _ = self.lstm(input_seq)
        x = self.fc(x)
        x = x[:, -1, :]
        return x

3.2 训练

def train():
    # load
    print('loading...')
    epoch_num = 10
    arr_train = np.load('npys/arr_train.npy', allow_pickle=True)
    labels_train = np.load('npys/labels_train.npy', allow_pickle=True)
    print('training...')
    model = LSTM(hidden_size=64).to(device)
    optimizer = optim.Adam(model.parameters(), lr=0.00005)
    criterion = nn.CrossEntropyLoss().to(device)
    loss = 0
    for i in range(epoch_num):
        for j in range(400):
            x = arr_train[j]
            y = labels_train[j]
            # print(y)
            input_ = torch.tensor(x, dtype=torch.float32).to(device)
            label = torch.tensor(y, dtype=torch.long).to(device)
            output = model(input_)
            # print(output)
            optimizer.zero_grad()
            loss = criterion(output, label)
            loss.backward()
            optimizer.step()
        print('epoch:%d loss:%.5f' % (i, loss.item()))
    # save model
    state = {'model': model.state_dict(), 'optimizer': optimizer.state_dict()}
    torch.save(state, 'models/LSTM.pkl')

3.3 测试

def test():
    print('loading...')
    arr_test = np.load('npys/arr_test.npy', allow_pickle=True)
    labels_test = np.load('npys/labels_test.npy', allow_pickle=True)
    print('testing...')
    model = LSTM(hidden_size=64).to(device)
    model.load_state_dict(torch.load('models/LSTM.pkl')['model'])
    model.eval()
    num = 0
    for i in range(100):
        xx = arr_test[i]
        yy = labels_test[i]
        input_ = torch.tensor(xx, dtype=torch.float32).to(device)
        label = torch.tensor(yy, dtype=torch.long).to(device)
        output = model(input_)
        pred = output.max(dim=-1)[1]
        for k in range(100):
            if pred[k] == label[k]:
                num += 1

    print('Accuracy:', num / 10000)

训练了10个epoch,精度为65%。

4. 代码使用方法

鉴于很多人问怎么运行,特地在这里总结一下:

(1)从文章最后的链接处下载源码,然后从百度网盘下载数据文件。

(2)将百度网盘文件中的glove.6B.50d.txt放入LSTM-IMDB-Classification/目录下。

(3)生成vocabulary_vectors.npy和word_list.npy
只需运行load_cab_vector即可:

if __name__ == '__main__':
    load_cab_vector()

运行结束后:
在这里插入图片描述
(4)生成sentence_code_1.npy和sentence_code_2.npy

if __name__ == '__main__':
    # load_cab_vector()
    process_sentence('train')
    process_sentence('test')

运行时间较长,运行成功后:
在这里插入图片描述
(5)生成训练集和测试集

if __name__ == '__main__':
    # load_cab_vector()
    # process_sentence('train')
    # process_sentence('test')
    process_batch(100)

运行成功后:
在这里插入图片描述
(6)训练+测试

if __name__ == '__main__':
    train()
    test()
    # load_cab_vector()
    # process_sentence('train')
    # process_sentence('test')
    # process_batch(100)

5. 源码

后面将陆续公开~

  • 36
    点赞
  • 251
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 37
    评论
以下是使用PyTorchLSTM数据集aclImdb进行预测的示例代码: ``` import torch import torch.nn as nn import torch.optim as optim from torchtext.datasets import IMDB from torchtext.data import Field, LabelField, BucketIterator # 定义字段 text_field = Field(tokenize='spacy', lower=True) label_field = LabelField(dtype=torch.float) # 加载数据集 train_data, test_data = IMDB.splits(text_field, label_field) # 构建词汇表 text_field.build_vocab(train_data, max_size=10000, vectors='glove.6B.100d') label_field.build_vocab(train_data) # 定义模型 class LSTMClassifier(nn.Module): def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, num_layers, bidirectional, dropout): super().__init__() self.embedding = nn.Embedding(vocab_size, embedding_dim) self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers, bidirectional=bidirectional, dropout=dropout) self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim) self.dropout = nn.Dropout(dropout) def forward(self, text): embedded = self.dropout(self.embedding(text)) output, (hidden, cell) = self.lstm(embedded) hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)) return self.fc(hidden) # 定义超参数 vocab_size = len(text_field.vocab) embedding_dim = 100 hidden_dim = 256 output_dim = 1 num_layers = 2 bidirectional = True dropout = 0.5 # 初始化模型 model = LSTMClassifier(vocab_size, embedding_dim, hidden_dim, output_dim, num_layers, bidirectional, dropout) # 定义损失函数和优化器 criterion = nn.BCEWithLogitsLoss() optimizer = optim.Adam(model.parameters()) # 将数据集分成batch BATCH_SIZE = 64 train_iterator, test_iterator = BucketIterator.splits((train_data, test_data), batch_size=BATCH_SIZE) # 训练模型 NUM_EPOCHS = 10 for epoch in range(NUM_EPOCHS): for batch in train_iterator: text = batch.text labels = batch.label optimizer.zero_grad() predictions = model(text).squeeze(1) loss = criterion(predictions, labels) loss.backward() optimizer.step() # 在测试集上测试模型 with torch.no_grad(): correct = 0 total = 0 for batch in test_iterator: text = batch.text labels = batch.label predictions = torch.round(torch.sigmoid(model(text))).squeeze(1) correct += (predictions == labels).sum().item() total += labels.size(0) accuracy = correct / total print('Epoch: %d, Test Accuracy: %.3f' % (epoch+1, accuracy)) ``` 在这个例子中,我们首先定义了两个字段,一个用于文本数据,另一个用于标签数据。然后,我们加载IMDB数据集并构建词汇表。接下来,我们定义了一个LSTM分类器模型,并初始化了超参数、损失函数和优化器。最后,我们将数据集分成batch并训练模型。 在每个epoch结束后,我们在测试集上测试模型,并输出测试精度。在这个例子中,我们使用了BCEWithLogitsLoss作为损失函数,并使用Adam作为优化器。我们也使用了dropout来避免过拟合。
评论 37
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Cyril_KI

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值