PyTorch搭建LSTM对IMDB数据集进行情感分析（详细的数据分析与处理过程）

Cyril_KI

已于 2022-09-17 20:56:49 修改

阅读量10w+

点赞数 36

分类专栏： PyTorch DL 文章标签： lstm pytorch 深度学习 glove IMDB

于 2020-08-21 22:44:10 首次发布

本文链接：https://blog.csdn.net/Cyril_KI/article/details/108157638

版权

PyTorch 同时被 2 个专栏收录

39 篇文章 208 订阅

订阅专栏

36 篇文章 149 订阅

订阅专栏

1. 数据介绍

数据链接：数据
提取码：p1ua

本次作业的难点主要在于对数据的处理，我们先看下原始数据：
在这里插入图片描述
我们这里只需要用到测试集下面的neg、pos以及训练集下面的neg、pos。我们以test/neg为例：

随便打开一个txt文件：

可以看到，每个txt文件里都是一句很长的评论，我们的任务是对该评论进行分类。

2. 数据处理

接下来我们先说一下LSTM需要什么样的数据。比如我们一共有25000句话，每句话有250个单词（多去少补，后面会详细介绍），然后每个单词用一个50维的向量表示，即每一个句子的维度是[250, 50]。假设我们把所有的训练集（25000）分成250批，每一批100句话，那么所有的训练集的规模就是[250, 100, 250, 50]。第一个250表示一共250批数据，100表示每批数据有100句话，第二个250表示每句话有250个单词，最后一个50表示每个单词为一个50维度的向量。接下来我们就详细介绍怎么得到这个数据集。

2.1 生成词向量表

首先我们需要得到每一个单词对应的50维度向量，我们这里用网上已经训练好的glove数据集：

每个文件里面都有40000行，每一行代表一个单词的词向量（有单词标签）。第一个文件为50维，后面依次为100/200/300维度。我们读取第一个文件，根据每一行的单词标签与该单词的向量，建立一个词向量表：

def load_cab_vector():
    word_list = []
    vocabulary_vectors = []
    data = open('glove.6B.50d.txt', encoding='utf-8')
    for line in data.readlines():
        temp = line.strip('\n').split(' ')  # 一个列表
        name = temp[0]
        word_list.append(name.lower())
        vector = [temp[i] for i in range(1, len(temp))]  # 向量
        vector = list(map(float, vector))  # 变成浮点数
        vocabulary_vectors.append(vector)
    # 保存
    vocabulary_vectors = np.array(vocabulary_vectors)
    word_list = np.array(word_list)
    np.save('npys/vocabulary_vectors', vocabulary_vectors)
    np.save('npys/word_list', word_list)
    return vocabulary_vectors, word_list

这样，我们就得到了一个词向量表。表由两个列表组成：word_list里面包含了40000个单词，vocabulary_vectors包含了40000个50维度的向量。加载数据十分缓慢，所以我们将这个两个列表转成array并利用np.save(file)存下来：（这个操作在后面经常用到）

vocabulary_vectors = np.array(vocabulary_vectors)
word_list = np.array(word_list)
np.save('npys/vocabulary_vectors', vocabulary_vectors)
np.save('npys/word_list', word_list)
return vocabulary_vectors, word_list

于是我们得到了两个npy文件：vocabulary_vectors.npy与word_list.npy。

2.2 处理训练集和测试集

对训练集和数据集进行处理。我们读取所有的文件（训练+测试一共50000条数据）：

def load_data(path, flag='train'):
    labels = ['pos', 'neg']
    data = []
    for label in labels:
        files = os.listdir(os.path.join(path, flag, label))
        # 去除标点符号
        r = '[’!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\n。！，]+'
        for file in files:
            with open(os.path.join(path, flag, label, file), 'r', encoding='utf8') as rf:
                temp = rf.read().replace('\n', '')
                temp = temp.replace('<br /><br />', ' ')
                temp = re.sub(r, '', temp)
                temp = temp.split(' ')
                temp = [temp[i].lower() for i in range(len(temp)) if temp[i] != '']
                if label == 'pos':
                    data.append([temp, 1])
                elif label == 'neg':
                    data.append([temp, 0])
    return data

最终返回的是一个列表。列表里每一个元素都是一个列表，该列表包含该句话的每一个单词以及标签（1表示pos，0表示neg）。比如我们输出一下train_data[0]：

train_data = load_data('Imdb')
print(train_data[0])

输出为：

[[‘bromwell’, ‘high’, ‘is’, ‘a’, ‘cartoon’, ‘comedy’, ‘it’, ‘ran’, ‘at’, ‘the’, ‘same’, ‘time’, ‘as’, ‘some’, ‘other’, ‘programs’, ‘about’, ‘school’, ‘life’, ‘such’, ‘as’, ‘teachers’, ‘my’, ‘35’, ‘years’, ‘in’, ‘the’, ‘teaching’, ‘profession’, ‘lead’, ‘me’, ‘to’, ‘believe’, ‘that’, ‘bromwell’, ‘highs’, ‘satire’, ‘is’, ‘much’, ‘closer’, ‘to’, ‘reality’, ‘than’, ‘is’, ‘teachers’, ‘the’, ‘scramble’, ‘to’, ‘survive’, ‘financially’, ‘the’, ‘insightful’, ‘students’, ‘who’, ‘can’, ‘see’, ‘right’, ‘through’, ‘their’, ‘pathetic’, ‘teachers’, ‘pomp’, ‘the’, ‘pettiness’, ‘of’, ‘the’, ‘whole’, ‘situation’, ‘all’, ‘remind’, ‘me’, ‘of’, ‘the’, ‘schools’, ‘i’, ‘knew’, ‘and’, ‘their’, ‘students’, ‘when’, ‘i’, ‘saw’, ‘the’, ‘episode’, ‘in’, ‘which’, ‘a’, ‘student’, ‘repeatedly’, ‘tried’, ‘to’, ‘burn’, ‘down’, ‘the’, ‘school’, ‘i’, ‘immediately’, ‘recalled’, ‘at’, ‘high’, ‘a’, ‘classic’, ‘line’, ‘inspector’, ‘im’, ‘here’, ‘to’, ‘sack’, ‘one’, ‘of’, ‘your’, ‘teachers’, ‘student’, ‘welcome’, ‘to’, ‘bromwell’, ‘high’, ‘i’, ‘expect’, ‘that’, ‘many’, ‘adults’, ‘of’, ‘my’, ‘age’, ‘think’, ‘that’, ‘bromwell’, ‘high’, ‘is’, ‘far’, ‘fetched’, ‘what’, ‘a’, ‘pity’, ‘that’, ‘it’, ‘isnt’], 1]

可以看到，该列表第一个元素为一个单词列表，第二个元素为标签。

对每一个句子进行处理，找到其中每一个单词在word_list中的索引值。比如对于上面这句话，我们找到里面每一个单词的在word_list中的索引。我们规定每个句子的最大长度为250，若影评单词个数超过250则自动截去，否则末尾补0：

def process_sentence(flag):
    sentence_code = []
    vocabulary_vectors = np.load('npys/vocabulary_vectors.npy', allow_pickle=True)
    word_list = np.load('npys/word_list.npy', allow_pickle=True)
    word_list = word_list.tolist()
    test_data = load_data('Imdb', flag)
    for i in range(len(test_data)):
        # print(i)
        vec = test_data[i][0]
        temp = []
        index = 0
        for j in range(len(vec)):
            try:
                index = word_list.index(vec[j])
            except ValueError:  # 没找到
                index = 399999
            finally:
                temp.append(index)  # temp表示一个单词在词典中的序号
        if len(temp) < 250:
            for k in range(len(temp), 250):  # 不足补0
                temp.append(0)
        else:
            temp = temp[0:250]  # 只保留250个
        sentence_code.append(temp)

    # print(sentence_code)

    sentence_code = np.array(sentence_code)
    if flag == 'train':
        np.save('npys/sentence_code_1', sentence_code)
    else:
        np.save('npys/sentence_code_2', sentence_code)

通过上面代码，我们最终得到了两个文件：sentence_code_1.npy与sentence_code_2.npy。每一个数组都是[25000, 250]，代表里面一共有25000句话，每句话的250个单词在word_list的索引保存在里面。

2.3 批量处理

批量处理数据。我们把25000个数据分成250批，每一批100句话，然后通过word_list与vocabulary_vectors，找到每个单词的向量：

def process_batch(batch_size):
    # 25000维
    # (25000, 2)  25000句话, 单词向量+标签
    test_data = load_data('Imdb', flag='test')
    train_data = load_data('Imdb')
    # 加载句子的索引
    # (25000句话, 250单词)
    sentence_code_1 = np.load('npys/sentence_code_1.npy', allow_pickle=True)
    sentence_code_1 = sentence_code_1.tolist()
    # 25000 * 250测试集
    sentence_code_2 = np.load('npys/sentence_code_2.npy', allow_pickle=True)
    sentence_code_2 = sentence_code_2.tolist()
    vocabulary_vectors = np.load('npys/vocabulary_vectors.npy', allow_pickle=True)
    vocabulary_vectors = vocabulary_vectors.tolist()

    # 每个sentence_code都是25000 * 250 * 50
    for i in range(25000):
        sentence_code_1[i] = [vocabulary_vectors[x] for x in sentence_code_1[i]]
        sentence_code_2[i] = [vocabulary_vectors[x] for x in sentence_code_2[i]]
        # for j in range(250):
        #     sentence_code_1[i][j] = vocabulary_vectors[sentence_code_1[i][j]]
        #     sentence_code_2[i][j] = vocabulary_vectors[sentence_code_2[i][j]]
    # 重新划分数据集，40000训练10000测试
    data = train_data + test_data
    sentence_code = np.r_[sentence_code_1, sentence_code_2]
    # shuffle
    shuffle_ix = np.random.permutation(np.arange(len(data)))
    data = np.array(data)[shuffle_ix].tolist()
    sentence_code = sentence_code[shuffle_ix]

    train_data = data[:int(len(data) * 0.8)]
    test_data = data[int(len(data) * 0.8):]
    sentence_code_1 = sentence_code[:int(len(sentence_code) * 0.8)]
    sentence_code_2 = sentence_code[int(len(sentence_code) * 0.8):]

    labels_train = []
    labels_test = []
    arr_train = []
    arr_test = []

    # mini-batch操作
    for i in range(1, int(len(train_data) / batch_size) + 1):
        arr_train.append(sentence_code_1[(i - 1) * batch_size:i * batch_size])
        labels_train.append([train_data[j][1] for j in range((i - 1) * batch_size, i * batch_size)])
    for i in range(1, int(len(test_data) / batch_size) + 1):
        arr_test.append(sentence_code_2[(i - 1) * batch_size:i * batch_size])
        labels_test.append([test_data[j][1] for j in range((i - 1) * batch_size, i * batch_size)])

    arr_train = np.array(arr_train)
    arr_test = np.array(arr_test)
    labels_train = np.array(labels_train)
    labels_test = np.array(labels_test)
    np.save('npys/arr_train', arr_train)
    np.save('npys/arr_test', arr_test)
    np.save('npys/labels_train', labels_train)
    np.save('npys/labels_test', labels_test)

    return arr_train, labels_train, arr_test, labels_test

上述代码对数据进行了重新划分，训练集占比为80%，测试集占比为20%。最终返回的是四个数组，以arr_train为例，其维度为[400, 100, 250, 50]，第一个400表示一共400批数据，100表示每批数据有100句话，第三个250表示每句话有250个单词，最后一个50表示每个单词为一个50维度的向量。

3. 模型

3.1 模型搭建

搭建LSTM网络：

class LSTM(nn.Module):
    def __init__(self, hidden_size):
        super(LSTM, self).__init__()
        self.lstm = nn.LSTM(input_size=50, hidden_size=hidden_size, num_layers=1,
                            batch_first=True)
        self.fc = nn.Sequential(nn.Dropout(0.5),
                                nn.Linear(hidden_size, 32),
                                nn.Linear(32, 2),
                                nn.ReLU())

    def forward(self, input_seq):
        # print(x.size())
        x, _ = self.lstm(input_seq)
        x = self.fc(x)
        x = x[:, -1, :]
        return x

3.2 训练

def train():
    # load
    print('loading...')
    epoch_num = 10
    arr_train = np.load('npys/arr_train.npy', allow_pickle=True)
    labels_train = np.load('npys/labels_train.npy', allow_pickle=True)
    print('training...')
    model = LSTM(hidden_size=64).to(device)
    optimizer = optim.Adam(model.parameters(), lr=0.00005)
    criterion = nn.CrossEntropyLoss().to(device)
    loss = 0
    for i in range(epoch_num):
        for j in range(400):
            x = arr_train[j]
            y = labels_train[j]
            # print(y)
            input_ = torch.tensor(x, dtype=torch.float32).to(device)
            label = torch.tensor(y, dtype=torch.long).to(device)
            output = model(input_)
            # print(output)
            optimizer.zero_grad()
            loss = criterion(output, label)
            loss.backward()
            optimizer.step()
        print('epoch:%d loss:%.5f' % (i, loss.item()))
    # save model
    state = {'model': model.state_dict(), 'optimizer': optimizer.state_dict()}
    torch.save(state, 'models/LSTM.pkl')

3.3 测试

def test():
    print('loading...')
    arr_test = np.load('npys/arr_test.npy', allow_pickle=True)
    labels_test = np.load('npys/labels_test.npy', allow_pickle=True)
    print('testing...')
    model = LSTM(hidden_size=64).to(device)
    model.load_state_dict(torch.load('models/LSTM.pkl')['model'])
    model.eval()
    num = 0
    for i in range(100):
        xx = arr_test[i]
        yy = labels_test[i]
        input_ = torch.tensor(xx, dtype=torch.float32).to(device)
        label = torch.tensor(yy, dtype=torch.long).to(device)
        output = model(input_)
        pred = output.max(dim=-1)[1]
        for k in range(100):
            if pred[k] == label[k]:
                num += 1

    print('Accuracy：', num / 10000)

训练了10个epoch，精度为65%。

4. 代码使用方法

鉴于很多人问怎么运行，特地在这里总结一下：

（1）从文章最后的链接处下载源码，然后从百度网盘下载数据文件。

（2）将百度网盘文件中的glove.6B.50d.txt放入LSTM-IMDB-Classification/目录下。

（3）生成vocabulary_vectors.npy和word_list.npy
只需运行load_cab_vector即可：

if __name__ == '__main__':
    load_cab_vector()

运行结束后：
在这里插入图片描述
（4）生成sentence_code_1.npy和sentence_code_2.npy

if __name__ == '__main__':
    # load_cab_vector()
    process_sentence('train')
    process_sentence('test')

运行时间较长，运行成功后：
在这里插入图片描述
（5）生成训练集和测试集

if __name__ == '__main__':
    # load_cab_vector()
    # process_sentence('train')
    # process_sentence('test')
    process_batch(100)

运行成功后：
在这里插入图片描述
（6）训练+测试

if __name__ == '__main__':
    train()
    test()
    # load_cab_vector()
    # process_sentence('train')
    # process_sentence('test')
    # process_batch(100)