深度学习实战之【电影评论分类】：二分类问题

赵孝正

已于 2022-06-20 09:00:22 修改

阅读量557

点赞数 1

分类专栏： Python深度学习文章标签：深度学习学习 keras

于 2022-06-19 11:00:01 首次发布

本文链接：https://blog.csdn.net/weixin_46713695/article/details/125355666

版权

Python深度学习专栏收录该内容

5 篇文章 0 订阅

订阅专栏

一. IMDB数据集

本节使用 IMDB 数据集，它包含来自互联网电影数据库（IMDB）的 50 000 条严重两极分化的评论。数据集被分为用于训练的 25 000 条评论与用于测试的 25 000 条评论，训练集和测试集都包含 50% 的正面评论和 50% 的负面评论。

IMDB 数据集内置于 Keras 库。它已经过预处理：评论（单词序列）已经被转换为整数序列，其中每个整数代表字典中的某个单词。

1.1 例行看一下keras的版本。

import keras
keras.__version__

 Out[1]: '2.9.0'

1.2 加载 IMDB 数据集

from keras.datasets import imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

这里是加载imdb数据集的操作，其中 num_words=10000 代表只取出前10000个经常用的词汇，来保证学习的合理性。

train_data[0]
train_labels[0]
max([max(sequence) for sequence in train_data])

输出为

 Out[2]: [1, 14, 22, 16, ... 178, 32]  # train_data 是评论组成的列表，每条评论又是单词索引组成的列表（表示一系列单词）。
 Out[3]: 1  # train_labels 是 0 和 1 组成的列表，其中 0 代表负面 （negative），1 代表正面 （positive）。
 Out[4]: 9999  # 由于限定为前 10000 个最常见的单词，单词索引都不会超过 10000。

这里看一下第一句话和标签，还有索引的最大值，可以很容易的看出，第一个输出的是第一句话的单词索引，是一行数字，第二个是对应的标签，1为积极，0为消极，第三句话是最大索引，因为只有10000个单词，所以最大索引毫无疑问是9999.

word_index = imdb.get_word_index()
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in train_data[0]])

这三行是用于解析出来相应的评论语句用的。第一行是获取索引字典，第二行是将键值颠倒，最后一行是解码。因为前三个是预先保留的索引，所以跳过。

decoded_review

看一下这个句子，就会明白评论是什么：“? this film was just brilliant casting location scenery story direction everyone’s really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy’s that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don’t you think the whole story was so lovely because it was true and was someone’s life after all that was shared with us all”
太长了说实话人工分辨都要费一点时间。

二. 数据准备

你不能将整数序列直接输入神经网络。你需要将列表转换为张量。转换方法主要有填充列表、对列表进行one-hot编码两种方式，下面采用one-hot编码的方法，将数据向量化。

2.1 将整数序列编码为二进制矩阵

import numpy as np

def vectorize_sequences(sequences, dimension=10000):
    results = np.zeros((len(sequences), dimension))  # 创建一个形状为(len(sequences), dimension)的零矩阵
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1  # 将results[i]的指定索引设为1
    return results

x_train = vectorize_sequences(train_data)  # 将训练数据向量化
x_test = vectorize_sequences(test_data)  # 将测试数据向量化

y_train = np.asarray(train_labels).astype('float32')  # 将标签向量化
y_test = np.asarray(test_labels).astype('float32')

惯例，这里对数据使用one hot编码，目的是将数据矩阵化，目的是简化数字，让它从0到9999变为0和1，同时把标签也向量化了。
现在可以将数据输入到神经网络中。

三. 构建网络

3.1 模型定义

from keras import models
from keras import layers

model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

这里开始搭建网络结构了。前两层为relu函数，它是非线性的函数，目的是得到更加丰富的假设空间，这样多层的结构才有意义，后面用sigmoid函数输出0到1的值，就是最终的判断。

3.2 编译模型

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

上述代码将优化器、损失函数和指标作为字符串传入，这是因为 rmsprop 、binary_crossentropy 和 accuracy 都是 Keras 内置的一部分。有时你可能希望配置自定义优化器的参数，或者传入自定义的损失函数或指标函数。前者可通过向 optimizer 参数传入一个优化器类实例来实现；后者可通过向 loss 和 metrics 参数传入函数对象来实现。

3.3 配置优化器

from keras import optimizers

model.compile(optimizer=optimizers.RMSprop(lr=0.001),
              loss='binary_crossentropy',
              metrics=['accuracy'])

3.4 使用自定义的损失和指标

from keras import losses
from keras import metrics

model.compile(optimizer=optimizers.RMSprop(lr=0.001),
              loss=losses.binary_crossentropy,
              metrics=[metrics.binary_accuracy])