Keras_电影评论分类问题二分类

Keras_电影二分类
https://zhuanlan.zhihu.com/p/63192044?utm_source=wechat_session

将整数序列编码为二进制矩阵
使用自定义的损失和指标
绘制训练损失和验证损失
构建网络

下面是你应该从这个例子中学到的要点。
‰ 通常需要对原始数据进行大量预处理,以便将其转换为张量输入到神经网络中。单词序列可以编码为二进制向量,但也有其他编码方式。
‰ 带有 relu 激活的 Dense 层堆叠,可以解决很多种问题(包括情感分类),你可能会经常用到这种模型。
‰ 对于二分类问题(两个输出类别),网络的最后一层应该是只有一个单元并使用 sigmoid激活的 Dense 层,网络输出应该是 0~1 范围内的标量,表示概率值。
‰ 对于二分类问题的 sigmoid 标量输出,你应该使用 binary_crossentropy 损失函数。
‰ 无论你的问题是什么, rmsprop 优化器通常都是足够好的选择。这一点你无须担心。
‰ 随着神经网络在训练数据上的表现越来越好,模型最终会过拟合,并在前所未见的数据上得到越来越差的结果。一定要一直监控模型在训练集之外的数据上的性能


import keras
# keras.__version__# 电影评论分类:二分类问题
# 二分类问题可能是应用最广泛的机器学习问题。在这个例子中,你将学习根据电影评论的
# 文字内容将其划分为正面或负面


# 本节使用 IMDB 数据集,它包含来自互联网电影数据库(IMDB)的 50 000 条严重两极分
# 化的评论。数据集被分为用于训练的 25 000 条评论与用于测试的 25 000 条评论,训练集和测试
# 集都包含 50% 的正面评论和 50% 的负面评论。# 与 MNIST 数据集一样,IMDB 数据集也内置于 Keras 库。它已经过预处理:评论(单词序列)
# 已经被转换为整数序列,其中每个整数代表字典中的某个单词。
# 下列代码将会加载 IMDB 数据集(第一次运行时会下载大约 80MB 的数据)from keras.datasets import imdb
​
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
The argument num_words=10000 means that we will only keep the top 10,000 most frequently occurring words in the training data. Rare words will be discarded. This allows us to work with vector data of manageable size.

The variables train_data and test_data are lists of reviews, each review being a list of word indices (encoding a sequence of words). train_labels and test_labels are lists of 0s and 1s, where 0 stands for "negative" and 1 stands for "positive":

# 参数 num_words=10000 的意思是仅保留训练数据中前 10 000 个最常出现的单词。低频单
# 词将被舍弃。这样得到的向量数据不会太大,便于处理。
# train_data 和 test_data 这两个变量都是评论组成的列表,每条评论又是单词索引组成
# 的列表(表示一系列单词)。 train_labels 和 test_labels 都是 0 和 1 组成的列表,其中 0
# 代表负面(negative), 1 代表正面(positive)。
​
train_data[0]
[1,
 14,
 22,
 16,
 43,
 530,
 973,
 1622,
 1385,
 65,
 458,
 4468,
 66,
 3941,
 4,
 173,
 36,

train_labels[0]
1
Since we restricted ourselves to the top 10,000 most frequent words, no word index will exceed 10,000:

max([max(sequence) for sequence in train_data])
9999
For kicks, here's how you can quickly decode one of these reviews back to English words:

# word_index is a dictionary mapping words to an integer index
# word_index 是一个将单词映射为整数索引的字典
word_index = imdb.get_word_index()# We reverse it, mapping integer indices to words
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])# We decode the review; note that our indices were offset by 3
# because 0, 1 and 2 are reserved indices for "padding", "start of sequence", and "unknown".
decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in train_data[0]])# 将评论解码。注意,索引减去了 3,因为 0、 1、 2
# 是为“padding”(填充)、“start of sequence”(序
# 列开始)、“unknown”(未知词)分别保留的索引
decoded_review
"? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all"



# 你不能将整数序列直接输入神经网络。你需要将列表转换为张量。# 填充列表,使其具有相同的长度,再将列表转换成形状为 (samples, word_indices)
# 的整数张量,然后网络第一层使用能处理这种整数张量的层(即 Embedding 层,本书
# 后面会详细介绍)。# • 对列表进行 one-hot 编码,将其转换为 0 和 1 组成的向量。举个例子,序列 [3, 5] 将会
# 被转换为 10 000 维向量,只有索引为 3 和 5 的元素是 1,其余元素都是 0。然后网络第
# 一层可以用 Dense 层,它能够处理浮点数向量数据。import numpy as np
​
def vectorize_sequences(sequences, dimension=10000):
    # Create an all-zero matrix of shape (len(sequences), dimension)
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.  # set specific indices of results[i] to 1s
    return results
​
# Our vectorized training data
x_train = vectorize_sequences(train_data)
# Our vectorized test data
x_test = vectorize_sequences(test_data)
Here's what our samples look like now:

x_train[0]
array([ 0.,  1.,  1., ...,  0.,  0.,  0.])
We should also vectorize our labels, which is straightforward:

# Our vectorized labels
# 还应该将标签向量化,这很简单
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')
Now our data is ready to be fed into a neural network.


# Building our networkfrom keras import models
from keras import layers
​
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
Lastly, we need to pick a loss function and an optimizer. Since we are facing a binary classification problem and the output of our network is a probability (we end our network with a single-unit layer with a sigmoid activation), is it best to use the binary_crossentropy loss. It isn't the only viable choice: you could use, for instance, mean_squared_error. But crossentropy is usually the best choice when you are dealing with models that output probabilities. Crossentropy is a quantity from the field of Information Theory, that measures the "distance" between probability distributions, or in our case, between the ground-truth distribution and our predictions.

Here's the step where we configure our model with the rmsprop optimizer and the binary_crossentropy loss function. Note that we will also monitor accuracy during training.

上述代码将优化器、损失函数和指标作为字符串传入,这是因为 rmsprop、 binary_
# crossentropy 和 accuracy 都是 Keras 内置的一部分。有时你可能希望配置自定义优化器的
# 参数,或者传入自定义的损失函数或指标函数。前者可通过向 optimizer 参数传入一个优化器
# 类实例来实现# 下面的步骤是用 rmsprop 优化器和 binary_crossentropy 损失函数来配置模型
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])# 上述代码将优化器、损失函数和指标作为字符串传入,这是因为 rmsprop、 binary_
# crossentropy 和 accuracy 都是 Keras 内置的一部分。有时你可能希望配置自定义优化器的
# 参数,或者传入自定义的损失函数或指标函数。前者可通过向 optimizer 参数传入一个优化器
# 类实例来实现
We are passing our optimizer, loss function and metrics as strings, which is possible because rmsprop, binary_crossentropy and accuracy are packaged as part of Keras. Sometimes you may want to configure the parameters of your optimizer, or pass a custom loss function or metric function. This former can be done by passing an optimizer class instance as the optimizer argument:

from keras import optimizers
​
model.compile(optimizer=optimizers.RMSprop(lr=0.001),
              loss='binary_crossentropy',
              metrics=['accuracy'])
The latter can be done by passing function objects as the `loss` or `metrics` arguments:# 使用自定义的损失和指标
from keras import losses
from keras import metrics
​
model.compile(optimizer=optimizers.RMSprop(lr=0.001),
              loss=losses.binary_crossentropy,
              metrics=[metrics.binary_accuracy])
## Validating our approach
​
In order to monitor during training the accuracy of the model on data that it has never seen before, we will create a "validation set" by 
setting apart 10,000 samples from the original training data:
验证你的方法
# 为了在训练过程中监控模型在前所未见的数据上的精度,你需要将原始训练数据留出 10 000
# 个样本作为验证集。
# 验证你的方法
# 为了在训练过程中监控模型在前所未见的数据上的精度,你需要将原始训练数据留出 10 000
# 个样本作为验证集。
​
x_val = x_train[:10000]
partial_x_train = x_train[10000:]
​
y_val = y_train[:10000]
partial_y_train = y_train[10000:]
We will now train our model for 20 epochs (20 iterations over all samples in the x_train and y_train tensors), in mini-batches of 512 samples. At this same time we will monitor loss and accuracy on the 10,000 samples that we set apart. This is done by passing the validation data as the validation_data argument:

# 现在使用 512 个样本组成的小批量,将模型训练 20 个轮次(即对 x_train 和 y_train 两
# 个张量中的所有样本进行 20 次迭代)。与此同时,你还要监控在留出的 10 000 个样本上的损失
# 和精度。你可以通过将验证数据传入 validation_data 参数来完成
# 注意,调用 model.fit() 返回了一个 History 对象。这个对象有一个成员 history,它
# 是一个字典,包含训练过程中的所有数据
# 字典中包含 4 个条目,对应训练过程和验证过程中监控的指标。在下面两个代码清单中,
# 我们将使用 Matplotlib 在同一张图上绘制训练损失和验证损失(见图 3-7),以及训练精度和验
# 证精度(见图 3-8)
​
history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=20,
                    batch_size=512,
                    validation_data=(x_val, y_val))
data about everything that happened during training. Let's take a look at it:

history_dict = history.history
history_dict.keys()
dict_keys(['val_acc', 'acc', 'val_loss', 'loss'])
It contains 4 entries: one per metric that was being monitored, during training and during validation. Let's use Matplotlib to plot the training and validation loss side by side, as well as the training and validation accuracy:

import matplotlib.pyplot as plt
​
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
​
epochs = range(1, len(acc) + 1)# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
​
plt.show()


在这里插入图片描述

plt.clf()   # clear figure
acc_values = history_dict['acc']
val_acc_values = history_dict['val_acc']

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

# 训练损失每轮都在降低,训练精度每轮都在提升。这就是梯度下降优化的预期
# 结果——你想要最小化的量随着每次迭代越来越小。但验证损失和验证精度并非如此:它们似
# 乎在第四轮达到最佳值。这就是我们之前警告过的一种情况:模型在训练数据上的表现越来越好,
# 但在前所未见的数据上不一定表现得越来越好。准确地说,你看到的是过拟合(overfit):在第
# 二轮之后,你对训练数据过度优化,最终学到的表示仅针对于训练数据,无法泛化到训练集之
# 外的数据。

# 在这种情况下,为了防止过拟合,你可以在 3 轮之后停止训练

在这里插入图片描述

# 从头开始重新训练一个模型
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=4, batch_size=512)
results = model.evaluate(x_test, y_test)


results
# 这种相当简单的方法得到了 88% 的精度。利用最先进的方法,你应该能够得到接近 95% 的精度。
[0.29184698499679568, 0.88495999999999997]
Our fairly naive approach achieves an accuracy of 88%. With state-of-the-art approaches, one should be able to get close to 95%.

Using a trained network to generate predictions on new data
After having trained a network, you will want to use it in a practical setting. You can generate the likelihood of reviews being positive by using the predict method:

# 使用训练好的网络在新数据上生成预测结果
model.predict(x_test)
array([[ 0.91966152],
       [ 0.86563045],
       [ 0.99936908],
       ..., 
       [ 0.45731062],
       [ 0.0038014 ],
       [ 0.79525089]], dtype=float32)
As you can see, the network is very confident for some samples (0.99 or more, or 0.01 or less) but less confident for others (0.6, 0.4).
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值