NLP系列——(1)数据探索-IMDB

最新推荐文章于 2024-05-08 07:48:34 发布

丶谢尔

最新推荐文章于 2024-05-08 07:48:34 发布

阅读量978

点赞数 2

分类专栏： nlp 文章标签： IMDB

本文链接：https://blog.csdn.net/weixin_40593658/article/details/90112955

版权

nlp 专栏收录该内容

16 篇文章 7 订阅

订阅专栏

数据集探索

一、数据集

数据集：中、英文数据集各一份

1、中文数据集：THUCNews

THUCNews数据子集：https://pan.baidu.com/s/1hugrfRu 密码：qfud

2、英文数据集：IMDB数据集

IMDB Sentiment Analysis

二、数据探索

1、IMDB数据集的探索

这里参考TensorFlow官方教程：影评文本分类 | TensorFlow 和科赛 - Kesci.com
这里先简单介绍一下IMDB：
IMDB数据集包含来自互联网的50000条严重两极分化的评论，该数据被分为用于训练的25000条评论和用于测试的25000条评论，训练集和测试集都包含50%的正面评价和50%的负面评价。该数据集已经经过预处理：评论（单词序列）已经被转换为整数序列，其中每个整数代表字典中的某个单词。

1-- 加载数据集

import keras
import tensorflow as tf
from keras.datasets import imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
#参数num_words = 10000的意思是仅保留训练数据的前10000个最常见出现的单词，低频单词将被舍弃。这样得到的向量数据不会太大，便于处理。

看一下训练集与测试集的数量

print("Training entries: {}, labels: {}".format(len(train_data), len(train_labels)))

Training entries: 25000, labels: 25000

影评文本已转换为整数，其中每个整数都表示字典中的一个特定字词。第一条影评如下所示：

print(train_data[0])

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]

观察数据集中的前两条数据

len(train_data[0]), len(train_data[1])

(218, 189)

这里可以看出数据集中每条影评的字词数是不一样的，单数神经网络的输入必须具有相同的长度。因此我们需要对数据进行处理。
从第一条数据我们可以看出，这些影评信息全是整数，不是人话啊，这里我们当然可以将其转成人话

#A dictionary mapping words to an integer index
word_index = imdb.get_word_index()
#The first indices are reserved
word_index = {k:(v+3) for k,v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # unknown
word_index["<UNUSED>"] = 3
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

经过处理之后，我们输出第一条信息看一下：

decode_review(train_data[0])

" this film was just brilliant casting location scenery story direction everyone’s really suited the part they played and you could just imagine being there robert is an amazing actor and now the same being director father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also to the two little boy’s that played the of norman and paul they were just brilliant children are often left out of the list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don’t you think the whole story was so lovely because it was true and was someone’s life after all that was shared with us all"

2-- 准备数据

对数据有了一定的了解之后，接下来我们将对输入神经网络的数据进行处理。
我们不能将整数序列直接输入神经网络，需要先将列表转换为张量。转换方式有以下两种：

2.1 – 对列表进行one-hot编码

比如序列[3, 5]将会被转换为10000维向量，只有索引为3和5的元素是1，其余元素是0，然后网络第一层可以用Dense层，它能够处理浮点数向量数据。
该方法会极大地占用内存需要一个大小为 num_words * num_reviews 的矩阵

import numpy as np
def vectorize_sequences(sequences, dimension=10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.  # 索引results矩阵中的位置，赋值为1，全部都是从第0行0列开始的
    return results

#Our vectorized training data
x_train = vectorize_sequences(train_data)
#Our vectorized test data
x_test = vectorize_sequences(test_data)

看一下处理后的第一条数据：

x_train[0]

array([0., 1., 1., …, 0., 0., 0.])
全都转换成0、1格式了。

2.2 – 填充的方法

使它们都具有相同的长度，然后创建一个形状为 max_length * num_reviews 的整数张量。
这里使用pad_sequence函数将长度进行标准化。

train_data = keras.preprocessing.sequence.pad_sequences(train_data,
                                                        value=word_index["<PAD>"],
                                                        padding='post',
                                                        maxlen=256)

test_data = keras.preprocessing.sequence.pad_sequences(test_data,
                                                       value=word_index["<PAD>"],
                                                       padding='post',
                                                       maxlen=256)

经过处理后，我们查看一下：

len(train_data[0]), len(train_data[1])

(256,256)

print(train_data[0])

[ 1 14 22 16 43 530 973 1622 1385 65 458 4468 66 3941
4 173 36 256 5 25 100 43 838 112 50 670 2 9
35 480 284 5 150 4 172 112 167 2 336 385 39 4
172 4536 1111 17 546 38 13 447 4 192 50 16 6 147
2025 19 14 22 4 1920 4613 469 4 22 71 87 12 16
43 530 38 76 15 13 1247 4 22 17 515 17 12 16
626 18 2 5 62 386 12 8 316 8 106 5 4 2223
5244 16 480 66 3785 33 4 130 12 16 38 619 5 25
124 51 36 135 48 25 1415 33 6 22 12 215 28 77
52 5 14 407 16 82 2 8 4 107 117 5952 15 256
4 2 7 3766 5 723 36 71 43 530 476 26 400 317
46 7 4 2 1029 13 104 88 4 381 15 297 98 32
2071 56 26 141 6 194 7486 18 4 226 22 21 134 476
26 480 5 144 30 5535 18 51 36 28 224 92 25 104
4 226 65 16 38 1334 88 12 16 283 5 16 4472 113
103 32 15 16 5345 19 178 32 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0]

到这里，数据准备好了，开始构建网络

3-- 构建网络

# input shape is the vocabulary count used for the movie reviews (10,000 words)
vocab_size = 10000

model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16, activation=tf.nn.relu))
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))

model.summary()
# 按顺序堆叠各个层以构建分类器：
# 第一层是 Embedding 层。该层会在整数编码的词汇表中查找每个字词-索引的嵌入向量。
# 模型在接受训练时会学习这些向量。这些向量会向输出数组添加一个维度。生成的维度为：(batch, sequence, embedding)。
# 接下来，一个 GlobalAveragePooling1D 层通过对序列维度求平均值，针对每个样本返回一个长度固定的输出向量。
# 这样，模型便能够以尽可能简单的方式处理各种长度的输入。
# 该长度固定的输出向量会传入一个全连接 (Dense) 层（包含 16 个隐藏单元）。
# 最后一层与单个输出节点密集连接。应用 sigmoid 激活函数后，结果是介于 0 到 1 之间的浮点值，表示概率或置信水平。

model.compile(optimizer=tf.train.AdamOptimizer(),
              loss='binary_crossentropy',
              metrics=['accuracy'])
## 划分数据集
x_val = train_data[:10000]
partial_x_train = train_data[10000:]

y_val = train_labels[:10000]
partial_y_train = train_labels[10000:]

x_val = train_data[:10000]
partial_x_train = train_data[10000:]

y_val = train_labels[:10000]
partial_y_train = train_labels[10000:]

## 训练模型
history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=40,
                    batch_size=512,
                    validation_data=(x_val, y_val),
                    verbose=1)

 ## 评估模型
results = model.evaluate(test_data, test_labels)

print(results)

25000/25000 [==============================] - 1s 53us/step
[0.34126003795623777, 0.869]

4-- 画图与分析

### 创建准确率和损失随时间变化的图
# model.fit() 返回一个 History 对象，该对象包含一个字典，其中包括训练期间发生的所有情况：
history_dict = history.history
history_dict.keys()

import matplotlib.pyplot as plt

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

在这里插入图片描述

plt.clf()   # clear figure
acc_values = history_dict['acc']
val_acc_values = history_dict['val_acc']

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.show()

在这里插入图片描述
可以注意到，训练损失随着周期数的增加而降低，训练准确率随着周期数的增加而提高。在使用梯度下降法优化模型时，这属于正常现象 - 该方法应在每次迭代时尽可能降低目标值。

验证损失和准确率的变化情况并非如此，它们似乎在大约 20 个周期后达到峰值。这是一种过拟合现象：模型在训练数据上的表现要优于在从未见过的数据上的表现。在此之后，模型会过度优化和学习特定于训练数据的表示法，而无法泛化到测试数据。

对于这种特殊情况，我们可以在大约 20 个周期后停止训练，防止出现过拟合。

丶谢尔

关注

2
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
NLP系列——(1)数据探索-IMDB

数据集探索一、数据集数据集：中、英文数据集各一份1、中文数据集：THUCNewsTHUCNews数据子集：https://pan.baidu.com/s/1hugrfRu 密码：qfud2、英文数据集：IMDB数据集IMDB Sentiment Analysis二、数据探索1、IMDB数据集的探索这里参考TensorFlow官方教程：影评文本分类 | TensorFlow 和...
复制链接

扫一扫

专栏目录