keras.Sequential 实现简单的文本分类

最新推荐文章于 2023-06-27 13:14:27 发布

Lzj000lzj

最新推荐文章于 2023-06-27 13:14:27 发布

阅读量560

点赞数

分类专栏： keras nlp 文章标签：文本分类

nlp 同时被 2 个专栏收录

11 篇文章 0 订阅

订阅专栏

keras

6 篇文章 1 订阅

订阅专栏

IMDB数据

imdb=keras.datasets.imdb
(train_x, train_y), (test_x, text_y)=keras.datasets.imdb.load_data(num_words=10000)
#参数num_words=10000表示数据集保留了最常出现的10,000个单词。为了保持数据大小的可处理性，罕见的单词会被丢弃。
print(type(train_x))

评论文本已转换为整数数组，每个整数表示字典中的特定单词。以下是第一篇评论文本转换后的形式：

print(train_data[0])

输出：

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]

创建id和词的匹配字典

word_index = imdb.get_word_index()
print(type(word_index))

#key为word，value为id
word2id = {k:(v+3) for k, v in word_index.items()}
word2id['<PAD>'] = 0
word2id['<START>'] = 1
word2id['<UNK>'] = 2
word2id['<UNUSED>'] = 3
#转换key和value的位置
id2word = {v:k for k, v in word2id.items()}
def get_words(sent_ids):#根据id得到对应的word
    return ' '.join([id2word.get(i, '?') for i in sent_ids])

sent = get_words(train_x[0])
print(sent)
<class 'dict'>
<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out of the <UNK> list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all

准备数据

在输入到神经网络之前，整数数组形式的评论必须转换为张量。这种转换可以通过以下两种方式完成：

方法一：对数组进行独热编码（One-hot-encode），将其转换为0和1的向量。例如序列[3,5]将成为一个10,000维的向量，除索引3和5为1外，其余全部为零。然后，将其作为我们网络中的第一层——全连接层（稠密层，Dense layer）——以处理浮点向量数据。然而，这种方法会占用大量内存，需要一个num_words * num_reviews大小的矩阵。

方法二：填充数组，使它们都具有相同的长度，然后创建一个形状为max_length * num_reviews的整数张量。我们可以使用能够处理这种形状的嵌入层（embedding layer）作为我们神经网络中的第一层。

在本教程中，我们使用第二种方法。

由于电影评论的长度必须相同，我们使用pad_sequences函数对长度进行标准化

# 句子末尾padding，'<PAD>'对应的id对齐
train_x = keras.preprocessing.sequence.pad_sequences(
    train_x, value=word2id['<PAD>'],
    padding='post', maxlen=256
)
test_x = keras.preprocessing.sequence.pad_sequences(
    test_x, value=word2id['<PAD>'],
    padding='post', maxlen=256
)
print(train_x[0])
print('len: ',len(train_x[0]), len(train_x[1]))

构建模型

import tensorflow.keras.layers as layers
vocab_size = 10000#为什么要大于10000？->选取了10000个词汇作为词汇表，至少要最大整数 index + 1才可以
model = keras.Sequential()
model.add(layers.Embedding(vocab_size, 16))#Embedding层只能用作模型中的第一层。输出的词向量的维度为16
model.add(layers.GlobalAveragePooling1D())#时序数据的全局平均池化
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.summary()
model.compile(optimizer='adam',
             loss='binary_crossentropy',
             metrics=['accuracy'])

在该模型中，以下4层按顺序堆叠以构建分类器：

第一层是嵌入层（Embedding layer）。该层采用整数编码的词汇表，并查找每个词索引的嵌入向量。这些向量是作为模型训练学习的。向量为输出数组添加维度，生成的维度为：(batch, sequence, embedding)。
接下来，全局平均池化层（GlobalAveragePooling1D layer）通过对序列维度求平均，为每个评论返回固定长度的输出向量。这允许模型以最简单的方式处理可变长度的输入。
这个固定长度的输出向量通过一个带有16个隐藏单元的全连接层（稠密层，Dense layer）进行传输。
最后一层与单个输出节点紧密连接。使用sigmoid激活函数，输出值是介于0和1之间的浮点数，表示概率或置信水平。
如果模型具有更多隐藏单元（更高维度的表示空间）和/或更多层，那么网络可以学习更复杂的表示。但是，它使网络的计算成本更高，并且可能导致学习不需要的模式——这些模式可以提高在训练数据上的表现，而不会提高在测试数据上的表现。这就是所谓的过度拟合。
对于损失函数以及优化器的解释：binary_crossentropy在处理概率上更优秀，mean_squared_error在回归问题中表现更好

训练模型

val_size=10000
x_val = train_x[:val_size]
x_train = train_x[val_size:]
print(train_x.shape)
y_val = train_y[:val_size]
y_train = train_y[val_size:]

history = model.fit(x_train,y_train,
                   epochs=40, batch_size=512,
                   validation_data=(x_val, y_val),
                   verbose=1)

result = model.evaluate(test_x, text_y)
print(result)

在训练时，我们想要检查模型在以前没有见过的数据上的准确性。因而我们通过从原始训练数据中分离10,000个影评来创建验证集。（为什么现在不使用测试集呢？我们的目标是只使用训练数据开发和调整我们的模型，然后仅使用一次测试数据来评估我们模型的准确性）。
本教程采用小批量梯度下降法训练模型，每个mini—batches含有512个样本（影评），模型共训练了40个epoch。这就意味着在x_train和y_train张量上对所有样本进行了40次迭代。在训练期间，模型在验证集（含10,000个样本）上的损失值和准确率同样会被记录。
最后通过测试集来检验模型的表现。检验结果将返回两个值：损失值（表示我们的误差，值越低越好）和准确率。

绘图

绘图查看精确率和损失值随时间变化情况

model.fit()函数会返回一个History对象，该对象包含一个字典，记录了训练期间发生的所有事情。

history_dict = history.history
history_dict.keys()
输出：

dict_keys(['acc', 'val_loss', 'loss', 'val_acc'])

字典中共有四个条目，每个条目对应训练或验证期间一个受监控的指标。我们可以使用这些条目来绘制训练和验证期间的损失值、训练和验证期间的准确率，以进行对比。

import matplotlib.pyplot as plt

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

plt.clf()   # clear figure
acc_values = history_dict['acc']
val_acc_values = history_dict['val_acc']

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.show()