本文是在学习tensorflow2.0官方教程时的一个笔记,原始教程请见文本情感分类
准备工作
1. 安装tensorflow并导入相关库
如果已经安装了可以略去此步
!pip install tensorflow
import tensorflow_datasets as tfds
import tensorflow as tf
import matplotlib.pyplot as plt
2. 准备数据集
2.1 导入数据集
dataset, info = tfds.load('imdb_reviews/subwords8k', with_info=True,
as_supervised=True)
train_data, test_data = dataset['train'], dataset['test']
数据集介绍
这是一个imdb的影评数据集。
tfds.core.DatasetInfo(
name=‘imdb_reviews’,
version=1.0.0,
description=‘Large Movie Review Dataset.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.’,
homepage=‘http://ai.stanford.edu/~amaas/data/sentiment/’,
features=FeaturesDict({
‘label’: ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
‘text’: Text(shape=(None,), dtype=tf.int64, encoder=),
}),
total_num_examples=100000,
splits={
‘test’: 25000,
‘train’: 25000,
‘unsupervised’: 50000,
},
supervised_keys=(‘text’, ‘label’),
citation="""@InProceedings{maas-EtAl:2011:ACL-HLT2011,
author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher},
title = {Learning Word Vectors for Sentiment Analysis},
booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
month = {June},
year = {2011},
address = {Portland, Oregon, USA},
publisher = {Association for Computational Linguistics},
pages = {142–150},
url = {http://www.aclweb.org/anthology/P11-1015}
}""",
redistribution_info=,
)
因为数据集的info自带encoder,所以直接调用
# The dataset info includes the encoder
encoder = info.features['text'].encoder
测试encoder
sample_string = 'Hello world.'
encoded_string = encoder.encode(sample_string)
print('Encoded string is {}'.format(encoded_string))
original_string = encoder.decode(encoded_string)
print('The original string: "{}"'.format(original_string))
assert original_string == sample_string
for index in encoded_string:
print('{} ----> {}'.format(index, encoder.decode([index])))
运行结果为:
Encoded string is [4025, 222, 562, 7975]
The original string: “Hello world.”
4025 ----> Hell
222 ----> o
562 ----> world
7975 ----> .
2.2 数据集预处理
对数据进行shuffle防止过拟合,对数据进行padded_batch,便于训练。值得注意的是,tensorflow2.0的padded_batch不需要paaded_shape。
BUFFER_SIZE = 10000
BATCH_SIZE = 64
train_dataset = (train_data
.shuffle(BUFFER_SIZE)
.padded_batch(BATCH_SIZE))
test_dataset = (test_data
.padded_batch(BATCH_SIZE))
3 创建模型
这里创建的是一个tf.keras.Sequential,模型如下图所示
embedding层的作用是生成词向量,作为神经网络的输入,这里的词向量选用的是64维,一般实际可能会更大一些。
LSTM层:长短程记忆单元,这里采用的是双向的,也就是走到最后一个词之后,倒着走到第一行。
dense层1:是一个全连接神经网络,64个unit
dense层2:是一个全连接神经网络,输出层
更多的理解可以参考循环神经网络学习笔记
model = tf.keras.Sequential([
tf.keras.layers.Embedding(encoder.vocab_size, 64),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1)
])
设置损失函数、优化器,
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
optimizer=tf.keras.optimizers.Adam(1e-4),
metrics=['accuracy'])
4 模型训练
tf.keras可以直接调用fit函数进行训练,训练的epoch为10次,验证数据为测试机,每30步一次。
history = model.fit(train_dataset, epochs=10,
validation_data=test_dataset,
validation_steps=30)
Epoch 1/10
391/391 [] - 44s 112ms/step - loss: 0.6572 - accuracy: 0.5434 - val_loss: 0.4859 - val_accuracy: 0.7865
Epoch 2/10
391/391 [] - 43s 110ms/step - loss: 0.3448 - accuracy: 0.8572 - val_loss: 0.3440 - val_accuracy: 0.8458
Epoch 3/10
391/391 [] - 43s 110ms/step - loss: 0.2618 - accuracy: 0.8952 - val_loss: 0.3378 - val_accuracy: 0.8458
Epoch 4/10
391/391 [] - 43s 111ms/step - loss: 0.2110 - accuracy: 0.9204 - val_loss: 0.3278 - val_accuracy: 0.8594
Epoch 5/10
391/391 [] - 43s 110ms/step - loss: 0.1867 - accuracy: 0.9322 - val_loss: 0.3563 - val_accuracy: 0.8510
Epoch 6/10
391/391 [] - 43s 110ms/step - loss: 0.1624 - accuracy: 0.9432 - val_loss: 0.3610 - val_accuracy: 0.8615
Epoch 7/10
391/391 [] - 43s 110ms/step - loss: 0.2073 - accuracy: 0.9308 - val_loss: 0.3900 - val_accuracy: 0.8578
Epoch 8/10
391/391 [] - 43s 109ms/step - loss: 0.1370 - accuracy: 0.9542 - val_loss: 0.4124 - val_accuracy: 0.8641
Epoch 9/10
391/391 [] - 44s 112ms/step - loss: 0.1222 - accuracy: 0.9597 - val_loss: 0.4238 - val_accuracy: 0.8641
Epoch 10/10
391/391 [] - 44s 113ms/step - loss: 0.1205 - accuracy: 0.9600 - val_loss: 0.4685 - val_accuracy: 0.8568
测试集上
test_loss, test_acc = model.evaluate(test_dataset)
print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))
Test Loss: 0.44925960898399353
Test Accuracy: 0.8586400151252747
随着epoch变化,准确度和,loss的变化
4 改进模型
1、使用双层的RNN神经网络,循环单元为仍然为lstm
2、增加一个dropout
model = tf.keras.Sequential([
tf.keras.layers.Embedding(encoder.vocab_size, 64),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True)),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(1)
])
其余同上
测试结果
Test Loss: 0.5735374093055725
Test Accuracy: 0.829039990901947
从准确性来看,并没有升高,这可能是因为数据量对于这个模型来说太少了,所以造成了过拟合,结果较差。
5 实际预测
现在我们随便输入一个句子,使用模型对其情感进行分类。当分数大于等于0.5时是积极的评价,小于0.5时是负面的评价
因为输入的句子的长度可能是不一样的,我们需要对输入的句子用0进行padding(补全)
def pad_to_size(vec, size):
zeros = [0] * (size - len(vec))
vec.extend(zeros)
return vec
def sample_predict(sample_pred_text, pad):
encoded_sample_pred_text = encoder.encode(sample_pred_text)
if pad:
encoded_sample_pred_text = pad_to_size(encoded_sample_pred_text, 64)
encoded_sample_pred_text = tf.cast(encoded_sample_pred_text, tf.float32)
predictions = model.predict(tf.expand_dims(encoded_sample_pred_text, 0))
return (predictions)
sample_pred_text = ('The movie was cool. The animation and the graphics '
'were out of this world. I would recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=False)
print(predictions)
predictions = sample_predict(sample_pred_text, pad=True)
print(predictions)
[[0.10079887]]
[[0.06816088]]
从结果可以看出padding可以使得结果更加准确。