tensorflow笔记-文本情感分类

本文是在学习tensorflow2.0官方教程时的一个笔记,原始教程请见文本情感分类

准备工作

1. 安装tensorflow并导入相关库

如果已经安装了可以略去此步
!pip install tensorflow

import tensorflow_datasets as tfds
import tensorflow as tf
import matplotlib.pyplot as plt
2. 准备数据集
2.1 导入数据集
dataset, info = tfds.load('imdb_reviews/subwords8k', with_info=True,
                          as_supervised=True)
train_data, test_data = dataset['train'], dataset['test']

数据集介绍
这是一个imdb的影评数据集。

tfds.core.DatasetInfo(
name=‘imdb_reviews’,
version=1.0.0,
description=‘Large Movie Review Dataset.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.’,
homepage=‘http://ai.stanford.edu/~amaas/data/sentiment/’,
features=FeaturesDict({
‘label’: ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
‘text’: Text(shape=(None,), dtype=tf.int64, encoder=),
}),
total_num_examples=100000,
splits={
‘test’: 25000,
‘train’: 25000,
‘unsupervised’: 50000,
},
supervised_keys=(‘text’, ‘label’),
citation="""@InProceedings{maas-EtAl:2011:ACL-HLT2011,
author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher},
title = {Learning Word Vectors for Sentiment Analysis},
booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
month = {June},
year = {2011},
address = {Portland, Oregon, USA},
publisher = {Association for Computational Linguistics},
pages = {142–150},
url = {http://www.aclweb.org/anthology/P11-1015}
}""",
redistribution_info=,
)

因为数据集的info自带encoder,所以直接调用

# The dataset info includes the encoder
encoder = info.features['text'].encoder

测试encoder

sample_string = 'Hello world.'

encoded_string = encoder.encode(sample_string)
print('Encoded string is {}'.format(encoded_string))
original_string = encoder.decode(encoded_string)
print('The original string: "{}"'.format(original_string))
assert original_string == sample_string
for index in encoded_string:
  print('{} ----> {}'.format(index, encoder.decode([index])))

运行结果为:

Encoded string is [4025, 222, 562, 7975]
The original string: “Hello world.”
4025 ----> Hell
222 ----> o
562 ----> world
7975 ----> .

2.2 数据集预处理

对数据进行shuffle防止过拟合,对数据进行padded_batch,便于训练。值得注意的是,tensorflow2.0的padded_batch不需要paaded_shape。

BUFFER_SIZE = 10000
BATCH_SIZE = 64
train_dataset = (train_data
                 .shuffle(BUFFER_SIZE)
                 .padded_batch(BATCH_SIZE))

test_dataset = (test_data
                .padded_batch(BATCH_SIZE))
3 创建模型

这里创建的是一个tf.keras.Sequential,模型如下图所示
单层LSTM常规模型
embedding层的作用是生成词向量,作为神经网络的输入,这里的词向量选用的是64维,一般实际可能会更大一些。
LSTM层:长短程记忆单元,这里采用的是双向的,也就是走到最后一个词之后,倒着走到第一行。
dense层1:是一个全连接神经网络,64个unit
dense层2:是一个全连接神经网络,输出层
更多的理解可以参考循环神经网络学习笔记

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(encoder.vocab_size, 64),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])

设置损失函数、优化器,

model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])
4 模型训练

tf.keras可以直接调用fit函数进行训练,训练的epoch为10次,验证数据为测试机,每30步一次。

history = model.fit(train_dataset, epochs=10,
                    validation_data=test_dataset, 
                    validation_steps=30)

Epoch 1/10
391/391 [] - 44s 112ms/step - loss: 0.6572 - accuracy: 0.5434 - val_loss: 0.4859 - val_accuracy: 0.7865
Epoch 2/10
391/391 [
] - 43s 110ms/step - loss: 0.3448 - accuracy: 0.8572 - val_loss: 0.3440 - val_accuracy: 0.8458
Epoch 3/10
391/391 [] - 43s 110ms/step - loss: 0.2618 - accuracy: 0.8952 - val_loss: 0.3378 - val_accuracy: 0.8458
Epoch 4/10
391/391 [
] - 43s 111ms/step - loss: 0.2110 - accuracy: 0.9204 - val_loss: 0.3278 - val_accuracy: 0.8594
Epoch 5/10
391/391 [] - 43s 110ms/step - loss: 0.1867 - accuracy: 0.9322 - val_loss: 0.3563 - val_accuracy: 0.8510
Epoch 6/10
391/391 [
] - 43s 110ms/step - loss: 0.1624 - accuracy: 0.9432 - val_loss: 0.3610 - val_accuracy: 0.8615
Epoch 7/10
391/391 [] - 43s 110ms/step - loss: 0.2073 - accuracy: 0.9308 - val_loss: 0.3900 - val_accuracy: 0.8578
Epoch 8/10
391/391 [
] - 43s 109ms/step - loss: 0.1370 - accuracy: 0.9542 - val_loss: 0.4124 - val_accuracy: 0.8641
Epoch 9/10
391/391 [] - 44s 112ms/step - loss: 0.1222 - accuracy: 0.9597 - val_loss: 0.4238 - val_accuracy: 0.8641
Epoch 10/10
391/391 [
] - 44s 113ms/step - loss: 0.1205 - accuracy: 0.9600 - val_loss: 0.4685 - val_accuracy: 0.8568

测试集上

test_loss, test_acc = model.evaluate(test_dataset)
print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))

Test Loss: 0.44925960898399353
Test Accuracy: 0.8586400151252747

随着epoch变化,准确度和,loss的变化

左loss,右acc

4 改进模型

1、使用双层的RNN神经网络,循环单元为仍然为lstm
2、增加一个dropout

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(encoder.vocab_size, 64),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64,  return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1)
])

其余同上

测试结果

Test Loss: 0.5735374093055725
Test Accuracy: 0.829039990901947

从准确性来看,并没有升高,这可能是因为数据量对于这个模型来说太少了,所以造成了过拟合,结果较差。

5 实际预测

现在我们随便输入一个句子,使用模型对其情感进行分类。当分数大于等于0.5时是积极的评价,小于0.5时是负面的评价

因为输入的句子的长度可能是不一样的,我们需要对输入的句子用0进行padding(补全)

def pad_to_size(vec, size):
  zeros = [0] * (size - len(vec))
  vec.extend(zeros)
  return vec
def sample_predict(sample_pred_text, pad):
  encoded_sample_pred_text = encoder.encode(sample_pred_text)

  if pad:
    encoded_sample_pred_text = pad_to_size(encoded_sample_pred_text, 64)
  encoded_sample_pred_text = tf.cast(encoded_sample_pred_text, tf.float32)
  predictions = model.predict(tf.expand_dims(encoded_sample_pred_text, 0))

  return (predictions) 
sample_pred_text = ('The movie was cool. The animation and the graphics '
                    'were out of this world. I would recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=False)
print(predictions)
predictions = sample_predict(sample_pred_text, pad=True)
print(predictions)

[[0.10079887]]
[[0.06816088]]

从结果可以看出padding可以使得结果更加准确。

  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值