NLP入门—Task1 基于tensorflow的数据集探索

最新推荐文章于 2022-10-22 02:02:13 发布

doggyya

最新推荐文章于 2022-10-22 02:02:13 发布

阅读量239

点赞数

分类专栏：学习笔记

学习笔记专栏收录该内容

8 篇文章 0 订阅

订阅专栏

参考学习资料：基于Tensorflow进行NLP文本分类：对IMDB影评数据集进行情感分析

采用的API：tf.keras

import tensorflow as tf
from tensorflow import keras

import numpy as np

print(tf.__version__)

IMDB数据集下载：

imdb = keras.datasets.imdb

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

分类结果：将文本影评分为“正面”、“负面”两种类别。

1、数据集介绍：[IMDB数据集]
数据集采用IMDB数据集，包括来自互联网电影数据库的5W条影评文本。将此数据集拆为训练集（25K条）和测试集（25K条）。

2、数据集探索

1）了解数据集结构

print("Training entries: {},  labels: {}".format(len(train_data), len(train_labels)))

结果为：Training entries: 25000, labels: 25000
经过预处理数据集为整数数组，标签0表示负面影评，1表示正面影评。
神经网络输入要求数据长度相同，先查看前两条影评数据

len(train_data[0]), len(train_data[1])

结果为：

(218, 189)

因此需要对数据长度进行处理。
2）将整数数据转换为文本

word_index = imdb.get_word_index()

# The first indices are reserved
word_index = {k:(v+3) for k,v in word_index.items()} 
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # unknown
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

使用 decode_review 显示第一条影评文本：

decode_review(train_data[0])

输出为：

" this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert  is an amazing actor and now the same being director  father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for  and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also  to the two little boy's that played the  of norman and paul they were just brilliant children are often left out of the  list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all"

3）预处理
解决数据集的长度不一致，存在两种方式：
A、one-hot编码
B、填充数组，补零，扩展到相同长度（采用此种）

train_data = keras.preprocessing.sequence.pad_sequences(train_data,value=word_index["<PAD>"],padding='post',maxlen=256)
 
test_data = keras.preprocessing.sequence.pad_sequences(test_data,value=word_index["<PAD>"],padding='post',maxlen=256)

查看现在的长度

len(train_data[0]), len(train_data[1])

输出为

(256, 256)

4）构建模型
需要确定的是：
A、模型层数
B、每层使用的隐藏单元个数
参考资料中采用模型如下：


vocab_size = 10000

model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16, activation=tf.nn.relu))
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))

model.summary()

堆叠层构建分类器
①嵌入层将稀疏数据
②池化层求均值
③全连接层
④激活函数sigmod，使结果为0-1的浮点值，表示置信度。

5）验证集并训练模型
构建验证集并进行分离

x_val = train_data[:10000]
partial_x_train = train_data[10000:]

y_val = train_labels[:10000]
partial_y_train = train_labels[10000:]

开始训练

6）模型评估
评估值：损失、准确率

results = model.evaluate(test_data, test_labels)

print(results)

输出为：[0.31364583273612129, 0.87561]

doggyya

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
NLP入门—Task1 基于tensorflow的数据集探索

参考学习资料：基于Tensorflow进行NLP文本分类：对IMDB影评数据集进行情感分析采用的API：tf.kerasimport tensorflow as tffrom tensorflow import kerasimport numpy as npprint(tf.__version__)IMDB数据集下载：imdb = keras.datasets.imdb(t...
复制链接

扫一扫