Keras : 对比CCN和RNN在影戏评论IMDB数据集上的表现

IMDB数据集介绍

IMDB数据集下载地址👉,点击即可下载

IMDB数据集主要介绍👉,点击直达

如题,IMDB数据集是影戏评论的情绪数据,包含50000条已经打分的评论数据,其中25000条训练数据,25000条测试数据。其中电影评分大于等于4的评论为正向评论;电影评分小于等于4的评论为负向评论。同时,该数据集还包含50000 unlabeled 的评论数据。

数据集大小的话大概80多一点个M,下载完成之后解压到目录文件下,也可以选择第一次跑程序的的时候下载解压,之后到参数调整步骤直接注释掉就行。手动下载的话需要在本地的电脑上解压缩一下,目录如下:
1

RRN情感预测

参考教程keras官网实例👉,KERAS·examples

RNN的神经网络结构如下图所示:
RNN
在案例中,首先加载已经预处理完成的IMDB的影评数据,再利用Embedding层将One-hot编码映射到一个固定大小的实体向量,通过构建数据模型,找出word之间的关系,构建特征空间。最后,在模型训练的时候,主要构建Loss值,优化方法。由于是分类问题,采用交叉熵作为Loss值,也就是目标函数。

训练和评估模型完成之后的运行截图如下,loss值为0.2194,accuracy准确率为0.9176。

3

CNN文本分类

还是一样基于IMDB影评情感分类,只是使用Conv1D取代RNN结构。

参考教程keras官网实例👉,KERAS·examples

CNN语言模型的基本原理和CNN图像模型类似,也是先通过卷积层提取特征,然后通过池化层减少神经元数量,最后通过类似softmax层输出类别可能性。

不同点在于数据结构不一样,图像数据是3维度的,长、宽和通道数;而语言模型是2维的,是句子的长度和每个词的词向量长度。图像卷积一般用tf.nn.conv2d;而文本卷积一般用conv1d。词向量卷积的时候通常利用多个长度不一的卷积核来提取特征。如果分别以3个连续单词、4个连续单词、5个连续单词为卷积核大小来提取特征,那么往往可以认为这种特征提取已经非常接近句子的意思。

在案例中,首先对IMDB数据进行了预处理,包括
标签的删除还有文本的填充等(但其实下载的数据集是已经处理过的/所以可以略过此步),让然后定义Embedding层和卷积池化层,用卷积核来提取文本特征,接着将多种卷积核提取的特征向量展开并连接在一起,添加全连接层输出类别。下一步定义损失函数,再选择合适的优化函数进行训练。

为了比较CNN和RNN再IMDB上的表现,这里修改了CNN的参数与RNN中的案例参数保持一致。最终运行结果如下,loss值为0.4450,accuracy准确率为0.8396。
4

过程中可能遇到的问题以及解决方法

RNN训练过程中报错,Process finished with exit code -1073740791 (0xC0000409)

解决方法:由于GPU内存不够,程序在进行训练时发现并停止运行。

在访问GPU的代码之前,设置os.environ[‘CUDA_VISIBLE_DEVICES’]=‘-1‘,将该参数设置为-1的意思是不使用GPU的内存,而采用CPU的内存进行程序运行。

2
设置完成后运行,成功~

CNN训练过程中出错,loss数值为负,且绝对值增长

训练中出现的错误如下图。
4
解决方法:猜测是训练数据过多导致的模型崩溃,考虑到实验中只需要对neg和pos子文件夹操作,所以删除IMDB数据集下面的unsup数据,减少训练量。
3
删除完成后重新调试代码,成功。

附RNN_text.py和CNN_text.py源代码

1.RNN_text.py

"""
Title: Bidirectional LSTM on IMDB
Author: [fchollet](https://twitter.com/fchollet)
Date created: 2020/05/03
Last modified: 2020/05/03
Description: Train a 2-layer bidirectional LSTM on the IMDB movie review sentiment classification dataset.
"""

"""
## Setup
"""

import numpy as np
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
from tensorflow import keras
from tensorflow.keras import layers

max_features = 20000  # Only consider the top 20k words
maxlen = 200  # Only consider the first 200 words of each movie review

"""
## Build the model
"""

# Input for variable-length sequences of integers
inputs = keras.Input(shape=(None,), dtype="int32")
# Embed each integer in a 128-dimensional vector
x = layers.Embedding(max_features, 128)(inputs)
# Add 2 bidirectional LSTMs
x = layers.Bidirectional(layers.LSTM(64, return_sequences=True))(x)
x = layers.Bidirectional(layers.LSTM(64))(x)
# Add a classifier
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.summary()

"""
## Load the IMDB movie review sentiment data
"""

(x_train, y_train), (x_val, y_val) = keras.datasets.imdb.load_data(
    num_words=max_features
)
print(len(x_train), "Training sequences")
print(len(x_val), "Validation sequences")
# Use pad_sequence to standardize sequence length:
# this will truncate sequences longer than 200 words and zero-pad sequences shorter than 200 words.
x_train = keras.preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_val = keras.preprocessing.sequence.pad_sequences(x_val, maxlen=maxlen)

"""
## Train and evaluate the model
"""

model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, batch_size=32, epochs=2, validation_data=(x_val, y_val))

2.CNN_text.py

"""
Title: Text classification from scratch
Authors: Mark Omernick, Francois Chollet
Date created: 2019/11/06
Last modified: 2020/05/17
Description: Text sentiment classification starting from raw text files.
"""
"""
## Introduction

This example shows how to do text classification starting from raw text (as
a set of text files on disk). We demonstrate the workflow on the IMDB sentiment
classification dataset (unprocessed version). We use the `TextVectorization` layer for
 word splitting & indexing.
"""

"""
## Setup
"""

import tensorflow as tf
import numpy as np

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

"""
## Load the data: IMDB movie review sentiment classification

Let's download the data and inspect its structure.
"""

"""shell
curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
tar -xf aclImdb_v1.tar.gz
"""

"""
The `aclImdb` folder contains a `train` and `test` subfolder:
"""

"""shell
ls aclImdb
"""

"""shell
ls aclImdb/test
"""

"""shell
ls aclImdb/train
"""

"""
The `aclImdb/train/pos` and `aclImdb/train/neg` folders contain text files, each of
 which represents on review (either positive or negative):
"""

"""shell
cat aclImdb/train/pos/6248_7.txt
"""

"""
We are only interested in the `pos` and `neg` subfolders, so let's delete the rest:
"""

"""shell
rm -r aclImdb/train/unsup
"""

"""
You can use the utility `tf.keras.preprocessing.text_dataset_from_directory` to
generate a labeled `tf.data.Dataset` object from a set of text files on disk filed
 into class-specific folders.

Let's use it to generate the training, validation, and test datasets. The validation
and training datasets are generated from two subsets of the `train` directory, with 20%
of samples going to the validation dataset and 80% going to the training dataset.

Having a validation dataset in addition to the test dataset is useful for tuning
hyperparameters, such as the model architecture, for which the test dataset should not
be used.

Before putting the model out into the real world however, it should be retrained using all
available training data (without creating a validation dataset), so its performance is maximized.

When using the `validation_split` & `subset` arguments, make sure to either specify a
random seed, or to pass `shuffle=False`, so that the validation & training splits you
get have no overlap.

"""

batch_size = 32
raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    "aclImdb/train",
    batch_size=batch_size,
    validation_split=0.2,
    subset="training",
    seed=1337,
)
raw_val_ds = tf.keras.preprocessing.text_dataset_from_directory(
    "aclImdb/train",
    batch_size=batch_size,
    validation_split=0.2,
    subset="validation",
    seed=1337,
)
raw_test_ds = tf.keras.preprocessing.text_dataset_from_directory(
    "aclImdb/test", batch_size=batch_size
)

print(
    "Number of batches in raw_train_ds: %d"
    % tf.data.experimental.cardinality(raw_train_ds)
)
print(
    "Number of batches in raw_val_ds: %d" % tf.data.experimental.cardinality(raw_val_ds)
)
print(
    "Number of batches in raw_test_ds: %d"
    % tf.data.experimental.cardinality(raw_test_ds)
)

"""
Let's preview a few samples:
"""

# It's important to take a look at your raw data to ensure your normalization
# and tokenization will work as expected. We can do that by taking a few
# examples from the training set and looking at them.
# This is one of the places where eager execution shines:
# we can just evaluate these tensors using .numpy()
# instead of needing to evaluate them in a Session/Graph context.
for text_batch, label_batch in raw_train_ds.take(1):
    for i in range(5):
        print(text_batch.numpy()[i])
        print(label_batch.numpy()[i])

"""
## Prepare the data

In particular, we remove `<br />` tags.
"""

from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
import string
import re

# Having looked at our data above, we see that the raw text contains HTML break
# tags of the form '<br />'. These tags will not be removed by the default
# standardizer (which doesn't strip HTML). Because of this, we will need to
# create a custom standardization function.
def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
    return tf.strings.regex_replace(
        stripped_html, "[%s]" % re.escape(string.punctuation), ""
    )


# Model constants.
max_features = 20000
embedding_dim = 128
sequence_length = 200

# Now that we have our custom standardization, we can instantiate our text
# vectorization layer. We are using this layer to normalize, split, and map
# strings to integers, so we set our 'output_mode' to 'int'.
# Note that we're using the default split function,
# and the custom standardization defined above.
# We also set an explicit maximum sequence length, since the CNNs later in our
# model won't support ragged sequences.
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=max_features,
    output_mode="int",
    output_sequence_length=sequence_length,
)

# Now that the vocab layer has been created, call `adapt` on a text-only
# dataset to create the vocabulary. You don't have to batch, but for very large
# datasets this means you're not keeping spare copies of the dataset in memory.

# Let's make a text-only dataset (no labels):
text_ds = raw_train_ds.map(lambda x, y: x)
# Let's call `adapt`:
vectorize_layer.adapt(text_ds)

"""
## Two options to vectorize the data

There are 2 ways we can use our text vectorization layer:

**Option 1: Make it part of the model**, so as to obtain a model that processes raw
 strings, like this:
"""

"""

text_input = tf.keras.Input(shape=(1,), dtype=tf.string, name='text')
x = vectorize_layer(text_input)
x = layers.Embedding(max_features + 1, embedding_dim)(x)
...

**Option 2: Apply it to the text dataset** to obtain a dataset of word indices, then
 feed it into a model that expects integer sequences as inputs.

An important difference between the two is that option 2 enables you to do
**asynchronous CPU processing and buffering** of your data when training on GPU.
So if you're training the model on GPU, you probably want to go with this option to get
 the best performance. This is what we will do below.

If we were to export our model to production, we'd ship a model that accepts raw
strings as input, like in the code snippet for option 1 above. This can be done after
 training. We do this in the last section.


"""


def vectorize_text(text, label):
    text = tf.expand_dims(text, -1)
    return vectorize_layer(text), label


# Vectorize the data.
train_ds = raw_train_ds.map(vectorize_text)
val_ds = raw_val_ds.map(vectorize_text)
test_ds = raw_test_ds.map(vectorize_text)

# Do async prefetching / buffering of the data for best performance on GPU.
train_ds = train_ds.cache().prefetch(buffer_size=10)
val_ds = val_ds.cache().prefetch(buffer_size=10)
test_ds = test_ds.cache().prefetch(buffer_size=10)

"""
## Build a model

We choose a simple 1D convnet starting with an `Embedding` layer.
"""

from tensorflow.keras import layers

# A integer input for vocab indices.
inputs = tf.keras.Input(shape=(None,), dtype="int64")

# Next, we add a layer to map those vocab indices into a space of dimensionality
# 'embedding_dim'.
x = layers.Embedding(max_features, embedding_dim)(inputs)
x = layers.Dropout(0.5)(x)

# Conv1D + global max pooling
# x = layers.Conv1D(128, 7, padding="valid", activation="relu", strides=3)(x)
x = layers.Conv1D(128, 7, padding="valid", activation="relu", strides=3)(x)
x = layers.GlobalMaxPooling1D()(x)

# We add a vanilla hidden layer:
x = layers.Dense(128, activation="relu")(x)
x = layers.Dropout(0.5)(x)

# We project onto a single unit output layer, and squash it with a sigmoid:
predictions = layers.Dense(1, activation="sigmoid", name="predictions")(x)

model = tf.keras.Model(inputs, predictions)

# Compile the model with binary crossentropy loss and an adam optimizer.
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

"""
## Train the model
"""

epochs = 3

# Fit the model using the train and test datasets.
model.fit(train_ds, validation_data=val_ds, epochs=epochs)

"""
## Evaluate the model on the test set
"""

model.evaluate(test_ds)

"""
## Make an end-to-end model

If you want to obtain a model capable of processing raw strings, you can simply
create a new model (using the weights we just trained):
"""

# A string input
inputs = tf.keras.Input(shape=(1,), dtype="string")
# Turn strings into vocab indices
indices = vectorize_layer(inputs)
# Turn vocab indices into predictions
outputs = model(indices)

# Our end to end model
end_to_end_model = tf.keras.Model(inputs, outputs)
end_to_end_model.compile(
    loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"]
)

# Test it with `raw_test_ds`, which yields raw strings
end_to_end_model.evaluate(raw_test_ds)

  • 1
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值