前言
本次将利用 Conv1D 在数据集上做一个完整的示例。其中包括:数据集简介、网络搭建和结果分析。
数据集简介:imdb
学习分类算法之前需要掌握算法所需要的数据输入的格式,下面将详细介绍即将用到的数据集的相关知识:
数据位置
可以从https://datasets.imdbws.com/访问和下载数据集文件。数据每天刷新。
IMDb数据集详细信息
每个数据集都包含在UTF-8字符集中一个gzip压缩的、tab分隔值(TSV)格式的文件中。每个文件的第一行包含描述每个列中内容的头文件。
’ \N '用于表示某个字段丢失或该标题/名称为空。现有数据集如下:
title.akas.tsv.gz-包含以下标题信息:
- titleId (string) - a tconst(字符常数),标题的字母数字唯一标识符
- ordering (integer) –一个数字,用于唯一标识给定titleId的行
- title (string) – 本地化的标题
- region (string) - 此标题版本的区域
- language (string) - 标题的语言
- types (array) - 此替代标题的枚举属性集。以下一种或多种: “alternative”, “dvd”, “festival”, “tv”, “video”, “working”, “original”, “imdbDisplay”. 将来可能会添加新值,而不会发出警告
- attributes (array) - 描述此替代标题的附加术语,未列举
- isOriginalTitle (boolean) – 0:不是原始标题;1:原始标题-
title.basics.tsv.gz-包含以下标题信息:
- tconst (string) - 标题的字母数字唯一标识符
- titleType (string) – 标题的类型/格式 (e.g. movie, short, tvseries, tvepisode, video, etc)
- primaryTitle (string) – 最受欢迎的标题/电影制片人在发行时在宣传材料上使用的标题
- originalTitle (string) - 原始语言的原始标题
- isAdult (boolean) - 0:非成人标题;1:成人标题
- startYear (YYYY) – 表示标题的发行年份。对于电视连续剧,则是该系列的开始年份
- endYear (YYYY) – 电视连续剧结束年。所有其他标题类型均使用’\ N’
- runtimeMinutes – 标题的主要运行时间(以分钟为单位)
- genres (string array) – 包括与标题相关的最多三种流派-
title.crew.tsv.gz –包含IMDb中所有标题的导演和作家信息。字段包括:
- tconst (string) - 标题的字母数字唯一标识符
- directors (array of nconsts) - 给定标题的导演
- writers (array of nconsts) – 指定标题的作者
title.episode.tsv.gz –包含电视节目信息。字段包括:
- tconst (string) - 情节的字母数字标识符
- parentTconst (string) - 父级电视连续剧的字母数字标识符
- seasonNumber (integer) – 情节所属的季节号
- episodeNumber (integer) – 电视连续剧中tconst的剧集编号
title.principals.tsv.gz –包含标题的主要演员/演员
- tconst (string) - 标题的字母数字唯一标识符
- ordering (integer) – 一个数字,用于唯一标识给定titleId的行
- nconst (string) - 名称/人的字母数字唯一标识符
- category (string) - 该人从事的工作类别
- job (string) - 特定的职位名称(如果适用),否则为’\ N’
- characters (string) - 播放的字符名称(如果适用),否则为’\ N’
title.ratings.tsv.gz –包含IMDb等级和标题的投票信息
- tconst (string) - 标题的字母数字唯一标识符
- averageRating – 所有个人用户评分的加权平均值
- numVotes -标题获得的票数
name.basics.tsv.gz –包含以下名称信息:
- nconst (string) - 名称/人的字母数字唯一标识符
- primaryName (string)– 最常记入此人的姓名
- birthYear – YYYY 格式
- deathYear – 如果适用,则为YYYY格式,否则为’\ N’
- primaryProfession (array of strings)– 该人的前三名专业
- knownForTitles (array of tconsts) – 已知人的标题
keras IMDB
keras 中选取来自IMDB的25,000部电影评论的数据集,并按情感(正/负)标记。评论已经过预处理,每个评论都被编码为单词索引(整数)列表。为了方便起见,单词以数据集中的整体频率进行索引,因此,例如整数“ 3”将编码数据中第3个最频繁出现的单词。这样就可以进行快速过滤操作,例如:“仅考虑前10,000个最常用的词,而排除前20个最常用的词”。
按照惯例,“ 0”不代表特定单词,而是用于编码任何未知单词。
通过 keras 中 load_data 载入数据集
tf.keras.datasets.imdb.load_data(
path="imdb.npz",
num_words=None,
skip_top=0,
maxlen=None,
seed=113,
start_char=1,
oov_char=2,
index_from=3,
**kwargs
)
load_data
参数:
path: where to cache the data (relative to ~/.keras/dataset).
num_words: integer or None. Words are ranked by how often they occur (in the training set) and only the num_words most frequent words are kept. Any less frequent word will appear as oov_char value in the sequence data. If None, all words are kept. Defaults to None, so all words are kept.
skip_top: skip the top N most frequently occurring words (which may not be informative). These words will appear as oov_char value in the dataset. Defaults to 0, so no words are skipped.
maxlen: int or None. Maximum sequence length. Any longer sequence will be truncated. Defaults to None, which means no truncation.
seed: int. Seed for reproducible data shuffling.
start_char: int. The start of a sequence will be marked with this character. Defaults to 1 because 0 is usually the padding character.
oov_char: int. The out-of-vocabulary character. Words that were cut out because of the num_words or skip_top limits will be replaced with this character.
index_from: int. Index actual words with this index and higher.
**kwargs: Used for backwards compatibility.
返回:
Numpy数组的元组: (x_train, y_train), (x_test, y_test).
x_train, x_test: 序列列表,即索引(整数)列表。如果num_words参数是特定的,则最大可能的索引值为num_words-1。如果指定了maxlen参数,则最大可能的序列长度是maxlen。
y_train, y_test: 整数标签(1或0)的列表。
请注意,“out of vocabulary”字符仅用于出现在训练集中但没有包括在内的单词,因为它们没有在这里删除num_words。
在训练集中没有出现但在测试集中出现的单词被简单地跳过了。
此外:
get_word_index
函数可以实现:检索将单词映射到其在IMDB数据集中的索引的字典。
Arguments
path: where to cache the data (relative to ~/.keras/dataset).
Returns
The word index dictionary. Keys are word strings, values are their index.
keras模型搭建
就像堆积木一样,建立 keras conv1D 分类模型。
# from: https://keras.io/examples/imdb_cnn/
def createConv1D(max_features, embedding_dims, maxlen, filters, kernel_size, hidden_dims):
model = Sequential()
# we start off with an efficient embedding layer which maps
# our vocab indices into embedding_dims dimensions
model.add(Embedding(max_features,
embedding_dims,
input_length=maxlen))
model.add(Dropout(0.2))
# we add a Convolution1D, which will learn filters
# word group filters of size filter_length:
model.add(Conv1D(filters,
kernel_size,
padding='valid',
activation='relu',
strides=1))
# we use max pooling:
model.add(GlobalMaxPooling1D())
# We add a vanilla hidden layer:
model.add(Dense(hidden_dims))
model.add(Dropout(0.2))
model.add(Activation('relu'))
# We project onto a single unit output layer, and squash it with a sigmoid:
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
return model
训练并测试数据
from __future__ import print_function
from keras.datasets import imdb
# set parameters:
max_features = 5000
maxlen = 400
batch_size = 32
embedding_dims = 50
filters = 250
kernel_size = 3
hidden_dims = 250
epochs = 20
print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')
print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)
print('Build model...')
model = createConv1D(max_features, embedding_dims, maxlen, filters, kernel_size, hidden_dims)
model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
validation_data=(x_test, y_test))
完整代码:
from __future__ import print_function
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import Conv1D, GlobalMaxPooling1D
from keras.datasets import imdb
# set parameters:
max_features = 5000
maxlen = 400
batch_size = 32
embedding_dims = 50
filters = 250
kernel_size = 3
hidden_dims = 250
epochs = 2
print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')
print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)
print('Build model...')
model = Sequential()
# we start off with an efficient embedding layer which maps
# our vocab indices into embedding_dims dimensions
model.add(Embedding(max_features,
embedding_dims,
input_length=maxlen))
model.add(Dropout(0.2))
# we add a Convolution1D, which will learn filters
# word group filters of size filter_length:
model.add(Conv1D(filters,
kernel_size,
padding='valid',
activation='relu',
strides=1))
# we use max pooling:
model.add(GlobalMaxPooling1D())
# We add a vanilla hidden layer:
model.add(Dense(hidden_dims))
model.add(Dropout(0.2))
model.add(Activation('relu'))
# We project onto a single unit output layer, and squash it with a sigmoid:
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
validation_data=(x_test, y_test))