一.引言:
处理 IMDB 数据集 demo 时,用到了很多文本转 onehot,文本转 embedding 的方法,下面整理一下。
本文后续样例测试数据集采用 IMDB 原始数据集,代表了用户对电影的评价,其中包含积极 positive 以及 消极 nagative 的评论:
Working with one of the best Shakespeare sources, this film manages to be creditable to it's source, whilst still appealing to a wider audience.<br /><br />Branagh steals the film from under Fishburne's nose, and there's a talented cast on good form.
Well...tremors I, the original started off in 1990 and i found the movie quite enjoyable to watch. however, they proceeded to make tremors II and III. Trust me, those movies started going downhill right after they finished the first one, i mean, ass blasters??? Now, only God himself is capable of answering the question "why in Gods name would they create another one of these dumpster dives of a movie?" Tremors IV cannot be considered a bad movie, in fact it cannot be even considered an epitome of a bad movie, for it lives up to more than that. As i attempted to sit though it, i noticed that my eyes started to bleed, and i hoped profusely that the little girl from the ring would crawl through the TV and kill me. did they really think that dressing the people who had stared in the other movies up as though they we're from the wild west would make the movie (with the exact same occurrences) any better? honestly, i would never suggest buying this movie, i mean, there are cheaper ways to find things that burn well.
Ouch! This one was a bit painful to sit through. It has a cute and amusing premise, but it all goes to hell from there. Matthew Modine is almost always pedestrian and annoying, and he does not disappoint in this one. Deborah Kara Unger and John Neville turned in surprisingly decent performances. Alan Bates and Jennifer Tilly, among others, played it way over the top. I know that's the way the parts were written, and it's hard to blame actors, when the script and director have them do such schlock. If you're going to have outrageous characters, that's OK, but you gotta have good material to make it work. It didn't here. Run away screaming from this movie if at all possible.
I've seen some crappy movies in my life, but this one must be among the very worst. Definately bottom 100 material (imo, that is).<br /><br />We follow two couples, the Dodds (Billy Bob Thornton as Lonnie Earl and Natasha Richardson as Darlene) and the Kirkendalls (Patrick Swayze as Roy and Charlize Theron as Candy) in one car on a roadtrip to Reno.<br /><br />Apparently, Lonnie isn't too happy with his sex-life, so he cheats on his wife with Candy, who's despirately trying to have a baby. Roy, meanwhile, isn't too sure if his sperm is OK so he's getting it checked by a doctor.<br /><br />Now, I had read the back of the DVD, but my girlfriend didn't, and she blurted out after about 20 minutes: 'oh yeah, she's gonna end up pregnant but her husband can't have any baby's'. Spot on, as this movie is soooo predictable. As well as boring. And annoying. Meaningless. Offensive. Terrible.<br /><br />An example of how much this movie stinks. The two couples set out in their big car towards Nevada, when they are stopped by 2 police-officers, as they didn't stop at a stop-sign. The guys know each other and finally bribe the two officers with a case of beer. Not only is this scene pointless and not important (or even relevant) for the movie, it takes about 5 minutes! It's just talk and talk and talk, without ever going somewhere.<br /><br />I still have to puke thinking about the ending though. Apparently, Roy ISN'T having problems down there so he IS the father of the child. How many times does that happen in the movies... try something new! The cheated wife ultimately forgives her husband and best friend for having the affair and they all live happily ever after. Yuck.<br /><br />Best scene of the movie is right at the end, with a couple of shots of the Grand Canyon. Why couldn't they just keep the camera on that for 90 minutes?<br /><br />One would expect more from this cast (although Thornton really tries), but you can't really blame them. Writers, shame on you!<br /><br />1/10.
这里展示了4行测试数据,只是冰山一角,但不影响后续执行,如果想要使用全部数据可以到 IMDB原始数据集 下载 tar.gz 文件,使用下述命令解压至 .tar 文件,随后双击直接解压即可。
gzip -d xx.tag.gz
运行环境: Mac-Mini + Python 3.8 + Tensroflow 2.4.0-rc0
二. Tokenizer 格式化文字样本
1.格式化文本
tensorflow.keras.preprocessing.text.Tokenizer 可以在指定的文本中挑选最常见的 max_words 个单词并构建单词索引,随后将文本处理为索引序列,可以理解为一个简单的文本编码器。
> 读取原始数据存入 texts 数组,关于labels的处理这里可以忽略掉
> 定义 max_words 并初始化 Tokenizer
> 调用 fit_on_texts 方法训练格式化容器,随后 texts_to_sequence 将文本转化为索引序列
imdb_dir = '~/aclImdb'
import os
train_dir = os.path.join(imdb_dir, 'train')
labels = []
texts = []
for label_type in ['neg', 'pos']:
dir_name = os.path.join(train_dir, label_type)
for fname in os.listdir(dir_name):
if fname[-4:] == '.txt':
f = open(os.path.join(dir_name, fname))
texts.append(f.read())
f.close()
if label_type == 'neg':
labels.append(0)
else:
labels.append(1)
# 对数据进行分词
from tensorflow.keras.preprocessing.text import Tokenizer
max_words = 10000 # 只考虑常见的10000个单词
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
运行结果显示 Found 88582 unique tokens. 代表texts共有 88582 个不同的单词,不过我们截取的是前10000个,所以得到的 sequences 中索引不超过 9999。
2.验证max_words
maxIndex = max([max(i) for i in sequences])
print("maxIndex:", maxIndex)
maxIndex: 9999
3.验证索引与index对应
tokenizer.word_index 可以看作是 word2index 的map,只要texts中出现的单词,就可以找到索引。这里查看原始的 sequences 的第一条,和用 word_index 一一对应出来的索引序列进行比较。基本一致,这里相差的 7 原词是 br,是我人工剔除的,47是where,
其余基本一致,所以后续有索引序列也可以通过 word_index 进行解码工作。
print(sequences[0])
textList = []
text = "Working with one of the best Shakespeare sources, this film manages to be creditable to it source, whilst still appealing to a wider audience Branagh steals the film from under Fishburne nose, and there a talented cast on good form."
for i in text.split(" "):
textList.append(word_index[i.lower().replace(',', '').replace('.', '')])
print(textList)
[777, 16, 28, 4, 1, 115, 2278, 6887, 11, 19, 1025, 5, 27, 5, 42, 2425, 1861, 128, 2270, 5, 3, 6985, 308, 7, 7, 3383, 2373, 1, 19, 36, 463, 3169, 2, 222, 3, 1016, 174, 20, 49, 808]
[777, 16, 28, 4, 1, 115, 2278, 6887, 11, 19, 1025, 5, 27, 19499, 5, 9, 2425, 1861, 128, 2270, 5, 3, 6985, 308, 3383, 2373, 1, 19, 36, 463, 7177, 3169, 2, 47, 3, 1016, 174, 20, 49, 808]
三.标准化序列样本
1.pad_sequences统一长度
pad_sequences 在之前的 扩充维度 中提到过,用于对数组按照固定长度填充或截取使用,全部参数解释如下:
def pad_sequences(sequences, maxlen=None, dtype='int32',
padding='pre', truncating='pre', value=0.):
# sequences:数组列表
# maxlen:None或整数,为序列的最大长度。大于此长度的序列将被截短,小于此长度的序列将在后部填0.
# dtype:返回的numpy array的数据类型
# padding:pre 或 post,确定当需要补0时,在序列的起始还是结尾补
# truncating:pre 或 post,确定当需要截断序列时,从起始还是结尾截断
# value:浮点数,此值将在填充时代替默认的填充值0
可以通过选择 padding 方式决定长度不足时在 pre 前面补默认值 value ,还是在 post 后面补默认值,这里选择上面生成的 sequences 进行截断:
from keras_preprocessing.sequence import pad_sequences
max_len = 100 # 100个单词后截断
data = pad_sequences([sequences[0]], maxlen=max_len, padding='pre')
print(data[0])
data = pad_sequences([sequences[0]], maxlen=max_len, padding='post')
print(data[0])
结果:
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 777 16 28 4 1 115 2278 6887 11 19
1025 5 27 5 42 2425 1861 128 2270 5 3 6985 308 7
7 3383 2373 1 19 36 463 3169 2 222 3 1016 174 20
49 808]
[ 777 16 28 4 1 115 2278 6887 11 19 1025 5 27 5
42 2425 1861 128 2270 5 3 6985 308 7 7 3383 2373 1
19 36 463 3169 2 222 3 1016 174 20 49 808 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0]
2. tokenizer 转 binary 矩阵
通过 tokenizer 得到的索引 map 对全部索引序列做 one-hot矩阵:
# 文本转矩阵
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
def one_hot_on_keras(samples):
tokenizer = Tokenizer(num_words=30)
tokenizer.fit_on_texts(samples)
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
print('Result Matrix Shape: ', one_hot_results.shape)
return one_hot_results
print(one_hot_on_keras(samples))
通过 texts_to_matrix 方法生成 [ samples.shape[0] x num_words ] 大小的 0-1 矩阵
Found 9 unique tokens.
Result Matrix Shape: (2, 30)
[[0. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.]]
3. to_categorical 转数字为 one-hot 向量
手写数字识别样例中,每张的图的 label 是数字给出的,共 0-9 十种,训练时需要通过 softmax 决定哪个类别概率大,所以预测 label 需要转化为 one-hot 的标签形式,可以使用 to_categorical 方法:
# to_category 文本转换为标签向量
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical
(train_images, train_labels),(test_images,test_labels) = mnist.load_data()
print(train_labels[0:5])
print('Max Label:', max(train_labels), ' Min Label:', min(train_labels))
train_labels = to_categorical(train_labels)
print(train_labels[0:5])
可以看到 label 由数字标签转化为 one-hot 标签,可以用于后续模型训练
[5 0 4 1 9]
Max Label: 9 Min Label: 0
[[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]
四. 文本 embedding
除了上面基本的 one-hot 形式,还可以使用词嵌入的方法对文本进行向量化用于后续训练,这里使用 Embedding 层对样本进行向量化:
# embedding层
from tensorflow.keras.datasets import imdb
from tensorflow.keras import models
from tensorflow.keras.layers import Dense, Flatten, Embedding
max_features = 10000 # 特征数
maxLen = 20 # 截断文本
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
# 转化为2维向量
x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxLen)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxLen)
# 使用embedding层
model = models.Sequential()
# 输入维度:词带大小 输出维度:Embedding大小 输入size:词向量截断长度
model.add(Embedding(10000, 8, input_length=maxLen))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()
history = model.fit(x_train, y_train,
epochs=1,
batch_size=32,
validation_split=0.2)
weight = model.layers[0].get_weights()
print(x_train.shape)
# 1 * 10000 * 8
print(np.array(weight).shape)
这里使用上面介绍的 pad_sequences 方法对样本数据进行标准化,随后通过 Embedding 层 Flatten 打平后接 Dense 大方法进行训练,最后通过 get_weigths() 方法获得词向量。注意这里 weights 的维度为 1 x 10000 x 8 ,
这里10000是定义的最大特征数,8为预定义的词嵌入的维度,因为我们有10000个文字样本,所以为10000个样本各生成一个8维 embedding,所以这里 Embedding 层可以看做是一个 Map,给定一个文字索引,就可以得到其对应维度 embedding。
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 20, 8) 80000
_________________________________________________________________
flatten (Flatten) (None, 160) 0
_________________________________________________________________
dense (Dense) (None, 1) 161
=================================================================
Total params: 80,161
Trainable params: 80,161
Non-trainable params: 0
_________________________________________________________________
625/625 [==============================] - 1s 693us/step - loss: 0.6837 - accuracy: 0.5810 - val_loss: 0.6128 - val_accuracy: 0.7010
(25000, 20)
(1, 10000, 8)
这里只介绍了基本的 embedding 方法,除此之外还有无监督的 autoEncoder,有监督的 word2vec 等方法,有机会可以继续实现~
更多推荐算法相关深度学习:深度学习导读专栏