机器学习CV代码练习(五)之图像描述-利用keras.preprocessing.text.Tokenizer进行文本预处理以及合并处理文本和图像数据作为LSTM输入
利用keras.preprocessing.text.Tokenizer处理text
其中预处理方法包含:keras.preprocessing.text.Tokenizer
tf.keras.preprocessing.text.Tokenizer(
num_words=None,
filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
lower=True,
split=" ",
char_level=False,
oov_token=None,
document_count=0,
**kwargs
)
参数:
1. num_words: 需要保留的最大词数,基于词频。只有最常出现的 num_words 词会被保留。
2. filters: 一个字符串,其中每个元素是一个将从文本中过滤掉的字符。默认值是所有标点符号,加上制表符和换行符,减去 ' 字符。
3. lower: 布尔值。是否将文本转换为小写。
4. split: 字符串。按该字符串切割文本。
5. char_level: 如果为 True,则每个字符都将被视为标记。
6. oov_token: 如果给出,它将被添加到 word_index 中,并用于在 text_to_sequence 调用期间替换词汇表外的单词。
- 简单使用:
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()
lines = ['this is good','that is a cat']
tokenizer.fit_on_texts(lines)#lines为训练数据——接受一个列表
results = tokenizer.texts_to_sequences(['cat is good'])#将text转换成sequence串,返回一个列表的里的一个列表
print(results[0])#result为二维数组
文本预处理creat_tokenizer函数,以及合并处理文本和图像数据作为lstm模型输入函create_input_data函数
- lstm模型输入:
* 整个模型网络结构:
import util
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from numpy import array
from pickle import load
# tokenizer = Tokenizer()
# lines = ['this is good','that is a cat']
# tokenizer.fit_on_texts(lines)#lines为训练数据——接受一个列表
#
# results = tokenizer.texts_to_sequences(['cat is good'])#将text转换成sequence串,返回一个列表的里的一个列表
# print(results[0])#result为二维数组
def creat_tokenizer():
'''
根据训练数据集中图像名,和其对应的标题,生成一个Tokenizer
:return:生成的tokenizer
'''
train_image_names = util.load_doc('Flickr_8k.trainImages.txt')#读出训练文件
train_descriptions = util.load_clean_captions('descriptions.txt',train_image_names)#读出描述,得到字典(文件名:list)
# print(train_descriptions['1000268201_693b08cb0e'])
#'1000268201_693b08cb0e':
# ['startseq A child in a pink dress is climbing up a set of stairs in an entry way . endseq',
# 'startseq A girl going into a wooden building . endseq',
# 'startseq A little girl climbing into a wooden playhouse . endseq',
# 'startseq A little girl climbing the stairs to her playhouse . endseq',
# 'startseq A little girl in a pink dress going into a wooden cabin . endseq']
tokenizer = Tokenizer()
lines = util.to_list(train_descriptions)#转换成python的列表
print(lines[0])# startseq A child in a pink dress is climbing up a set of stairs in an entry way . endseq
tokenizer.fit_on_texts(lines) # lines为训练数据——接受一个列表
results = tokenizer.texts_to_sequences(lines)#将text转换成sequence串,返回一个列表的里的一个列表
print(results[0])#result为二维数组
# [2, 1, 43, 4, 1, 88, 172, 7, 117, 53, 1, 390, 12, 397, 4, 28, 4450, 629, 3]
return results
def create_input_data(tokenizer, max_length, descriptions, photos_features, vocab_size):
"""
从输入的图片标题list和图片特征构造LSTM的一组输入
Args:
:param tokenizer: 英文单词和整数转换的工具keras.preprocessing.text.Tokenizer
:param max_length: 训练数据集中最长的标题的长度
:param descriptions: dict, key 为图像的名(不带.jpg后缀), value 为list, 包含一个图像的几个不同的描述
:param photos_features: dict, key 为图像的名(不带.jpg后缀), value 为numpy array 图像的特征
:param vocab_size: 训练集中表的单词数量
:return: tuple:
第一个元素为 numpy array, 元素为图像的特征(重复几次), 它本身也是 numpy.array
第二个元素为 numpy array, 元素为图像标题的前缀(第一个元素,第一个元素+第二个元素...), 它自身也是 numpy.array
第三个元素为 numpy array, 元素为图像标题的下一个单词(根据图像特征和标题的前缀产生) ,(是一个长度为vocab_size的one-hot编码)也为numpy.array
(array([[ 0.434, 0.534, 0.212, 0.98 ],
[ 0.434, 0.534, 0.212, 0.98 ],
[ 0.434, 0.534, 0.212, 0.98 ],
[ 0.434, 0.534, 0.212, 0.98 ],
[ 0.434, 0.534, 0.212, 0.98 ],
[ 0.434, 0.534, 0.212, 0.98 ],
[ 0.434, 0.534, 0.212, 0.98 ],
[ 0.434, 0.534, 0.212, 0.98 ],
[ 0.434, 0.534, 0.212, 0.98 ],
[ 0.434, 0.534, 0.212, 0.98 ],
[ 0.534, 0.634, 0.712, 0.28 ],
[ 0.534, 0.634, 0.712, 0.28 ],
[ 0.534, 0.634, 0.712, 0.28 ],
[ 0.534, 0.634, 0.712, 0.28 ],
[ 0.534, 0.634, 0.712, 0.28 ],
[ 0.534, 0.634, 0.712, 0.28 ],
[ 0.534, 0.634, 0.712, 0.28 ],
[ 0.534, 0.634, 0.712, 0.28 ],
[ 0.534, 0.634, 0.712, 0.28 ],
[ 0.534, 0.634, 0.712, 0.28 ],
[ 0.534, 0.634, 0.712, 0.28 ]]),
array([[ 0, 0, 0, 0, 0, 2],
[ 0, 0, 0, 0, 2, 59],
[ 0, 0, 0, 2, 59, 254],
[ 0, 0, 2, 59, 254, 6],
[ 0, 2, 59, 254, 6, 134],
[ 0, 0, 0, 0, 0, 2],
[ 0, 0, 0, 0, 2, 26],
[ 0, 0, 0, 2, 26, 254],
[ 0, 0, 2, 26, 254, 6],
[ 0, 2, 26, 254, 6, 134],
[ 0, 0, 0, 0, 0, 2],
[ 0, 0, 0, 0, 2, 59],
[ 0, 0, 0, 2, 59, 16],
[ 0, 0, 2, 59, 16, 82],
[ 0, 2, 59, 16, 82, 24],
[ 0, 0, 0, 0, 0, 2],
[ 0, 0, 0, 0, 2, 59],
[ 0, 0, 0, 2, 59, 16],
[ 0, 0, 2, 59, 16, 165],
[ 0, 2, 59, 16, 165, 127],
[ 2, 59, 16, 165, 127, 24]]),
array([[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
...,
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.]]))
"""
X1, X2, y = list(), list(), list()
for key, desc_list in descriptions.items():
for desc in desc_list:
seq = tokenizer.texts_to_sequences([desc])[0]
for i in range(1, len(seq)):
in_seq, out_seq = seq[:i], seq[i]
# 填充in_seq,使得其长度为max_length
in_seq = pad_sequences([in_seq], maxlen=max_length)[0]#多个输入截断或补齐
out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]#整数转换为二进制类矩阵
X1.append(photos_features[key])
X2.append(in_seq)
y.append(out_seq)
return array(X1), array(X2), array(y)
if __name__ == "__main__":
# creat_tokenizer()
tokenizer = load(open('tokenizer.pkl', 'rb'))
max_length = 6
descriptions = {'1235345': ['startseq one bird on tree endseq', "startseq red bird on tree endseq"],
'1234546': ['startseq one boy play water endseq', "startseq one boy run across water endseq"]}
photo_features = {'1235345': [0.434, 0.534, 0.212, 0.98],
'1234546': [0.534, 0.634, 0.712, 0.28]}
vocab_size = 7378
print(create_input_data(tokenizer, max_length, descriptions, photo_features, vocab_size))