机器学习CV代码练习（五）之图像描述-利用keras.preprocessing.text.Tokenizer进行文本预处理以及合并处理文本和图像数据作为LSTM输入

最新推荐文章于 2023-09-21 16:12:46 发布

Laura_Wangzx

最新推荐文章于 2023-09-21 16:12:46 发布

阅读量588

点赞数

分类专栏：机器学习与深度学习AI

本文链接：https://blog.csdn.net/qq_37486501/article/details/118723864

版权

机器学习与深度学习AI 专栏收录该内容

52 篇文章 18 订阅

订阅专栏

机器学习CV代码练习（五）之图像描述-利用keras.preprocessing.text.Tokenizer进行文本预处理以及合并处理文本和图像数据作为LSTM输入

利用keras.preprocessing.text.Tokenizer处理text
文本预处理creat_tokenizer函数，以及合并处理文本和图像数据作为lstm模型输入函create_input_data函数

Keras的API文档：https://keras.io/api/

利用keras.preprocessing.text.Tokenizer处理text

其中预处理方法包含：keras.preprocessing.text.Tokenizer

tf.keras.preprocessing.text.Tokenizer(
    num_words=None,
    filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
    lower=True,
    split=" ",
    char_level=False,
    oov_token=None,
    document_count=0,
    **kwargs
)

参数:
1. num_words: 需要保留的最大词数，基于词频。只有最常出现的 num_words 词会被保留。
2. filters: 一个字符串，其中每个元素是一个将从文本中过滤掉的字符。默认值是所有标点符号，加上制表符和换行符，减去 ' 字符。
3. lower: 布尔值。是否将文本转换为小写。
4. split: 字符串。按该字符串切割文本。
5. char_level: 如果为 True，则每个字符都将被视为标记。
6. oov_token: 如果给出，它将被添加到 word_index 中，并用于在 text_to_sequence 调用期间替换词汇表外的单词。

简单使用：

from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()
lines = ['this is good','that is a cat']
tokenizer.fit_on_texts(lines)#lines为训练数据——接受一个列表

results = tokenizer.texts_to_sequences(['cat is good'])#将text转换成sequence串，返回一个列表的里的一个列表
print(results[0])#result为二维数组

文本预处理creat_tokenizer函数，以及合并处理文本和图像数据作为lstm模型输入函create_input_data函数

lstm模型输入：
* 整个模型网络结构：

import util
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from numpy import array
from pickle import load

# tokenizer = Tokenizer()
# lines = ['this is good','that is a cat']
# tokenizer.fit_on_texts(lines)#lines为训练数据——接受一个列表
#
# results = tokenizer.texts_to_sequences(['cat is good'])#将text转换成sequence串，返回一个列表的里的一个列表
# print(results[0])#result为二维数组


def creat_tokenizer():
    '''
    根据训练数据集中图像名，和其对应的标题，生成一个Tokenizer
    :return:生成的tokenizer
    '''
    train_image_names = util.load_doc('Flickr_8k.trainImages.txt')#读出训练文件
    train_descriptions = util.load_clean_captions('descriptions.txt',train_image_names)#读出描述，得到字典（文件名：list）
    # print(train_descriptions['1000268201_693b08cb0e'])
    #'1000268201_693b08cb0e':
    # ['startseq A child in a pink dress is climbing up a set of stairs in an entry way . endseq',
    # 'startseq A girl going into a wooden building . endseq',
    # 'startseq A little girl climbing into a wooden playhouse . endseq',
    # 'startseq A little girl climbing the stairs to her playhouse . endseq',
    # 'startseq A little girl in a pink dress going into a wooden cabin . endseq']
    tokenizer = Tokenizer()
    lines = util.to_list(train_descriptions)#转换成python的列表
    print(lines[0])# startseq A child in a pink dress is climbing up a set of stairs in an entry way . endseq
    tokenizer.fit_on_texts(lines)  # lines为训练数据——接受一个列表
    results = tokenizer.texts_to_sequences(lines)#将text转换成sequence串，返回一个列表的里的一个列表
    print(results[0])#result为二维数组
    # [2, 1, 43, 4, 1, 88, 172, 7, 117, 53, 1, 390, 12, 397, 4, 28, 4450, 629, 3]
    return results

def create_input_data(tokenizer, max_length, descriptions, photos_features, vocab_size):
    """
    从输入的图片标题list和图片特征构造LSTM的一组输入
    Args:
    :param tokenizer: 英文单词和整数转换的工具keras.preprocessing.text.Tokenizer
    :param max_length: 训练数据集中最长的标题的长度
    :param descriptions: dict, key 为图像的名(不带.jpg后缀), value 为list, 包含一个图像的几个不同的描述
    :param photos_features:  dict, key 为图像的名(不带.jpg后缀), value 为numpy array 图像的特征
    :param vocab_size: 训练集中表的单词数量
    :return: tuple:
            第一个元素为 numpy array, 元素为图像的特征(重复几次), 它本身也是 numpy.array
            第二个元素为 numpy array, 元素为图像标题的前缀（第一个元素，第一个元素+第二个元素...）, 它自身也是 numpy.array
            第三个元素为 numpy array, 元素为图像标题的下一个单词(根据图像特征和标题的前缀产生) ，(是一个长度为vocab_size的one-hot编码)也为numpy.array
    (array([[ 0.434,  0.534,  0.212,  0.98 ],
       [ 0.434,  0.534,  0.212,  0.98 ],
       [ 0.434,  0.534,  0.212,  0.98 ],
       [ 0.434,  0.534,  0.212,  0.98 ],
       [ 0.434,  0.534,  0.212,  0.98 ],
       [ 0.434,  0.534,  0.212,  0.98 ],
       [ 0.434,  0.534,  0.212,  0.98 ],
       [ 0.434,  0.534,  0.212,  0.98 ],
       [ 0.434,  0.534,  0.212,  0.98 ],
       [ 0.434,  0.534,  0.212,  0.98 ],
       [ 0.534,  0.634,  0.712,  0.28 ],
       [ 0.534,  0.634,  0.712,  0.28 ],
       [ 0.534,  0.634,  0.712,  0.28 ],
       [ 0.534,  0.634,  0.712,  0.28 ],
       [ 0.534,  0.634,  0.712,  0.28 ],
       [ 0.534,  0.634,  0.712,  0.28 ],
       [ 0.534,  0.634,  0.712,  0.28 ],
       [ 0.534,  0.634,  0.712,  0.28 ],
       [ 0.534,  0.634,  0.712,  0.28 ],
       [ 0.534,  0.634,  0.712,  0.28 ],
       [ 0.534,  0.634,  0.712,  0.28 ]]),
array([[  0,   0,   0,   0,   0,   2],
       [  0,   0,   0,   0,   2,  59],
       [  0,   0,   0,   2,  59, 254],
       [  0,   0,   2,  59, 254,   6],
       [  0,   2,  59, 254,   6, 134],
       [  0,   0,   0,   0,   0,   2],
       [  0,   0,   0,   0,   2,  26],
       [  0,   0,   0,   2,  26, 254],
       [  0,   0,   2,  26, 254,   6],
       [  0,   2,  26, 254,   6, 134],
       [  0,   0,   0,   0,   0,   2],
       [  0,   0,   0,   0,   2,  59],
       [  0,   0,   0,   2,  59,  16],
       [  0,   0,   2,  59,  16,  82],
       [  0,   2,  59,  16,  82,  24],
       [  0,   0,   0,   0,   0,   2],
       [  0,   0,   0,   0,   2,  59],
       [  0,   0,   0,   2,  59,  16],
       [  0,   0,   2,  59,  16, 165],
       [  0,   2,  59,  16, 165, 127],
       [  2,  59,  16, 165, 127,  24]]),
array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ...,
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]]))
    """
    X1, X2, y = list(), list(), list()
    for key, desc_list in descriptions.items():
        for desc in desc_list:
            seq = tokenizer.texts_to_sequences([desc])[0]
            for i in range(1, len(seq)):
                in_seq, out_seq = seq[:i], seq[i]
                # 填充in_seq,使得其长度为max_length
                in_seq = pad_sequences([in_seq], maxlen=max_length)[0]#多个输入截断或补齐
                out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]#整数转换为二进制类矩阵
                X1.append(photos_features[key])
                X2.append(in_seq)
                y.append(out_seq)
    return array(X1), array(X2), array(y)

if __name__ == "__main__":
    # creat_tokenizer()

    tokenizer = load(open('tokenizer.pkl', 'rb'))
    max_length = 6
    descriptions = {'1235345': ['startseq one bird on tree endseq', "startseq red bird on tree endseq"],
                    '1234546': ['startseq one boy play water endseq', "startseq one boy run across water endseq"]}
    photo_features = {'1235345': [0.434, 0.534, 0.212, 0.98],
                      '1234546': [0.534, 0.634, 0.712, 0.28]}
    vocab_size = 7378
    print(create_input_data(tokenizer, max_length, descriptions, photo_features, vocab_size))