cv练习：看图说话(Image Captioning)--2

最新推荐文章于 2021-04-15 14:16:12 发布

枭诗梦

最新推荐文章于 2021-04-15 14:16:12 发布

阅读量877

点赞数 3

文章标签：神经网络

本文链接：https://blog.csdn.net/weixin_44270537/article/details/108490504

版权

第一部分通过对给定的VGG16网络模型进行裁剪，把图像数据当作VGG16模型输入预测图象特征，不保存为pkl文件，一般在整个模型训练的过程中，这个步骤不会另存为一个文件，会占用较大的硬盘内存，一般特征提取出来会直接放到模型中训练，这样分开做可以更好的理解模型训练的整个过程。
在这里插入图片描述

图像特征提取完毕，那么接下来是文本特征提取

神经网络的输入和输出一般是数字，因此我们需要把英文单词串转换为数字交给神经网络，简单来讲，我们可以把单词串变为一个字典，字典的key是单词，value是数字，我们可以根据训练数据集中图像名，和其对应的标题，生成一个tokenizer

def load_img_names(filename):
    """
    从文本文件加载图像名set
    Args：
        filename：文本文件，每一行都包含一个图像文件名（包含.jpg文件后缀）
    Return：get_max_length
        set，文件名，去除了，.jpg后缀
    """
    doc = load_doc(filename)
    dataset = list()
    #pross line by line
    for line in doc.split('\n'):
        #skip empty lines
        if len(line) < 1:
            continue
        #get the img identifier
        identifier =line.split('.')[0]
        dataset.append(identifier)
    return set(dataset)


def load_clean_captions(filename, dataset):
    """为图像标题首尾分别加上'startseq ' 和 ' endseq', 作为自动标题生成的起始和终止
    Args:
        filename: 文本文件,每一行由图像名,和图像标题构成, 图像的标题已经进行了清洗
        dataset: 图像名list
    Returns:
        dict, key为图像名, value为添加了＇startseq'和＇endseq'的标题list
    """

    # load document
    doc = load_doc(filename)
    descriptions = dict()
    for line in doc.split('\n'):
        # split line by white space
        tokens = line.split()
        # split id from description
        image_id, image_desc = tokens[0], tokens[1:]
        # skip images not in the set
        if image_id in dataset:
            # create list
            if image_id not in descriptions:
                descriptions[image_id] = list()
            # wrap description in tokens
            desc = 'startseq ' + ' '.join(image_desc) + ' endseq'
            # store
            descriptions[image_id].append(desc)
    return descriptions
 def to_list(captions):
    """将一个字典(key为文件名, value为图像标题list)转换为图像标题list
    Args:
        captions: 一个字典, key为文件名, value为图像标题list
    Returns:
        图像标题list
    """
    all_desc = list()
    for key in captions.keys():
        [all_desc.append(d) for d in captions[key]]
    return all_desc


def creat_tokenizer():
    """
    根据训练数据集中图像名，和其对应的标题，生成一个tokenizer
    ：return：生成的tokenizer
    """

    tokenizer = Tokenizer()
    train_img_names = tool.load_img_names('Flickr_8k.testImages.txt')
    train_descriptions = tool.load_clean_captions('descriptions.txt', train_img_names)
    lines = tool.to_list(train_descriptions)
    tokenizer.fit_on_texts(lines)

    resulr = tokenizer.texts_to_sequences()
    return resulr

有了tokenizer，我们就需要根据输入的图片标题list和图片特征构造LSTM的输入，也就是构造训练数据

在这里插入图片描述这里的文字输入和输出都是数字，输入是根据tokenizer生成的数字，输出是根据下一个单词和tokenizer生成的one hot

def create_batches(desc_list, photo_features, tokenizer, max_len, vocab_size=7378):
"""从输入的图片标题list和图片特征构造LSTM的一组输入

Args:
    desc_list: 某一个图像对应的一组标题(一个list)
    photo_features: 某一个图像对应的特征
    tokenizer: 英文单词和整数转换的工具keras.preprocessing.text.Tokenizer
    max_len: 训练数据集中最长的标题的长度
    vocab_size: 训练集中的单词个数, 默认为7378

Returns:
    tuple:
        第一个元素为list, list的元素为图像的特征
        第二个元素为list, list的元素为图像标题的前缀
        第三个元素为list, list的元素为图像标题的下一个单词(根据图像特征和标题的前缀产生)"""
X1, X2, y = list(), list(), list()
# walk through each description for the image
for desc in desc_list:
    # encode the sequence
    seq = tokenizer.texts_to_sequences([desc])[0]
    # split one sequence into multiple X,y pairs
    for i in range(1, len(seq)):
        # split into input and output pair
        in_seq, out_seq = seq[:i], seq[i]
        # pad input sequence
        in_seq = pad_sequences([in_seq], maxlen=max_len)[0]
        # encode output sequence
        out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
        # store
        X1.append(photo_features)
        X2.append(in_seq)
        y.append(out_seq)
return array(X1), array(X2), array(y)

图像输入
在这里插入图片描述
文字输入

输出

接下来要创建一个数据生成器，这里我使用fit_generator训练模型，第一个参数generator

在这里插入图片描述

def data_generator(captions, photo_features, tokenizer, max_len):
    """创建一个训练数据生成器, 用于传入模型训练函数的第一个参数model.fit_generator(generator,...)

    Args:
        captions: dict, key为图像名(不包含.jpg后缀), value为list, 图像的几个训练标题
        photo_features: dict, key为图像名(不包含.jpg后缀), value为图像的特征
        tokenizer: 英文单词和整数转换的工具keras.preprocessing.text.Tokenizer
        max_len: 训练集中的标题最长长度

    Returns:
        generator, 使用yield ({'input_1: 元素为图像特征, 'input_2,:元素为输入的图像标题前缀}, {'dense_2': 元素为预期的输出图像标题的下一个单词})

    """
    # loop for ever over images
    while 1:
        for key, desc_list in captions.items():
            # retrieve the photo feature
            photo_feature = photo_features[key][0]
            in_img, in_seq, out_word = create_batches(desc_list, photo_feature, tokenizer, max_len)
            # yield [[in_img, in_seq], out_word]
            yield ({'input_1': in_img, 'input_2': in_seq}, {'dense_2': out_word})

创建一个新的用于给图片生成标题的网络模型

在这里插入图片描述

def caption_model(vocab_size, max_len):
    """
    Args:
        vocab_size: 训练集中标题单词个数
        max_len: 训练集中的标题最长长度

    Returns:
        用于给图像生成标题的网络模型

    """
    inputs1 = Input(shape=(4096,))   #加逗号表示元组
    fe1 = Dropout(0.5)(inputs1)
    fe2 = Dense(256, activation='relu')(fe1)
    inputs2 = Input(shape=(max_len,))
    se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
    se2 = Dropout(0.5)(se1)
    #嵌入层：将正整数转换为呃固定尺寸的稠密向量，例如:[[4],[20]]————> [[0.25,0.1],[0.6,-0.2]],该层只能用作模型中的第一层
    se3 = LSTM(256)(se2)
    decoder1 = add([fe2, se3])
    decoder2 = Dense(256, activation='relu')(decoder1)
    outputs = Dense(vocab_size, activation='softmax')(decoder2)
    model = Model(inputs=[inputs1, inputs2], outputs=outputs)
    model.compile(loss='categorical_crossentropy', optimizer='adam')

    model.summary()
    return model

编写训练函数训练模型

def train():
    # load training dataset (6K)
    filename = 'Flickr_8k.trainImages.txt'
    train = tool.load_ids(filename)
    print('Dataset: %d' % len(train))
    train_captions = tool.load_clean_captions('descriptions.txt', train)
    print('Captions: train number=%d' % len(train_captions))
    # photo features
    train_features = tool.load_photo_features('../features.pkl', train)
    print('Photos: train=%d' % len(train_features))
    # prepare tokenizer
    tokenizer = load(open('tokenizer.pkl', 'rb'))
    vocab_size = len(tokenizer.word_index) + 1
    print('Vocabulary Size: %d' % vocab_size)
    # determine the maximum sequence length
    max_len = tool.get_max_length(train_captions)
    print('Description Length: %d' % max_len)

    # define the model
    model = caption_model(vocab_size, max_len)
    # train the model, run epochs manually and save after each epoch
    epochs = 20
    steps = len(train_captions)
    for i in range(epochs):
        # create the data generator
        generator = data_generator(train_captions, train_features, tokenizer, max_len)
        # fit for one epoch
        model.fit_generator(generator, epochs=1, steps_per_epoch=steps, verbose=1)
        # save model
        model.save('model_' + str(i) + '.h5')


if __name__ == '__main__':
    train()

我们设置20个epchos，让数据在模型上跑20遍
在这里插入图片描述
这里只显示部分，可以看出loss正在逐渐减小，正在逐渐收敛，这里训练模型没有优化，优化部分还不是很会，可以用Attention加入，或者更多样的数据，调整网络结构，调整参数都可以使结果优化，训练过程十分漫长，心想我如果有块GPU该多好，平均一个epochs得将近一个小时QAQ，漫漫学习路。

枭诗梦

关注

3
点赞
踩
8

收藏

觉得还不错? 一键收藏
1
评论
cv练习：看图说话(Image Captioning)--2

第一部分通过对给定的VGG16网络模型进行裁剪，把图像数据当作VGG16模型输入预测图象特征，不保存为pkl文件，一般在整个模型训练的过程中，这个步骤不会另存为一个文件，会占用较大的硬盘内存，一般特征提取出来会直接放到模型中训练，这样分开做可以更好的理解模型训练的整个过程。图像特征提取完毕，那么接下来是文本特征提取神经网络的输入和输出一般是数字，因此我们需要把英文单词串转换为数字交给神经网络，简单来讲，我们可以把单词串变为一个字典，字典的key是单词，value是数字，我们可以根据训练数据集中图像名，
复制链接

扫一扫