双编码器的自然语言图像搜索

最新推荐文章于 2024-02-02 14:14:01 发布

LiveVideoStack_

最新推荐文章于 2024-02-02 14:14:01 发布

阅读量556

点赞数 1

文章标签：深度学习 tensorflow python 机器学习神经网络

本文链接：https://blog.csdn.net/vn9PLgZvnPs1522s82g/article/details/114156860

版权

该博客介绍了如何使用双编码器模型（双塔）进行自然语言图像搜索，灵感来自CLIP方法。通过训练视觉编码器（Xception）和文本编码器（BERT），将图像和文本投射到同一嵌入空间，实现自然语言查询的图像检索。示例使用MS-COCO数据集，训练后利用投影头进行匹配，实际应用中可采用ScaNN、Annoy或Faiss进行近似匹配。

摘要由CSDN通过智能技术生成

正文字数：5798 阅读时长：10 分钟

如何构建一个双编码器（也称为双塔）神经网络模型，以使用自然语言搜索图像。

作者 / Khalid Salama

原文链接 / https://keras.io/examples/nlp/nl_image_search/

介绍

该示例演示了如何构建一个双编码器（也称为双塔）神经网络模型，以使用自然语言搜索图像。该模型的灵感来自于Alec Radford等人提出的CLIP方法，其思想是联合训练一个视觉编码器和一个文本编码器，将图像及其标题的表示投射到同一个嵌入空间，从而使标题嵌入位于其描述的图像的嵌入附近。

这个例子需要TensorFlow 2.4或更高版本。此外，BERT模型需要TensorFlow Hub和TensorFlow Text，AdamW优化器需要TensorFlow Addons。这些库可以使用以下命令进行安装。

pip install -q -U tensorflow-hub tensorflow-text tensorflow-addons

安装

import os
import collections
import json
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow_hub as hub
import tensorflow_text as text
import tensorflow_addons as tfa
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from tqdm import tqdm


# Suppressing tf.hub warnings
tf.get_logger().setLevel("ERROR")

准备数据

我们使用MS-COCO数据集来训练我们的双编码器模型。MS-COCO包含超过82,000张图片，每张图片至少有5个不同的标题注释。该数据集通常用image captioning任务，但我们可以重新利用图像标题对来训练双编码器模型进行图像搜索。

下载提取数据

首先，下载数据集，它由两个压缩文件夹组成：一个是图像，另一个是相关的图像标题。值得注意的是压缩后的图像文件夹大小为13GB。

root_dir = "datasets"
annotations_dir = os.path.join(root_dir, "annotations")
images_dir = os.path.join(root_dir, "train2014")
tfrecords_dir = os.path.join(root_dir, "tfrecords")
annotation_file = os.path.join(annotations_dir, "captions_train2014.json")


# Download caption annotation files
if not os.path.exists(annotations_dir):
    annotation_zip = tf.keras.utils.get_file(
        "captions.zip",
        cache_dir=os.path.abspath("."),
        origin="http://images.cocodataset.org/annotations/annotations_trainval2014.zip",
        extract=True,
    )
    os.remove(annotation_zip)


# Download image files
if not os.path.exists(images_dir):
    image_zip = tf.keras.utils.get_file(
        "train2014.zip",
        cache_dir=os.path.abspath("."),
        origin="http://images.cocodataset.org/zips/train2014.zip",
        extract=True,
    )
    os.remove(image_zip)


print("Dataset is downloaded and extracted successfully.")


with open(annotation_file, "r") as f:
    annotations = json.load(f)["annotations"]


image_path_to_caption = collections.defaultdict(list)
for element in annotations:
    caption = f"{element['caption'].lower().rstrip('.')}"
    image_path = images_dir + "/COCO_train2014_" + "%012d.jpg" % (element["image_id"])
    image_path_to_caption[image_path].append(caption)


image_paths = list(image_path_to_caption.keys())
print(f"Number of images: {len(image_paths)}")

Downloading data from http://images.cocodataset.org/annotations/annotations_trainval2014.zip
252878848/252872794 [==============================] - 5s 0us/step
Downloading data from http://images.cocodataset.org/zips/train2014.zip
13510574080/13510573713 [==============================] - 394s 0us/step
Dataset is downloaded and extracted successfully.
Number of images: 82783

处理并将数据保存到TFRecord文件中

你可以改变sample_size参数去控制将用于训练双编码器模型的多对图像-标题。在这个例子中，我们将training_size设置为30000张图像，约占数据集的35%。我们为每张图像使用2个标题，从而产生60000个图像-标题对。训练集的大小会影响生成编码器的质量，样本越多，训练时间越长。

train_size = 30000
valid_size = 5000
captions_per_image = 2
images_per_file = 2000
train_image_paths = image_paths[:train_size]
num_train_files = int(np.ceil(train_size / images_per_file))
train_files_prefix = os.path.join(tfrecords_dir, "train")


valid_image_paths = image_paths[-valid_size:]
num_valid_files = int(np.ceil(valid_size / images_per_file))
valid_files_prefix = os.path.join(tfrecords_dir, "valid")


tf.io.gfile.makedirs(tfrecords_dir)




def bytes_feature(value):
    return tf.train.Feature(b