How to get multimodal embeddings from CLIP model?

营赢盈英

于 2024-07-26 10:05:26 发布

阅读量429

点赞数 11

分类专栏： AI 文章标签： ai nlp machinelearning neuralnetwork openai api llm word embedding

本文链接：https://blog.csdn.net/suiusoar/article/details/140706892

版权

AI 专栏收录该内容

81 篇文章 0 订阅

订阅专栏

题意：如何从CLIP模型中获得多模态嵌入？

问题背景：

I'm hoping to use CLIP to get a single embedding for rows of multimodal (image and text) data.

我希望使用CLIP来为多模态（图像和文本）数据行获取单一的嵌入表示。

Say I have the following model: 假设我有以下模型：

from PIL import Image
import torch
from transformers import CLIPProcessor, CLIPModel
import torchvision.transforms as transforms

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

def convert_image_data_to_tensor(image_data):
    return torch.tensor(image_data)

dataset = df[['image_data', 'text_data']].to_dict('records')

embeddings = []
for data in dataset:
    image_tensor = convert_image_data_to_tensor(data['image_data'])
    text = data['text_data']

    inputs = processor(text=text, images=image_tensor, return_tensors=True)
    with torch.no_grad():
        output = model(**inputs)

I want to get the embeddings calculated in output. I know that output has the addtributes text_embeddings and image_embeddings, but I'm not sure how they interact later on. If I want to get a single embedding for each record, should I just be concatenating these attributes together? Is there another attribute that combines the two in some other way?

我希望获取在output中计算出的嵌入表示。我知道output有text_embeddings和image_embeddings这两个属性，但我不确定它们之后是如何相互作用的。如果我希望为每个记录获取单一的嵌入表示，我是否应该只是简单地将这两个属性拼接在一起？还是存在另一个属性以其他方式将两者结合起来？

These are the attributes stored in output: 这些是存储在output中的属性：

print(dir(output))

['__annotations__', '__class__', '__contains__', '__dataclass_fields__', '__dataclass_params__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__post_init__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'clear', 'copy', 'fromkeys', 'get', 'image_embeds', 'items', 'keys', 'logits_per_image', 'logits_per_text', 'loss', 'move_to_end', 'pop', 'popitem', 'setdefault', 'text_embeds', 'text_model_output', 'to_tuple', 'update', 'values', 'vision_model_output']

Also, is there a way to specify the size of the embedding that CLIP outputs? Similar to how you can specify the embedding size in BERT configs?

同时，是否有方法可以指定CLIP输出的嵌入向量的大小？类似于在BERT配置中可以指定嵌入向量大小的方式？

Thanks in advance for any help here. Feel free to correct me if I'm misunderstanding anything critical here.

提前感谢你的帮助。如果我在这里有任何关键性的误解，请随时指正。

问题解决：

CLIP is trained such that the text and image embeddings are projected on to a shared latent space. In fact, image-text similarity is what the model is trained to optimise.

CLIP（Contrastive Language-Image Pre-training）模型被训练成将文本和图像的嵌入向量投影到一个共享的潜在空间中。实际上，模型被训练以优化图像-文本的相似性。

So a very typical use case of CLIP is to compare and match images and text based on similarity. In your case, you don't seem to be interested in any measure of similarity. You already have an image and the text and want some joint embedding representation. So concatenation of the two embeddings the way you described it is fine. An alternative would be take their mean (since they are in the same embedding space, it's fine to do this).

所以，CLIP的一个非常典型的应用场景是基于相似性来比较和匹配图像和文本。但在你的情况下，你似乎并不关心任何相似性的度量。你已经有了图像和文本，并希望得到一个联合的嵌入表示。因此，按照你描述的方式将这两个嵌入向量拼接起来是可以的。另一种替代方法是取它们的平均值（因为它们位于同一个嵌入空间中，这样做是可以的）

As for the size of the embedding, I don't think there is a way to change it as it's hardwired into the architecture of the model when it's trained. You can perhaps employ a dimensionality reduction technique, or fine tune the model after stacking another fully connected layer with the dimensionality of your choice.

关于嵌入向量的大小，我认为在模型训练完成后，其大小是固定在模型架构中的，因此无法直接更改。不过，你可以考虑使用降维技术，或者在模型后面堆叠一个全连接层，并将该层的神经元数量设置为你所需的维度，以此来微调模型。