open_clip 安装使用笔记

AI算法网奇

已于 2024-11-12 12:17:24 修改

阅读量3.5k

点赞数 3

分类专栏： python基础文章标签： numpy

于 2024-10-04 23:19:21 首次发布

本文链接：https://blog.csdn.net/jacke121/article/details/142708984

版权

python基础专栏收录该内容

776 篇文章

订阅专栏

open_clip 安装

解决方法

ViewCrafter 用到了

FrozenOpenCLIPEmbedder

Contrastive Language-Image Pre-training

用途概述

图片输入的部分

使用 CLIPTextModel 的基本代码示例

用法总结

open_clip 安装

pip install open-clip-torch==2.17.1

使用报错：

model, _, preprocess = open_clip.create_model_and_transforms(
        'ViT-L-14', 
        pretrained='laion/CLIP-ViT-L-14-laion2B-s32B-b82K/pytorch_model.bin'
    )

报错

  model, _, preprocess = open_clip.create_model_and_transforms(
  File "/data/.local/lib/python3.10/site-packages/open_clip/factory.py", line 382, in create_model_and_transforms
    model = create_model(
  File "/data/.local/lib/python3.10/site-packages/open_clip/factory.py", line 288, in create_model
    load_checkpoint(model, checkpoint_path)
  File "/data/.local/lib/python3.10/site-packages/open_clip/factory.py", line 159, in load_checkpoint
    incompatible_keys = model.load_state_dict(state_dict, strict=strict)
  File "/data/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for CLIP:
        Missing key(s) in state_dict: "positional_embedding", "text_projection", "visual.class_embedding", "visual.positional_embedding", "visual.proj"

解决方法

到 https://huggingface.co/laion/CLIP-ViT-L-14-laion2B-s32B-b82K/tree/main 下载 open_clip_pytorch_model.bin
修改代码

model, _, preprocess = open_clip.create_model_and_transforms(
        'ViT-L-14', 
        pretrained='laion/CLIP-ViT-L-14-laion2B-s32B-b82K/open_clip_pytorch_model.bin'
    )

原文链接：https://blog.csdn.net/zengNLP/article/details/135644453

ViewCrafter 用到了

/mnt/data-2/users/libanggeng/project/drag/ViewCrafter/lvdm/modules/encoders/condition.py

 model, _, _ = open_clip.create_model_and_transforms(arch, device=torch.device('cpu'), pretrained=version)

FrozenOpenCLIPEmbedder

class FrozenOpenCLIPEmbedder(AbstractEncoder):
    """
    Uses the OpenCLIP transformer encoder for text
    """
    LAYERS = [
        # "pooled",
        "last",
        "penultimate"
    ]

    def __init__(self, arch="ViT-H-14", version="laion2b_s32b_b79k", device="cuda", max_length=77,
                 freeze=True, layer="last"):
        super().__init__()
        assert layer in self.LAYERS
        model, _, _ = open_clip.create_model_and_transforms(arch, device=torch.device('cpu'), pretrained=version)
        del model.visual
        self.model = model

        self.device = device
        self.max_length = max_length
        if freeze:
            self.freeze()
        self.layer = layer
        if self.layer == "last":
            self.layer_idx = 0
        elif self.layer == "penultimate":
            self.layer_idx = 1
        else:
            raise NotImplementedError()

    def freeze(self):
        self.model = self.model.eval()
        for param in self.parameters():
            param.requires_grad = False

    def forward(self, text):
        tokens = open_clip.tokenize(text) ## all clip models use 77 as context length
        z = self.encode_with_transformer(tokens.to(self.device))
        return z

    def encode_with_transformer(self, text):
        x = self.model.token_embedding(text)  # [batch_size, n_ctx, d_model]
        x = x + self.model.positional_embedding
        x = x.permute(1, 0, 2)  # NLD -> LND
        x = self.text_transformer_forward(x, attn_mask=self.model.attn_mask)
        x = x.permute(1, 0, 2)  # LND -> NLD
        x = self.model.ln_final(x)
        return x

Contrastive Language-Image Pre-training

CLIPTextModel 是 OpenAI 的 CLIP (Contrastive Language-Image Pre-training) 模型的一部分，它主要用于处理文本输入，而不是直接处理图片。CLIP 模型包含两个部分：一个用于处理图像的视觉编码器 (CLIPVisionModel) 和一个用于处理文本的文本编码器 (CLIPTextModel)。

用途概述

CLIPTextModel 是专门用于将文本转换为文本特征的模型。它的主要功能是将输入的文本编码成向量表示，用于与图像编码器生成的图像特征进行对比。CLIP 的核心思想是使用对比学习方法，将图像和文本的表示学习到相同的嵌入空间，从而实现图像和文本的匹配。

图片输入的部分

对于处理图片的部分，CLIP 使用的是视觉模型（如 CLIPVisionModel），而不是 CLIPTextModel。如果你给 CLIP 模型输入一个图片，视觉模型将把图片编码成一个向量。

通常，CLIP 的工作流程如下：

图像编码：使用 CLIPVisionModel 将输入图像编码为向量。
文本编码：使用 CLIPTextModel 将输入的文本描述编码为向量。
对比匹配：CLIP 通过计算图像和文本向量的相似度，来判断给定的文本与哪个图像匹配。

使用 `CLIPTextModel` 的基本代码示例

要使用 CLIPTextModel 来处理文本，可以这样：

from transformers import CLIPTextModel, CLIPTokenizer 
# 加载预训练的 CLIP 文本模型和 tokenizer 
model = CLIPTextModel.from_pretrained("openai/clip-vit-base-patch32") 
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch32") # 输入文本 
text = ["a photo of a cat"] # 将文本输入编码为 tokens 
inputs = tokenizer(text, return_tensors="pt") # 使用文本模型将文本编码为向量 
text_features = model(**inputs).last_hidden_state # 输出的文本特征 
print(text_features)

text_features 是 CLIPTextModel 对输入文本的向量表示，这些特征可以用于与图像特征进行对比。