CLIP论文讲解和代码实操

最新推荐文章于 2025-04-25 18:00:49 发布

zyw2002

最新推荐文章于 2025-04-25 18:00:49 发布

阅读量6.9k

点赞数 50

分类专栏：大模型和多模态文章标签： clip 多模态 llm AIGC

本文链接：https://blog.csdn.net/zyw2002/article/details/137756591

版权

大模型和多模态专栏收录该内容

9 篇文章

订阅专栏

文章目录

CLIP 基础
- 动机和概述
- 实现方法
CLIP 代码

CLIP 基础

Paper: 《Learning Transferable Visual Models From Natural Language Supervision》
Code: https://github.com/openai/CLIP

动机和概述

研究动机
作者的研究动机就是在 NLP 领域利用大规模数据去预训练模型，而且用这种跟下游任务无关的训练方式，NLP 那边取得了非常革命性的成功，比如 GPT-3。作者希望把 NLP 中的这种成功应用到其他领域，如视觉领域。在预训练时 CLIP 使用了对比学习，利用文本的提示去做 zero-shot 迁移学习。在大规模数据集和大模型的双向加持下，CLIP 的性能可以与特定任务的有监督训练出来的模型竞争，同时也有很大的改进空间。

CLIP 概述

CLIP的全称是 Contrastive Language-Image Pre-training，即对比语言-图片预训练。

现有的计算机视觉系统被用来训练一组固定的预先定义好的对象类别，这种受限制的监督形式限制了他们的通用性和可靠性。 CLIP是直接从原始文本中学习图像表征，它利用了更广泛的监督来源。
CLIP从互联网收集的4亿对(图像，文本)数据集上进行预训练，学习哪个标题与哪个图像相匹配。
在预训练之后，使用自然语言来指示学习到的视觉概念，从而实现模型向下游任务的zero-shot 迁移。
我们通过对30多个不同的现有计算机视觉数据集进行基准测试来研究这种方法的性能，这些数据集涵盖了OCR、视频中的动作识别、地理定位和许多类型的细粒度对象分类等任务。在不需要任何数据集特定的训练的情况下，CLIP可以和完全监督的基线方法竞争。

实现方法

CLIP方法的核心思想是从自然语言的监督中学习感知。

与其他训练方法相比，从自然语言中学习有几个优势:

与用于图像分类的标准标签相比，扩展自然语言监督要容易得多，因为它不要求标注遵循某种特定的格式（像目标检测数据集，需要标注需要有特定的格式，如COCO、VOC、YOLO格式，需要耗费大量的人力进行标注）。但是我们可以从互联网上爬取大量文本和图像对用于监督，节省了很多人力标注的成本。
与大多数无监督或自监督学习方法相比，从自然语言中学习可以不仅“只是”学习表征，而且还将该表征与语言联系起来，从而实现灵活的zero-shot迁移。(从网络上获得大量的图像文本对，比如一只斑马，可以获得很多关于斑马的视觉描述，如具有黑白条纹这些特征，通过这些文本信息，可以辅助我们对视觉进行感知。）
可以用于open-vocabulary的学习，可以用于检测新类别。例如，ViLD（Learning Transferable Visual Models From Natural Language Supervision）这篇文章就是将CLIP用于开放世界的目标检测。如下图，ViLD不仅仅可以识别出基础类（玩具），还可以拓展到新的类别，如玩具的颜色和形状。

CLIP模型的架构如下图，接下来我们将介绍CLIP的预训练和推理部分。

（1）预训练
用于预训练CLIP的数据集是互联网上各种公开可用的资源中搜集到的4亿对(图像，文本)对。
CLIP模型主要包括两个模态：

文本模态：输入的句子通过Text Encoder （Transformer）得到文本特征 (visual_embedding)。
假设每个 training batch 都有 N 个图像-文本对，那么就会得到 N 个文本的特征（如 $T_1,T_2,T_3... ,T_N$ )
图像模态：输入的图片通过Image Encoder （Resnet或者Vision Transformer）得到视觉特征 (text_embedding)。
假设每个 training batch 都有 N 个图像-文本对，那么就会得到 N 个图像的特征（如 $I_1,I_2,I_3... ,I_N$ )

visual_embedding 	[N, embedding_size]
text_embedding		[N, embedding_size]

不同模态的数据表示之间可能存在gap，无法进行直接的比较，因此先将不同模态的数据映射到同一个多模态空间（joint multimodal sapce），有利于后续的相似度计算等操作

接下来CLIP就对这些文本和图像对之间做对比学习，其中只有对角线上(上图的蓝色格子： $I_1 T_1, I_2 T_2, I_3 T_3 ... I_N T_N$ ) 的图像文本对是匹配的，为正样本 ( $N$ 个)，其余的都是负样本（ $N^2-N$ 个）。
有了正、负样本后，模型就可以通过对比学习的方式去训练，不需要任何手工的标注，是一种无监督的训练方式。
我们将visual_embedding 和text_embedding做内积，得到图像向量和文本向量之间的cosine相似度矩阵，大小为 $N\times N$ ，如果图像和对应的文本嵌入越相似，那么他们的内积便越大。
然后通过交叉熵进行训练，将来自同一个样本的图像和文本嵌入映射到相近的位置，而将来自不同样本的嵌入映射到较远的位置。这使得模型能够学习到图像和文本之间的共同特征。

（2）推理

对于图像模态，CLIP将输入的图片通过图像编码器，得到图像特征 ( $I_1$ )。
对于文本模态，CLIP提出 prompt template，将N个类（如图中"plane", “car”, “dog”, …, “brid”）变成一个句子，也就是将这些类别去替代 “A photo of a {object}” 中的 “{object}” ，那么 N个类别就都在这里生成了N个句子。然后将这N个句子通过先前预训练好的 Text Encoder 就会得到N个文本的特征 ( $T_1,T_2,T_3... ,T_N$ )
最后计算图像特征和文本特征之间的 cosine similarity（余弦相似度），相似度最高的就是分类的类别（如上图中 $I_1T_3$ 的得分最高，则分类结果就是dog）。

为什么要采用对比学习的方法？

对于一张图片来说，可以有很多不同的描述，文本之间的差距将是非常巨大的。如果用这种预测型的任务去预训练模型的话，它就会有太多的可能的结果，模型训练的过程会很慢。
如果把训练任务变成对比的任务，也就是说只需要判断这个图片和这个文本是不是配对的，那么这个任务就简单了很多，约束一下就放宽了很多。下图中仅仅把预测型的目标函数换成对比型的目标函数，训练效率一下就提高了4倍。
在这里插入图片描述

CLIP的实验结果

由于CLIP 学习的是文本语义信息，而不是one-hot编码的单类别信息，这使得CLIP具有更好的迁移能力。CLIP不仅在ImageNet 常规数据集上表现优秀，对于ImageNet Sketch 素描图、ImageNet-R 动漫图等非常规图像上的迁移学习能力要远远优于Resnet101，如下：
在这里插入图片描述
Zero-Shot CLIP 是指不进行任何的微调，直接迁移到其他的数据集上进行测试。
Linear Probe CLIP 是指训练的时候把预训练好的模型权重冻住，直接用其提取特征，然后只是去训练最后的 fc 分类头。
从下图中可以看出Zero-Shot CLIP的能力已经超过了其他有监督的网络。而Linear Probe CLIP 在few-shot的设置下，性能也达到了最佳。
在这里插入图片描述

CLIP的代码实现
下图是模型总体结构的伪代码：
在这里插入图片描述

图像的输入 $I\in [n,h,w,c]$ ，文本的输入 $T\in [n,l]$ ，其中 $n$ 就是batch size， $l$ 是序列长度, $h, w, c$ 分别表示图像的高、宽和通道数。
分别提取文本和图像模态的特征表示
图像和文本的输入分别通过 Image Encoder 和 Text Encoder 得到图像和文本的特征 $I_f\in [n,d_i]$ 和 $T_f\in [n,d_t]$ ， $d_i,d_t$ 分别表示编码后的图像和文本的特征维度。
其中 Image Encoder 可以是 ResNet 或 Vision Transformer，Text Encoder 可以是 CBOW 或 Text Transformer。
融合文本和图像两种模态的嵌入,变成多模态
将 $I_f$ 和 $T_f$ 分别通过两个线性投射层 $W_i\in [d_i,d_e]$ 和 $W_t\in[d_t,d_e]$ (即做矩阵乘积)。然后在特征维度做 L2 归一化，就得到了用来对比学习的特征 $I_e\in [n,d_e]$ 和 $T_e\in [n,d_e]$
计算图像和文本对之间的余弦相似度
首先将 $I_e$ 和 $T_e$ 的转置做矩阵乘法，得到 $n\times n$ 大小的相似度矩阵，代表n个图片和n个文本两两之间的相似度得分。然后再乘以 $e^t$ , 其中t是可学习的调制标量。
计算损失函数
分别计算图像和文本的交叉熵损失，最后求平均就得到了总的损失。

CLIP 代码

代码解读

文本编码器

CLIP中使用Transformer对文本进行编码。

Transformer
Transformer实现的就是将输入的文本嵌入通过layers个串联的ResidualAttentionBlock

class Transformer(nn.Module):
    def __init__(self, width: int, layers: int, heads: int, attn_mask: torch.Tensor = None):
        super().__init__()
        self.width = width
        self.layers = layers
        # layers个ResidualAttentionBlock串联
        self.resblocks = nn.Sequential(*[ResidualAttentionBlock(width, heads, attn_mask) for _ in range(layers)])

    def forward(self, x: torch.Tensor):
        return self.resblocks(x)

ResidualAttentionBlock
下述代码实现的就是标准的Transformer中encoder的结构。

关于Transformer的介绍可以参考详解注意力机制和Transformer 和代码详解Transformer 这两篇博客。

Transformer的结构主要有多头自注意力(Multi-Head Attention), 层归一化(LayerNorm) 和多层感知机（MLP）。
在这里插入图片描述

class ResidualAttentionBlock(nn.Module):
    def __init__(self, d_model: int, n_head: int, attn_mask: torch.Tensor = None):
        super().__init__()

        self.attn = nn.MultiheadAttention(d_model, n_head) # 多头注意力
        self.ln_1 = LayerNorm(d_model) # 层归一化
        self.mlp = nn.Sequential(OrderedDict([ # FeedForward 
            ("c_fc", nn.Linear(d_model, d_model * 4)), # 经过第一层线性变换，维度扩大4倍
            ("gelu", QuickGELU()), # GLUE激活函数的快速实现版本
            ("c_proj", nn.Linear(d_model * 4, d_model)) # 最后经过第二层线性变换（c_proj）将维度缩小回d_model
        ])) # 这种设计可以增加模型的表示能力，使得模型能够学习更复杂的函数映射关系。
        self.ln_2 = LayerNorm(d_model) # 层归一化
        self.attn_mask = attn_mask # attention 中使用的mask

    def attention(self, x: torch.Tensor):
        self.attn_mask = self.attn_mask.to(dtype=x.dtype, device=x.device) if self.attn_mask is not None else None
        return self.attn(x, x, x, need_weights=False, attn_mask=self.attn_mask)[0]

    def forward(self, x: torch.Tensor):
        x = x + self.attention(self.ln_1(x))# LayerNorm->MultiHead self-attention->残差连接
        x = x + self.mlp(self.ln_2(x))# LayerNorm->FeedForward->残差连接
        return x

其中QucickGlue是Glue激活函数的一个快速实现版本，具体如下：

class QuickGELU(nn.Module):
    def forward(self, x: torch.Tensor):
        return x * torch.sigmoid(1.702 * x)

图像编码器

在CLIP中，图像编码器有两种选择，分别是Vision Transformer和Resnet

VIT实现版本

VisionTransformer
Vision Transformer(ViT, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale)
Vision Transformer的核心还是上文中介绍到的Transformer结构，只是在输入上把图像划分成一个个的patch, 然后将每个图像patch经过一个线性层投影后，添加位置编码和类别编码。
在这里插入图片描述

class VisionTransformer(nn.Module):
    def __init__(self, input_resolution: int, patch_size: int, width: int, layers: int, heads: int, output_dim: int):
        super().__init__()
        self.input_resolution = input_resolution
        self.output_dim = output_dim
        # conv1用来讲输入的图片划分成一个个的patch，kernel的大小和步长都为patch_size
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=width, kernel_size=patch_size, stride=patch_size, bias=False)

        scale = width ** -0.5
        self.class_embedding = nn.Parameter(scale * torch.randn(width)) # 类别编码
        self.positional_embedding = nn.Parameter(scale * torch.randn((input_resolution // patch_size) ** 2 + 1, width)) # 位置编码
        self.ln_pre = LayerNorm(width) # 层归一化

        self.transformer = Transformer(width, layers, heads) # Transformer Block

        self.ln_post = LayerNorm(width)
        self.proj = nn.Parameter(scale * torch.randn(width, output_dim))

    def forward(self, x: torch.Tensor): # x: (b,3,h,w)
    	# 将图像划分成patch
        x = self.conv1(x)  # shape = [b, width, grid, grid] 其中grid=h/patch_size
        x = x.reshape(x.shape[0], x.shape[1], -1)  # shape = [b, width, grid ** 2]
        x = x.permute(0, 2, 1)  # shape = [b, grid ** 2, width]
        # 添加class token
        x = torch.cat([self.class_embedding.to(x.dtype) + torch.zeros(x.shape[0], 1, x.shape[-1], dtype=x.dtype, device=x.device), x], dim=1)  # shape = [b, grid ** 2 + 1, width]
        # 添加位置编码
        x = x + self.positional_embedding.to(x.dtype)
        x = self.ln_pre(x) # LayerNorm

        x = x.permute(1, 0, 2)  # NLD -> LND [b, grid ** 2 + 1, width]->[grid ** 2 + 1,b,width]
        x = self.transformer(x) # multi-head Transformer [grid ** 2 + 1,b,width]
        x = x.permute(1, 0, 2)  # LND -> NLD [grid ** 2 + 1,b,width]-> [b, grid ** 2 + 1, width]
        # 获取类别信息
        x = self.ln_post(x[:, 0, :]) # [b,width]

        if self.proj is not None:
            x = x @ self.proj #  [b,output_dim]

        return x # [b,output_dim]

ModifiedResNet实现版本

ModifiedResNet
图像编码器的另外一种实现方式ModifiedResNet
它一个类似于torchvision的ResNet类，但包含以下更改：

现在有3个"stem"卷积，而不是1个，其中包含一个平均池化而不是最大池化。
执行 anti-aliasing stride卷积，其中在步幅大于1的卷积之前加上了一个平均池化。
最终的池化层是一个QKV注意力，而不是平均池。

class ModifiedResNet(nn.Module):
    def __init__(self, layers, output_dim, heads, input_resolution=224, width=64):
        super().__init__()
        self.output_dim = output_dim
        self.input_resolution = input_resolution

        # the 3-layer stem
        self.conv1 = nn.Conv2d(3, width // 2, kernel_size=3, stride=2, padding=1, bias=False)
        # (b,3,h,w)->(b,width/2,h/2,w/2)
        self.bn1 = nn.BatchNorm2d(width // 2) 
        self.relu1 = nn.ReLU(inplace=True)
        self.conv2 = nn.Conv2d(width // 2, width // 2, kernel_size=3, padding=1, bias=False)
        # (b,width/2,h/2,w/2)->(b,width/2,h/2,w/2)
        self.bn2 = nn.BatchNorm2d(width // 2)
        self.relu2 = nn.ReLU(inplace=True)
        self.conv3 = nn.Conv2d(width // 2, width, kernel_size=3, padding=1, bias=False)
        # (b,width/2,h/2,w/2)->(b,width/2,h/2,w/2)
        self.bn3 = nn.BatchNorm2d(width)
        self.relu3 = nn.ReLU(inplace=True)
        self.avgpool = nn.AvgPool2d(2)

        # residual layers
        self._inplanes = width  # this is a *mutable* variable used during construction
        self.layer1 = self._make_layer(width, layers[0]) # Layers[0]个bottleneck
        self.layer2 = self._make_layer(width * 2, layers[1], stride=2)# Layers[1]个bottleneck
        self.layer3 = self._make_layer(width * 4, layers[2], stride=2)# Layers[2]个bottleneck
        self.layer4 = self._make_layer(width * 8, layers[3], stride=2)# Layers[3]个bottleneck

        embed_dim = width * 32  # the ResNet feature dimension
        self.attnpool = AttentionPool2d(input_resolution // 32, embed_dim, heads, output_dim)

    def _make_layer(self, planes, blocks, stride=1): # Blocks个BottleNeck串联
        layers = [Bottleneck(self._inplanes, planes, stride)]

        self._inplanes = planes * Bottleneck.expansion
        for _ in range(1, blocks):
            layers.append(Bottleneck(self._inplanes, planes))

        return nn.Sequential(*layers)

    def forward(self, x): # x: (b,3,h,w)
        def stem(x):
            x = self.relu1(self.bn1(self.conv1(x)))
            x = self.relu2(self.bn2(self.conv2(x)))
            x = self.relu3(self.bn3(self.conv3(x)))
            x = self.avgpool(x)
            return x

        x = x.type(self.conv1.weight.dtype) # 转换x的数据类型
        x = stem(x) # (b,width/2,h/2,w/2)
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)
        x = self.attnpool(x) # AttentionPool2d

        return x

Bottleneck
ModifiedResNet 中的layer1~4使用的就是Bottleneck

class Bottleneck(nn.Module):
    expansion = 4

    def __init__(self, inplanes, planes, stride=1):
        super().__init__()

        # all conv layers have stride 1. an avgpool is performed after the second convolution when stride > 1
        self.conv1 = nn.Conv2d(inplanes, planes, 1, bias=False)
        self.bn1 = nn.BatchNorm2d(planes)
        self.relu1 = nn.ReLU(inplace=True)

        self.conv2 = nn.Conv2d(planes, planes, 3, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(planes)
        self.relu2 = nn.ReLU(inplace=True)

        self.avgpool = nn.AvgPool2d(stride) if stride > 1 else nn.Identity()

        self.conv3 = nn.Conv2d(planes, planes * self.expansion, 1, bias=False)
        self.bn3 = nn.BatchNorm2d(planes * self.expansion)
        self.relu3 = nn.ReLU(inplace=True)

        self.downsample = None
        self.stride = stride

        if stride > 1 or inplanes != planes * Bottleneck.expansion:
            # downsampling layer is prepended with an avgpool, and the subsequent convolution has stride 1
            self.downsample = nn.Sequential(OrderedDict([
                ("-1", nn.AvgPool2d(stride)),
                ("0", nn.Conv2d(inplanes, planes * self.expansion, 1, stride=1, bias=False)),
                ("1", nn.BatchNorm2d(planes * self.expansion))
            ]))

    def forward(self, x: torch.Tensor):
        identity = x
		# 1*1 conv -> BatchNorm2d ->Relu
        out = self.relu1(self.bn1(self.conv1(x))) # (b,inplanes,h,w)->(b,planes,h,w)
        # 3*3 conv -> BatchNorm2d ->Relu
        out = self.relu2(self.bn2(self.conv2(out))) # (b,planes,h,w)->(b,planes,h,w)
        out = self.avgpool(out) # AvgPool2d 二维平均池化
        out = self.bn3(self.conv3(out))# (b,planes,h,w)->(b,planes*expansion,h,w)

        if self.downsample is not None:
            identity = self.downsample(x) # 进行下采样操作

        out += identity # 残差连接
        out = self.relu3(out)
        return out # (b,planes*expansion,h,w)

AttentionPool2d
ModifiedResNet 的最后一层使用的就是AttentionPool2d。

class AttentionPool2d(nn.Module):
    def __init__(self, spacial_dim: int, embed_dim: int, num_heads: int, output_dim: int = None):
        super().__init__()
        self.positional_embedding = nn.Parameter(torch.randn(spacial_dim ** 2 + 1, embed_dim) / embed_dim ** 0.5)
        self.k_proj = nn.Linear(embed_dim, embed_dim)
        self.q_proj = nn.Linear(embed_dim, embed_dim)
        self.v_proj = nn.Linear(embed_dim, embed_dim)
        self.c_proj = nn.Linear(embed_dim, output_dim or embed_dim)
        self.num_heads = num_heads

    def forward(self, x): # (b,c,h,w)
        x = x.flatten(start_dim=2).permute(2, 0, 1)  # (b,c,h*w)->(h*w,b,c)
        x = torch.cat([x.mean(dim=0, keepdim=True), x], dim=0)  # (h*w+1,b,c)
        x = x + self.positional_embedding[:, None, :].to(x.dtype)  # 添加位置编码 (h*w+1,b,c)
        x, _ = F.multi_head_attention_forward( # 多头注意力机制
            query=x[:1], key=x, value=x,
            embed_dim_to_check=x.shape[-1],
            num_heads=self.num_heads,
            q_proj_weight=self.q_proj.weight,
            k_proj_weight=self.k_proj.weight,
            v_proj_weight=self.v_proj.weight,
            in_proj_weight=None,
            in_proj_bias=torch.cat([self.q_proj.bias, self.k_proj.bias, self.v_proj.bias]),
            bias_k=None,
            bias_v=None,
            add_zero_attn=False,
            dropout_p=0,
            out_proj_weight=self.c_proj.weight,
            out_proj_bias=self.c_proj.bias,
            use_separate_proj_weight=True,
            training=self.training,
            need_weights=False
        )
        return x.squeeze(0)

CLIP模型

CLIP模型(https://github.com/openai/CLIP)实现的核心代码在clip/models.py文件中定义的CLIP类。

__init__
初始化函数

    def __init__(self,
                 embed_dim: int,
                 # vision
                 image_resolution: int,
                 vision_layers: Union[Tuple[int, int, int, int], int],
                 vision_width: int,
                 vision_patch_size: int,
                 # text
                 context_length: int,
                 vocab_size: int,
                 transformer_width: int,
                 transformer_heads: int,
                 transformer_layers: int
                 ):
        super().__init__()

        self.context_length = context_length
		# 图像编码器的两种形式
		# 当输入的vision_layer 的格式是(tuple,list), 则用ResNet实现
        if isinstance(vision_layers, (tuple, list)): 
            vision_heads = vision_width * 32 // 64
            self.visual = ModifiedResNet(
                layers=vision_layers,
                output_dim=embed_dim,
                heads=vision_heads,
                input_resolution=image_resolution,
                width=vision_width
            )
        else: # 否则用Vision Transformer对图像进行编码
            vision_heads = vision_width // 64
            self.visual = VisionTransformer(
                input_resolution=image_resolution,
                patch_size=vision_patch_size,
                width=vision_width,
                layers=vision_layers,
                heads=vision_heads,
                output_dim=embed_dim
            )
		# 文本编码器用Transformer实现
        self.transformer = Transformer(
            width=transformer_width,
            layers=transformer_layers,
            heads=transformer_heads,
            attn_mask=self.build_attention_mask()
        )

        self.vocab_size = vocab_size
        self.token_embedding = nn.Embedding(vocab_size, transformer_width) # vocab_size 表示词汇表的大小，transformer_width 表示每个 token 被映射成的向量的维度。
        self.positional_embedding = nn.Parameter(torch.empty(self.context_length, transformer_width))
        self.ln_final = LayerNorm(transformer_width)

        self.text_projection = nn.Parameter(torch.empty(transformer_width, embed_dim))
        self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07))

        self.initialize_parameters()

encode_image
图像编码器，调用self.visual对图像进行编码

    def encode_image(self, image):
        return self.visual(image.type(self.dtype)) 
        # 先转换image的数据类别，然后再输入到图像编码器中进行编码

其中self.dtype的实现如下, 用于获取图像编码器中conv1的权重的数据类别。

    @property
    def dtype(self):
        return self.visual.conv1.weight.dtype

encode_text
文本编码器

    def encode_text(self, text):
        # 每个句子前面有两个特殊符号 [CLS] 和 [Seq]
        x = self.token_embedding(text).type(self.dtype)  # [batch_size, n_ctx, d_model]

        x = x + self.positional_embedding.type(self.dtype) # 添加位置编码
        x = x.permute(1, 0, 2)  # NLD -> LND
        x = self.transformer(x)
        x = x.permute(1, 0, 2)  # LND -> NLD [batch_size, n_ctx, d_model]
        x = self.ln_final(x).type(self.dtype) # LayerNorm

        # x.shape = [batch_size, n_ctx, transformer.width]
        # take features from the eot embedding (eot_token is the highest number in each sequence)
        x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)] @ self.text_projection

        return x

forward函数
CLIP模型的前向传播过程，首先编码图像和文本信息，然后对图像和文本特征进行归一化，将归一化后的特征计算相似度得分。

    def forward(self, image, text):
        image_features = self.encode_image(image) # 编码图像特征
        text_features = self.encode_text(text) # 编码文字特征

        # 对特征进行归一化
        image_features = image_features / image_features.norm(dim=1, keepdim=True)
        text_features = text_features / text_features.norm(dim=1, keepdim=True)

        # cosine similarity as logits
        logit_scale = self.logit_scale.exp() # 可学习参数
        logits_per_image = logit_scale * image_features @ text_features.t() # 每个图像与每个文本之间的相似度得分。
        logits_per_text = logits_per_image.t() # 每个文本与每个图像之间的相似度得分。

        # shape = [global_batch_size, global_batch_size]
        return logits_per_image, logits_per_text

代码复现

API
clip提供如下方法可供如下API可供调用

clip.available_models() 返回可以使用的CLIP模型的名称
clip.load(name, device=..., jit=False)
返回模型和模型所需的TorchVision变换，由clip.available_models()返回的模型名称指定。必要时会下载模型。name参数也可以是本地检查点的路径。
可以选择指定运行模型的设备，默认情况下使用第一个CUDA设备（如果有的话），否则使用CPU。当jit为False时，将加载模型的非JIT版本
clip.tokenize(text: Union[str, List[str]], context_length=77) 返回一个LongTensor, 包含输入文本的token化序列。

由clip.load()返回的模型具有如下的方法：

model.encode_image(image: Tensor) 输入一组batch的图片，返回编码后的图像特征。
model.encode_text(text: Tensor) 输入一组batch的文本token, 返回CLIP模型编码后的文本特征。
model(image: Tensor, text: Tensor) 给定一个图像批次和一个文本标记批次，返回两个张量，包含对应于每个图像和文本输入的logit分数。这些值是对应图像和文本特征之间的余弦相似度乘以100。

本地环境

环境配置

配置pytorch环境并安装其他相关包

conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
pip install ftfy regex tqdm

下载并安装clip

# 方法1： 直接下载并安装
pip install git+https://github.com/openai/CLIP.git
# 方法2： 从github上下载clip源码到本地，然后解压后，进入文件夹内编译
cd CLIP-main
pip install -v -e .

在这里插入图片描述

推理测试
计算一张图片和多个文本间的相似度得分

import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device) # 加载模型

image = preprocess(Image.open("../CLIP.png")).unsqueeze(0).to(device) # 图片预处理
text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)# 文本序列化

with torch.no_grad():
    image_features = model.encode_image(image) # 编码图像特征
    text_features = model.encode_text(text) # 编码文本特征

    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()
print("logits_per_image: ",logits_per_image)
print("logits_per_text:", logits_per_text)
print("Label probs:", probs)  # prints: [[0.9927937  0.00421068 0.00299572]]

在这里插入图片描述

Zero-Shot 预测
预测单张图片的类别

import os
import clip
import torch
from torchvision.datasets import CIFAR100
# 加载模型
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device)
# 下载数据集
cifar100 = CIFAR100(root=os.path.expanduser("~/.cache"), download=True, train=False)
# 输入准备
image, class_id = cifar100[3637]
image_input = preprocess(image).unsqueeze(0).to(device)
text_inputs = torch.cat([clip.tokenize(f"a photo of a {c}") for c in cifar100.classes]).to(device)
# 计算图像和文本特征
with torch.no_grad():
    image_features = model.encode_image(image_input)
    text_features = model.encode_text(text_inputs)
# 特征归一化
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
# 计算余弦相似度
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
# 选取前五个相似度得分最高的
values, indices = similarity[0].topk(5)
# 打印结果
print("\nTop predictions:\n")
for value, index in zip(values, indices):
    print(f"{cifar100.classes[index]:>16s}: {100 * value.item():.2f}%")

在这里插入图片描述

验证
在多张图片上进行验证

import os
import clip
import torch
import numpy as np
from sklearn.linear_model import LogisticRegression
from torch.utils.data import DataLoader
from torchvision.datasets import CIFAR100
from tqdm import tqdm

# 加载模型
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device)
# 加载测试和验证数据集
root = os.path.expanduser("~/.cache")
train = CIFAR100(root, download=True, train=True, transform=preprocess)
test = CIFAR100(root, download=True, train=False, transform=preprocess)

def get_features(dataset):
    all_features = []
    all_labels = []
    with torch.no_grad():
        for images, labels in tqdm(DataLoader(dataset, batch_size=100)):
            features = model.encode_image(images.to(device))# 对图像特征进行编码
            all_features.append(features)
            all_labels.append(labels)
    return torch.cat(all_features).cpu().numpy(), torch.cat(all_labels).cpu().numpy()

# 对训练集和测试集的图像进行编码
train_features, train_labels = get_features(train)
test_features, test_labels = get_features(test)

# 训练过程：执行 logistic regression
classifier = LogisticRegression(random_state=0, C=0.316, max_iter=1000, verbose=1)
classifier.fit(train_features, train_labels)

# 验证分类结果
predictions = classifier.predict(test_features)
accuracy = np.mean((test_labels == predictions).astype(float)) * 100.
print(f"Accuracy = {accuracy:.3f}") # 得到总的分类准确率

Colab

https://colab.research.google.com/github/openai/clip/blob/master/notebooks/Interacting_with_CLIP.ipynb
(1) 环境配置
安装相应的包和CLIP

! pip install ftfy regex tqdm
! pip install git+https://github.com/openai/CLIP.git

在这里插入图片描述
测试torch的版本

import numpy as np
import torch
from pkg_resources import packaging

print("Torch version:", torch.__version__)

在这里插入图片描述

（2）加载模型
输出clip中可用的预训练模型

import clip

clip.available_models()

在这里插入图片描述
加载clip模型并打印相关的参数信息

model, preprocess = clip.load("ViT-B/32") # 加载模型
model.cuda().eval() # 验证模式
input_resolution = model.visual.input_resolution
context_length = model.context_length
vocab_size = model.vocab_size

print("Model parameters:", f"{np.sum([int(np.prod(p.shape)) for p in model.parameters()]):,}") # 模型的参数量
print("Input resolution:", input_resolution)# 输入图像分辨率大小
print("Context length:", context_length)# 文本长度
print("Vocab size:", vocab_size)# 词汇表大小

在这里插入图片描述
(3) 图像预处理
图像预处理的步骤，包括Resize到244*244，并进行CenterCrop 和Normalization操作。

preprocess

在这里插入图片描述
（4）文本预处理
文本预处理使用的是不区分大小写的分词器，可以通过clip.tokenize()来调用。默认情况下，输出被填充为77个令牌长。

clip.tokenize("Hello World!")

在这里插入图片描述
（5）设置输入图像和文本
我们将向模型输入8张示例图片及其文字描述，并比较相应特征之间的相似性。
其中分词器不区分大小写，我们可以自由地提供任何适当的文字描述。

import os
import skimage
import IPython.display
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np

from collections import OrderedDict
import torch

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# images in skimage to use and their textual descriptions
descriptions = {
    "page": "a page of text about segmentation",
    "chelsea": "a facial photo of a tabby cat",
    "astronaut": "a portrait of an astronaut with the American flag",
    "rocket": "a rocket standing on a launchpad",
    "motorcycle_right": "a red motorcycle standing in a garage",
    "camera": "a person looking at a camera on a tripod",
    "horse": "a black-and-white silhouette of a horse", 
    "coffee": "a cup of coffee on a saucer"
}

下面的代码主要展示我们的测试图片和对应的文本描述

original_images = []
images = []
texts = []
plt.figure(figsize=(16, 5))

for filename in [filename for filename in os.listdir(skimage.data_dir) if filename.endswith(".png") or filename.endswith(".jpg")]:
    name = os.path.splitext(filename)[0]
    if name not in descriptions:
        continue

    image = Image.open(os.path.join(skimage.data_dir, filename)).convert("RGB")
  
    plt.subplot(2, 4, len(images) + 1)
    plt.imshow(image)
    plt.title(f"{filename}\n{descriptions[name]}")
    plt.xticks([])
    plt.yticks([])

    original_images.append(image)
    images.append(preprocess(image))
    texts.append(descriptions[name])

plt.tight_layout()

在这里插入图片描述
（6）创建图像文本特征
然后对图片进行归一化处理，对每个文本输入进行分词，并运行模型的前向传递，以获得图片和文本的特征。

image_input = torch.tensor(np.stack(images)).cuda()
text_tokens = clip.tokenize(["This is " + desc for desc in texts]).cuda()
with torch.no_grad():
    image_features = model.encode_image(image_input).float() # 图像特征
    text_features = model.encode_text(text_tokens).float()#文本特征

（7）计算余弦相似度
将特征进行归一化，并计算余弦相似度。

image_features /= image_features.norm(dim=-1, keepdim=True)# 对图像特征归一化
text_features /= text_features.norm(dim=-1, keepdim=True)# 对文本特征归一化
similarity = text_features.cpu().numpy() @ image_features.cpu().numpy().T # 点乘，计算相似度

用热力图的形式可视化相似度矩阵

count = len(descriptions)

plt.figure(figsize=(20, 14))
plt.imshow(similarity, vmin=0.1, vmax=0.3)
# plt.colorbar()
plt.yticks(range(count), texts, fontsize=18)
plt.xticks([])
for i, image in enumerate(original_images):
    plt.imshow(image, extent=(i - 0.5, i + 0.5, -1.6, -0.6), origin="lower")
for x in range(similarity.shape[1]):
    for y in range(similarity.shape[0]):
        plt.text(x, y, f"{similarity[y, x]:.2f}", ha="center", va="center", size=12)

for side in ["left", "top", "right", "bottom"]:
  plt.gca().spines[side].set_visible(False)

plt.xlim([-0.5, count - 0.5])
plt.ylim([count + 0.5, -2])

plt.title("Cosine similarity between text and image features", size=20)

可以看到对角线上是匹配的图像文本对，相似度值最高。
在这里插入图片描述
（8）Zero-shot 图像分类

from torchvision.datasets import CIFAR100

cifar100 = CIFAR100(os.path.expanduser("~/.cache"), transform=preprocess, download=True)

text_descriptions = [f"This is a photo of a {label}" for label in cifar100.classes] # 将类别名嵌入到文本中
text_tokens = clip.tokenize(text_descriptions).cuda() # 对文本进行序列化

with torch.no_grad():
    text_features = model.encode_text(text_tokens).float()# 对文本进行编码
    text_features /= text_features.norm(dim=-1, keepdim=True)# 对文本特征进行归一化

text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)# 计算余弦相似度
top_probs, top_labels = text_probs.cpu().topk(5, dim=-1) # 取相似度最高的5个值
# 分类结果可视化
plt.figure(figsize=(16, 16))
for i, image in enumerate(original_images):
    plt.subplot(4, 4, 2 * i + 1)
    plt.imshow(image)
    plt.axis("off")

    plt.subplot(4, 4, 2 * i + 2)
    y = np.arange(top_probs.shape[-1])
    plt.grid()
    plt.barh(y, top_probs[i])
    plt.gca().invert_yaxis()
    plt.gca().set_axisbelow(True)
    plt.yticks(y, [cifar100.classes[index] for index in top_labels[i].numpy()])
    plt.xlabel("probability")

plt.subplots_adjust(wspace=0.5)
plt.show()