Clip:学习笔记

最新推荐文章于 2024-10-29 23:19:24 发布

GF心流

最新推荐文章于 2024-10-29 23:19:24 发布

阅读量1.6k

点赞数 3

分类专栏：多模态文章标签：学习深度学习计算机视觉

本文链接：https://blog.csdn.net/weixin_46133588/article/details/129285944

版权

多模态专栏收录该内容

2 篇文章

订阅专栏

CLIP模型通过对比学习方法，使用大量互联网收集的图像-文本对进行预训练，从而获得强大的迁移学习能力，尤其在零样本迁移（zero-shottransfer）上表现出色。模型结构包括图像编码器和文本编码器，通过求余弦相似度实现跨模态匹配。文章探讨了预训练的效率、模板工程和ensemble方法对性能的影响，并指出CLIP在抽象任务和特定场景中的局限性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Clip

前言

阅读论文：
Learning Transferable Visual Models From Natural Language Supervision
CLIP 论文逐段精读【论文精读】
Github:
https://openai.com/research/clip
https://github.com/OpenAI/CLIP
知乎：
如何评价OpenAI最新的工作CLIP：连接文本和图像，zero shot效果堪比ResNet50？
OpenAI发布CLIP模型快一年了，盘点那些CLIP相关让人印象深刻的工作

一、原理

原理：
在这里插入图片描述
一个batch中，image encoder （可以是resnet，也可以是visual transformers）对应 text encoder 在矩阵中进行对比学习，蓝色对角线上的是正例样本，其余的都是负样本。推理的时候如何做到不需要imagenet的监督学习，就可以做到监督信号的呢？那是构造了prompt template, 原本的linear的1000个类，构造成 a photo of a [object label]经过text encoder(pretrain的 encoder)，得到的向量和 image 经过 image encoder的向量进行求cosine similarity.

针对prompt template的构造还有 prompt engineering和prompt ensmble两种方法
比对学习需要大量的图片文本的数据集，openai收集了4亿对的数据集进行预训练

效果：迁移学习能力非常强，zeroshot在视觉数据集上效果很好,尤其是ImageNet上的效果，摆脱了categorical label的限制

The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on

https://openai.com/research/clip
在这里插入图片描述
和NLP的结合，CLIP学出来的视觉特征和语言描述的某些物体产生强烈的联系

有趣的应用：
styleCLIP
text 2 修改图片
CLIPDraw
text 2 简笔画的生成，抽象主义的
物体监测分割
open-vocabulary detector
视频检索clifs

1.1 摘要

imagenet 1000 类
CIFAR 10
CIFAR 100
目标监测
coco 80
语义分割
city scapes 19
视频
Kineitcs 400

想法
Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision.
对应的任务设计成：
We demonstrate that the simple pre-training task of redicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
实验
在30个不同的CV datasets上做测试，迁移的效果非常好
开源的代码只有推理的部分，并没有预训练的部分

1.2 引言

The development of “text-to-text” as a standardized input-output interface (McCann et al., 2018;Radford et al., 2019; Raffel et al., 2019) has enabled task-agnostic architectures to zero-shot transfer to downstream datasets removing the need for specialized output heads or dataset specific customization

核心在于预训练的架构和下游任务无关，这样就不需要监督信号学习一个和下游任务相关的分类头。
NLP的那套预训练的方法：

These results suggest that the aggregate supervision accessible to modern pre-training methods within web-scale collections of text surpasses that of high-quality crowd-labeled NLP datasets

这个方法是很有效的，希望用到CV的相关任务中来。

作者从1999年的相关论文讨论到2021年
主要跟2017年的Learning visual n-grams from web data
有了transformer、完形填空自监督的学习信号后，有了VirTex\ICMLM\Con-VIRT,基于transformer做的后面的一些工作是想把一些弱的监督信号用起来：

Instead, more narrowly scoped but well-targeted uses of weak supervision have improved performance. Mahajan et al. (2018) showed
that predicting ImageNet-related hashtags on Instagram images is an effective pre-training task. When fine-tuned to ImageNet these pre-trained models increased accuracy by over 5% and improved the overall state of the art at the time. Kolesnikov et al. (2019) and Dosovitskiy et al. (2020) have also demonstrated large gains on a broader set of transfer benchmarks by pre-training models to predict the classes of the noisily labeled JFT-300M dataset.

好处是弱的监督信号的数据集大，因此，作者认为，使用gold-labels是很有限的，反而希望用上，那些取之不尽用之不竭的文本，即使用上了在模型层面还是很有局限性，主要是用静态的softmax作为分类头，缺乏zero-shot的能力。

Both works carefully design, and in the process limit, their supervision to 1000 and 18291 classes respectively. Natural language is able to express, and therefore supervise, a much wider set of visual concepts through its generality. Both approaches also use static softmax classifiers to perform prediction and lack a mechanism for dynamic outputs. This severely curtails their flexibility and limits their “zero-shot” capabilities.

之前的工作不行，主要是数据集的规模和模型的规模都要上去。accelerator years，所以作者团队先从数据集开始入手，收集了4对的文本图片对。模型层面从：resnet\efficient net\vision transformer(VIT Large),就提出了CLIP。单单视觉上的模型就用了8个，最大和最小的模型容量差了100倍。

作者怒刷30个数据集，看泛化性和迁移的效果。在做zero-shot之前呢，作者去看了linear-probe，为了进一步提供模型的学习能力，直接把主干网络冻住，训练最后一层的分类头，发现全方面碾压之前的方法。

1.3 方法

用上了deep contextual representations, like bert，就能利用上abundant source of supervision,
总结下来说：文本监督信号，帮助训练一个视觉模型，是很有潜力，前提是数据集量够大，目前能用的数据集：MS-COCO\Visual Genome,好归好，但是数据量太少了，JFT300M有3亿个样本、Instagram 有3.5billion。YFCC100标注质量太差了，有人去清晰了下，只身下15M了。

NLP那边的数据集来说和GPT2差不多的级别，CV和JFT300m还多了一个亿。WIT数据集。

之前训练的模型也还只是在1000类别上就已经如此的耗时了，更不用说是开放点视觉概念任务上了。作者提出：

In the course of our efforts, we found training efficiency was key to successfully scaling natural language supervision and we selected our
final pre-training method based on this metric

训练的效率视乎是训练自然语言监督信号的核心。

step1, similar to VirTex , jointly trained an image CNN and text transformer from scratch to predict the caption of an image, 结果：很慢
contrastive objectives can learn better representations than their equivalent predictive objective 比对学习目标比预测型的目标更加好学
不仅如此，推理的速度更加快，快了4倍

橙色是将文本变成全局的特征，而不是逐字逐句的特征，再把约束放宽，推理速度又更近一步。
伪代码

# image_encoder - ResNet or Vision Transformer
# text_encoder - CBOW or Text Transformer
# I[n, h, w, c] - minibatch of aligned images [8, 224, 224, 3]
# T[n, l] - minibatch of aligned texts [8,512]
# W_i[d_i, d_e] - learned proj of image to embed
# W_t[d_t, d_e] - learned proj of text to embed
# t - learned temperature parameter

# extract feature representations of each modality
I_f = image_encoder(I) #[n, d_i]
T_f = text_encoder(T) #[n, d_t]

# joint multimodal embedding [n, d_e]
I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)

# scaled pairwise cosine similarities [n, n]
logits = np.dot(I_e, T_e.T) * np.exp(t)

# symmetric loss function
# SimCLR 到 BYOL, 一直到最新的MOCO V3 DINO这些工作都是用对称式的目标函数
labels = np.arange(n) 对角线上的元素
loss_i = cross_entropy_loss(logits, labels, axis=0)
loss_t = cross_entropy_loss(logits, labels, axis=1)
loss = (loss_i + loss_t)/2

Figure 3. Numpy-like pseudocode for the core of an implementation of CLIP.

训练细节:

数据集太大，不会导致over-fitting的问题
从头预训练没有加载imagesnet权重和文本的权重
也没有用非线性层的映射，在表示层和比对的embedding映射空间，这里只用了linear层
移除了只采样图片文本的一个句子的功能
简化了图片数据增强的功能，只采用裁剪这种方式
对于比对学习中temperature parameter参数只是设置为可以学习的标量
视觉部分，模型可以选择ResNET（还稍微做了一些修改） ,也可以选择visual transformers（VIT）, 文本部分只是使用的transformers

. As a base size we use a 63M-parameter 12-layer 512-wide model with 8 attention heads.

BPE\49152词表、76的最大长度
并在模型的宽度和深度做了一些简单的尝试
5个resnet(50-101-50x4-50x16-50x64), 3个vit(32-16-14)
模型训练的是32epochs，Adam优化器
权重衰减、not gains, not biases
cosine schedule的lr
只在resnet50做了grid searches 一个epoches,
32768的batch_size, 天啊
混精度训练、 gradient checkpointing、 half-precision Adam statistics、half-precision stochastically rounded text encoder weights
相似度的计算也是放在不同的GPU上

题外话：openai热衷于GPT
GPT系列、DALL-E、Image gpt 和 openai codex

1.4 实验

1.4.1 zero-shot Transfer

Our focus on studying zero-shot transfer as an evaluation of task learning is inspired by work demonstrating task learning in the field of NLP
作者的核心就是使用一张图片，分别问1000个句子，之后做一个softmax, 就可以进行相对应的zero-shot了。

大幅度提升了效果

1.4.2 PROMPT ENGINEERING AND ENSEMBLING

主要的问题是词语的多义性，polysemy

When the name of a class is the only information provided to CLIP’s text encoder it is unable to differentiate which word sense is meant due to the lack of context

论文提出模板式：

A photo of a {label} to be a good default that helps specify the text is about the content of the image
如果知道更多信息，那效果会更好
For example on Oxford-IIIT Pets, using “A photo of a {label}, a type of pet.” to help provide context worked well. Likewise, on Food101 specifying a type of food and on FGVC Aircraft a type of aircraft helped too.

作者用了80个提示模板
https://github.com/openai/CLIP/blob/main/notebooks/Prompt_Engineering_for_ImageNet.ipynb

在这里插入图片描述
再做了27个数据集的实验：

clip的zero-shot,以及linear probe就是冻住主干，只训练最后一层的linear层
linear probe是基线版本，绿色是优于probe的，蓝色是低的
物品的分类效果会更加好，更难的数据集，纹理、物体的计数会更加抽象，会更难
难的任务，可能需要few shot
few shot的实验也做了，横坐标是每个label使用的样本，纵坐标是20个数据集中的平均准确率，同时都是用的linear probe，clip冻住的是图片的encoder，
bit是专门为迁移学习所做的，当时最好的迁移学习的模型，很强的baseline
1，2，4的fewshot效果还没有多模态的zero-shot好，说明文本的监督信号确实强

zeroshot\few shot都做完了，接下来如果直接使用全量的监督信号的数据进行实验会如何
方法有两种：
1、linear probe
2、finetune
作者只用第一种方式，减少预训练对数据的影响，看预训练的好坏。finetune太多参数可以调，这样就不好比对效果。
在这里插入图片描述
横坐标是一张图经过的参数量，纵坐标是准确率

作者再把clip和efficientnet做对比

Fitting a linear classifier on CLIP’s features outperforms using the Noisy Student
EfficientNet-L2 on 21 out of 27 datasets.

冻住主干网络，只训分类头，全量的数据。

当数据有偏移的时候，模型表现如何：
在这里插入图片描述
和人类进行比较，找了5个人来做实验

表格的体现出来的效果还是很好的

人类觉得难的，模型也觉得难

做了去重的实验，还是觉得clip的泛化性能好

1.5 局限性

Significant work is still needed to improve the task learning and transfer capabilities of CLIP. While scaling has so far steadily improved performance and suggests a route for continued improvement, we estimate around a 1000x increase in compute is required for zero-shot CLIP to reach overall state-of-the-art performance

扩大规模来弥补和stoa的差距不现实

CLIP also struggles with more abstract and systematic tasks such as counting the number of objects in an image. Finally for novel tasks which are unlikely to be included in CLIP’s pre-training dataset, such as classifying the distance to the nearest car in a photo, CLIP’s performance can be near random. We are confident that there are still many, many, tasks where CLIP’s zero-shot performance is near chance level.

更难的任务上确实不太行

However, CLIP only achieves 88% accuracy on the handwritten digits of MNIST

预训练的数据集和下游的数据分布如果是out of distribution也不太行

最好的是，直接生成图片的标题，这就是端到端的了，而不是给你一个自然语言的监督信号，做成一个生成式的模型。对比学习的函数和生成式的目标函数合在一起

对数据的利用并不高效。如何提高数据的利用效率，自监督的方式和伪标签的方式

做实验过程，总是以测试集为导向进行调参，而不是真正的zeroshot

这选中的27个数据集，也是有主观的偏见的，如果有一个数据集是专门来做zeroshot的那就太好了

数据都是网上爬的，没有经过过滤的，会带有社会的偏见

在一些很难用语言描述的任务过程中，如果你不提供训练样本的表现由于你few shot的效果

二、总结

We have investigated whether it is possible to transfer the success of task-agnostic web-scale pre-training in NLP to another domain. We find that adopting this formula results in similar behaviors emerging in the field of computer vision and discuss the social implications of this line of research. In order to optimize their training objective, CLIP models learn to perform a wide variety of tasks during pretraining. This task learning can then be leveraged via natural language prompting to enable zero-shot transfer to many existing datasets. At sufficient scale, the performance of this approach can be competitive with task-specific supervised models although there is still room for much improvement.

打破了固定标签的学习范式，无监督的方式进行学习，数据处理更方便，模型也是方便，推理更加方便。新意度100 有效性100 问题大小100分