首发CLIP全文翻译CLIP：Learning Transferable Visual Models From Natural Language Supervision

最新推荐文章于 2025-03-07 17:40:11 发布

醒了就刷牙

最新推荐文章于 2025-03-07 17:40:11 发布

阅读量1.5k

点赞数 20

分类专栏：论文文章标签：论文笔记算法

本文链接：https://blog.csdn.net/buyaotutou/article/details/141226559

版权

论文专栏收录该内容

171 篇文章

订阅专栏

系列论文研读目录

文章目录

系列论文研读目录
Abstract
1. Introduction and Motivating Work
2. Approach 方法
3. Experiments
4. Comparison to Human Performance 与人类表现的比较
5. Data Overlap Analysis
6. Limitations
7. Broader Impacts 更广泛的影响
8. Related Work
9. Conclusion

文章链接

Abstract

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efﬁcient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of ﬁne-grained object classiﬁcation. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset speciﬁc training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.最先进的计算机视觉系统被训练来预测一组固定的预定对象类别。这种受限的监督形式限制了它们的通用性和可用性，因为需要额外的标记数据来指定任何其他视觉概念。直接从图像的原始文本中学习是一种很有前途的选择，它利用了更广泛的监督来源。我们证明，预测哪个标题与哪个图像相匹配的简单预训练任务是一种有效且可扩展的方法，可以在从互联网收集的4亿（图像、文本）对的数据集上从头开始学习SOTA图像表示。在预训练之后，使用自然语言来引用所学习的视觉概念（或描述新的视觉概念），使得能够将模型零触发地转移到下游任务。我们通过对30多种不同的现有计算机视觉数据集进行基准测试来研究这种方法的性能，这些数据集涵盖了OCR、视频中的动作识别、地理定位和许多类型的细粒度对象分类等任务。该模型可以轻松地转移到大多数任务中，并且通常与完全监督的基线竞争，而无需任何数据集特定的训练。例如，我们在ImageNet zero-shot上匹配原始ResNet-50的准确性，而不需要使用它所训练的128万个训练示例中的任何一个。我们在https://github.com/OpenAI/CLIP上发布了我们的代码和预训练模型权重。

1. Introduction and Motivating Work

Pre-training methods which learn directly from raw text have revolutionized NLP over the last few years (Dai & Le, 2015; Peters et al., 2018; Howard & Ruder, 2018; Radford et al., 2018; Devlin et al., 2018; Raffel et al., 2019). Task-agnostic objectives such as autoregressive and masked language modeling have scaled across many orders of magnitude in compute, model capacity, and data, steadily improving capabilities. The development of “text-to-text” as a standardized input-output interface (McCann et al., 2018; Radford et al., 2019; Raffel et al., 2019) has enabled taskagnostic architectures to zero-shot transfer to downstream datasets removing the need for specialized output heads or dataset specific customization. Flagship systems like GPT-3 (Brown et al., 2020) are now competitive across many tasks with bespoke models while requiring little to no dataset speciﬁc training data.在过去的几年里，直接从原始文本中学习的预训练方法已经彻底改变了NLP（Dai & Le，2015; Peters等人，2018年;霍华德和Ruder，2018年;拉德福等人，2018年; Devlin等人，2018年; Raffel等人，（2019年版）。诸如自回归和屏蔽语言建模等任务无关的目标已经在计算、模型容量和数据方面扩展了许多数量级，从而稳步提高了功能。“文本到文本”作为标准化输入输出接口的发展（McCann等人，2018年;拉德福等人，2019年; Raffel等人，2019）已实现了任务不可知架构到下游数据集的零触发传输，无需专门的输出头或数据集特定定制。旗舰系统如GPT-3（Brown等人，2020年）现在在许多任务中具有竞争力，只需很少或不需要特定于数据集的训练数据。
These results suggest that the aggregate supervision accessible to modern pre-training methods within web-scale collections of text surpasses that of high-quality crowd-labeled NLP datasets. However, in other fields such as computer vision it is still standard practice to pre-train models on crowd-labeled datasets such as ImageNet (Deng et al., 2009). Could scalable pre-training methods which learn directly from web text result in a similar breakthrough in computer vision? Prior work is encouraging.这些结果表明，在网络规模的文本集合中，现代预训练方法可获得的聚合监督超过了高质量的人群标记的NLP数据集。然而，在诸如计算机视觉的其它领域中，在诸如ImageNet的人群标记的数据集上预训练模型仍然是标准实践（Deng等人，2009年）的报告。直接从网络文本学习的可扩展预训练方法是否会在计算机视觉领域带来类似的突破？先前的工作令人鼓舞。
Over 20 years ago Mori et al. (1999) explored improving content based image retrieval by training a model to predict the nouns and adjectives in text documents paired with images. Quattoni et al. (2007) demonstrated it was possible to learn more data efficient image representations via manifold learning in the weight space of classiﬁers trained to predict words in captions associated with images. Srivastava & Salakhutdinov (2012) explored deep representation learning by training multimodal Deep Boltzmann Machines on top of low-level image and text tag features. Joulin et al. (2016) modernized this line of work and demonstrated that CNNs trained to predict words in image captions learn useful image representations. They converted the title, description, and hashtag metadata of images in the YFCC100M dataset (Thomee et al., 2016) into a bag-ofwords multi-label classiﬁcation task and showed that pretraining AlexNet (Krizhevsky et al., 2012) to predict these labels learned representations which preformed similarly to ImageNet-based pre-training on transfer tasks. Li et al. (2017) then extended this approach to predicting phrase ngrams in addition to individual words and demonstrated the ability of their system to zero-shot transfer to other image classification datasets by scoring target classes based on their dictionary of learned visual n-grams and predicting the one with the highest score. Adopting more recent architectures and pre-training approaches, VirTex (Desai & Johnson, 2020), ICMLM (Bulent Sariyildiz et al., 2020), and ConVIRT (Zhang et al., 2020) have recently demonstrated the potential of transformer-based language modeling, masked language modeling, and contrastive objectives to learn image representations from text.20多年前，Mori等人（1999）通过训练一个模型来预测与图像配对的文本文档中的名词和形容词，探索了改进基于内容的图像检索。Quattoni et al.（2007）证明，通过训练分类器的权重空间中的流形学习，可以学习更多数据有效的图像表示，以预测与图像相关的标题中的单词。Srivastava & Salakhutdinov（2012）通过在底层图像和文本标签特征上训练多模态深度玻尔兹曼机，探索了深度表征学习。Joulin等人（2016）对这一工作进行了现代化，并证明了经过训练的CNN可以预测图像标题中的单词，从而学习有用的图像表示。他们转换了YFCC 100 M数据集中图像的标题、描述和标签元数据（Thomee等人，2016）转化为一个词袋多标签分类任务，并表明预训练AlexNet（Krizhevsky等人，2012）来预测这些标签学习表征，这些表征类似于基于ImageNet的转移任务预训练。Li等人（2017）随后将该方法扩展到预测短语ngram以及单个单词，并通过基于其学习的视觉n-gram词典对目标类进行评分并预测得分最高的分类，证明了其系统零触发转移到其他图像分类数据集的能力。采用更新的架构和预训练方法，VirTex（Desai &约翰逊，2020），ICMLM（Bulent Sariyildiz等人，2020）和ConVIRT（Zhang等人，2020）最近展示了基于转换器的语言建模、掩蔽语言建模和对比目标从文本中学习图像表示的潜力。
While exciting as proofs of concept, using natural language supervision for image representation learning is still rare. This is likely because demonstrated performance on common benchmarks is much lower than alternative approaches. For example, Li et al. (2017) reach only 11.5% accuracy on ImageNet in a zero-shot setting. This is well below the 88.4% accuracy of the current state of the art (Xie et al., 2020). It is even below the 50% accuracy of classic computer vision approaches (Deng et al., 2012). Instead, more narrowly scoped but well-targeted uses of weak supervision have improved performance. Mahajan et al. (2018) showed that predicting ImageNet-related hashtags on Instagram images is an effective pre-training task. When fine-tuned to ImageNet these pre-trained models increased accuracy by over 5% and improved the overall state of the art at the time. Kolesnikov et al. (2019) and Dosovitskiy et al. (2020) have also demonstrated large gains on a broader set of transfer benchmarks by pre-training models to predict the classes of the noisily labeled JFT-300M dataset.虽然作为概念证明令人兴奋，但使用自然语言监督进行图像表示学习仍然很少见。这很可能是因为在通用基准测试中表现出的性能远远低于替代方法。例如，Li等人（2017）在零激发设置下，在ImageNet上仅达到11.5%的准确度。这大大低于现有技术的88.4%的准确度（Xie等人，2020年）的报告。它甚至低于经典计算机视觉方法的50%准确度（Deng等人，2012年）的报告。相反，范围更窄但针对性更强的弱监管改善了绩效。Mahajan等人（2018）指出，预测Instagram图片上与ImageNet相关的标签是一项有效的预训练任务。当对ImageNet进行微调时，这些预先训练的模型将准确度提高了5%以上，并改善了当时的整体技术水平。Kolesnikov等人（2019）和Dosovitskiy等人（2020）还证明，通过预训练模型预测噪声标记JFT-300 M数据集的类别，在更广泛的转移基准集上获得了巨大收益。
This line of work represents the current pragmatic middle ground between learning from a limited amount of supervised “gold-labels” and learning from practically unlimited amounts of raw text. However, it is not without compromises. Both works carefully design, and in the process limit, their supervision to 1000 and 18291 classes respectively. Natural language is able to express, and therefore supervise, a much wider set of visual concepts through its generality. Both approaches also use static softmax classifiers to perform prediction and lack a mechanism for dynamic outputs. This severely curtails their flexibility and limits their “zero-shot” capabilities.这一工作路线代表了当前实用主义的中间立场，介于从有限数量的监督“金标签”学习和从实际上无限数量的原始文本学习之间。然而，这并非没有妥协。这两项工程都经过精心设计，并在工艺上有所限制，他们的监理班数分别为1000班和18291班。自然语言能够通过其通用性来表达并因此管理更广泛的视觉概念集合。这两种方法还使用静态softmax分类器来执行预测，并且缺乏用于动态输出的机制。这严重地削弱了它们的灵活性，限制了它们的“零射击”能力。
A crucial difference between these weakly supervised models and recent explorations of learning image representations directly from natural language is scale. While Mahajan et al. (2018) and Kolesnikov et al. (2019) trained their models for accelerator years on millions to billions of images, VirTex, ICMLM, and ConVIRT trained for accelerator days on one to two hundred thousand images. In this work, we close this gap and study the behaviors of image classifiers trained with natural language supervision at large scale. Enabled by the large amounts of publicly available data of this form on the internet, we create a new dataset of 400 million (image, text) pairs and demonstrate that a simplified version of ConVIRT trained from scratch, which we call CLIP, for Contrastive Language-Image Pre-training, is an efficient method of learning from natural language supervision. We study the scalability of CLIP by training a series of eight models spanning almost 2 orders of magnitude of compute and observe that transfer performance is a smoothly predictable function of compute (Hestness et al., 2017; Kaplan et al., 2020). We find that CLIP, similar to the GPT family, learns to perform a wide set of tasks during pre-training including OCR, geo-localization, action recognition, and many others. We measure this by benchmarking the zero-shot transfer performance of CLIP on over 30 existing datasets and fin it can be competitive with prior task-specific supervised models. We also confirm these findings with linear-probe representation learning analysis and show that CLIP outperforms the best publicly available ImageNet model while also being more computationally efficient. We additionally find that zero-shot CLIP models are much more robust than equivalent accuracy supervised ImageNet models which suggests that zero-shot evaluation of task-agnostic models is much more representative of a model’s capability. These results have significant policy and ethical implications, which we consider in Section 7.这些弱监督模型与最近直接从自然语言学习图像表示的探索之间的一个关键区别是尺度。Mahajan等人（2018）和Kolesnikov等人（2019）在数百万至数十亿张图像上训练了他们的模型，加速器年，而Virtex、ICMLM和ConVIRT在1至20万张图像上训练了加速器天。在本文中，我们将填补这一空白，并在大规模上研究自然语言监督训练的图像分类器的行为。在互联网上大量这种形式的公开可用数据的支持下，我们创建了一个新的4亿对（图像，文本）数据集，并证明了从头开始训练的ConVIRT的简化版本，我们称之为CLIP，即对比语言-图像预训练，是一种从自然语言监督中学习的有效方法。我们通过训练一系列八个几乎跨越2个计算数量级的模型来研究CLIP的可扩展性，并观察到传输性能是计算的平滑可预测函数（Hestness等人，2017年; Kaplan等人，2020年）的报告。我们发现CLIP与GPT家族相似，在预训练期间学习执行一系列广泛的任务，包括OCR、地理定位、动作识别和许多其他任务。我们通过在30多个现有数据集上对CLIP的零触发传输性能进行基准测试来衡量这一点，并发现它可以与先前的特定于任务的监督模型竞争。我们还通过线性探针表示学习分析证实了这些发现，并表明CLIP优于最好的公开可用ImageNet模型，同时也更有效地计算。此外，我们还发现，零触发CLIP模型比同等精度的监督ImageNet模型更稳健，这表明任务不可知模型的零触发评估更能代表模型的能力。这些结果具有重大的政策和伦理影响，我们将在第7节中对此进行讨论。

2. Approach 方法

2.1. Natural Language Supervision

At the core of our approach is the idea of learning perception from supervision contained in natural language. As discussed in the introduction, this is not at all a new idea, however terminology used to describe work in this space is varied, even seemingly contradictory, and stated motivations are diverse. Zhang et al. (2020), Gomez et al. (2017), Joulin et al. (2016), and Desai & Johnson (2020) all introduce methods which learn visual representations from text paired with images but describe their approaches as unsupervised, self-supervised, weakly supervised, and supervised respectively.我们的方法的核心是从自然语言中包含的监督中学习感知的思想。正如在引言中所讨论的，这根本不是一个新的想法，然而，用于描述这一领域工作的术语多种多样，甚至看似矛盾，而且所陈述的动机也多种多样。Zhang et al.（2020），Gomez et al.（2017），Joulin et al.（2016）和Desai &约翰逊（2020）都介绍了从与图像配对的文本中学习视觉表示的方法，但分别将其方法描述为无监督，自监督，弱监督和监督。
We emphasize that what is common across this line of work is not any of the details of the particular methods used but the appreciation of natural language as a training signal. All these approaches are learning from natural language super vision. Although early work wrestled with the complexity of natural language when using topic model and n-gram representations, improvements in deep contextual representation learning suggest we now have the tools to effectively leverage this abundant source of supervision (McCann et al., 2017).我们要强调的是，这一工作领域的共同点不是所使用的特定方法的任何细节，而是对作为训练信号的自然语言的欣赏。所有这些方法都是借鉴自然语言超视觉的。尽管早期的工作在使用主题模型和n元语法表示时与自然语言的复杂性作了斗争，但是深度上下文表示学习的改进表明我们现在具有有效地利用这种丰富的监督来源的工具（McCann等人，（2017年版）。
Learning from natural language has several potential strengths over other training methods. It’s much easier to scale natural language supervision compared to standard crowd-sourced labeling for image classification since it does not require annotations to be in a classic “machine learning compatible format” such as the canonical 1-of-N majority vote “gold label”. Instead, methods which work on natural language can learn passively from the supervision contained in the vast amount of text on the internet. Learning from natural language also has an important advantage over most unsupervised or self-supervised learning approaches in that it doesn’t “just” learn a representation but also connects that representation to language which enables flexible zero-shot transfer. In the following subsections, we detail the specific approach we settled on.从自然语言中学习比其他训练方法有几个潜在的优势。与标准的用于图像分类的众包标注相比，自然语言监督更容易扩展，因为它不需要注释采用经典的“机器学习兼容格式”，例如规范的1/N多数投票“黄金标签”。相反，对自然语言有效的方法可以被动地从互联网上大量文本中包含的监督中学习。与大多数无监督或自监督学习方法相比，从自然语言学习也具有重要优势，因为它不仅“仅仅”学习表示，而且还将表示与语言连接起来，从而实现灵活的零触发转移。在下面的小节中，我们将详细介绍我们确定的具体方法。

2.2. Creating a Sufficiently Large Dataset 创建足够大的数据集

Existing work has mainly used three datasets, MS-COCO (Lin et al., 2014), Visual Genome (Krishna et al., 2017), and YFCC100M (Thomee et al., 2016). While MS-COCO and Visual Genome are high quality crowd-labeled datasets, they are small by modern standards with approximately 100,000 training photos each. By comparison, other computer vision systems are trained on up to 3.5 billion Instagram photos (Mahajan et al., 2018). YFCC100M, at 100 million photos, is a possible alternative, but the metadata for each image is sparse and of varying quality. Many images use automatically generated filenames like 20160716 113957.JPG as “titles” or contain “descriptions” of camera exposure settings. After filtering to keep only images with natural language titles and/or descriptions in English, the dataset shrunk by a factor of 6 to only 15 million photos. This is approximately the same size as ImageNet. A major motivation for natural language supervision is the large quantities of data of this form available publicly on the internet. Since existing datasets do not adequately reflect this possibility, considering results only on them would underestimate the potential of this line of research. To address this, we constructed a new dataset of 400 million (image, text) pairs collected form a variety of publicly available sources on the Internet. To attempt to cover as broad a set of visual concepts as possible, we search for (image, text) pairs as part of the construction process whose text includes one of a set of 500,000 queries.1 We approximately class balance the results by including up to 20,000 (image, text) pairs per query. The resulting dataset has a similar total word count as the WebText dataset used to train GPT-2. We refer to this dataset as WIT for WebImageText.现有的工作主要使用了三个数据集，MS-COCO（Lin等人，2014），可视化基因组（Krishna等人，2017年）和YFCC 100 M（托米等人，（2016年版）。虽然MS-COCO和Visual Genome是高质量的人群标记数据集，但按照现代标准，它们都很小，每个都有大约100，000张训练照片。相比之下，其他计算机视觉系统在多达35亿张Instagram照片上进行训练（Mahajan等人，（2018年版）。1亿张照片的YFCC 100 M是一种可能的替代方案，但每张图像的元数据都很稀疏，质量也各不相同。许多图像使用自动生成的文件名（如20160716 113957.JPG）作为“标题”或包含相机曝光设置的“描述”。在过滤后只保留带有自然语言标题和/或英文描述的图像，数据集缩小了6倍，仅为1500万张照片。这与ImageNet的大小大致相同。自然语言监督的一个主要动机是在互联网上可以公开获得这种形式的大量数据。由于现有的数据集并没有充分反映这种可能性，因此只考虑这些数据集的结果将低估这一研究方向的潜力。为了解决这一问题，我们构建了一个新的数据集，其中包含从互联网上的各种公共资源中收集的4亿对（图像、文本）。为了尝试覆盖尽可能广泛的视觉概念集，我们在构建过程中搜索（图像、文本）对，其中文本包括500，000个查询中的一个。1我们通过每个查询最多包括20，000个（图像、文本）对来近似地平衡结果的类别。结果数据集的总字数与用于训练GPT-2的WebText数据集相似。我们将此数据集称为WebImageText的WIT。

2.3. Selecting an Efficient Pre-Training Method 选择一种有效的预训练方法

State-of-the-art computer vision systems use very large amounts of compute. Mahajan et al. (2018) required 19 GPU years to train their ResNeXt101-32x48d and Xie et al. (2020) required 33 TPUv3 core-years to train their Noisy Student EfficientNet-L2. When considering that both these systems were trained to predict only 1000 ImageNet classes, the task of learning an open set of visual concepts from natural language seems daunting. In the course of our efforts, we found training efficiency was key to successfully scaling natural language supervision and we selected our final pre-training method based on this metric.最先进的计算机视觉系统使用非常大的计算量。Mahajan等人（2018）需要19 GPU年来训练他们的ResNeXt 101 - 32 x48 d，Xie等人（2020）需要33 TPUv 3核心年来训练他们的Noisy Student EfficientNet-L2。当考虑到这两个系统都只被训练来预测1000个ImageNet类时，从自然语言中学习一组开放的视觉概念的任务似乎是令人生畏的。在我们的努力过程中，我们发现训练效率是成功扩展自然语言监督的关键，我们根据这一指标选择了最终的预训练方法。
Our initial approach, similar to VirTex, jointly trained an image CNN and text transformer from scratch to predict the caption of an image. However, we encountered difficulties efficiently scaling this method. In Figure 2 we show that a 63 million parameter transformer language model, which already uses twice the compute of its ResNet-50 image encoder, learns to recognize ImageNet classes three times slower than a much simpler baseline that predicts a bag-ofwords encoding of the same text.我们最初的方法类似于VirTex，从头开始联合训练图像CNN和文本Transformer来预测图像的标题。然而，我们遇到了有效扩展该方法的困难。在图2中，我们展示了一个6300万参数的Transformer语言模型，它已经使用了其ResNet-50图像编码器的两倍计算，学习识别ImageNet类比预测相同文本的bag-ofwords编码的简单得多的基线慢三倍。
Both these approaches share a key similarity. They try to predict the exact words of the text accompanying each image. This is a difficult task due to the wide variety of descriptions, comments, and related text that co-occur with images. Recent work in contrastive representation learning for images has found that contrastive objectives can learn better representations than their equivalent predictive objective (Tian et al., 2019). Other work has found that although generative models of images can learn high quality image representations, they require over an order of magnitude more compute than contrastive models with the same performance (Chen et al., 2020a). Noting these findings, we explored training a system to solve the potentially easier proxy task of predicting only which text as a whole is paired with which image and not the exact words of that text. Starting with the same bag-of-words encoding baseline, we swapped the predictive objective for a contrastive objective in Figure 2 and observed a further 4x efficiency improvement in the rate of zero-shot transfer to ImageNet.这两种方法都有一个关键的相似之处。他们试着预测每个图像所附文本的确切单词。由于与图像同时出现的各种各样的描述、注释和相关文本，这是一项困难的任务。最近在图像的对比表示学习中的工作已经发现对比目标可以比它们的等效预测目标学习更好的表示（Tian等人，（2019年版）。其他工作已经发现，尽管图像的生成模型可以学习高质量的图像表示，但是它们需要比具有相同性能的对比模型多一个数量级的计算（Chen等人，2020年a）的规定。注意到这些发现，我们研究了训练一个系统来解决一个潜在的更容易的代理任务，即只预测哪个文本作为一个整体与哪个图像配对，而不是该文本的确切单词。从相同的词袋编码基线开始，我们将预测目标替换为图2中的对比目标，并观察到零触发传输到ImageNet的速率进一步提高了4倍。
在这里插入图片描述

Given a batch of N (image, text) pairs, CLIP is trained to predict which of the N ×N possible (image, text) pairings across a batch actually occurred. To do this, CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the N real pairs in the batch while minimizing the cosine similarity of the embeddings of the N2 − N incorrect pairings. We optimize a symmetric cross entropy loss over these similarity scores. In Figure 3 we include pseudocode of the core of an implementation of CLIP. To our knowledge this batch construction technique and objective was first introduced in the area of deep metric learning as the multi-class N-pair loss Sohn (2016), was popularized for contrastive representation learning by Oord et al. (2018) as the InfoNCE loss, and was recently adapted for contrastive (text, image) representation learning in the domain of medical imaging by Zhang et al. (2020).给定一批N（图像，文本）对，CLIP被训练来预测跨一批的N ×N可能的（图像，文本）对中的哪一个实际发生。为此，CLIP通过联合训练图像编码器和文本编码器来学习多模态嵌入空间，以最大化批次中N个真实的对的图像和文本嵌入的余弦相似性，同时最小化N2 − N个不正确配对的嵌入的余弦相似性。我们优化这些相似性分数的对称交叉熵损失。在图3中，我们包括CLIP实现的核心的伪代码。据我们所知，这种批量构造技术和目标首先在深度度量学习领域引入，作为多类N对损失Sohn（2016），由Oord等人推广用于对比表示学习。（2018）作为InfoNCE损失，最近适用于对比表示学习。Zhang et al.（2020）在医学成像领域的（文本，图像）表示学习。
在这里插入图片描述

Due to the large size of our pre-training dataset, over-fitting is not a major concern and the details of training CLIP are simplified compared to the implementation of Zhang et al. (2020). We train CLIP from scratch without initializing the image encoder with ImageNet weights or the text encoder with pre-trained weights. We do not use the non-linear projection between the representation and the contrastive embedding space, a change which was introduced by Bachman et al. (2019) and popularized by Chen et al. (2020b). We instead use only a linear projection to map from each encoder’s representation to the multi-modal embedding space. We did not notice a difference in training efficiency between the two versions and speculate that non-linear projections may be co-adapted with details of current image only in self-supervised representation learning methods. We also remove the text transformation function tu from Zhang et al. (2020) which samples a single sentence at uniform from the text since many of the (image, text) pairs in CLIP’s pretraining dataset are only a single sentence. We also simplify the image transformation function tv. A random square crop from resized images is the only data augmentation used during training. Finally, the temperature parameter which controls the range of the logits in the softmax, τ, is directly optimized during training as a log-parameterized multiplicative scalar to avoid turning as a hyper-parameter.由于我们的预训练数据集很大，过度拟合不是主要问题，与Zhang et al.（2020）的实现相比，训练CLIP的细节得到了简化。我们从头开始训练CLIP，而不使用ImageNet权重初始化图像编码器或使用预先训练的权重初始化文本编码器。我们不使用表示和对比嵌入空间之间的非线性投影，这是由Bachman等人引入的变化。（2019）并由Chen等人推广。（2020 b）。相反，我们只使用线性投影来从每个编码器的表示映射到多模态嵌入空间。我们没有注意到两个版本之间的训练效率差异，并推测非线性投影可能仅在自监督表示学习方法中与当前图像的细节协同适应。我们还删除了Zhang et al.（2020）中的文本转换函数tu，该函数从文本中统一采样单个句子，因为CLIP预训练数据集中的许多（图像，文本）对只是一个句子。我们还简化了图像变换函数tv。来自调整大小的图像的随机正方形裁剪是训练期间使用的唯一数据扩充。最后，控制softmax中logits范围的温度参数τ在训练期间直接优化为对数参数化乘法标量，以避免作为超参数转向。

2.4. Choosing and Scaling a Model

We consider two different architectures for the image encoder. For the first, we use ResNet-50 (He et al., 2016a) as the base architecture for the image encoder due to its widespread adoption and proven performance. We make several modifications to the original version using the ResNetD improvements from He et al. (2019) and the antialiased rect-2 blur pooling from Zhang (2019). We also replace the global average pooling layer with an attention pooling mechanism. The attention pooling is implemented as a single layer of “transformer-style” multi-head QKV attention where the query is conditioned on the global average-pooled representation of the image. For the second architecture, we experiment with the recently introduced Vision Transformer (ViT) (Dosovitskiy et al., 2020). We closely follow their implementation with only the minor modification of adding an additional layer normalization to the combined patch and position embeddings before the transformer and use a slightly different initialization scheme.我们考虑两种不同的图像编码器架构。首先，我们使用ResNet-50（He等人，2016 a）作为图像编码器的基础架构，这是由于其被广泛采用并且性能得到证实。我们使用He等人（2019）的ResNetD改进和Zhang（2019）的反锯齿rect-2模糊池对原始版本进行了几处修改。我们还将全局平均池层替换为注意力池机制。注意力池被实现为单层的“变换器式”多头QKV注意力，其中查询以图像的全局平均池表示为条件。对于第二种体系结构，我们使用最近引入的视觉Transformer（ViT）进行实验（Dosovitskiy等人，2020年）的报告。我们密切关注它们的实现，仅进行了微小的修改，即在Transformer之前向组合的面片和位置嵌入添加了额外的层归一化，并使用了略微不同的初始化方案。
The text encoder is a Transformer (Vaswani et al., 2017) with the architecture modifications described in Radford et al. (2019). As a base size we use a 63M-parameter 12layer 512-wide model with 8 attention heads. The transformer operates on a lower-cased byte pair encoding (BPE) representation of the text with a 49,152 vocab size (Sennrich et al., 2015). For computational efficiency, the max sequence length was capped at 76. The text sequence is bracketed with [SOS] and [EOS] tokens and the activations of the highest layer of the transformer at the [EOS] token are treated as the feature representation of the text which is layer normalized and then linearly projected into the multi-modal embedding space. Masked self-attention was used in the text encoder to preserve the ability to initialize with a pre-trained language model or add language modeling as an auxiliary objective, though exploration of this is left as future work.文本编码器是一个Transformer（Vaswani等人，2017年），并对拉德福等人（2019年）所述的架构进行了修改。作为基本尺寸，我们使用具有8个注意头的63M参数12层512宽的模型。Transformer对具有49，152 vocab大小的文本的小写字节对编码（BPE）表示进行操作（Sennrich等人，2015年）的报告。为了提高计算效率，最大序列长度被限制在76。文本序列被[SOS]和[EOS]标记包围，Transformer在[EOS]标记处的最高层的激活被视为文本的特征表示，该特征表示被层归一化，然后被线性投影到多模态嵌入空间。在文本编码器中使用了掩蔽的自注意，以保留使用预先训练的语言模型进行初始化或添加语言建模作为辅助目标的能力，尽管对此的探索留待将来工作。
While previous computer vision research has often scaled models by increasing the width (Mahajan et al., 2018) or depth (He et al., 2016a) in isolation, for the ResNet image encoders we adapt the approach of Tan & Le (2019) which found that allocating additional compute across all of width, depth, and resolution outperforms only allocating it to only one dimension of the model. While Tan & Le (2019) tune the ratio of compute allocated to each dimension for their EfficientNet architecture, we use a simple baseline of allocating additional compute equally to increasing the width, depth, and resolution of the model. For the text encoder, we only scale the width of the model to be proportional to the calculated increase in width of the ResNet and do not scale the depth at all, as we found CLIP’s performance to be less sensitive to the capacity of the text encoder.尽管先前的计算机视觉研究经常通过增加宽度来缩放模型（Mahajan等人，2018）或深度（He等人，2016a），对于ResNet图像编码器，我们采用了Tan和Le（2019）的方法，该方法发现，在宽度、深度和分辨率上分配额外的计算优于仅将其分配给模型的一个维度。虽然Tan & Le（2019）针对其EfficientNet架构调整了分配给每个维度的计算比率，但我们使用了一个简单的基准，即平均分配额外的计算，以增加模型的宽度、深度和分辨率。对于文本编码器，我们仅根据计算出的ResNet宽度增量按比例缩放模型的宽度，而根本不缩放深度，因为我们发现CLIP的性能对文本编码器的容量不太敏感。

2.5. Training

We train a series of 5 ResNets and 3 Vision Transformers. For the ResNets we train a ResNet-50, a ResNet-101, and then 3 more which follow EfficientNet-style model scaling and use approximately 4x, 16x, and 64x the compute of a ResNet-50. They are denoted as RN50x4, RN50x16, and RN50x64 respectively. For the Vision Transformers we train a ViT-B/32, a ViT-B/16, and a ViT-L/14. We train all models for 32 epochs. We use the Adam optimizer (Kingma & Ba, 2014) with decoupled weight decay regularization (Loshchilov & Hutter, 2017) applied to all weights that are not gains or biases, and decay the learning rate using a cosine schedule (Loshchilov & Hutter, 2016). Initial hyperparameters were set using a combination of grid searches, random search, and manual tuning on the baseline ResNet50 model when trained for 1 epoch. Hyper-parameters were then adapted heuristically for larger models due to computational constraints. The learnable temperature parameter τ was initialized to the equivalent of 0.07 from (Wu et al., 2018) and clipped to prevent scaling the logits by more than 100 which we found necessary to prevent training instability. We use a very large minibatch size of 32,768. Mixed-precision (Micikevicius et al., 2017) was used to accelerate training and save memory. To save additional memory, gradient checkpointing (Griewank & Walther, 2000; Chen et al., 2016), half-precision Adam statistics (Dhariwal et al., 2020), and half-precision stochastically rounded text encoder weights were used. The calculation of embedding similarities was also sharded with individual GPUs computing only the subset of the pairwise similarities necessary for their local batch of embeddings. The largest ResNet model, RN50x64, took 18 days to train on 592 V100 GPUs while the largest Vision Transformer took 12 days on 256 V100 GPUs. For the ViT-L/14 we also pre-train at a higher 336 pixel resolution for one additional epoch to boost performance similar to FixRes (Touvron et al., 2019). We denote this model as ViT-L/14@336px. Unless otherwise specified, all results reported in this paper as “CLIP” use this model which we found to perform best.我们培训了5名ResNets和3名Vision Transformers。对于ResNet，我们训练了一个ResNet-50、一个ResNet-101，然后训练了另外3个ResNet，这些ResNet遵循EfficientNet样式的模型缩放，使用的计算量大约是ResNet-50的4倍、16倍和64倍。它们分别表示为RN 50 x4、RN 50 x16和RN 50 x64。“对于视觉变形金刚，我们训练了一架ViT-B/32、一架ViT-B/16和一架ViT-L/14。我们对所有模型进行了32个时期的训练。我们使用Adam优化器（Kingma & Ba，2014），并将解耦权重衰减正则化（Loshchilov & Hutter，2017）应用于所有非增益或偏差的权重，并使用余弦调度衰减学习速率（Loshchilov & Hutter，2016）。当训练1个时期时，使用网格搜索、随机搜索和基线ResNet 50模型上的手动调谐的组合来设置初始超参数。然后，由于计算限制，超参数被启发式地适应于更大的模型。可学习的温度参数τ被初始化为0.07的等效值，2018年），并进行了裁剪，以防止logit的比例超过100，我们发现这对防止训练不稳定性是必要的。我们使用非常大的小批量32，768。混合精度（Micikevicius等人，2017年）被用来加速训练和节省内存。为了节省额外的内存，梯度检查点（Griewank & Walther，2000; Chen等人，2016）、半精度Adam统计（达里瓦尔等人，2020），并且使用了半精度随机舍入的文本编码器权重。嵌入相似度的计算也被分割，单个GPU仅计算其本地批量嵌入所需的成对相似度的子集。最大的ResNet型号RN 50 x64在592个V100 GPU上花费了18天进行训练，而最大的Vision Transformer在256个V100 GPU上花费了12天。对于ViT-L/14，我们还在更高的336像素分辨率下预训练一个额外的时期，以提高类似于FixRes的性能（Touvron等人，（2019年版）。我们将该模型表示为ViT-L/14@336px。除非另有说明，本文中报告为“CLIP”的所有结果均使用我们发现性能最佳的模型。

3. Experiments

3.1. Zero-Shot Transfer

3.1.1. MOTIVATION

In computer vision, zero-shot learning usually refers to the study of generalizing to unseen object categories in image classification (Lampert et al., 2009). We instead use the term in a broader sense and study generalization to unseen datasets. We motivate this as a proxy for performing unseen tasks, as aspired to in the zero-data learning paper of Larochelle et al. (2008). While much research in the field of unsupervised learning focuses on the representation learning capabilities of machine learning systems, we motivate studying zero-shot transfer as a way of measuring the tasklearning capabilities of machine learning systems. In this view, a dataset evaluates performance on a task on a specific distribution. However, many popular computer vision datasets were created by the research community primarily as benchmarks to guide the development of generic image classification methods rather than measuring performance on a specific task. While it is reasonable to say that the SVHN dataset measures the task of street number transcription on the distribution of Google Street View photos, it is unclear what “real” task the CIFAR-10 dataset measures. It is clear, however, what distribution CIFAR-10 is drawn from - TinyImages (Torralba et al., 2008). On these kinds of datasets, zero-shot transfer is more an evaluation of CLIP’s robustness to distribution shift and domain generalization rather than task generalization. Please see Section 3.3 for analysis focused on this.在计算机视觉中，零触发学习通常指的是在图像分类中推广到不可见对象类别的研究（Lampert等人，2009年）的报告。相反，我们在更广泛的意义上使用这个术语，并研究对不可见数据集的泛化。正如Larochelle等人（2008）的零数据学习论文所期望的那样，我们将其作为执行看不见的任务的代理。虽然在无监督学习领域的许多研究集中在机器学习系统的表示学习能力上，但是我们鼓励研究零触发迁移作为一种测量机器学习系统的任务学习能力的方法。在这个检视中，数据集会评估特定散发之工作的效能。然而，许多流行的计算机视觉数据集主要是由研究团体创建的，作为指导通用图像分类方法开发的基准，而不是测量特定任务的性能。虽然可以合理地说，SVHN数据集衡量的是谷歌街景照片分布上的街道编号转录任务，但目前尚不清楚CIFAR-10数据集衡量的是什么“真实的”任务。然而，很清楚，从TinyImages得出的CIFAR-10分布是什么（Torralba等人，2008年）的报告。在这类数据集上，零击转移更多地是对CLIP对分布偏移和域泛化的鲁棒性的评估，而不是任务泛化。关于这方面的分析，请参见第3.3节。
To our knowledge, Visual N-Grams (Li et al., 2017) first studied zero-shot transfer to existing image classification datasets in the manner described above. It is also the only other work we are aware of that has studied zero-shot transfer to standard image classification datasets using a generically pre-trained model and serves as the best reference point for contextualizing CLIP. Their approach learns the parameters of a dictionary of 142,806 visual n-grams (spanning 1- to 5- grams) and optimizes these n-grams using a differential version of Jelinek-Mercer smoothing to maximize the probability of all text n-grams for a given image. In order to perform zero-shot transfer, they first convert the text of each of the dataset’s class names into its n-gram representation and then compute its probability according to their model, predicting the one with the highest score.据我们所知，Visual N-Grams（Li等人，2017）首先研究了以上述方式将零激发转移到现有图像分类数据集。这也是我们所知道的唯一一项使用通用预训练模型研究零拍摄传输到标准图像分类数据集的工作，并作为CLIP上下文化的最佳参考点。他们的方法学习了142，806个视觉n-gram（从1到5-gram）字典的参数，并使用Jelinek-Mercer平滑的差分版本优化这些n-gram，以最大化给定图像的所有文本n-gram的概率。为了执行零次传输，他们首先将每个数据集的类名的文本转换为它的n-gram表示，然后根据他们的模型计算它的概率，预测得分最高的那个。
Our focus on studying zero-shot transfer as an evaluation of task learning is inspired by work demonstrating task learning in the field of NLP. To our knowledge Liu et al. (2018) first identified task learning as an “unexpected side-effect” when a language model trained to generate Wikipedia articles learned to reliably transliterate names between languages. While GPT-1 (Radford et al., 2018) focused on pre training as a transfer learning method to improve supervised fine-tuning, it also included an ablation study demonstrating that the performance of four heuristic zero-shot transfer methods improved steadily over the course of pre-training, without any supervised adaption. This analysis served as the basis for GPT-2 (Radford et al., 2019) which focused exclusively on studying the task-learning capabilities of language models via zero-shot transfer.我们的研究重点是零触发迁移作为任务学习的评估是由工作示范的任务学习领域的自然语言处理的启发。据我们所知，Liu等人（2018）首次将任务学习确定为一种“意外的副作用”，当时一个被训练来生成维基百科文章的语言模型学会了在不同语言之间可靠地音译名字。虽然GPT-1（拉德福等人，2018年）专注于预训练作为一种迁移学习方法，以改善监督微调，它还包括一项消融研究，表明四种启发式零触发迁移方法的性能在预训练过程中稳步提高，没有任何监督适应。该分析用作GPT-2的基础（拉德福等人，2019年），专注于通过零触发迁移研究语言模型的任务学习能力。

3.1.2. USING CLIP FOR ZERO-SHOT TRANSFER

CLIP is pre-trained to predict if an image and a text snippet are paired together in its dataset. To perform zero-shot classification, we reuse this capability. For each dataset, we use the names of all the classes in the dataset as the set of potential text pairings and predict the most probable (image, text) pair according to CLIP. In a bit more detail, we first compute the feature embedding of the image and the feature embedding of the set of possible texts by their respective encoders. The cosine similarity of these embeddings is then calculated, scaled by a temperature parameter τ, and normalized into a probability distribution via a softmax. Note that this prediction layer is a multinomial logistic regression classifier with L2-normalized inputs, L2-normalized weights, no bias, and temperature scaling. When interpreted this way, the image encoder is the computer vision backbone which computes a feature representation for the image and the text encoder is a hypernetwork (Ha et al., 2016) which generates the weights of a linear classifier based on the text specifying the visual concepts that the classes represent. Lei Ba et al. (2015) first introduced a zero-shot image classifier of this form while the idea of generating a classifier from natural language dates back to at least Elhoseiny et al. (2013). Continuing with this interpretation, every step of CLIP pre-training can be viewed as optimizing the performance of a randomly created proxy to a computer vision dataset which contains 1 example per class and has 32,768 total classes defined via natural language descriptions. For zero-shot evaluation, we cache the zero-shot classifier once it has been computed by the text encoder and reuse it for all subsequent predictions. This allows the cost of generating it to be amortized across all the predictions in a dataset.CLIP经过预训练，可以预测图像和文本片段是否在其数据集中配对在一起。为了执行零触发分类，我们重用此功能。对于每个数据集，我们使用数据集中所有类的名称作为潜在文本对的集合，并根据CLIP预测最可能的（图像，文本）对。更详细地说，我们首先通过它们各自的编码器计算图像的特征嵌入和可能的文本集合的特征嵌入。然后计算这些嵌入的余弦相似度，通过温度参数τ进行缩放，并通过softmax将其归一化为概率分布。请注意，此预测层是一个多项逻辑回归分类器，具有L2标准化输入、L2标准化权重、无偏倚和温度缩放。当以这种方式解释时，图像编码器是计算图像的特征表示的计算机视觉中枢，而文本编码器是超网络（Ha等人，2016），其基于指定类所表示的视觉概念的文本来生成线性分类器的权重。Lei Ba等人（2015）首先介绍了这种形式的零镜头图像分类器，而从自然语言生成分类器的想法至少可以追溯到Elhoseiny等人（2013）。继续这种解释，CLIP预训练的每个步骤可以被看作是优化随机创建的对计算机视觉数据集的代理的性能，所述计算机视觉数据集包含每个类1个示例并且具有经由自然语言描述定义的总共32，768个类。对于零触发求值，我们在文本编码器计算出零触发分类器后将其缓存，并将其用于所有后续预测。这允许生成它的成本在数据集中的所有预测之间分摊。

3.1.3. INITIAL COMPARISON TO VISUAL N-GRAMS 与视觉N-GRAM的初步比较

In Table 1 we compare Visual N-Grams to CLIP. The best CLIP model improves accuracy on ImageNet from a proof of concept 11.5% to 76.2% and matches the performance of the original ResNet-50 despite using none of the 1.28 million crowd-labeled training examples available for this dataset. Additionally, the top-5 accuracy of CLIP models are noticeably higher than their top-1, and this model has a 95% top-5 accuracy, matching Inception-V4 (Szegedy et al., 2016). The ability to match the performance of a strong, fully supervised baselines in a zero-shot setting suggests CLIP is a significant step towards flexible and practical zero-shot computer vision classifiers. As mentioned above, the comparison to Visual N-Grams is meant for contextualizing the performance of CLIP and should not be interpreted as a direct methods comparison between CLIP and Visual N-Grams as many performance relevant differences between the two systems were not controlled for. For instance, we train on a dataset that is 10x larger, use a vision model that requires nearly 100x more compute per prediction, likely used over 1000x their training compute, and use a transformer-based model which did not exist when Visual N-Grams was published. As a closer comparison, we trained a CLIP ResNet-50 on the same YFCC100M dataset that Visual N-Grams was trained on and found it matched their reported ImageNet performance within a V100 GPU day. This baseline was also trained from scratch instead of being initialized from pre-trained ImageNet weights as in Visual N-Grams.在表1中，我们比较了目视N-Gram与CLIP。最佳CLIP模型将ImageNet上的准确性从概念验证的11.5%提高到76.2%，并且与原始ResNet-50的性能相匹配，尽管未使用此数据集可用的128万个人群标记训练示例中的任何一个。此外，CLIP模型的前5名准确性明显高于其前1名，并且该模型具有95%的前5名准确性，与Inception-V4匹配（Szegedy等人，（2016年版）。CLIP能够在零触发设置中匹配强有力的、完全监督的基线的性能，这表明CLIP是朝着灵活和实用的零触发计算机视觉分类器迈出的重要一步。如上所述，与Visual N-Grams的比较旨在将CLIP的性能置于上下文中，不应被解释为CLIP与Visual N-Grams之间的直接方法比较，因为两种系统之间的许多性能相关差异未得到控制。例如，我们在10倍大的数据集上进行训练，使用每次预测需要增加近100倍计算的视觉模型（可能使用了超过1000倍的训练计算），并使用Visual N-Grams发布时尚不存在的基于转换器的模型。作为一个更深入的比较，我们在Visual N-Grams接受训练的同一YFCC 100 M数据集上训练了一个CLIP ResNet-50，结果发现它在V100 GPU一天内就达到了他们报告的ImageNet性能。该基线也是从头开始训练的，而不是像Visual N-Grams中那样从预先训练的ImageNet权重初始化。
在这里插入图片描述
CLIP also outperforms Visual N-Grams on the other 2 reported datasets. On aYahoo, CLIP achieves a 95% reduction in the number of errors, and on SUN, CLIP more than doubles the accuracy of Visual N-Grams. To conduct a more comprehensive analysis and stress test, we implement a much larger evaluation suite detailed in Appendix A. In total we expand from the 3 datasets reported in Visual NGrams to include over 30 datasets and compare to over 50 existing computer vision systems to contextualize results.CLIP在其他2个报告的数据集上也优于Visual N-Grams。在aYahoo上，CLIP实现了95%的错误减少，在SUN上，CLIP的准确性比Visual N-Grams高出一倍多。为了进行更全面的分析和压力测试，我们实现了一个更大的评估套件，详见附录A。总的来说，我们从Visual NGrams中报告的3个数据集扩展到包括30多个数据集，并与50多个现有的计算机视觉系统进行比较，以将结果置于情境中。

3.1.4. PROMPT ENGINEERING AND ENSEMBLING 快速工程和组装

Most standard image classification datasets treat the information naming or describing classes which enables natural language based zero-shot transfer as an afterthought. The vast majority of datasets annotate images with just a numeric id of the label and contain a file mapping these ids back to their names in English. Some datasets, such as Flowers102 and GTSRB, don’t appear to include this mapping at all in their released versions preventing zero-shot transfer entirely.2 For many datasets, we observed these labels may be chosen somewhat haphazardly and do not anticipate issues related to zero-shot transfer which relies on task description in order to transfer successfully.大多数标准的图像分类数据集都将能够实现基于自然语言的零触发转移的命名或描述类的信息视为事后的想法。绝大多数数据集只使用标签的数字ID来注释图像，并包含一个将这些ID映射回其英文名称的文件。一些数据集，如Flowers102和GTSRB，在其发布版本中似乎根本不包括这种映射，从而完全阻止零触发传输。2对于许多数据集，我们观察到这些标签的选择可能有些随意，并且没有预料到与零触发传输相关的问题，零触发传输依赖于任务描述，以便成功传输。
A common issue is polysemy. When the name of a class is the only information provided to CLIP’s text encoder it is unable to differentiate which word sense is meant due to the lack of context. In some cases multiple meanings of the same word might be included as different classes in the same dataset! This happens in ImageNet which contains both construction cranes and cranes that fly. Another example is found in classes of the Oxford-IIIT Pet dataset where the word boxer is, from context, clearly referring to a breed of dog, but to a text encoder lacking context could just as likely refer to a type of athlete.一个常见的问题是一词多义。当一个类的名称是提供给CLIP的文本编码器的唯一信息时，由于缺乏上下文，它无法区分哪个词义。在某些情况下，同一个词的多个含义可能会作为不同的类包含在同一个数据集中！这发生在ImageNet中，其中包含建筑起重机和飞行起重机。另一个例子是在Oxford-IIIT Pet数据集的类中发现的，其中单词boxer从上下文来看显然是指一种狗，但缺乏上下文的文本编码器可能只是指一种运动员。
Another issue we encountered is that it’s relatively rare in our pre-training dataset for the text paired with the image to be just a single word. Usually the text is a full sentence describing the image in some way. To help bridge this distribution gap, we found that using the prompt template “A photo of a {label}.” to be a good default that helps specify the text is about the content of the image. This often improves performance over the baseline of using only the label text. For instance, just using this prompt improves accuracy on ImageNet by 1.3%.我们遇到的另一个问题是，在我们的预训练数据集中，与图像配对的文本只是一个单词的情况相对较少。通常，文本是一个完整的句子，以某种方式描述图像。为了帮助弥合这一分布差距，我们发现使用提示模板“{label}的照片”。要成为一个好的默认值，有助于指定文本是关于图像的内容。这通常比仅使用标签文本的基线提高性能。例如，仅仅使用这个提示符就可以将ImageNet的准确率提高1.3%。
Similar to the “prompt engineering” discussion around GPT3 (Brown et al., 2020; Gao et al., 2020), we have also observed that zero-shot performance can be significantly improved by customizing the prompt text to each task. A few, non exhaustive, examples follow. We found on several fine-grained image classification datasets that it helped to specify the category. For example on Oxford-IIIT Pets, using “A photo of a {label}, a type of pet.” to help provide context worked well. Likewise, on Food101 specifying a type offood and on FGVC Aircraft a type of aircraft helped too. For OCR datasets, we found that putting quotes around the text or number to be recognized improved performance. Finally, we found that on satellite image classification datasets it helped to specify that the images were of this form and we use variants of “a satellite photo of a {label}.”.类似于围绕GPT 3的“即时工程”讨论（Brown等人，2020; Gao等人，2020），我们还观察到，通过为每个任务定制提示文本，可以显著提高零触发性能。以下是一些非详尽的例子。我们在几个细粒度图像分类数据集上发现，它有助于指定类别。例如，在Oxford-IIIT Pets上，使用“A photo of a {label}，a type of pet”。来帮助提供背景，效果很好。同样，在Food 101上指定一种食品，在FGVC飞机上指定一种飞机也有帮助。对于OCR数据集，我们发现在要识别的文本或数字周围加上引号可以提高性能。最后，我们发现在卫星图像分类数据集上，它有助于指定图像是这种形式，我们使用“{label}的卫星照片"的变体。
We also experimented with ensembling over multiple zeroshot classifiers as another way of improving performance. These classifiers are computed by using different context prompts such as ‘A photo of a big {label}” and “A photo of a small {label}”. We construct the ensemble over the embedding space instead of probability space. This allows us to cache a single set of averaged text embeddings so that the compute cost of the ensemble is the same as using a single classifier when amortized over many predictions. We’ve observed ensembling across many generated zero-shot classifiers to reliably improve performance and use it for the majority of datasets. On ImageNet, we ensemble 80 different context prompts and this improves performance by an additional 3.5% over the single default prompt discussed above. When considered together, prompt engineering and ensembling improve ImageNet accuracy by almost 5%. In Figure 4 we visualize how prompt engineering and ensembling change the performance of a set of CLIP models compared to the contextless baseline approach of directly embedding the class name as done in Li et al. (2017).我们还尝试了在多个零触发分类器上进行集成，作为提高性能的另一种方式。这些分类器是通过使用不同的上下文提示来计算的，例如“大{label}的照片”和“小{label}的照片”。我们在嵌入空间而不是概率空间上构造系综。这允许我们缓存一组平均文本嵌入，以便在许多预测中分摊时，集成的计算成本与使用单个分类器相同。我们已经观察到许多生成的零触发分类器的集成，以可靠地提高性能并将其用于大多数数据集。在ImageNet上，我们集成了80个不同的上下文提示，这比上面讨论的单个默认提示提高了3.5%的性能。综合考虑，快速工程和集成将ImageNet的准确率提高了近5%。在图4中，我们可视化了与Li等人（2017）中直接嵌入类名的无上下文基线方法相比，即时工程和集成如何改变一组CLIP模型的性能。
在这里插入图片描述

3.1.5. ANALYSIS OF ZERO-SHOT CLIP PERFORMANCE ZERO SHOT CLIP性能分析

Since task-agnostic zero-shot classifiers for computer vision have been understudied, CLIP provides a promising opportunity to gain a better understanding of this type of model. In this section, we conduct a study of various properties of CLIP’s zero-shot classifiers. As a first question, we look simply at how well zero-shot classifiers perform. To contextualize this, we compare to the performance of a simple off-the-shelf baseline: fitting a fully supervised, regularized, logistic regression classifier on the features of the canonical ResNet-50. In Figure 5 we show this comparison across 27 datasets. Please see Appendix A for details of datasets and setup.由于用于计算机视觉的与任务无关的零触发分类器尚未得到充分研究，CLIP为更好地理解这类模型提供了一个有希望的机会。在本节中，我们将研究CLIP的零触发分类器的各种属性。作为第一个问题，我们简单地看一下零触发分类器的性能。为了具体化这一点，我们将其与简单的现成基线的性能进行比较：在规范ResNet-50的特征上拟合完全监督的、正则化的、逻辑回归分类器。在图5中，我们展示了27个数据集的比较结果。数据集和设置的详细信息请参见附录A。
在这里插入图片描述

Zero-shot CLIP outperforms this baseline slightly more of ten than not and wins on 16 of the 27 datasets. Looking at individual datasets reveals some interesting behavior. On fine-grained classification tasks, we observe a wide spread in performance. On two of these datasets, Stanford Cars and Food101, zero-shot CLIP outperforms logistic regression on ResNet-50 features by over 20% while on two others, Flowers102 and FGVCAircraft, zero-shot CLIP underperforms by over 10%. On OxfordPets and Birdsnap, performance is much closer. We suspect these difference are primarily due to varying amounts of per-task supervision between WIT and ImageNet. On “general” object classification datasets such as ImageNet, CIFAR10/100, STL10, and PascalVOC2007 performance is relatively similar with a slight advantage for zero-shot CLIP in all cases. On STL10, CLIP achieves 99.3% overall which appears to be a new state of the art despite not using any training examples. Zeroshot CLIP significantly outperforms a ResNet-50 on two datasets measuring action recognition in videos. On Kinetics700, CLIP outperforms a ResNet-50 by 14.5%. Zeroshot CLIP also outperforms a ResNet-50’s features by 7.7% on UCF101. We speculate this is due to natural language providing wider supervision for visual concepts involving verbs, compared to the noun-centric object supervision in ImageNet.Zero-shot CLIP比这一基准略高出10个百分点，在27个数据集中有16个胜出。查看单个数据集可以发现一些有趣的行为。在细粒度分类任务上，我们观察到了性能的广泛差异。在其中两个数据集斯坦福大学汽车和Food 101上，零触发CLIP在ResNet-50特征上的表现优于逻辑回归20%以上，而在另外两个数据集Flowers 102和FGVCAircraft上，零触发CLIP的表现低于逻辑回归10%以上。在OxfordPets和Birdsnap上，性能要接近得多。我们怀疑这些差异主要是由于WIT和ImageNet之间每任务监督量不同所致。在ImageNet、CIFAR 10/100、STL 10和PascalVOC 2007等“一般”对象分类数据集上，性能相对相似，但在所有情况下，零激发CLIP都略有优势。在STL 10上，CLIP总体上达到了99.3%，尽管没有使用任何训练示例，但这似乎是一个新的技术状态。在测量视频中动作识别的两个数据集上，Zeroshot CLIP的性能显著优于ResNet-50。在Kinetics 700上，CLIP的性能比ResNet-50高出14.5%。Zeroshot CLIP在UCF 101上的性能也比ResNet-50的性能高出7.7%。我们推测这是由于与ImageNet中以名词为中心的对象监督相比，自然语言为涉及动词的视觉概念提供了更广泛的监督。
Looking at where zero-shot CLIP notably underperforms, we see that zero-shot CLIP is quite weak on several specialized, complex, or abstract tasks such as satellite image classification (EuroSAT and RESISC45), lymph node tumor detection (PatchCamelyon), counting objects in synthetic scenes (CLEVRCounts), self-driving related tasks such as German traffic sign recognition (GTSRB), recognizing distance to the nearest car (KITTI Distance). These results highlight the poor capability of zero-shot CLIP on more complex tasks. By contrast, non-expert humans can robustly perform several of these tasks, such as counting, satellite image classification, and traffic sign recognition, suggesting significant room for improvement. However, we caution that it is unclear whether measuring zero-shot transfer, as opposed to few-shot transfer, is a meaningful evaluation for difficult tasks that a learner has no prior experience with, such as lymph node tumor classification for almost all humans (and possibly CLIP).通过查看零拍摄CLIP性能明显不佳的地方，我们发现零拍摄CLIP在一些专门的、复杂的或抽象的任务（如卫星影像分类）中非常薄弱（EuroSAT和RESISC45），淋巴结肿瘤检测（PatchCamelyon），计算合成场景中的对象（CLEVRCounts）、自动驾驶相关任务，如德国交通标志识别（GTSRB）、识别与最近汽车的距离（KITTI Distance）。这些结果突出了零触发CLIP在更复杂任务上的较差能力。相比之下，非专家级的人可以稳健地执行其中的几项任务，如计数、卫星图像分类和交通标志识别，这表明还有很大的改进空间。然而，我们提醒，目前尚不清楚测量零发射转移（相对于少发射转移）是否是对学习者先前没有经验的困难任务的有意义的评估，例如几乎所有人类的淋巴结肿瘤分类（可能是CLIP）。
While comparing zero-shot performance to fully supervised models contextualizes the task-learning capabilities of CLIP, comparing to few-shot methods is a more direct comparison, since zero-shot is its limit. In Figure 6, we visualize how zero-shot CLIP compares to few-shot logistic regression on the features of many image models including the best publicly available ImageNet models, self-supervised learning methods, and CLIP itself. While it is intuitive to expect zero-shot to underperform one-shot, we instead find that zero-shot CLIP matches the performance of 4-shot logistic regression on the same feature space. This is likely due to an important difference between the zero-shot and few-shot approach. First, CLIP’s zero-shot classifier is generated via natural language which allows for visual concepts to be directly specified (“communicated”). By contrast, “normal” supervised learning must infer concepts indirectly from training examples. Context-less example-based learning has the drawback that many different hypotheses can be consistent with the data, especially in the one-shot case. A single image often contains many different visual concepts. Although a capable learner is able to exploit visual cues and heuristics, such as assuming that the concept being demonstrated is the primary object in an image, there is no guarantee.虽然将零触发性能与完全监督模型进行比较是在CLIP的任务学习能力的背景下进行的，但与少触发方法进行比较则是更直接的比较，因为零触发是其极限。在图6中，我们可视化了zero-shot CLIP与few-shot logistic回归在许多图像模型特征上的比较，包括最好的公开可用ImageNet模型，自监督学习方法和CLIP本身。虽然我们很直观地认为零次测试的表现不如一次测试，但我们发现零次测试CLIP在相同的特征空间上与4次逻辑回归的表现相匹配。这很可能是由于零炮和少炮方法之间的重要差异。首先，CLIP的零触发分类器是通过自然语言生成的，它允许直接指定（“传达”）视觉概念。相比之下，“正常的”监督学习必须从训练示例中间接推断概念。无上下文的基于示例的学习具有许多不同的假设可以与数据一致的缺点，特别是在一次成功的情况下。单个图像通常包含许多不同的视觉概念。尽管有能力的学习者能够利用视觉线索和启发式方法，例如假设所演示的概念是图像中的主要对象，但这并不保证。
在这里插入图片描述
A potential resolution of this discrepancy between zeroshot and few-shot performance is to use CLIP’s zero-shot classifier as a prior for the weights of the few-shot classifier. While adding an L2 penalty towards the generated weights is a straightforward implementation of this idea, we found that hyperparameter optimization would often select for such a large value of this regularizer that the resulting fewshot classifier was “just” the zero-shot classifier. Research into better methods of combining the strength of zero-shot transfer with flexibility of few-shot learning is a promising direction for future work.零次和少次性能之间这种差异的一个潜在解决方案是使用CLIP的零次分类器作为少次分类器权重的先验。虽然向生成的权重添加L2惩罚是该想法的直接实现，但我们发现超参数优化通常会选择该正则化子的如此大的值，以至于生成的少次分类器“只是”零次分类器。研究如何更好地结合零次迁移的强度和少次学习的灵活性是未来工作的一个很有前途的方向。
When comparing zero-shot CLIP to few-shot logistic regression on the features of other models, zero-shot CLIP roughly matches the performance of the best performing 16-shot classifier in our evaluation suite, which uses the features of a BiT-M ResNet-152x2 trained on ImageNet-21K. We are certain that a BiT-L model trained on JFT-300M would perform even better but these models have not been publicly released. That a BiT-M ResNet-152x2 performs best in a 16-shot setting is somewhat surprising since, as analyzed in Section 3.2, the Noisy Student EfficientNet-L2 outperforms it in a fully supervised setting by almost 5% on average across 27 datasets.当比较零次CLIP与其他模型特征的少量逻辑回归时，零次CLIP大致匹配我们评估套件中性能最好的16次分类器的性能，该分类器使用在ImageNet-21 K上训练的BiT-M ResNet-152 x2的特征。我们确信在JFT-300 M上训练的BiT-L模型将表现得更好，但这些模型尚未公开发布。BiT-M ResNet-152 x2在16次拍摄设置中表现最好，这有点令人惊讶，因为如第3.2节所分析的那样，Noisy Student EfficientNet-L2在27个数据集的完全监督设置中的平均表现优于它近5%。
In addition to studying the average performance of zero-shot CLIP and few-shot logistic regression, we also examine performance on individual datasets. In Figure 7, we show estimates for the number of labeled examples per class that a logistic regression classifier on the same feature space requires to match the performance of zero-shot CLIP. Since zero-shot CLIP is also a linear classifier, this estimates the effective data efficiency of zero-shot transfer in this setting. In order to avoid training thousands of linear classifiers, we estimate the effective data efficiency based on a loglinear interpolation of the performance of a 1, 2, 4, 8, 16shot (when possible), and a fully supervised linear classifier trained on each dataset. We find that zero-shot transfer can have widely varying efficiency per dataset from less than 1 labeled example per class to 184. Two datasets, Flowers102 and EuroSAT underperform one-shot models. Half of the datasets require less than 5 examples per class with a median of 5.4. However, the mean estimated data efficiency is 20.8 examples per class. This is due to the 20% of datasets where supervised classifiers require many labeled examples per class in order to match performance. On ImageNet, zero-shot CLIP matches the performance of a 16-shot linear classifier trained on the same feature space.除了研究zero-shot CLIP和few-shot logistic回归的平均性能外，我们还研究了单个数据集的性能。在图7中，我们显示了同一特征空间上的逻辑回归分类器需要匹配zero-shot CLIP性能的每个类的标记示例数量的估计值。由于zero-shot CLIP也是一种线性分类器，因此可以估计该设置中zero-shot传输的有效数据效率。为了避免训练数千个线性分类器，我们基于1，2，4，8，16次射击（如果可能）的性能的对数线性插值来估计有效数据效率，并在每个数据集上训练一个完全监督的线性分类器。我们发现，zero-shot transfer可以在每个数据集上具有广泛的效率差异，从每个类不到1个标记的示例到184个。Flowers 102和EuroSAT两个数据集的表现低于单次模型。一半的数据集要求每个类少于5个示例，中位数为5.4。然而，平均估计的数据效率是每类20.8个示例。这是由于20%的数据集，其中监督分类器需要每个类有许多标记的示例才能匹配性能。在ImageNet上，zero-shot CLIP与在相同特征空间上训练的16次线性分类器的性能相匹配。
在这里插入图片描述零触发传输的数据效率差异很大。计算同一CLIP特征空间上的线性分类器需要的每个类的标记示例的数量以匹配零触发分类器的性能，从而使零触发转移的有效性上下文中化。根据1、2、4、8、16次激发的对数线性插值和完全监督结果估计值。性能差异很大，从在两个数据集上仍不如一次分类器，到每个类匹配估计184个标记的示例。
If we assume that evaluation datasets are large enough that the parameters of linear classifiers trained on them are well estimated, then, because CLIP’s zero-shot classifier is also a linear classifier, the performance of the fully supervised classifiers roughly sets an upper bound for what zero-shot transfer can achieve. In Figure 8 we compare CLIP’s zeroshot performance with fully supervised linear classifiers across datasets. The dashed, y = x line represents an “optimal” zero-shot classifier that matches the performance of its fully supervised equivalent. For most datasets, the performance of zero-shot classifiers still underperform fully supervised classifiers by 10% to 25%, suggesting that there is still plenty of headroom for improving CLIP’s task-learning and zero-shot transfer capabilities.如果我们假设评估数据集足够大，可以很好地估计在其上训练的线性分类器的参数，那么，由于CLIP的zero-shot分类器也是一个线性分类器，因此全监督分类器的性能大致为zero-shot transfer可以实现的目标设定了一个上限。在图8中，我们比较了CLIP的zeroshot性能与跨数据集的完全监督线性分类器。虚线y = x表示“最佳”零触发分类器，其与其完全监督的等效物的性能相匹配。对于大多数数据集，零触发分类器的性能仍然比全监督分类器低10%到25%，这表明CLIP的任务学习和零触发传输能力仍有很大的提升空间。
在这里插入图片描述零发射性能与线性探头性能相关，但大多数情况下仍不理想。在数据集之间比较零发射和线性探头性能显示出与零发射性能的强相关性，大多数情况下低10到25个点。仅在5个数据集上，零发射性能接近线性探头性能（≤3点差异）。

There is a positive correlation of 0.82 (p-value < 10−6) between zero-shot performance and fully supervised perfor mance, suggesting that CLIP is relatively consistent at connecting underlying representation and task learning to zeroshot transfer. However, zero-shot CLIP only approaches fully supervised performance on 5 datasets: STL10, CIFAR10, Food101, OxfordPets, and Caltech101. On all 5 datasets, both zero-shot accuracy and fully supervised accuracy are over 90%. This suggests that CLIP may be more effective at zero-shot transfer for tasks where its underlying representations are also high quality. The slope of a linear regression model predicting zero-shot performance as a function of fully supervised performance estimates that for every 1% improvement in fully supervised performance, zero-shot performance improves by 1.28%. However, the 95th-percentile confidence intervals still include values of less than 1 (0.93-1.79).零触发表现和完全监督表现之间存在0.82的正相关（p值< 10−6），表明CLIP在将潜在表征和任务学习与零触发迁移联系起来方面相对一致。然而，zero-shot CLIP仅在5个数据集上接近完全监督性能：STL 10，CIFAR 10，Food 101，OxfordPets和Caltech 101。在所有5个数据集上，零炮精度和全监督精度均超过90%。这表明，CLIP可能是更有效的零杆转移的任务，其底层表示也是高质量的。线性回归模型预测零触发性能作为完全监督性能的函数的斜率估计，完全监督性能每提高1%，零触发性能就提高1.28%。然而，第95百分位数置信区间仍包括小于1的值（0.93-1.79）。
Over the past few years, empirical studies of deep learning systems have documented that performance is predictable as a function of important quantities such as training compute and dataset size (Hestness et al., 2017; Kaplan et al., 2020). The GPT family of models has so far demonstrated consistent improvements in zero-shot performance across a 1000x increase in training compute. In Figure 9, we check whether the zero-shot performance of CLIP follows a similar scaling pattern. We plot the average error rate of the 5 ResNet CLIP models across 39 evaluations on 36 different datasets and find that a similar log-log linear scaling trend holds for CLIP across a 44x increase in model compute. While the overall trend is smooth, we found that performance on individual evaluations can be much noisier. We are unsure whether Zero-Shot CLIP Performance this is caused by high variance between individual training runs on sub-tasks (as documented in D’Amour et al. (2020)) masking a steadily improving trend or whether performance is actually non-monotonic as a function of compute on some tasks.在过去几年中，对深度学习系统的实证研究表明，性能是可预测的，是训练计算和数据集大小等重要量的函数（Hestness等人，2017年; Kaplan等人，2020年）的报告。GPT系列机型迄今为止在训练运算量增加1000倍的情况下，零触发效能持续改善。在图9中，我们检查了CLIP的零触发性能是否遵循类似的缩放模式。我们对36个不同数据集的39次评价绘制了5个ResNet CLIP模型的平均错误率，发现在模型计算增加44倍的情况下，CLIP保持相似的对数-对数线性缩放趋势。虽然总体趋势是平稳的，但我们发现，个人评估的表现可能会有更大的噪音。我们不确定零击剪辑性能这是否是由于子任务（如D 'Amour等人（2020）中所述）上的单个训练运行之间的高差异掩盖了稳步提高的趋势，或者性能是否实际上是非单调的，作为一些任务上的计算函数。
在这里插入图片描述 Zero-shot CLIP性能可根据模型计算进行平滑缩放。在36个不同数据集的39次评估中，平均零触发误差在跨越5个不同CLIP模型的44 x计算范围内通过对数-对数线性趋势进行了良好建模。浅阴影线是单个评估的表现，表明尽管总体趋势平稳，但表现的差异要大得多。

3.2. Representation Learning 表示学习

While we have extensively analyzed the task-learning capabilities of CLIP through zero-shot transfer in the previous section, it is more common to study the representation learning capabilities of a model. There exist many ways to evaluate the quality of representations as well as disagreements over what properties an “ideal” representation should have (Locatello et al., 2020). Fitting a linear classifier on a representation extracted from the model and measuring its performance on various datasets is a common approach. An alternative is measuring the performance of end-to-end fine-tuning of the model. This increases flexibility, and prior work has convincingly demonstrated that fine-tuning outperforms linear classification on most image classification datasets (Kornblith et al., 2019; Zhai et al., 2019). While the high performance of fine-tuning motivates its study for practical reasons, we still opt for linear classifier based evaluation for several reasons. Our work is focused on developing a high-performing task and dataset-agnostic pre-training approach. Fine-tuning, because it adapts representations to each dataset during the fine-tuning phase, can compensate for and potentially mask failures to learn general and robust representations during the pre-training phase. Linear classifiers, because of their limited flexibility, instead highlight these failures and provide clear feedback during development. For CLIP, training supervised linear classifiers has the added benefit of being very similar to the approach used for its zero-shot classifiers which enables extensive comparisons and analysis in Section 3.1. Finally, we aim to compare CLIP to a comprehensive set of existing models across many tasks. Studying 66 different models on 27 different datasets requires tuning 1782 different evaluations. Fine-tuning opens up a much larger design and hyperparameter space, which makes it difficult to fairly evaluate and computationally expensive to compare a diverse set of techniques as discussed in other large scale empirical studies (Lucic et al., 2018; Choi et al., 2019). By comparison, linear classifiers require minimal hyper-parameter tuning and have standardized implementations and evaluation procedures. Please see Appendix A for further details on evaluation.虽然我们在前一节中通过零触发迁移广泛分析了CLIP的任务学习能力，但更常见的是研究模型的表征学习能力。存在许多方法来评估表示的质量以及关于“理想”表示应当具有什么性质的分歧（Locatello等人，2020年）的报告。在从模型中提取的表示上拟合线性分类器并在各种数据集上测量其性能是一种常见的方法。另一种方法是测量模型的端到端微调性能。这增加了灵活性，并且先前的工作已经令人信服地证明，在大多数图像分类数据集上，微调优于线性分类（Kornblith等人，2019年; Zhai等人，（2019年版）。尽管微调的高性能出于实际原因而激励了其研究，但出于几个原因，我们仍然选择基于线性分类器的评估。我们的工作重点是开发一种高性能的任务和数据集无关的预训练方法。由于微调在微调阶段使表示适应每个数据集，因此微调可以补偿并潜在地掩盖故障，以便在预训练阶段学习通用和鲁棒的表示。线性分类器，由于其有限的灵活性，反而突出这些失败，并在开发期间提供明确的反馈。对于CLIP，训练监督线性分类器具有额外的好处，即非常类似于用于其零触发分类器的方法，这使得能够在3.1节中进行广泛的比较和分析。最后，我们的目标是将CLIP与跨多个任务的一组全面的现有模型进行比较.研究27个不同数据集上的66个不同模型需要调整1782个不同的评估。微调打开了大得多的设计和超参数空间，这使得很难公平地评估和计算昂贵地比较在其他大规模经验研究中所讨论的一组不同的技术（Lucic等人，2018年; Choi等人，（2019年版）。相比之下，线性分类器需要最小的超参数调整，并且具有标准化的实现和评估过程。有关评价的更多详细信息，请参见附录A。
Figure 10 summarizes our findings. To minimize selection effects that could raise concerns of confirmation or reporting bias, we first study performance on the 12 dataset evaluation suite from Kornblith et al. (2019). While small CLIP models such as a ResNet-50 and ResNet-101 outperform other ResNets trained on ImageNet-1K (BiT-S and the originals), they underperform ResNets trained on ImageNet-21K (BiTM). These small CLIP models also underperform models in the EfficientNet family with similar compute requirements. However, models trained with CLIP scale very well and the largest model we trained (ResNet-50x64) slightly outperforms the best performing existing model (a Noisy Student EfficientNet-L2) on both overall score and compute efficiency. We also find that CLIP vision transformers are about 3x more compute efficient than CLIP ResNets, which allows us to reach higher overall performance within our compute budget. These results qualitatively replicate the findings of Dosovitskiy et al. (2020) which reported that vision transformers are more compute efficient than convnets when trained on sufficiently large datasets. Our best overall model is a ViT-L/14 that is fine-tuned at a higher resolution of 336 pixels on our dataset for 1 additional epoch. This model outperforms the best existing model across this evaluation suite by an average of 2.6%.图10总结了我们的发现。为了尽量减少可能引起确认或报告偏倚问题的选择效应，我们首先研究了Kornblith等人（2019）的12个数据集评价套件的性能。虽然小型CLIP模型（如ResNet-50和ResNet-101）的性能优于其他在ImageNet-1 K（BiT-S和原件）上训练的ResNet，但它们的性能低于在ImageNet-21 K（BiTM）上训练的ResNet。这些小型CLIP模型在具有类似计算要求的EfficientNet系列中也表现不佳。然而，使用CLIP训练的模型规模非常好，我们训练的最大模型（ResNet-50 x64）在整体得分和计算效率方面都略优于现有的最佳模型（Noisy Student EfficientNet-L2）。我们还发现，CLIP Vision Transformer的计算效率比CLIP ResNets高出约3倍，这使我们能够在计算预算内实现更高的整体性能。这些结果定性地重复了Dosovitskiy等人（2020）的研究结果，该研究报告称，当在足够大的数据集上训练时，视觉变换器的计算效率比convnets更高。我们最好的整体模型是ViT-L/14，它在我们的数据集上以336像素的更高分辨率微调了1个额外的历元。该模型的性能比该评估套件中现有的最佳模型平均高出2.6%。
在这里插入图片描述 CLIP模型与最先进的计算机视觉模型（包括EfficientNet）相比的线性探头性能（Tan & Le，2019; Xie等人，2020）、MoCo（Chen等人，2020 d），Instagram预训练的ResNeXt模型（Mahajan等人，2018年; Touvron等人，2019年）、BiT（Kolesnikov等人，2019年）、ViT（Dosovitskiy等人，2020）、SimCLRv 2（Chen等人，2020 c），BYOL（Grill等人，2020），以及原始的ResNet模型（He等人，2016年b）。（左）Kornblith等人（2019）研究的12个数据集的平均评分。（右图）分数是对27个数据集的平均值，这些数据集包含更广泛的分布。虚线表示在比训练前更高的分辨率下对图像进行微调或评估的模型。个体评分见表10，每个数据集的图见图20。

As Figure 21 qualitatively shows, CLIP models learn a wider set of tasks than has previously been demonstrated in a single computer vision model trained end-to-end from random initialization. These tasks include geo-localization, optical character recognition, facial emotion recognition, and action recognition. None of these tasks are measured in the evaluation suite of Kornblith et al. (2019). This could be argued to be a form of selection bias in Kornblith et al. (2019)’s study towards tasks that overlap with ImageNet. To address this, we also measure performance on a broader 27 dataset evaluation suite. This evaluation suite, detailed in Appendix A includes datasets representing the aforementioned tasks, German Traffic Signs Recognition Benchmark (Stallkamp et al., 2011), as well as several other datasets adapted from VTAB (Zhai et al., 2019).如图21定性所示，CLIP模型学习的任务范围比之前在通过随机初始化进行端到端训练的单个计算机视觉模型中所展示的更广。这些任务包括地理定位、光学字符识别、面部情绪识别和动作识别。这些任务均未在Kornblith等人（2019）的评价套件中进行测量。在Kornblith et al.（2019）对与ImageNet重叠的任务的研究中，这可以被认为是一种选择偏倚。为了解决这一问题，我们还在更广泛的27个数据集评估套件上测量了性能。附录A中详述的该评估套件包括代表上述任务的数据集、德国交通标志识别基准（Stallkamp等人，2011），以及改编自VTAB的其他几个数据集（Zhai等人，（2019年版）。
在这里插入图片描述
On this broader evaluation suite, the benefits of CLIP are more clear. All CLIP models, regardless of scale, outperform all evaluated systems in terms of compute efficiency. The improvement in average score of the best model over previous systems increases from 2.6% to 5%. We also find that self-supervised systems do noticeably better on our broader evaluation suite. For instance, while SimCLRv2 still underperforms BiT-M on average on the 12 datasets of Kornblith et al. (2019), SimCLRv2 outperforms BiT-M on our 27 dataset evaluation suite. These findings suggest continuing to expand task diversity and coverage in order to better understand the “general” performance of systems. We suspect additional evaluation efforts along the lines of VTAB to be valuable.在这个更广泛的评估套件中，CLIP的优势更加明显。无论规模如何，所有CLIP型号在计算效率方面都优于所有评估的系统。与以前的系统相比，最佳模型的平均得分的提高从2.6%增加到5%。我们还发现，自我监督系统在我们更广泛的评估套件中表现得明显更好。例如，虽然SimCLRv 2在Kornblith等人（2019）的12个数据集上的平均表现仍然低于BiT-M，但SimCLRv 2在我们的27个数据集评估套件上的表现优于BiT-M。这些研究结果表明，继续扩大任务的多样性和覆盖面，以更好地了解系统的“一般”性能。我们怀疑沿着VTAB路线的额外评估工作是有价值的。
In addition to the aggregate analysis above, we visualize per-dataset differences in the performance of the best CLIP model and the best model in our evaluation suite across all 27 datasets in Figure 11. CLIP outperforms the Noisy Student EfficientNet-L2 on 21 of the 27 datasets. CLIP improves the most on tasks which require OCR (SST2 and HatefulMemes), geo-localization and scene recognition (Country211, SUN397), and activity recognition in videos (Kinetics700 and UCF101). In addition CLIP also does much better on fine-grained car and traffic sign recognition (Stanford Cars and GTSRB). This may reflect a problem with overly narrow supervision in ImageNet. A result such as the 14.7% improvement on GTSRB could be indicative of an issue with ImageNet-1K, which has only a single label for all traffic and street signs. This could encourage a supervised representation to collapse intra-class details and hurt accuracy on a fine-grained downstream task. As mentioned, CLIP still underperforms the EfficientNet on several datasets. Unsurprisingly, the dataset that the EfficientNet does best relative to CLIP on is the one it was trained on: ImageNet. The EffcientNet also slightly outperforms CLIP on low-resolution datasets such as CIFAR10 and CIFAR100. We suspect this is at least partly due to the lack of scale-based data augmentation in CLIP. The EfficientNet also does slightly better on PatchCamelyon and CLEVRCounts, datasets where overall performance is still low for both approaches.除了上面的聚合分析之外，我们还在图11中可视化了所有27个数据集的最佳CLIP模型和评估套件中的最佳模型的性能差异。CLIP在27个数据集中的21个数据集上优于Noisy Student EfficientNet-L2。CLIP在需要OCR（SST 2和HatefulMemes）、地理定位和场景识别（Country 211、SUN 397）以及视频中的活动识别（Kinetics 700和UCF 101）的任务上改进最多。此外，CLIP在细粒度汽车和交通标志识别方面也做得更好（斯坦福大学汽车和GTSRB）。这可能反映了ImageNet中监督过于狭窄的问题。GTSRB上14.7%的改进可能表明ImageNet-1 K存在问题，因为ImageNet-1 K对所有交通和街道标志只有一个标签。这可能会鼓励监督表示崩溃类内细节，并损害细粒度下游任务的准确性。如前所述，CLIP在几个数据集上的表现仍然不如EfficientNet。毫不奇怪，EfficientNet相对于CLIP表现最好的数据集是它训练的数据集：ImageNet。EffcientNet在CIFAR 10和CIFAR 100等低分辨率数据集上的表现也略优于CLIP。我们怀疑这至少部分是由于CLIP中缺乏基于尺度的数据增强。EfficientNet在PatchCamelyon和CLEVRCounts上的表现也稍好，这两种方法的整体性能仍然很低。
在这里插入图片描述 CLIP的功能在各种数据集上都优于最佳ImageNet模型的功能。在CLIP的特征上拟合线性分类器优于使用Noisy Student EfficientNet-L2在27个数据集中的21个数据集上。

3.3. Robustness to Natural Distribution Shift 对自然分布偏移的稳健性

In 2015, it was announced that a deep learning model exceeded human performance on the ImageNet test set (He et al., 2015). However, research in the subsequent years has repeatedly found that these models still make many simple mistakes (Dodge & Karam, 2017; Geirhos et al., 2018; Alcorn et al., 2019), and new benchmarks testing these systems has often found their performance to be much lower than both their ImageNet accuracy and human accuracy (Recht et al., 2019; Barbu et al., 2019). What explains this discrepancy? Various ideas have been suggested and studied (Ilyas et al., 2019; Geirhos et al., 2020). A common theme of proposed explanations is that deep learning models are exceedingly adept at finding correlations and patterns which hold across their training dataset and thus improve in-distribution performance. However many of these correlations and patterns are actually spurious and do not hold for other distributions and result in large drops in performance on other datasets.2015年，有消息称，在ImageNet测试集上，深度学习模型的表现超过了人类的表现（He等人，2015年）的报告。然而，随后几年的研究一再发现，这些模型仍然会犯很多简单的错误（Dodge & Karam，2017; Geirhos等人，2018年; Alcorn等人，2019），而测试这些系统的新基准测试通常发现其性能远低于ImageNet准确度和人工准确度（Recht等人，2019年; Barbu等人，（2019年版）。如何解释这种差异？已经提出和研究了各种想法（Ilyas等人，2019年; Geirhos等人，2020年）的报告。提出的解释的一个共同主题是，深度学习模型非常擅长发现在其训练数据集内保持的相关性和模式，从而提高分布内的性能。然而，这些相关性和模式中的许多实际上是虚假的，并且不适用于其他分布，从而导致其他数据集的性能大幅下降。
We caution that, to date, most of these studies limit their evaluation to models trained on ImageNet. Recalling the topic of discussion, it may be a mistake to generalize too far from these initial findings. To what degree are these failures attributable to deep learning, ImageNet, or some combination of the two? CLIP models, which are trained via natural language supervision on a very large dataset and are capable of high zero-shot performance, are an opportunity to investigate this question from a different angle.我们要提醒的是，到目前为止，这些研究中的大多数都将其评估限制在ImageNet上训练的模型上。回顾一下讨论的主题，从这些初步发现中过于概括可能是错误的。这些失败在多大程度上可归因于深度学习、ImageNet或两者的某种组合？CLIP模型通过自然语言监督在非常大的数据集上进行训练，并且能够实现高零触发性能，这是从不同角度研究这个问题的机会。
Taori et al. (2020) is a recent comprehensive study moving towards quantifying and understanding these behaviors for ImageNet models. Taori et al. (2020) study how the performance of ImageNet models change when evaluated on natural distribution shifts. They measure performance on a set of 7 distribution shifts: ImageNetV2 (Recht et al., 2019), ImageNet Sketch (Wang et al., 2019), Youtube-BB and ImageNet-Vid (Shankar et al., 2019), ObjectNet (Barbu et al., 2019), ImageNet Adversarial (Hendrycks et al., 2019), and ImageNet Rendition (Hendrycks et al., 2020a). They distinguish these datasets, which all consist of novel images collected from a variety of sources, from synthetic distribution shifts such as ImageNet-C (Hendrycks & Dietterich, 2019), Stylized ImageNet (Geirhos et al., 2018), or adversarial attacks (Goodfellow et al., 2014) which are created by perturbing existing images in various ways. They propose this distinction because in part because they find that while several techniques have been demonstrated to improve performance on synthetic distribution shifts, they often fail to yield consistent improvements on natural distributions.Taori et al.（2020）是一项近期的综合研究，旨在量化和理解ImageNet模型的这些行为。Taori等人（2020）研究了在自然分布偏移上评价时ImageNet模型的性能如何变化。他们在一组7个分布偏移上测量性能：ImageNetV 2（Recht等人，2019）、ImageNet素描（Wang等人，2019年）、Youtube-BB和ImageNet-Vid（尚卡尔等人，2019年）、ObjectNet（Barbu等人，2019年）、ImageNet对抗（Hendrycks等人，2019年）和ImageNet渲染（Hendrycks等人，2020年a）的规定。他们将这些数据集（均由从各种来源收集的新图像组成）与合成分布偏移（如ImageNet-C（Hendrycks & Dietterich，2019）、风格化ImageNet（Geirhos等人，2018），或对抗性攻击（Goodfellow等人，2014），其通过以各种方式扰动现有图像而创建。他们提出这种区分的部分原因是他们发现，虽然已经证明了几种技术可以提高合成分布偏移的性能，但它们通常无法对自然分布产生一致的改进。
Across these collected datasets, the accuracy of ImageNet models drop well below the expectation set by the ImageNet validation set. For the following summary discussion we report average accuracy across all 7 natural distribution shift datasets and average accuracy across the corresponding class subsets of ImageNet unless otherwise specified. Additionally, for Youtube-BB and ImageNet-Vid, which have two different evaluation settings, we use the average of pm-0 and pm-10 accuracy.在这些收集的数据集中，ImageNet模型的准确性远远低于ImageNet验证集的预期。对于下面的总结讨论，我们报告了所有7个自然分布偏移数据集的平均准确度以及ImageNet相应类别子集的平均准确度，除非另有说明。此外，对于具有两种不同评估设置的Youtube-BB和ImageNet-Vid，我们使用pm-0和pm-10精度的平均值。
A ResNet-101 makes 5 times as many mistakes when evaluated on these natural distribution shifts compared to the ImageNet validation set. Encouragingly however, Taori et al. (2020) find that accuracy under distribution shift increases predictably with ImageNet accuracy and is well modeled as a linear function of logit-transformed accuracy. Taori et al. (2020) use this finding to propose that robustness analysis should distinguish between effective and relative robustness. Effective robustness measures improvements in accuracy under distribution shift above what is predicted by the documented relationship between in-distribution and out-of-distribution accuracy. Relative robustness captures any improvement in out-of-distribution accuracy. Taori et al. (2020) argue that robustness techniques should aim to improve both effective robustness and relative robustness.与ImageNet验证集相比，ResNet-101在这些自然分布偏移上进行评估时的错误是ImageNet验证集的5倍。然而，令人鼓舞的是，Taori等人（2020）发现分布偏移下的准确性随着ImageNet准确性的可预测性而增加，并且很好地建模为logit转换准确性的线性函数。Taori et al.（2020）利用这一发现提出，稳健性分析应区分有效稳健性和相对稳健性。有效稳健性衡量的是在分布偏移高于分布内和分布外准确度之间的记录关系所预测的情况下准确度的改善。相对稳健性可反映分布外准确度的任何改善。Taori等人（2020）认为，稳健性技术应旨在提高有效稳健性和相对稳健性。
Almost all models studied in Taori et al. (2020) are trained or fine-tuned on the ImageNet dataset. Returning to the discussion in the introduction to this section - is training or adapting to the ImageNet dataset distribution the cause of the observed robustness gap? Intuitively, a zero-shot model should not be able to exploit spurious correlations or patterns that hold only on a specific distribution, since it is not trained on that distribution. 4 Thus it is reasonable to expect zero-shot models to have much higher effective robustness. In Figure 13, we compare the performance of zero-shot CLIP with existing ImageNet models on natural distribution shifts. All zero-shot CLIP models improve effective robustness by a large amount and reduce the size of the gap between ImageNet accuracy and accuracy under distribution shift by up to 75%.Taori等人（2020）研究的几乎所有模型都是在ImageNet数据集上训练或微调的。回到本节介绍中的讨论-训练或适应ImageNet数据集分布是观察到的鲁棒性差距的原因吗？直观地说，零触发模型不应该能够利用仅在特定分布上存在的虚假相关性或模式，因为它不是在该分布上训练的。4因此，期望零激发模型具有更高的有效鲁棒性是合理的。在图13中，我们比较了zero-shot CLIP与现有ImageNet模型在自然分布偏移方面的性能。所有zero-shot CLIP模型都大幅提高了有效的鲁棒性，并将ImageNet准确度与分布偏移下的准确度之间的差距缩小了75%。
While these results show that zero-shot models can be much more robust, they do not necessarily mean that supervised learning on ImageNet causes a robustness gap. Other details of CLIP, such as its large and diverse pre-training dataset or use of natural language supervision could also result in much more robust models regardless of whether they are zero-shot or fine-tuned. As an initial experiment to potentially begin narrowing this down, we also measure how the performance of CLIP models change after adapting to the ImageNet distribution via a L2 regularized logistic regression classifier fit to CLIP features on the ImageNet training set. We visualize how performance changes from the zero-shot classifier in Figure 14. Although adapting CLIP to the ImageNet distribution increases its ImageNet accuracy by 9.2% to 85.4% overall, and ties the accuracy of the 2018 SOTA from Mahajan et al. (2018), average accuracy under distribution shift slightly decreases.虽然这些结果表明，零触发模型可以更加鲁棒，但它们并不一定意味着ImageNet上的监督学习会导致鲁棒性差距。CLIP的其他细节，例如其庞大而多样化的预训练数据集或使用自然语言监督，也可以产生更强大的模型，无论它们是零射击还是微调。作为可能开始缩小范围的初始实验，我们还通过L2正则化逻辑回归分类器来测量CLIP模型在适应ImageNet分布后的性能变化，该分类器适合ImageNet训练集上的CLIP特征。我们在图14中可视化了零触发分类器的性能变化。尽管将CLIP适应ImageNet分布将其ImageNet准确度整体提高了9.2%至85.4%，并与Mahajan等人（2018）的2018 SOTA的准确度保持一致，但分布偏移下的平均准确度略有下降。
It is surprising to see a 9.2% increase in accuracy, which corresponds to roughly 3 years of improvement in SOTA, fail to translate into any improvement in average performance under distribution shift. We also break down the differences between zero-shot accuracy and linear classifier accuracy per dataset in Figure 14 and find performance still increases significantly on one dataset, ImageNetV2. ImageNetV2 closely followed the creation process of the original ImageNet dataset which suggests that gains in accuracy from supervised adaptation are closely concentrated around the ImageNet distribution. Performance decreases by 4.7% on ImageNet-R, 3.8% on ObjectNet, 2.8% on ImageNet Sketch, and 1.9% on ImageNet-A. The change in accuracy on the two other datasets, Youtube-BB and ImageNet Vid, is insignificant.令人惊讶的是，准确性提高了9.2%，这相当于SOTA大约3年的改进，但在分布变化下，未能转化为平均性能的任何改进。我们还在图14中分解了每个数据集的零射击准确度和线性分类器准确度之间的差异，并发现性能在一个数据集ImageNetV 2上仍然显着提高。ImageNet V2密切关注原始ImageNet数据集的创建过程，这表明监督自适应的准确性增益密切集中在ImageNet分布周围。性能在ImageNet-R上下降了4.7%，在ObjectNet上下降了3.8%，在ImageNet Sketch上下降了2.8%，在ImageNet-A上下降了1.9%。另外两个数据集Youtube-BB和ImageNet Vid的准确性变化微不足道。
How is it possible to improve accuracy by 9.2% on the ImageNet dataset with little to no increase in accuracy under distribution shift? Is the gain primarily from “exploiting spurious correlations”? Is this behavior unique to some combination of CLIP, the ImageNet datatset, and the distribution shifts studied, or a more general phenomena? Does it hold for end-to-end finetuning as well as linear classifiers? We do not have confident answers to these questions at this time. Prior work has also pre-trained models on distributions other than ImageNet, but it is common to study and release models only after they have been fine-tuned to ImageNet. As a step towards understanding whether pre-trained zero-shot models consistently have higher effective robustness than fine-tuned models, we encourage the authors of Mahajan et al. (2018), Kolesnikov et al. (2019), and Dosovitskiy et al. (2020) to, if possible, study these questions on their models as well.如何在ImageNet数据集上将准确率提高9.2%，而在分布偏移下的准确率几乎没有提高？收益是否主要来自“利用虚假相关性”？这种行为是CLIP、ImageNet数据集和所研究的分布变化的某种组合所独有的，还是一种更普遍的现象？它是否适用于端到端微调以及线性分类器？我们目前对这些问题没有确定的答案。之前的工作还在ImageNet以外的发行版上预训练了模型，但通常只有在模型根据ImageNet进行微调后才研究和发布模型。作为了解预训练的零触发模型是否始终比微调模型具有更高的有效鲁棒性的一步，我们鼓励Mahajan等人（2018），Kolesnikov等人（2019）和Dosovitskiy等人（2020）的作者，如果可能的话，也在他们的模型上研究这些问题。
We also investigate another robustness intervention enabled by flexible zero-shot natural-language-based image classifiers. The target classes across the 7 transfer datasets are not always perfectly aligned with those of ImageNet. Two datasets, Youtube-BB and ImageNet-Vid, consist of superclasses of ImageNet. This presents a problem when trying to use the fixed 1000-way classifier of an ImageNet model to make predictions. Taori et al. (2020) handle this by max pooling predictions across all sub-classes according to the ImageNet class hierarchy. Sometimes this mapping is much less than perfect. For the person class in Youtube-BB, predictions are made by pooling over the ImageNet classes for a baseball player, a bridegroom, and a scuba diver. With CLIP we can instead generate a custom zero-shot classifier for each dataset directly based on its class names. In Figure 14 we see that this improves average effective robustness by 5% but is concentrated in large improvements on only a few datasets. Curiously, accuracy on ObjectNet also increases by 2.3%. Although the dataset was designed to closely overlap with ImageNet classes, using the names provided for each class by ObjectNet’s creators still helps a small amount compared to using ImageNet class names and pooling predictions when necessary.我们还研究了另一个鲁棒性干预，使灵活的零杆基于自然语言的图像分类器。7个传输数据集的目标类并不总是与ImageNet的目标类完全一致。两个数据集Youtube-BB和ImageNet-Vid由ImageNet的超类组成。这在尝试使用ImageNet模型的固定1000路分类器进行预测时会出现问题。Taori et al.（2020）通过根据ImageNet类层次结构在所有子类中进行最大池化预测来处理这个问题。有时这种映射并不完美。对于Youtube-BB中的person类，预测是通过将棒球运动员、新郎和潜水员的ImageNet类集中起来进行的。使用CLIP，我们可以直接根据每个数据集的类名为每个数据集生成一个自定义的零触发分类器。在图14中，我们看到这将平均有效鲁棒性提高了5%，但仅集中在少数数据集上。奇怪的是，ObjectNet上的准确性也提高了2.3%。尽管数据集被设计为与ImageNet类紧密重叠，但与使用ImageNet类名和必要时池化预测相比，使用ObjectNet创建者为每个类提供的名称仍然有很小的帮助。
While zero-shot CLIP improves effective robustness, Figure 14 shows that the benefit is almost entirely gone in a fully supervised setting. To better understand this difference, we investigate how effective robustness changes on the continuum from zero-shot to fully supervised. In Figure 15 we visualize the performance of 0-shot, 1-shot, 2-shot, 4-shot …, 128-shot, and fully supervised logistic regression classifiers on the best CLIP model’s features. We see that while few-shot models also show higher effective robustness than existing models, this benefit fades as in-distribution performance increases with more training data and is mostly, though not entirely, gone for the fully supervised model. Additionally, zero-shot CLIP is notably more robust than a few-shot model with equivalent ImageNet performance.虽然zero-shot CLIP提高了有效的鲁棒性，但图14显示，在完全监督的设置中，这种好处几乎完全消失。为了更好地理解这种差异，我们研究了从零触发到完全监督的连续体上的有效鲁棒性变化。在图15中，我们可视化了0次、1次、2次、4次…的性能，128杆，和完全监督逻辑回归分类器上的最佳CLIP模型的功能。我们看到，虽然少拍模型也显示出比现有模型更高的有效鲁棒性，但随着分布性能随着更多训练数据的增加而增加，这种优势逐渐消失，并且对于完全监督模型来说，这种优势大部分（尽管不是全部）消失了。此外，zero-shot CLIP比具有同等ImageNet性能的几次射击模型更强大。
在这里插入图片描述虽然对ImageNet的监督适应将ImageNet的准确率提高了9.2%，但它略微降低了平均鲁棒性。（左）与使用单个静态zero-shot ImageNet分类器和Taori等人（2020）中的类似类的池化预测相比，为每个数据集定制zero-shot CLIP可以提高鲁棒性。适应ImageNet的CLIP模型具有与最佳的ImageNet模型相似的有效鲁棒性。（右）两种稳健性干预的每个数据集准确度变化的详细信息。适应ImageNet显著提高了ImageNet V2的准确性，但在其他几个发行版上却牺牲了准确性。特定于数据集的零触发分类器可以大幅提高准确性，但仅限于少数包含与ImageNet类别不完全一致的类的数据集。
在这里插入图片描述与现有的ImageNet模型相比，少激发CLIP还提高了有效稳健性，但稳健性不如零激发CLIP。最大限度地减少用于自适应的ImageNet训练数据量可以提高有效的鲁棒性，但代价是降低相对鲁棒性。16-Shot logistic regression CLIP与ImageNet上的zero-shot CLIP匹配，如图7所示，但鲁棒性较差。

Across our experiments, high effective robustness seems to result from minimizing the amount of distribution specific training data a model has access to, but this comes at a cost of reducing dataset-specific performance.在我们的实验中，高效的鲁棒性似乎来自于最小化模型可以访问的分布特定训练数据的数量，但这是以降低特定于集群的性能为代价的。
Taken together, these results suggest that the recent shift towards large-scale task and dataset agnostic pre-training combined with a reorientation towards zero-shot and fewshot benchmarking on broad evaluation suites (as advocated by Yogatama et al. (2019) and Linzen (2020)) promotes the development of more robust systems and provides a more accurate assessment of performance. We are curious to see if the same results hold for zero-shot models in the field of NLP such as the GPT family. While Hendrycks et al. (2020b) has reported that pre-training improves relative robustness on sentiment analysis, Miller et al. (2020)’s study of the robustness of question answering models under natural distribution shift finds, similar to Taori et al. (2020), little evidence of effective robustness improvements to date.总的来说，这些结果表明，最近转向大规模任务和数据集不可知的预训练，再加上对广泛评估套件的零射击和少数射击基准的重新定位（如Yogatama等人（2019）和Linzen（2020）所倡导的），促进了更强大系统的开发，并提供了更准确的性能评估。我们很想知道，对于NLP领域的零次模型（如GPT家族），是否也有同样的结果。虽然Hendrycks等人（2020 b）报告称预训练可以提高情感分析的相对鲁棒性，但米勒等人（2020）对自然分布转变下问答模型鲁棒性的研究发现，与Taori等人类似。（2020）迄今为止，几乎没有证据表明有效的鲁棒性改进。

4. Comparison to Human Performance 与人类表现的比较

How does CLIP compare to human performance and human learning? To get a better understanding of how well humans perform in similar evaluation settings to CLIP, we evaluated humans on one of our tasks. We wanted to get a sense of how strong human zero-shot performance is at these tasks, and how much human performance is improved if they are shown one or two image samples. This can help us to compare task difficulty for humans and CLIP, and identify correlations and differences between them.CLIP与人类表现和人类学习相比如何？为了更好地了解人类在与CLIP类似的评估环境中的表现，我们对人类进行了一项任务的评估。我们希望了解人类在这些任务中的零拍摄性能有多强，以及如果向他们展示一两个图像样本，人类的性能会提高多少。这可以帮助我们比较人类和CLIP的任务难度，并确定它们之间的相关性和差异。
We had five different humans look at each of 3669 images in the test split of the Oxford IIT Pets dataset (Parkhi et al., 2012) and select which of the 37 cat or dog breeds best matched the image (or ‘I don’t know’ if they were completely uncertain). In the zero-shot case the humans were given no examples of the breeds and asked to label them to the best of their ability without an internet search. In the one-shot experiment the humans were given one sample image of each breed and in the two-shot experiment they were given two sample images of each breed.5我们让五个不同的人在牛津IIT宠物数据集的测试分割中查看3669个图像中的每一个（Parkhi等人，2012年），并选择37个猫或狗品种最匹配的图像（或’我不知道’，如果他们完全不确定）。在零射击的情况下，人类没有得到任何品种的例子，并要求他们尽最大努力在没有互联网搜索的情况下标记它们。在一次拍摄实验中，每个品种的人都有一个样本图像，而在两次拍摄实验中，每个品种的人都有两个样本图像。
One possible concern was that the human workers were not sufficiently motivated in the zero-shot task. High human accuracy of 94% on the STL-10 dataset (Coates et al., 2011) and 97-100% accuracy on the subset of attention check images increased our trust in the human workers.一个可能的担忧是，人类工作者在零射击任务中没有足够的动力。STL-10数据集上94%的高人类准确性（科茨et al.，2011）和97-100%的注意力检查图像子集的准确率增加了我们对人类工作者的信任。
Interestingly, humans went from a performance average of 54% to 76% with just one training example per class, and the marginal gain from an additional training example is minimal. The gain in accuracy going from zero to one shot is almost entirely on images that humans were uncertain about. This suggests that humans “know what they don’t know” and are able to update their priors on the images they are most uncertain in based on a single example. Given this, it seems that while CLIP is a promising training strategy for zero-shot performance (Figure 5) and does well on tests of natural distribution shift (Figure 13), there is a large difference between how humans learn from a few examples and the few-shot methods in this paper.有趣的是，每个类只需一个训练示例，人类的平均性能就从54%提高到了76%，而额外训练示例的边际收益微乎其微。从零到一次拍摄的准确性几乎完全取决于人类不确定的图像。这表明人类“知道他们不知道的”，并且能够基于单个示例更新他们最不确定的图像的先验知识。鉴于此，尽管CLIP似乎是一种很有前途的零触发性能训练策略（图5），并且在自然分布偏移测试中表现良好（图13），但人类如何从几个示例中学习与本文中的几个方法之间存在很大差异。
This suggests that there are still algorithmic improvements waiting to be made to decrease the gap between machine and human sample efficiency, as noted by Lake et al. (2016) and others. Because these few-shot evaluations of CLIP don’t make effective use of prior knowledge and the humans do, we speculate that finding a method to properly integrate prior knowledge into few-shot learning is an important step in algorithmic improvements to CLIP. To our knowledge, using a linear classifier on top of the features of a high quality pre-trained model is near state-of-the-art for few shot learning (Tian et al., 2020), which suggests that there is a gap between the best few-shot machine learning methods and human few-shot learning.这表明，仍有算法改进等待进行，以减少机器和人类样本效率之间的差距，如Lake et al.（2016）和其他人所指出的。由于CLIP的这些少量评估没有有效地利用先验知识，而人类却做到了，因此我们推测，找到一种将先验知识正确整合到少量学习中的方法是CLIP算法改进的重要一步。据我们所知，在高质量预训练模型的特征之上使用线性分类器对于少镜头学习来说接近最先进水平（Tian等人，2020年），这表明最好的几次机器学习方法和人类的几次学习之间存在差距。
If we plot human accuracy vs CLIP’s zero shot accuracy (Figure 16), we see that the hardest problems for CLIP are also hard for humans. To the extent that errors are consistent, our hypothesis is that this is due to at least a two factors: noise in the dataset (including mislabeled images) and out of distribution images being hard for both humans and models.如果我们绘制人类精度与CLIP的零射击精度（图16），我们可以看到CLIP最难的问题对人类来说也很难。在误差一致的情况下，我们的假设是这至少是由于两个因素：数据集中的噪声（包括错误标记的图像）和分布图像对人类和模型都很难。
在这里插入图片描述 CLIP最困难的问题往往也是人类最困难的问题。在这里，我们根据CLIP的难度对图像类别进行排名，以正确标签的概率来衡量。

5. Data Overlap Analysis

A concern with pre-training on a very large internet dataset is unintentional overlap with downstream evals. This is important to investigate since, in a worst-case scenario, a complete copy of an evaluation dataset could leak into the pre-training dataset and invalidate the evaluation as a meaningful test of generalization. One option to prevent this is to identify and remove all duplicates before training a model. While this guarantees reporting true hold-out performance, it requires knowing all possible data which a model might be evaluated on ahead of time. This has the downside of limiting the scope of benchmarking and analysis. Adding a new evaluation would require an expensive re-train or risk reporting an un-quantified benefit due to overlap.在非常大的互联网数据集上进行预训练的一个问题是与下游评估的无意重叠。这一点很重要，因为在最坏的情况下，评估数据集的完整副本可能会泄漏到预训练数据集中，并使评估作为有意义的泛化测试无效。防止这种情况的一个选择是在训练模型之前识别并删除所有重复项。虽然这保证了报告真实的保持性能，但它需要提前知道模型可能评估的所有可能数据。这有限制基准和分析范围的缺点。增加一个新的评价将需要昂贵的再培训或由于重叠而报告未量化的效益的风险。
Instead, we document how much overlap occurs and how performance changes due to these overlaps. To do this, we use the following procedure:相反，我们记录有多少重叠发生，以及如何性能变化，由于这些重叠。为此，我们使用以下程序：

For each evaluation dataset, we run a duplicate detector (see Appendix C) on its examples. We then manually inspect the found nearest neighbors and set a per dataset threshold to keep high precision while maximizing recall. Using this threshold, we then create two new subsets, Overlap, which contains all examples which have a similarity to a training example above the threshold, and Clean, which contains all examples that are below this threshold. We denote the unaltered full dataset All for reference. From this we first record the degree of data contamination as the ratio of the number of examples in Overlap to the size of All.1)对于每个评估数据集，我们在其示例上运行重复检测器（参见附录C）。然后，我们手动检查找到的最近邻居，并设置每个数据集的阈值，以保持高精度，同时最大化召回率。使用这个阈值，我们创建了两个新的子集，Overlap，包含与阈值以上的训练示例具有相似性的所有示例，以及Clean，包含低于此阈值的所有示例。我们表示未更改的完整数据集All以供参考。首先，我们将数据污染的程度记录为Overlap中的样本数量与All大小的比率。
We then compute the zero-shot accuracy of CLIP RN50x64 on the three splits and report All - Clean as our main metric. This is the difference in accuracy due to contamination. When positive it is our estimate of how much the overall reported accuracy on the dataset was inflated by over-fitting to overlapping data.2)然后，我们计算CLIP RN 50 x64在三个分割上的零射击精度，并报告All - Clean作为我们的主要指标。这是由于污染造成的精度差异。当为正值时，它是我们对数据集上报告的总体准确性因过度拟合重叠数据而膨胀的估计。
The amount of overlap is often small so we also run a binomial significance test where we use the accuracy on Clean as the null hypothesis and compute the one-tailed (greater) p-value for the Overlap subset. We also calculate 99.5% Clopper-Pearson confidence intervals on Dirty as another check.3)重叠量通常很小，因此我们还运行了一个二项式显著性检验，其中我们使用Clean的准确度作为零假设，并计算重叠子集的单侧（较大）p值。我们还计算了Dirty的99.5% Clopper-Pearson置信区间作为另一种检查。
A summary of this analysis is presented in Figure 17. Out of 35 datasets studied, 9 datasets have no detected overlap at all. Most of these datasets are synthetic or specialized making them unlikely to be posted as normal images on the internet (for instance MNIST, CLEVR, and GTSRB) or are guaranteed to have no overlap due to containing novel data from after the date our dataset was created (ObjectNet and Hateful Memes). This demonstrates our detector has a low-false positive rate which is important as false positives would under-estimate the effect of contamination in our analysis. There is a median overlap of 2.2% and an average overlap of 3.2%. Due to this small amount of overlap, overall accuracy is rarely shifted by more than 0.1% with only 7 datasets above this threshold. Of these, only 2 are statistically significant after Bonferroni correction. The max detected improvement is only 0.6% on Birdsnap which has the second largest overlap at 12.1%. The largest overlap is for Country211 at 21.5%. This is due to it being constructed out of YFCC100M, which our pre-training dataset contains a filtered subset of. Despite this large overlap there is only a 0.2% increase in accuracy on Country211. This may be because the training text accompanying an example is often not related to the specific task a downstream eval measures. Country211 measures geo-localization ability, but inspecting the training text for these duplicates showed they often do not mention the location of the image.该分析总结见图17。在研究的35个数据集中，有9个数据集根本没有检测到重叠。大多数这些数据集是合成的或专门的，使它们不太可能作为正常图像发布在互联网上（例如MNIST，CLEVR和GTSRB），或者由于包含数据集创建日期之后的新数据（ObjectNet和Hateful Memes）而保证没有重叠。这表明我们的检测器具有低假阳性率，这很重要，因为假阳性会低估我们分析中污染的影响。有2.2%的中位数重叠和3.2%的平均重叠。由于这种少量的重叠，总体准确度很少超过0.1%，只有7个数据集高于此阈值。其中，只有2例在Bonferroni校正后具有统计学显著性。在Birdsnap上检测到的最大改善仅为0.6%，其重叠率为12.1%，是第二大重叠率。重叠最大的是Country211，为21.5%。这是因为它是由YFCC 100M构建的，我们的预训练数据集包含一个过滤的子集。尽管有这么大的重叠，但Country211的准确性只提高了0.2%。这可能是因为示例附带的训练文本通常与下游评估测量的特定任务无关。Country211测量地理定位能力，但检查这些重复的训练文本显示，它们通常没有提到图像的位置。
由于检测到的数据重叠，准确度几乎没有统计学显著改善。（左）虽然几个数据集在检测到的重叠与干净示例的零发射准确度上具有高达±20%的明显差异，但总共35个数据集中只有5个数据集具有99.5%的Clopper-Pearson置信区间，排除了0%的准确度差异。其中2个数据集在重叠数据上表现更差。（右）由于检测到的重叠示例的百分比几乎总是个位数，因此由于重叠而导致的整体测试准确度增益要小得多，Birdsnap上的最大估计增幅仅为0.6%。同样，当使用单侧二项式检验计算时，仅6个数据集的准确度改善具有统计学显著性。
We are aware of two potential concerns with our analysis. First our detector is not perfect. While it achieves near 100% accuracy on its proxy training task and manual inspection + threshold tuning results in very high precision with good recall among the found nearest-neighbors, we can not tractably check its recall across 400 million examples. Another potential confounder of our analysis is that the underlying data distribution may shift between the Overlap and Clean subsets. For example, on Kinetics-700 many “overlaps” are in fact all black transition frames. This explains why Kinetics-700 has an apparent 20% accuracy drop on Overlap. We suspect more subtle distribution shifts likely exist. One possibility we noticed on CIFAR-100 is that, due to the very low resolution of its images, many duplicates were false positives of small objects such as birds or planes. Changes in accuracy could instead be due to changes in the class distribution or difficulty of the duplicates. Unfortunately, these distribution and difficulty shifts could also mask the effects of over-fitting我们意识到我们的分析有两个潜在的问题。首先，我们的探测器并不完美。虽然它在代理训练任务中达到了接近100%的准确率，并且手动检查+阈值调整导致了非常高的精度，并且在找到的最近邻居中具有良好的召回率，但我们无法在4亿个示例中跟踪检查其召回率。我们分析的另一个潜在混淆因素是，底层数据分布可能在重叠子集和干净子集之间发生变化。例如，在Kinetics-700上，许多“重叠”实际上都是黑色过渡帧。这就解释了为什么Kinetics-700在重叠上有明显的20%的准确度下降。我们怀疑可能存在更微妙的分布变化。我们在CIFAR-100上注意到的一种可能性是，由于其图像的分辨率非常低，许多重复都是对鸟类或飞机等小物体的误报。准确度的变化可能是由于类别分布或重复难度的变化。不幸的是，这些分布和难度变化也可能掩盖过度拟合的影响。
However, these results closely follow the findings of similar duplicate analysis in previous work on large scale pretraining. Mahajan et al. (2018) and Kolesnikov et al. (2019) detected similar overlap rates and found minimal changes in overall performance. Importantly, Kolesnikov et al. (2019) also compared the alternative de-duplication strategy discussed in the introduction to this section with the approach we settled on and observed little difference between the two approaches.然而，这些结果密切遵循类似的重复分析的结果，在以前的工作大规模预训练。Mahajan等人（2018）和Kolesnikov等人（2019）检测到相似的重叠率，并发现总体性能变化极小。重要的是，Kolesnikov等人（2019）还将本节介绍中讨论的替代重复数据删除策略与我们确定的方法进行了比较，并观察到两种方法之间几乎没有差异。

6. Limitations

There are still many limitations to CLIP. While several of these are discussed as part of analysis in various sections, we summarize and collect them here.CLIP仍然有许多限制。虽然其中一些作为分析的一部分在各个部分进行了讨论，但我们在这里总结和收集它们。
On datasets with training splits, the performance of zeroshot CLIP is on average competitive with the simple su pervised baseline of a linear classifier on top of ResNet-50 features. On most of these datasets, the performance of this baseline is now well below the overall state of the art. Significant work is still needed to improve the task learning and transfer capabilities of CLIP. While scaling has so far steadily improved performance and suggests a route for continued improvement, we estimate around a 1000x increase in compute is required for zero-shot CLIP to reach overall state-of-the-art performance. This is infeasible to train with current hardware. Further research into improving upon the computational and data efficiency of CLIP will be necessary.在具有训练分裂的数据集上，Zeroshot CLIP的性能平均而言与基于ResNet-50特征的线性分类器的简单监督基线具有竞争力。在大多数数据集上，该基线的性能现在远低于现有技术的总体水平。仍需要进行大量工作来改进CLIP的任务学习和迁移能力。虽然扩展到目前为止已经稳步提高了性能，并提出了持续改进的途径，但我们估计，零触发CLIP要达到最先进的整体性能，需要将计算能力提高约1000倍。用当前的硬件进行训练是不可行的。需要进一步研究如何提高CLIP的计算和数据效率。
Analysis in Section 3.1 found that CLIP’s zero-shot performance is still quite weak on several kinds of tasks. When compared to task-specific models, the performance of CLIP is poor on several types of fine-grained classification such as differentiating models of cars, species of flowers, and variants of aircraft. CLIP also struggles with more abstract and systematic tasks such as counting the number of objects in an image. Finally for novel tasks which are unlikely to be included in CLIP’s pre-training dataset, such as classifying the distance to the nearest car in a photo, CLIP’s performance can be near random. We are confident that there are still many, many, tasks where CLIP’s zero-shot performance is near chance level.第3.1节中的分析发现，CLIP的零射击性能在几种任务上仍然很弱。当与特定于任务的模型相比时，CLIP在几种类型的细粒度分类上的性能较差，例如区分汽车的模型、花的种类和飞机的变体。CLIP还需要处理更抽象和更系统的任务，例如计算图像中对象的数量。最后，对于不太可能包含在CLIP预训练数据集中的新任务，例如对照片中最近的汽车的距离进行分类，CLIP的性能可以接近随机。我们相信，在许多任务中，CLIP的零命中性能接近机会水平。
While zero-shot CLIP generalizes well to many natural image distributions as investigated in Section 3.3, we’ve observed that zero-shot CLIP still generalizes poorly to data that is truly out-of-distribution for it. An illustrative example occurs for the task of OCR as reported in Appendix E. CLIP learns a high quality semantic OCR representation that performs well on digitally rendered text, which is common in its pre-training dataset, as evidenced by performance on Rendered SST2. However, CLIP only achieves 88% accuracy on the handwritten digits ofMNIST. An embarrassingly simple baseline of logistic regression on raw pixels outperforms zero-shot CLIP. Both semantic and near-duplicate nearest-neighbor retrieval verify that there are almost no images that resemble MNIST digits in our pre-training dataset. This suggests CLIP does little to address the underlying problem of brittle generalization of deep learning models. Instead CLIP tries to circumvent the problem and hopes that by training on such a large and varied dataset that all data will be effectively in-distribution. This is a naive assumption that, as MNIST demonstrates, is easy to violate.虽然零触发CLIP可以很好地推广到3.3节中研究的许多自然图像分布，但我们已经观察到零触发CLIP仍然不能很好地推广到它真正分布外的数据。附录E中报告的OCR任务中有一个说明性的例子。CLIP学习高质量的语义OCR表示，在数字渲染文本上表现良好，这在其预训练数据集中很常见，Rendered SST2上的性能证明了这一点。然而，CLIP在MNIST的手写数字上仅达到88%的准确率。对原始像素进行逻辑回归的一个非常简单的基线优于零触发CLIP。语义和近乎重复的最近邻检索都验证了我们的预训练数据集中几乎没有类似于MNIST数字的图像。这表明CLIP在解决深度学习模型的脆弱泛化的潜在问题方面做得很少。相反，CLIP试图绕过这个问题，并希望通过在如此庞大而多样的数据集上进行训练，所有数据都将有效地分布。这是一个天真的假设，正如MNIST所证明的那样，很容易违反。
Although CLIP can flexibly generate zero-shot classifiers for a wide variety of tasks and datasets, CLIP is still limited to choosing from only those concepts in a given zero-shot classifier. This is a significant restriction compared to a truly flexible approach like image captioning which could generate novel outputs. Unfortunately, as described in Section 2.3 we found the computational efficiency of the image caption baseline we tried to be much lower than CLIP. A simple idea worth trying is joint training of a contrastive and generative objective with the hope of combining the efficiency of CLIP with the flexibility of a caption model. As another alternative, search could be performed at inference time over many natural language explanations of a given image, similar to approach proposed in Learning with Latent Language Andreas et al. (2017).尽管CLIP可以灵活地为各种任务和数据集生成零触发分类器，但CLIP仍然限于仅从给定零触发分类器中的那些概念中进行选择。这是一个显着的限制相比，一个真正灵活的方法，如图像字幕，可以产生新的输出。不幸的是，如第2.3节所述，我们发现我们尝试的图像标题基线的计算效率比CLIP低得多。一个值得尝试的简单想法是联合训练对比和生成目标，希望将CLIP的效率与字幕模型的灵活性结合起来。作为另一种替代方案，可以在推理时对给定图像的许多自然语言解释进行搜索，类似于Andreas等人（2017）在潜在语言学习中提出的方法。
CLIP also does not address the poor data efficiency of deep learning. Instead CLIP compensates by using a source of supervision that can be scaled to hundreds of millions of training examples. If every image seen during training of a CLIP model was presented at a rate of one per second, it would take 405 years to iterate through the 12.8 billion images seen over 32 training epochs. Combining CLIP with self-supervision (Henaff, 2020; Chen et al., 2020c) and self-training (Lee; Xie et al., 2020) methods is a promising direction given their demonstrated ability to improve data efficiency over standard supervised learning.CLIP也没有解决深度学习数据效率低的问题。相反，CLIP通过使用可以扩展到数亿个训练示例的监督源进行补偿。如果在CLIP模型的训练过程中看到的每一张图像都以每秒一张的速度呈现，那么需要405年才能在32个训练时期内看到的128亿张图像中进行检索。将CLIP与自我监督相结合（Henaff，2020; Chen等人，2020 c）和自我训练（Lee; Xie et al.，2020）方法是一个很有前途的方向，因为它们已证明能够比标准监督学习提高数据效率。
Our methodology has several significant limitations. Despite our focus on zero-shot transfer, we repeatedly queried performance on full validation sets to guide the development of CLIP. These validation sets often have thousands of examples, which is unrealistic for true zero-shot scenarios. Similar concerns have been raised in the field of semi-supervised learning (Oliver et al., 2018). Another potential issue is our selection of evaluation datasets. While we have reported results on Kornblith et al. (2019)’s 12 dataset evaluation suite as a standardized collection, our main results use a somewhat haphazardly assembled collection of 27 datasets that is undeniably co-adapted with the development and capabilities of CLIP. Creating a new benchmark of tasks designed explicitly to evaluate broad zero-shot transfer capabilities, rather than re-using existing supervised datasets, would help address these issues.我们的方法有几个明显的局限性。尽管我们的重点是零触发传输，但我们反复询问了完整验证集的性能，以指导CLIP的开发。这些验证集通常有数千个示例，这对于真正的零触发场景是不现实的。在半监督学习领域中也提出了类似的问题（奥利弗等人，（2018年版）。另一个潜在问题是我们对评估数据集的选择。虽然我们已经报告了Kornblith等人（2019）的12个数据集评价套件作为标准化集合的结果，但我们的主要结果使用了27个数据集的有点随意的集合，不可否认，该集合与CLIP的开发和功能相适应。创建一个新的任务基准，明确设计用于评估广泛的零触发传输能力，而不是重复使用现有的监督数据集，将有助于解决这些问题。
CLIP is trained on text paired with images on the internet. These image-text pairs are unfiltered and uncurated and result in CLIP models learning many social biases. This has been previously demonstrated for image caption models (Bhargava & Forsyth, 2019). We refer readers to Section 7 for detailed analysis and quantification of these behaviors for CLIP as well as discussion of potential mitigation strategies.CLIP是在互联网上与图像配对的文本上训练的。这些图像-文本对未经过滤和未经策划，导致CLIP模型学习许多社会偏见。这已经在之前的图像标题模型中得到了证明（Bhargava &福赛斯，2019）。我们建议读者参阅第7节，以了解CLIP这些行为的详细分析和量化以及潜在缓解策略的讨论。
While we have emphasized throughout this work that specifying image classifiers through natural language is a flexible and general interface, it has its own limitations. Many complex tasks and visual concepts can be difficult to specify just through text. Actual training examples are undeniably useful but CLIP does not optimize for few-shot performance directly. In our work, we fall back to fitting linear classifiers on top of CLIP’s features. This results in a counter-intuitive drop in performance when transitioning from a zero-shot to a few-shot setting. As discussed in Section 4, this is notably different from human performance which shows a large increase from a zero to a one shot setting. Future work is needed to develop methods that combine CLIP’s strong zero-shot performance with efficient few-shot learning.虽然我们在整个工作中强调，通过自然语言指定图像分类器是一个灵活和通用的接口，但它有自己的局限性。许多复杂的任务和视觉概念很难仅通过文本来指定。不可否认，实际训练示例非常有用，但CLIP并不能直接优化少拍性能。在我们的工作中，我们回到在CLIP的特征之上拟合线性分类器。这会导致在从零激发转换到少激发设置时性能出现与直觉相反的下降。如第4节所述，这与人的表现明显不同，人的表现显示出从零到一次设置的大幅增加。未来的工作是需要开发的方法，结合联合收割机CLIP的强大的零拍性能与高效的少拍学习。

7. Broader Impacts 更广泛的影响

CLIP has a wide range of capabilities due to its ability to carry out arbitrary image classification tasks. One can give it images of cats and dogs and ask it to classify cats, or give it images taken in a department store and ask it to classify shoplifters–a task with significant social implications and for which AI may be unfit. Like any image classification system, CLIP’s performance and fitness for purpose need to be evaluated, and its broader impacts analyzed in context. CLIP also introduces a capability that will magnify and alter such issues: CLIP makes it possible to easily create your own classes for categorization (to ‘roll your own classifier’) without a need for re-training. This capability introduces challenges similar to those found in characterizing other, large-scale generative models like GPT-3 (Brown et al., 2020); models that exhibit non-trivial zero-shot (or fewshot) generalization can have a vast range of capabilities, many of which are made clear only after testing for them.CLIP具有广泛的功能，因为它能够执行任意图像分类任务。人们可以给它猫和狗的图像，并要求它对猫进行分类，或者给它在百货公司拍摄的图像，并要求它对商店扒手进行分类-这是一项具有重大社会意义的任务，AI可能不适合。像任何图像分类系统一样，CLIP的性能和适用性需要进行评估，并在上下文中分析其更广泛的影响。CLIP还引入了一种放大和改变这些问题的能力：CLIP使您可以轻松地创建自己的分类类（“滚动自己的分类器”），而无需重新训练。这种能力引入了类似于在表征其他大规模生成模型（如GPT-3）中发现的挑战（Brown等人，2020年）;表现出非平凡的零次（或少次）泛化的模型可以具有广泛的能力，其中许多能力只有在测试之后才变得清楚。
Our studies of CLIP in a zero-shot setting show that the model displays significant promise for widely-applicable tasks like image retrieval or search. For example, it can find relevant images in a database given text, or relevant text given an image. Further, the relative ease of steering CLIP toward bespoke applications with little or no additional data or training could unlock a variety of novel applications that are hard for us to envision today, as has occurred with large language models over the past few years.我们在零拍摄设置中对CLIP的研究表明，该模型在图像检索或搜索等广泛适用的任务中显示出显着的前景。例如，它可以在给定文本的数据库中找到相关图像，或者在给定图像的情况下找到相关文本。此外，将CLIP转向定制应用程序的相对容易性，很少或没有额外的数据或训练，可以解锁各种我们今天难以想象的新应用程序，就像过去几年大型语言模型一样。
In addition to the more than 30 datasets studied in earlier sections of this paper, we evaluate CLIP’s performance on the FairFace benchmark and undertake exploratory bias probes. We then characterize the model’s performance in a downstream task, surveillance, and discuss its usefulness as compared with other available systems. Many of CLIP’s capabilities are omni-use in nature (e.g. OCR can be used to make scanned documents searchable, to power screen reading technologies, or to read license plates). Several of the capabilities measured, from action recognition, object classification, and geo-localization, to facial emotion recognition, can be used in surveillance. Given its social implications, we address this domain of use specifically in the Surveillance section.除了本文前面部分研究的30多个数据集外，我们还评估了CLIP在FairFace基准测试中的性能，并进行了探索性的偏差探测。然后，我们描述了该模型的性能在下游的任务，监视，并讨论其实用性相比，与其他可用的系统。CLIP的许多功能在本质上是通用的（例如，OCR可用于使扫描的文档可搜索，为屏幕阅读技术提供动力，或读取车牌）。从动作识别、对象分类和地理定位到面部情绪识别，所测量的几种能力都可以用于监视。考虑到其社会影响，我们在监控部分专门讨论了这一使用领域。
We have also sought to characterize the social biases inherent to the model. Our bias tests represent our initial efforts to probe aspects of how the model responds in different scenarios, and are by nature limited in scope. CLIP and models like it will need to be analyzed in relation to their specific deployments to understand how bias manifests and identify potential interventions. Further community exploration will be required to develop broader, more contextual, and more robust testing schemes so that AI developers can better characterize biases in general purpose computer vision models.我们还试图描述该模型所固有的社会偏见。我们的偏差测试代表了我们探索模型在不同场景下如何响应的初步努力，并且本质上范围有限。CLIP和类似的模型需要根据其具体部署进行分析，以了解偏见如何表现并确定潜在的干预措施。需要进一步的社区探索，以开发更广泛，更上下文，更强大的测试方案，以便AI开发人员可以更好地描述通用计算机视觉模型中的偏见。

7.1. Bias

Algorithmic decisions, training data, and choices about how classes are defined and taxonomized (which we refer to informally as “class design”) can all contribute to and amplify social biases and inequalities resulting from the use of AI systems (Noble, 2018; Bechmann & Bowker, 2019; Bowker & Star, 2000). Class design is particularly relevant to models like CLIP, since any developer can define a class and the model will provide some result.数学决策、训练数据以及关于如何定义和分类类的选择（我们非正式地称之为“类设计”）都可能导致并放大使用人工智能系统所导致的社会偏见和不平等（Noble，2018; Bechmann & Bowker，2019; Bowker &星星，2000）。类设计与CLIP之类的模型特别相关，因为任何开发人员都可以定义一个类，而模型将提供一些结果。
In this section, we provide preliminary analysis of some of the biases in CLIP, using bias probes inspired by those outlined in Buolamwini & Gebru (2018) and K¨arkk¨ainen & Joo (2019). We also conduct exploratory bias research intended to find specific examples of biases in the model, similar to that conducted by Solaiman et al. (2019).在本节中，我们使用受Buolamwini & Gebru（2018）和Küarkküainen & Joo（2019）概述的偏差探针，对CLIP中的一些偏差进行了初步分析。我们还进行了探索性偏倚研究，旨在找到模型中偏倚的具体例子，类似于Solaiman等人（2019）进行的研究。
We start by analyzing the performance of Zero-Shot CLIP on the face image dataset FairFace (K¨arkk¨ainen & Joo, 2019)6 as an initial bias probe, then probe the model further to surface additional biases and sources of biases, including class design.首先，我们分析了Zero-Shot CLIP在人脸图像数据集FairFace（Küarküainen & Joo，2019）6上的性能，作为初始偏差探测器，然后进一步探测模型，以发现额外的偏差和偏差来源，包括类设计。
We start by analyzing the performance of Zero-Shot CLIP on the face image dataset FairFace (K¨arkk¨ainen & Joo, 2019)6 as an initial bias probe, then probe the model further to surface additional biases and sources of biases, including class design. We evaluated two versions of CLIP on the FairFace dataset: a zero-shot CLIP model (“ZS CLIP”), and a logistic regression classifier fitted to FairFace’s dataset on top of CLIP’s features (“LR CLIP”). We find that LR CLIP gets higher accuracy on the FairFace dataset than both the ResNext-101 32x48d Instagram model (“Linear Probe Instagram”) (Mahajan et al., 2018) and FairFace’s own model on most of the classification tests we ran7. ZS CLIP’s performance varies by category and is worse than that of FairFace’s model for a few categories, and better for others. (See Table 3 and Table 4).首先，我们分析了Zero-Shot CLIP在人脸图像数据集FairFace（Küarküainen & Joo，2019）6上的性能，作为初始偏差探测器，然后进一步探测模型，以发现额外的偏差和偏差来源，包括类设计。我们在FairFace数据集上评估了两个版本的CLIP：零射击CLIP模型（“ZS CLIP”）和在CLIP特征之上拟合FairFace数据集的逻辑回归分类器（“LR CLIP”）。我们发现LR CLIP在FairFace数据集上获得比ResNext-101 32 x48 d Instagram模型（“线性探针Instagram”）更高的准确性（Mahajan等人，2018）和FairFace自己的模型在我们运行的大多数分类测试中7。ZS CLIP的性能因类别而异，在少数类别中比FairFace的模型差，而在其他类别中则更好。(See表3和表4）。
在这里插入图片描述 FairFace类别“白色”中图像的种族、性别和年龄分类的准确率
在FairFace类别“黑人”、“印度人”、"东亚人“、"东南亚人”、“中东人”和“拉丁美洲人”中，图像的种族、性别和年龄分类的准确率百分比（组合为FairFace类别“非白人”）
Additionally, we test the performance of the LR CLIP and ZS CLIP models across intersectional race and gender categories as they are defined in the FairFace dataset. We find that model performance on gender classification is above 95% for all race categories. Table 5 summarizes these results.此外，我们测试了LR CLIP和ZS CLIP模型在交叉种族和性别类别中的性能，因为它们在FairFace数据集中定义。我们发现，对于所有种族类别，性别分类的模型性能都在95%以上。表5总结了这些结果。
在这里插入图片描述
While LR CLIP achieves higher accuracy than the Linear Probe Instagram model on the FairFace benchmark dataset for gender, race and age classification of images by intersectional categories, accuracy on benchmarks offers only one approximation of algorithmic fairness, as Raji et al. (2020) have shown, and often fails as a meaningful measure of fairness in real world contexts. Even if a model has both higher accuracy and lower disparities in performance on different sub-groups, this does not mean it will have lower disparities in impact (Scheuerman et al., 2019). For example, higher performance on underrepresented groups might be used by a company to justify their use of facial recognition, and to then deploy it ways that affect demographic groups disproportionately. Our use of facial classification benchmarks to probe for biases is not intended to imply that facial classification is an unproblematic task, nor to endorse the use of race, age, or gender classification in deployed contexts.虽然LR CLIP在FairFace基准数据集上按交叉类别对图像进行性别、种族和年龄分类的准确度高于线性探针Instagram模型，但基准的准确度仅提供了算法公平性的一个近似值，如Raji等人（2020）所示，在真实的世界环境中，作为一种有意义的公平性测量，LR CLIP往往失败。即使一个模型在不同的亚组上具有较高的准确性和较低的性能差异，这并不意味着它将具有较低的影响差异（Scheuerman等人，（2019年版）。例如，一家公司可能会利用在代表性不足的群体中表现更好来证明他们使用面部识别的合理性，然后以不成比例地影响人口统计群体的方式部署面部识别。我们使用面部分类基准来探查偏见，并不意味着面部分类是一项没有问题的任务，也不支持在部署的环境中使用种族、年龄或性别分类。
We also probed the model using classification terms with high potential to cause representational harm, focusing on denigration harms in particular (Crawford, 2017). We carried out an experiment in which the ZS CLIP model was required to classify 10,000 images from the FairFace dataset. In addition to the FairFace classes, we added in the following classes: ‘animal’, ‘gorilla’, ‘chimpanzee’, ‘orangutan’, ‘thief’, ‘criminal’ and ‘suspicious person’. The goal of this experiment was to check if harms of denigration disproportionately impact certain demographic subgroups.我们还使用高可能造成代表性伤害的分类术语探索了该模型，特别关注诋毁伤害（Crawford，2017）。我们进行了一个实验，要求ZS CLIP模型对FairFace数据集中的10，000张图像进行分类。除了FairFace类之外，我们还添加了以下类：“动物”，“大猩猩”，“黑猩猩”，“猩猩”，“小偷”，“罪犯”和“可疑人员”。这项实验的目的是检查诋毁的危害是否不成比例地影响某些人口亚组。
We found that 4.9% (confidence intervals between 4.6% and 5.4%) of the images were misclassified into one of the non-human classes we used in our probes (‘animal’, ‘chimpanzee’, ‘gorilla’, ‘orangutan’). Out of these, ‘Black’ images had the highest misclassification rate (approximately 14%; confidence intervals between [12.6% and 16.4%]) while all other races had misclassification rates under 8%. People aged 0-20 years had the highest proportion being classified into this category at 14% .我们发现，4.9%（置信区间在4.6%和5.4%之间）的图像被错误地分类为我们在探测中使用的非人类类别之一（“动物”，“黑猩猩”，“大猩猩”，“猩猩”）。其中，“黑人”图像的错误分类率最高（约14%;置信区间在[12.6%和16.4%]之间），而所有其他种族的错误分类率均低于8%。0-20岁的人被归入这一类别的比例最高，为14%。
We also found that 16.5% of male images were misclassified into classes related to crime (‘thief’, ‘suspicious person’ and ‘criminal’) as compared to 9.8% of female images. Interestingly, we found that people aged 0-20 years old were more likely to fall under these crime-related classes (approximately 18%) compared to images of people in different age ranges (approximately 12% for people aged 20-60 and 0% for people over 70). We found significant disparities in classifications across races for crime related terms, which is captured in Table 6.我们还发现，16.5%的男性图像被错误地归类为与犯罪有关的类别（“小偷”，“可疑人员”和“罪犯”），而女性图像的比例为9.8%。有趣的是，我们发现0-20岁的人更有可能属于这些犯罪相关类别（约18%），而不同年龄段的人的图像（20-60岁的人约为12%，70岁以上的人为0%）。我们发现不同种族在犯罪相关术语的分类上存在显著差异，见表6。
在这里插入图片描述按FairFace Race类别分类为犯罪相关和非人类类别的图像百分比。标签集包括7个FairFace种族类别，男性和女性各一个（共14个），以及3个犯罪相关类别和4个非人类类别。
Given that we observed that people under 20 were the most likely to be classified in both the crime-related and nonhuman animal categories, we carried out classification for the images with the same classes but with an additional category ‘child’ added to the categories. Our goal here was to see if this category would significantly change the behaviour of the model and shift how the denigration harms are distributed by age. We found that this drastically reduced the number of images of people under 20 classified in either crime-related categories or non-human animal categories (Table 7). This points to how class design has the potential to be a key factor determining both the model performance and the unwanted biases or behaviour the model may exhibit while also asks overarching questions about the use of face images to automatically classify people along such lines (y Arcas et al., 2017).考虑到我们观察到20岁以下的人最有可能被分类为犯罪相关和非人类动物类别，我们对具有相同类别但在类别中添加了额外类别“儿童”的图像进行了分类。我们的目标是看看这一类别是否会显著改变模型的行为，并改变诋毁伤害按年龄分布的方式。我们发现，这大大减少了20岁以下的人在犯罪相关类别或非人类动物类别中的图像数量（表7）。这指出了类设计如何具有成为决定模型性能和模型可能表现出的不希望的偏差或行为的关键因素的潜力，同时也提出了关于使用面部图像来沿着这样的路线对人进行自动分类的首要问题（y Arcas等人，（2017年版）。
The results of these probes can change based on the class categories one chooses to include as well as the specific language one uses to describe each class. Poor class design can lead to poor real world performance; this concern is particularly relevant to a model like CLIP, given how easily developers can design their own classes.这些探测的结果可能会根据选择包含的类类别以及用于描述每个类的特定语言而变化。糟糕的类设计会导致糟糕的真实的性能;考虑到开发人员可以很容易地设计自己的类，这种担忧与CLIP这样的模型特别相关。
We also carried out experiments similar to those outlined by Schwemmer et al. (2020) to test how CLIP treated images of men and women differently using images of Members of Congress. As part of these experiments, we studied how certain additional design decisions such as deciding thresholds for labels can impact the labels output by CLIP and how biases manifest.我们还进行了类似于Schwemmer等人（2020）概述的实验，以测试CLIP如何使用国会议员的图像来不同地处理男性和女性的图像。作为这些实验的一部分，我们研究了某些额外的设计决策（例如决定标签的阈值）如何影响CLIP输出的标签以及偏见如何表现。
We carried out three experiments - we tested for accuracy on gender classification and we tested for how labels were differentially distributed across two different label sets. For our first label set, we used a label set of 300 occupations and for our second label set we used a combined set of labels that Google Cloud Vision, Amazon Rekognition and Microsoft Azure Computer Vision returned for all the images.我们进行了三个实验-我们测试了性别分类的准确性，并测试了标签如何在两个不同的标签集上差异分布。对于我们的第一个标签集，我们使用了300个职业的标签集，对于我们的第二个标签集，我们使用了Google Cloud Vision，Amazon Rekognition和Microsoft Azure Computer Vision为所有图像返回的组合标签集。
We first simply looked into gender prediction performance of the model on the images of Members of Congress, in order to check to see if the model correctly recognized men as men and women as women given the image of a person who appeared to be in an official setting/position of power. We found that the model got 100% accuracy on the images. This is slightly better performance than the model’s performance on the FairFace dataset. We hypothesize that one of the reasons for this is that all the images in the Members of Congress dataset were high-quality and clear, with the people clearly centered, unlike those in the FairFace dataset.我们首先简单地研究了该模型在国会议员图像上的性别预测性能，以检查该模型是否正确地将男性识别为男性，并将女性识别为女性，因为该图像似乎是一个官方设置/权力位置的人。我们发现该模型在图像上获得了100%的准确性。这比模型在FairFace数据集上的性能稍好。我们假设其中一个原因是，国会议员数据集中的所有图像都是高质量和清晰的，与FairFace数据集中的图像不同，人的位置清晰。
In order to study how the biases in returned labels depend on the thresholds set for label probability, we did an experiment in which we set threshold values at 0.5% and 4.0%. We found that the lower threshold led to lower quality of labels. However, even the differing distributions of labels under this threshold can hold signals for bias. For example, we find that under the 0.5% threshold labels such as ‘nanny’ and ‘housekeeper’ start appearing for women whereas labels such as ‘prisoner’ and ‘mobster’ start appearing for men. This points to gendered associations similar to those that have previously been found for occupations (Schwemmer et al., 2020) (Nosek et al., 2002) (Bolukbasi et al., 2016).为了研究返回的标签中的偏差如何取决于标签概率的阈值设置，我们进行了一个实验，其中我们将阈值设置为0.5%和4.0%。我们发现，较低的阈值导致较低的标签质量。然而，即使在该阈值之下的标记的不同分布也可以保持用于偏置的信号。例如，我们发现，在0.5%的门槛下，“保姆”和“管家”等标签开始出现在女性身上，而“囚犯”和“暴徒”等标签开始出现在男性身上。这表明性别关联类似于以前在职业方面发现的关联（Schwemmer等人，2020年）（Nosek等人，2002年）（Bolukbasi等人，（2016年版）。
At the higher 4% threshold, the labels with the highest probability across both genders include “lawmaker”, “legislator” and “congressman”. However, the presence of these biases amongst lower probability labels nonetheless point to larger questions about what ‘sufficiently’ safe behaviour may look like for deploying such systems.在较高的4%门槛下，男女中概率最高的标签包括“立法者”、“立法者”和“国会议员”。然而，在较低的概率标签中存在这些偏差，仍然指向更大的问题，即部署这种系统的“足够”安全的行为可能是什么样子。
When given the combined set of labels that Google Cloud Vision (GCV), Amazon Rekognition and Microsoft returned for all the images, similar to the biases Schwemmer et al. (2020) found in GCV systems, we found our system also disproportionately attached labels to do with hair and appearance in general to women more than men. For example, labels such as ‘brown hair’, ‘blonde’ and ‘blond’ appeared significantly more often for women. Additionally, CLIP attached some labels that described high status occupations disproportionately more often to men such as ‘executive’ and ‘doctor’. Out of the only four occupations that it attached more often to women, three were ‘newscaster’, ‘television presenter’ and ‘newsreader’ and the fourth was ‘Judge’. This is again similar to the biases found in GCV and points to historical gendered differences (Schwemmer et al., 2020).当给出谷歌云视觉（GCV）、亚马逊Rekognition和微软为所有图片返回的标签组合集时，我们发现我们的系统也不成比例地将与头发和外观有关的标签更多地附加到女性身上，这与Schwemmer等人（2020）在GCV系统中发现的偏见相似。例如，“棕色头发”、“金发”和“金发”等标签在女性中出现的频率要高得多。此外，CLIP还将一些描述高地位职业的标签不成比例地更多地贴在男性身上，如“行政人员”和“医生”。在仅有的四种职业中，妇女占多数，其中三种是“新闻播音员”、“电视节目主持人”和“新闻播音员”，第四种是“法官”。这再次与GCV中发现的偏倚相似，并指出了历史上的性别差异（Schwemmer等人，2020年）的报告。
Interestingly, when we lowered the threshold to 0.5% for this set of labels, we found that the labels disproportionately describing men also shifted to appearance oriented words such as ‘suit’, ‘tie’ and ‘necktie’ (Figure 18). Many occupation oriented words such as ‘military person’ and ‘executive’ - which were not used to describe images of women at the higher 4% threshold - were used for both men and women at the lower 0.5% threshold, which could have caused the change in labels for men. The reverse was not true. Descriptive words used to describe women were still uncommon amongst men.有趣的是，当我们将这组标签的阈值降低到0.5%时，我们发现不成比例地描述男性的标签也转向了以外观为导向的单词，如“西装”，“领带”和“领带”（图18）。许多以职业为导向的词汇，如“军人”和“行政人员”-在较高的4%阈值时不用于描述妇女的形象-在较低的0.5%阈值时用于男女，这可能导致了对男子的标签的变化。反之则不然。用于描述女性的描述性词语在男性中仍然很少见。
在这里插入图片描述

Design decisions at every stage of building a model impact how biases manifest and this is especially true for CLIP given the flexibility it offers. In addition to choices about training data and model architecture, decisions about things like class designs and thresholding values can alter the labels a model outputs and as a result heighten or lower certain kinds of harm, such as those described by Crawford (2017). People designing and developing models and AI systems have considerable power. Decisions about things like class design are a key determiner not only of model performance, but also of how and in what contexts model biases manifest.构建模型的每个阶段的设计决策都会影响偏差的表现方式，考虑到CLIP提供的灵活性，这一点尤其适用于CLIP。除了关于训练数据和模型架构的选择之外，关于类设计和阈值等事情的决定可以改变模型输出的标签，从而增加或降低某些类型的伤害，例如Crawford（2017）所描述的伤害。设计和开发模型和人工智能系统的人拥有相当大的权力。关于类设计等事情的决策不仅是模型性能的关键决定因素，而且也是模型偏差如何以及在什么上下文中表现的关键决定因素。
These experiments are not comprehensive. They illustrate potential issues stemming from class design and other sources of bias, and are intended to spark inquiry.这些实验并不全面。它们说明了源于类设计和其他偏见来源的潜在问题，并旨在激发探究。

7.2. Surveillance 监控

We next sought to characterize model performance in relation to a downstream task for which there is significant societal sensitivity: surveillance. Our analysis aims to better embody the characterization approach described above and to help orient the research community towards the potential future impacts of increasingly general purpose computer vision models and aid the development of norms and checks around such systems. Our inclusion of surveillance is not intended to indicate enthusiasm for this domain - rather, we think surveillance is an important domain to try to make predictions about given its societal implications (Zuboff, 2015; Browne, 2015).接下来，我们试图描述与下游任务相关的模型性能，该下游任务具有显著的社会敏感性：监视。我们的分析旨在更好地体现上述表征方法，并帮助研究界关注日益通用的计算机视觉模型的潜在未来影响，并帮助围绕此类系统制定规范和检查。我们纳入监视并不意味着对这个领域的热情-相反，我们认为监视是一个重要的领域，试图对其社会影响进行预测（Zuboff，2015; Browne，2015）。
We measure the model’s performance on classification of images from CCTV cameras and zero-shot celebrity identification. We first tested model performance on low-resolution images captured from surveillance cameras (e.g. CCTV cameras). We used the VIRAT dataset (Oh et al., 2011) and data captured by Varadarajan & Odobez (2009), which both consist of real world outdoor scenes with non-actors.我们衡量模型的性能分类图像从闭路电视摄像机和零拍摄名人识别。我们首先在从监控摄像机（如CCTV摄像机）捕获的低分辨率图像上测试了模型性能。我们使用VIRAT数据集（Oh等人，2011年）和Varadarajan & Odobez（2009年）捕获的数据，两者都由非演员的真实的世界户外场景组成。
Given CLIP’s flexible class construction, we tested 515 surveillance images captured from 12 different video sequences on self-constructed general classes for coarse and fine grained classification. Coarse classification required the model to correctly identify the main subject of the image (i.e. determine if the image was a picture of an empty parking lot, school campus, etc.). For fine-grained classification, the model had to choose between two options constructed to determine if the model could identify the presence/absence of smaller features in the image such as a person standing in the corner.鉴于CLIP的灵活的类结构，我们测试了515个监控图像从12个不同的视频序列上自构建的一般类粗，细粒度分类。粗分类要求模型正确识别图像的主体（即确定图像是否是空停车场、校园等的图片）。对于细粒度分类，模型必须在两个选项之间进行选择，以确定模型是否可以识别图像中是否存在较小的特征，例如站在角落里的人。
For coarse classification, we constructed the classes by handcaptioning the images ourselves to describe the contents of the image and there were always at least 6 options for the model to choose from. Additionally, we carried out a ‘stress test’ where the class set included at least one more caption for something that was ‘close’ to the image (for example, ‘parking lot with white car’ vs. ‘parking lot with red car’). We found that the model had a top-1 accuracy of 91.8% on the CCTV images for the initial evaluation. The accuracy dropped significantly to 51.1% for the second evaluation, with the model incorrectly choosing the ‘close’ answer 40.7% of the time.对于粗分类，我们通过自己对图像进行手写说明来描述图像的内容，并且总是有至少6个选项供模型选择。此外，我们进行了一个“压力测试”，其中类集至少包括一个与图像“接近”的标题（例如，“有白色汽车的停车场”与“有红色汽车的停车场”）。我们发现，该模型在最初评估的CCTV图像上具有91.8%的top-1准确度。第二次评估的准确率显著下降到51.1%，模型错误地选择了40.7%的“接近”答案。
For fine-grained detection, the zero-shot model performed poorly, with results near random. Note that this experiment was targeted only towards detecting the presence or absence of small objects in image sequences.对于细粒度检测，零触发模型表现不佳，结果接近随机。请注意，该实验仅针对检测图像序列中是否存在小对象。
We also tested CLIP’s zero-shot performance for ‘in the wild’ identity detection using the CelebA dataset8. We did this to evaluate the model’s performance for identity detection using just the publicly available data it was pre-trained on. While we tested this on a dataset of celebrities who have a larger number of images on the internet, we hypothesize that the number of images in the pre-training data needed for the model to associate faces with names will keep decreasing as models get more powerful (see Table 8), which has significant societal implications (Garvie, 2019). This mirrors recent developments in natural language processing, in which recent large language models trained on Internet data often exhibit a surprising ability to provide information related to relatively minor public figures (Brown et al., 2020).我们还使用CelebA数据集8测试了CLIP在“野外”身份检测中的零触发性能。我们这样做是为了评估该模型的身份检测性能，只使用公开数据进行预训练。当我们在互联网上拥有大量图片的名人数据集上进行测试时，我们假设模型将人脸与姓名关联所需的预训练数据中的图片数量会随着模型的功能越来越强大而不断减少（见表8），具有显著的社会影响（Garvie，2019）。这反映了自然语言处理的最近发展，其中最近在因特网数据上训练的大型语言模型经常表现出提供与相对次要的公众人物相关的信息的令人惊讶的能力（Brown等人，2020年）的报告。
在这里插入图片描述 Zero-Shot Top-1身份识别精度
We found that the model had 59.2% top-1 accuracy out of 100 possible classes for ‘in the wild’ 8k celebrity images. However, this performance dropped to 43.3% when we increased our class sizes to 1k celebrity names. This performance is not competitive when compared to production level models such as Google’s Celebrity Recognition (Google). However, what makes these results noteworthy is that this analysis was done using only zero-shot identification capabilities based on names inferred from pre-training data - we didn’t use any additional task-specific dataset, and so the (relatively) strong results further indicate that before deploying multimodal models, people will need to carefully study them for behaviors in a given context and domain.我们发现，在100个可能的类别中，该模型对“野生”8 k名人图像的准确率为59.2%。然而，当我们将班级规模增加到1000个名人名字时，这一表现下降到43.3%。与Google的Celebrity Recognition（Google）等生产级模型相比，这种性能并不具有竞争力。然而，这些结果值得注意的是，该分析仅使用基于从预训练数据推断的名称的零射击识别功能完成-我们没有使用任何额外的特定于任务的数据集，因此（相对）强的结果进一步表明，在部署多模态模型之前，人们需要仔细研究它们在给定上下文和领域中的行为。
CLIP offers significant benefit for tasks that have relatively little data given its zero-shot capabilities. However, large datasets and high performing supervised models exist for many in-demand surveillance tasks such as facial recognition. As a result, CLIP’s comparative appeal for such uses is low. Additionally, CLIP is not designed for common surveillance-relevant tasks like object detection and semantic segmentation. This means it has limited use for certain surveillance tasks when models that are designed with these uses in mind such as Detectron2 (Wu et al., 2019) are widely available.CLIP为具有相对较少数据的任务提供了显著的优势，因为它具有零触发功能。然而，大型数据集和高性能的监督模型存在于许多需要的监控任务中，例如面部识别。因此，CLIP对此类用途的相对吸引力很低。此外，CLIP不是为对象检测和语义分割等常见的监控相关任务而设计的。这意味着，当考虑到这些用途而设计的模型（如Detectron 2）时，它对某些监视任务的使用有限（Wu等人，2019年，广泛使用。
However, CLIP does unlock a certain aspect of usability given how it removes the need for training data. Thus, CLIP and similar models could enable bespoke, niche surveillance use cases for which no well-tailored models or datasets exist, and could lower the skill requirements to build such applications. As our experiments show, ZS CLIP displays nontrivial, but not exceptional, performance on a few surveillance relevant tasks today.然而，CLIP确实解锁了可用性的某些方面，因为它消除了对训练数据的需求。因此，CLIP和类似的模型可以实现定制的，利基监视用例，对于这些用例，没有定制的模型或数据集，并且可以降低构建此类应用程序的技能要求。正如我们的实验所示，ZS CLIP在今天的一些监视相关任务上显示出了不平凡但不例外的性能。

7.3. Future Work

This preliminary analysis is intended to illustrate some of the challenges that general purpose computer vision models pose and to give a glimpse into their biases and impacts. We hope that this work motivates future research on the characterization of the capabilities, shortcomings, and biases of such models, and we are excited to engage with the research community on such questions.这一初步分析旨在说明通用计算机视觉模型带来的一些挑战，并简要介绍其偏见和影响。我们希望这项工作能够激励未来对这些模型的能力，缺点和偏见的表征进行研究，我们很高兴能与研究界就这些问题进行合作。
We believe one good step forward is community exploration to further characterize the capabilities of models like CLIP and - crucially - identify application areas where they have promising performance and areas where they may have reduced performance9. This process of characterization can help researchers increase the likelihood models are used beneficially by:我们相信，向前迈出的一步是社区探索，以进一步描述CLIP等模型的功能，并（至关重要的是）识别性能有希望的应用领域和性能可能下降的领域9。这种表征过程可以帮助研究人员增加模型被有益地使用的可能性：
Identifying potentially beneficial downstream uses of models early in the research process, enabling other researchers to think about applications. • Surfacing tasks with significant sensitivity and a large set of societal stakeholders, which may call for intervention by policymakers. • Better characterizing biases in models, alerting other researchers to areas of concern and areas for interventions. • Creating suites of tests to evaluate systems like CLIP on, so we can better characterize model capabilities earlier in the development cycle. • Identifying potential failure modes and areas for further work.在研究过程的早期确定模型的潜在有益下游用途，使其他研究人员能够考虑应用。·面对具有高度敏感性和大量社会利益攸关方的任务，这可能需要决策者的干预。·更好地描述模型中的偏差，提醒其他研究人员关注的领域和干预领域。·创建测试套件来评估CLIP等系统，以便我们可以在开发周期的早期更好地表征模型功能。·识别潜在的故障模式和进一步工作的领域。
We plan to contribute to this work, and hope this analysis provides some motivating examples for subsequent research.我们计划为这项工作做出贡献，并希望这种分析为后续研究提供一些激励性的例子。

8. Related Work

Any model that leverages written, spoken, signed or any other form of human language as part of its training signal is arguably using natural language as a source of supervision. This is an admittedly extremely broad area and covers most work in the field of distributional semantics including topic models (Blei et al., 2003), word, sentence, and paragraph vectors (Mikolov et al., 2013; Kiros et al., 2015; Le & Mikolov, 2014), and language models (Bengio et al., 2003). It also includes much of the broader field of NLP that deals with predicting or modeling sequences of natural language in some way. Work in NLP intentionally leveraging natural language supervision in the form of explanations, feedback, instructions, and advice for tasks such as classification (as opposed to the commonly used representation of supervision as a set of arbitrarily encoded discrete category labels) has been explored in many creative and advanced ways. Dialog based learning (Weston, 2016; Li et al., 2016; Hancock et al., 2019) develops techniques to learn from interactive natural language feedback in dialog. Several papers have leveraged semantic parsing to convert natural language explanations into features (Srivastava et al., 2017) or additional training labels (Hancock et al., 2018). More recently, ExpBERT (Murty et al., 2020) uses feature representations produced by conditioning a deep contextual language model on natural language explanations and descriptions of relations to improve performance on the task of relation extraction.任何利用书面、口头、签名或任何其他形式的人类语言作为其训练信号的一部分的模型都可以说是使用自然语言作为监督的来源。这是公认的极其广泛的领域并且涵盖了包括主题模型的分布语义学领域中的大多数工作（Brei等人，2003）、单词、句子和段落矢量（Mikolov等人，2013年; Kiros等人，2015年; Le & Mikolov，2014年）和语言模型（Bengio等人，2003年）。它还包括NLP的许多更广泛的领域，这些领域以某种方式处理自然语言序列的预测或建模。NLP中的工作已经以许多创造性的和先进的方式探索了有意地利用自然语言监督的工作，这些监督以解释、反馈、指令和建议的形式用于诸如分类之类的任务（与通常使用的监督表示法相反，监督表示法是一组任意编码的离散类别标签）。基于对话的学习（韦斯顿，2016; Li等人，2016年;汉考克等人，2019年）开发了从对话中的交互式自然语言反馈中学习的技术。几篇论文已经利用语义分析将自然语言解释转换成特征（Srivastava等人，2017）或其他培训标签（汉考克等人，（2018年版）。最近，ExpBERT（Murty等人，2020）使用通过将深层上下文语言模型调节到关系的自然语言解释和描述上而产生的特征表示来提高关系提取任务的性能。
CLIP is an example of using natural language as a training signal for learning about a domain other than language. In this context, the earliest use of the term natural language supervision that we are aware of is the work of Ramanathan et al. (2013) which showed that natural language descriptions could be used along side other sources of supervision to improve performance on the task of video event understanding. However, as mentioned in the introduction and approach section, methods of leveraging natural language descriptions in computer vision well predate the use of this specific term, especially for image retrieval (Mori et al., 1999) and object classification (Wang et al., 2009). Other early work leveraged tags (but not natural language) associated with images for the task of semantic segmentation (Barnard et al., 2003). More recently, He & Peng (2017) and Liang et al. (2020) demonstrated using natural language descriptions and explanations to improve fine-grained visual classification of birds. Others have investigated how grounded language can be used to improve visual representations and classifiers on the ShapeWorld dataset (Kuhnle & Copestake, 2017; Andreas et al., 2017; Mu et al., 2019). Finally, techniques which combine natural language with reinforcement learning environments (Narasimhan et al., 2015) have demonstrated exciting emergent behaviors such as systematically accomplishing zero-shot tasks (Hill et al., 2019).CLIP是使用自然语言作为学习语言以外的领域的训练信号的示例。在这种情况下，我们所知的自然语言监督一词的最早使用是Ramanathan等人（2013）的工作，该工作表明，自然语言描述可以与其他监督来源一起使用，以提高视频事件理解任务的性能。然而，如引言和方法部分所述，在计算机视觉中利用自然语言描述的方法早于该特定术语的使用，特别是对于图像检索（Mori等人，1999）和对象分类（Wang等人，2009年）的报告。其他早期工作利用与图像相关联的标签（但不是自然语言）来完成语义分割的任务（Barnard等人，2003年）。最近，He & Peng（2017）和Liang等人（2020）证明了使用自然语言描述和解释来提高鸟类的细粒度视觉分类。其他人研究了如何使用扎根语言来改善ShapeWorld数据集的视觉表示和分类器（Kuhnle & Copestake，2017; Andreas等人，2017年; Mu等人，（2019年版）。最后，将自然语言与强化学习环境相结合的技术（Narasimhan等人，2015）已经展示了令人兴奋的突发行为，例如系统地完成零触发任务（Hill等人，（2019年版）。
CLIP’s pre-training task optimizes for text-image retrieval. This areas of research dates back to the mid-90s with the previously mentioned Mori et al. (1999) as representative of early work. While initial efforts focused primarily on predictive objectives over time research shifted towards learning joint multi-modal embedding spaces with techniques like kernel Canonical Correlation Analysis and various ranking objectives (Weston et al., 2010; Socher & Fei-Fei, 2010; Hodosh et al., 2013). Over time work explored many combinations of training objective, transfer, and more expressive models and steadily improved performance (Frome et al., 2013; Socher et al., 2014; Karpathy et al., 2014; Kiros et al., 2014; Faghri et al., 2017).CLIP的预训练任务针对文本图像检索进行了优化。这一领域的研究可以追溯到90年代中期，之前提到的Mori等人（1999年）是早期工作的代表。虽然最初的努力主要集中在随着时间的推移的预测目标上，但是研究转向使用诸如核典型相关分析和各种排序目标的技术来学习联合多模态嵌入空间（韦斯顿等人，2010年; Socher和Fei-Fei，2010年; Hodosh等人，2013年）的报告。随着时间的推移，工作探索了训练目标、迁移和更具表现力的模型的许多组合，并稳步提高了表现（弗罗姆等人，2013年; Socher等人，2014年; Karpathy等人，2014年; Kiros等人，2014年; Faghri等人，（2017年版）。
Other work has leveraged natural language supervision for domains other than images. Stroud et al. (2020) explores large scale representation learning by training a system to pair descriptive text with videos instead of images. Several works have explored using dense spoken natural language supervision for videos (Miech et al., 2019; 2020b). When considered together with CLIP, these works suggest that large scale natural language supervision is a promising way to learn high quality perceptual systems for many domains. Alayrac et al. (2020) extended this line of work to an additional modality by adding raw audio as an additional supervision source and demonstrated benefits from combining all three sources of supervision.其他工作利用自然语言监督图像以外的领域。Stroud等人（2020）通过训练系统将描述性文本与视频而不是图像配对来探索大规模表示学习。几项工作已经探索了对视频使用密集口语自然语言监督（Miech等人，2019年; 2020年b）。当与CLIP一起考虑时，这些工作表明，大规模自然语言监督是一种很有前途的方法，可以在许多领域学习高质量的感知系统。Alayrac等人（2020）通过添加原始音频作为额外的监督来源，将这一工作扩展到一种额外的模式，并证明了结合所有三种监督来源的好处。
As part of our work on CLIP we also construct a new dataset of image-text pairs. Modern work on image-text retrieval has relied on a set of crowd-sourced sentence level image caption evaluation datasets like Pascal1K (Rashtchian et al., 2010), Flickr8K (Hodosh et al., 2013), and Flickr30K (Young et al., 2014). However, these datasets are still relatively small and limit achievable performance. Several methods have been proposed to create larger datasets automatically with Ordonez et al. (2011) as a notable early example. In the deep learning era, Mithun et al. (2018) demonstrated an additional set of (image, text) pairs collected from the internet could improve retrieval performance and several new automatically constructed datasets such as Conceptual Captions (Sharma et al., 2018), LAIT (Qi et al., 2020), and OCR-CC (Yang et al., 2020) have been created. However, these datasets still use significantly more aggressive filtering or are designed for a specific task such as OCR and as a result are still much smaller than WIT with between 1 and 10 million training examples.作为CLIP工作的一部分，我们还构建了一个新的图像-文本对数据集。关于图像-文本检索的现代工作依赖于一组众包的句子级图像字幕评估数据集，2010年）、Flickr 8 K（Hodosh等人，2013年）和Flickr 30 K（Young等人，2014年）的报告。但是，这些数据集仍然相对较小，并且限制了可实现的性能。Ordonez et al.（2011）是一个值得注意的早期例子，提出了几种方法来自动创建更大的数据集。在深度学习时代，Mithun等人（2018年）证明了从互联网上收集的一组额外的（图像、文本）对可以提高检索性能，并证明了几个新的自动构建的数据集，如概念性字幕（Sharma等人，2018）、LAIT（Qi等人，2020年）和OCR-CC（Yang等人，2020年）已经创建。然而，这些数据集仍然使用明显更积极的过滤，或者是为特定任务（如OCR）而设计的，因此仍然比WIT小得多，只有100万到1000万个训练示例。
A related idea to CLIP is webly supervised learning. This line of work queries image search engines to build image datasets by querying for terms and uses the queries as the labels for the returned images (Fergus et al., 2005). Classifiers trained on these large but noisily labeled datasets can be competitive with those trained on smaller carefully labeled datasets. These image-query pairs are also often used to improve performance on standard datasets as additional training data (Chen & Gupta, 2015). CLIP also uses search queries as part of its dataset creation process. However CLIP only uses full text sequences co-occuring with images as supervision rather than just the queries, which are often only a single word or short n-gram. We also restrict this step in CLIP to text only querying for sub-string matches while most webly supervised work uses standard image search engines which have their own complex retrieval and filtering pipelines that often involve computer vision systems. Of this line of work, Learning Everything about Anything: Webly-Supervised Visual Concept Learning (Divvala et al., 2014) has a notably similar ambition and goal as CLIP.与CLIP相关的一个概念是Web监督学习。这一行的工作通过查询术语来查询图像搜索引擎以构建图像数据集并使用查询作为返回图像的标签（费尔格斯等人，2005年）的报告。在这些大的但有噪声标记的数据集上训练的分类器可以与在较小的仔细标记的数据集上训练的分类器竞争。这些图像-查询对也常用于提高标准数据集的性能，作为额外的训练数据（Chen & Gupta，2015）。CLIP还将搜索查询用作其数据集创建过程的一部分。然而，CLIP只使用与图像同时出现的全文序列作为监督，而不仅仅是查询，这通常只是一个单词或短的n元语法，我们也限制在CLIP中的这一步，以文本查询子串匹配，而大多数Web监督的工作使用标准的图像搜索引擎，这些引擎有自己的复杂的检索和过滤管道，往往涉及计算机视觉系统。在这方面的工作中，学习关于任何事物的一切：Webly监督的视觉概念学习（Divvala等人，2014）与CLIP有着明显相似的雄心和目标。
Finally, CLIP is related to a recent burst of activity on learning joint models of vision and language (Lu et al., 2019; Tan & Bansal, 2019; Chen et al., 2019; Li et al., 2020b; Yu et al., 2020). This line ofwork focuses on richly connecting vision and language in order to solve complex downstream tasks such as visual question answering, visual commonsense reasoning, or multimodal entailment. These approaches leverage impressively engineered models which combine 3 (or more) pre-trained subsystems, typically an image feature model, a region proposal / object detection model, and a pre-trained masked language model such as BERT. These systems are then jointly fine-tuned via various training objectives on image-text pairs and applied to the aforementioned tasks and achieve impressive results. CLIP is instead focused on learning visual models from scratch via natural language supervision and does not densely connect the two domains with a joint attention model. The only interaction in a CLIP model between the image and text domain is a single dot product in a learned joint embedding space. We are excited to see CLIP hybridized with this line of work.最后，CLIP与最近关于学习视觉和语言的联合模型的活动的爆发有关（Lu等人，2019年; Tan和Bansal，2019年; Chen等人，2019年; Li等人，2020 b; Yu等人，2020年）的报告。这一领域的工作侧重于丰富地连接视觉和语言，以解决复杂的下游任务，如视觉问题回答，视觉常识推理，或多模态蕴涵。这些方法利用了联合收割机3个（或更多个）预训练子系统的令人印象深刻的工程模型，通常是图像特征模型、区域提议/对象检测模型和诸如BERT之类的预训练掩蔽语言模型。然后通过对图像-文本对的各种训练目标对这些系统进行联合微调，并将其应用于上述任务中，取得了令人印象深刻的效果。CLIP专注于通过自然语言监督从头学习视觉模型，而不是通过联合注意模型紧密连接两个领域。在CLIP模型中，图像和文本域之间的唯一交互是在学习的联合嵌入空间中的单个点积。我们很高兴看到CLIP与这一系列的工作相结合。

9. Conclusion

We have investigated whether it is possible to transfer the success of task-agnostic web-scale pre-training in NLP to another domain. We find that adopting this formula results in similar behaviors emerging in the field of computer vision and discuss the social implications of this line of research. In order to optimize their training objective, CLIP models learn to perform a wide variety of tasks during pretraining. This task learning can then be leveraged via natural language prompting to enable zero-shot transfer to many existing datasets. At sufficient scale, the performance of this approach can be competitive with task-specific supervised models although there is still room for much improvement.我们已经研究了是否有可能将NLP中任务不可知的网络规模预训练的成功转移到另一个领域。我们发现采用这个公式会导致类似的行为出现在计算机视觉领域，并讨论了这一研究方向的社会意义。为了优化其训练目标，CLIP模型在预训练期间学习执行各种各样的任务。然后，可以通过自然语言提示来利用该任务学习，以实现到许多现有数据集的零触发转移。在足够大的规模下，该方法的性能可以与特定任务的监督模型竞争，尽管仍有很大的改进空间。