【翻译】AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE-CSDN博客

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE
一幅图像抵得上16x16个字：用于图像识别的尺度转化器

论文：https://arxiv.org/pdf/2010.11929.pdf
代码：https://github.com/google-research/vision_transformer

文章目录

摘要
1 介绍
2 相关工作
3 方法
- 3.1 VISION TRANSFORMER (VIT)
- 3.2 微调和更高的分辨率
4 实验
5 结论

摘要

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
　　　虽然Transformer架构已经成为自然语言处理任务的事实标准，但它在计算机视觉方面的应用仍然有限。在视觉中，注意力要么与卷积网络一起应用，要么用来取代卷积网络的某些组件，同时保持其整体结构。我们表明，这种对CNN的依赖是没有必要的，直接应用于图像斑块序列的纯转化器在图像分类任务上可以表现得非常好。当对大量数据进行预训练并转移到多个中型或小型图像识别基准（ImageNet、CIFAR-100、VTAB等）时，与最先进的卷积网络相比，Vision Transformer（ViT）获得了出色的结果，同时需要大量的计算资源来训练。

1 介绍

Self-attention-based architectures, in particular Transformers (Vaswani et al., 2017), have become the model of choice in natural language processing (NLP). The dominant approach is to pre-train on a large text corpus and then fine-tune on a smaller task-specific dataset (Devlin et al., 2019). Thanks to Transformers’ computational efficiency and scalability, it has become possible to train models of unprecedented size, with over 100B parameters (Brown et al., 2020; Lepikhin et al., 2020). With the models and datasets growing, there is still no sign of saturating performance.
　　基于自注意力的架构，特别是Transformers （Vaswani等人，2017），已经成为自然语言处理（NLP）的首选模型。占主导地位的方法是在大型文本语料库上进行预训练，然后在较小的特定任务数据集上进行微调（Devlin等人，2019）。由于Transformers 的计算效率和可扩展性，已经有可能训练出规模空前的模型，参数超过100B（Brown等人，2020；Lepikhin等人，2020）。随着模型和数据集的增长，仍然没有性能饱和的迹象。
　　In computer vision, however, convolutional architectures remain dominant (LeCun et al., 1989; Krizhevsky et al., 2012; He et al., 2016). Inspired by NLP successes, multiple works try combining CNN-like architectures with self-attention (Wang et al., 2018; Carion et al., 2020), some replacing the convolutions entirely (Ramachandran et al., 2019; Wang et al., 2020a). The latter models, while theoretically efficient, have not yet been scaled effectively on modern hardware accelerators due to the use of specialized attention patterns. Therefore, in large-scale image recognition, classic ResNetlike architectures are still state of the art (Mahajan et al., 2018; Xie et al., 2020; Kolesnikov et al., 2020).
　　然而，在计算机视觉中，卷积架构仍然占主导地位（LeCun等人，1989；Krizhevsky等人，2012；He等人，2016）。受NLP成功的启发，多项工作尝试将类似CNN的架构与自我注意力相结合（Wang等人，2018；Carion等人，2020），有些工作完全取代了卷积（Ramachandran等人，2019；Wang等人，2020a）。后者的模型虽然在理论上很高效，但由于使用了专门的注意力模式，在现代硬件加速器上还没有得到有效的扩展。因此，在大规模图像识别中，经典的ResNetlike架构仍然是最先进的（Mahajan等人，2018；Xie等人，2020；Kolesnikov等人，2020）。
　　Inspired by the Transformer scaling successes in NLP, we experiment with applying a standard Transformer directly to images, with the fewest possible modifications. To do so, we split an image into patches and provide the sequence of linear embeddings of these patches as an input to a Transformer. Image patches are treated the same way as tokens (words) in an NLP application. We train the model on image classification in supervised fashion.
　　受NLP中Transformer扩展成功的启发，我们尝试将一个标准的Transformer直接应用于图像，并尽可能少地进行修改。为此，我们将图像分割成补丁，并将这些补丁的线性嵌入序列作为转化器的一个输入。图像patches的处理方式与NLP应用中的tokens（单词）相同。我们以监督的方式训练图像分类的模型。
　　When trained on mid-sized datasets such as ImageNet without strong regularization, these models yield modest accuracies of a few percentage points below ResNets of comparable size. This seemingly discouraging outcome may be expected: Transformers lack some of the inductive biases inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well when trained on insufficient amounts of data.
　　当在中等规模的数据集（如ImageNet）上训练时，如果没有强大的正则化，这些模型产生的精度不高，比同等规模的ResNets低几个百分点。这种看似令人沮丧的结果可能是意料之中的。Transformers 缺乏CNN固有的一些归纳偏见，如翻译的等价性和定位性，因此，在数据量不足的情况下训练时，不能很好地泛化。
　　However, the picture changes if the models are trained on larger datasets (14M-300M images). We find that large scale training trumps inductive bias. Our Vision Transformer (ViT) attains excellent results when pre-trained at sufficient scale and transferred to tasks with fewer datapoints. When pre-trained on the public ImageNet-21k dataset or the in-house JFT-300M dataset, ViT approaches or beats state of the art on multiple image recognition benchmarks. In particular, the best model reaches the accuracy of 88.55% on ImageNet, 90.72% on ImageNet-ReaL, 94.55% on CIFAR-100, and 77.63% on the VTAB suite of 19 tasks.
　　然而，如果在更大的数据集（14M-300M图像）上训练模型，情况就会改变。我们发现，大规模的训练胜过了归纳的偏见。我们的ViT在进行足够规模的预训练并转移到数据点较少的任务中时，取得了优异的成绩。当在公共的ImageNet-21k数据集或内部的JFT-300M数据集上进行预训练时，ViT在多个图像识别基准上接近或超过了技术水平。特别是，最好的模型在ImageNet上达到了88.55%的准确率，在ImageNet-ReaL上达到了90.72%，在CIFAR-100上达到了94.55%，在VTAB的19项任务中达到了77.63%。

2 相关工作

Transformers were proposed by Vaswani et al. (2017) for machine translation, and have since become the state of the art method in many NLP tasks. Large Transformer-based models are often pre-trained on large corpora and then fine-tuned for the task at hand: BERT (Devlin et al., 2019) uses a denoising self-supervised pre-training task, while the GPT line of work uses language modeling as its pre-training task (Radford et al., 2018; 2019; Brown et al., 2020).
　　Transformers 是由Vaswani等人（2017）提出的，用于机器翻译，此后成为许多NLP任务中最先进的方法。基于Transformers 的大型模型通常在大型语料库中进行预训练，然后针对手头的任务进行微调。BERT（Devlin等人，2019）使用去噪自监督的预训练任务，而GPT系列工作使用语言建模作为其预训练任务（Radford等人，2018；2019；Brown等人，2020）。
　　Naive application of self-attention to images would require that each pixel attends to every other pixel. With quadratic cost in the number of pixels, this does not scale to realistic input sizes. Thus, to apply Transformers in the context of image processing, several approximations have been tried in the past. Parmar et al. (2018) applied the self-attention only in local neighborhoods for each query pixel instead of globally. Such local multi-head dot-product self attention blocks can completely replace convolutions (Hu et al., 2019; Ramachandran et al., 2019; Zhao et al., 2020). In a different line of work, Sparse Transformers (Child et al., 2019) employ scalable approximations to global selfattention in order to be applicable to images. An alternative way to scale attention is to apply it in blocks of varying sizes (Weissenborn et al., 2019), in the extreme case only along individual axes (Ho et al., 2019; Wang et al., 2020a). Many of these specialized attention architectures demonstrate promising results on computer vision tasks, but require complex engineering to be implemented efficiently on hardware accelerators.
　　在图像上天真地应用自注意力，需要每个像素都关注其他每个像素。由于像素数量的二次成本，这并不能扩展到现实的输入尺寸。因此，为了在图像处理的背景下应用Transformers，过去已经尝试了几种近似的方法。Parmar等人（2018）只在每个查询像素的局部邻域而不是全局应用自注意力。这样的局部多头点积自注意力块可以完全取代卷积（Hu等人，2019；Ramachandran等人，2019；Zhao等人，2020）。在不同的工作路线中，稀疏变换（Child等人，2019）采用了可扩展的全局自注意力的近似，以便适用于图像。另一种扩展注意力的方式是在不同大小的块中应用它（Weissenborn等人，2019年），在极端情况下，只沿个别轴线（Ho等人，2019年；Wang等人，2020a）。许多这些专门的注意力架构在计算机视觉任务上表现出有希望的结果，但需要复杂的工程来在硬件加速器上有效地实现。
　　Most related to ours is the model of Cordonnier et al. (2020), which extracts patches of size 2 × 2 from the input image and applies full self-attention on top. This model is very similar to ViT, but our work goes further to demonstrate that large scale pre-training makes vanilla transformers competitive with (or even better than) state-of-the-art CNNs. Moreover, Cordonnier et al. (2020) use a small patch size of 2 × 2 pixels, which makes the model applicable only to small-resolution images, while we handle medium-resolution images as well.
　　与我们最相关的是Cordonnier等人（2020）的模型，它从输入图像中提取大小为2×2的斑块，并在上面应用完全的自我关注。这个模型与ViT非常相似，但我们的工作更进一步，证明了大规模的预训练使香草转化器与最先进的CNN竞争（甚至更好）。此外，Cordonnier等人（2020）使用了2×2像素的小补丁尺寸，这使得该模型只适用于小分辨率的图像，而我们也能处理中等分辨率的图像。
　　There has also been a lot of interest in combining convolutional neural networks (CNNs) with forms of self-attention, e.g. by augmenting feature maps for image classification (Bello et al., 2019) or by further processing the output of a CNN using self-attention, e.g. for object detection (Hu et al., 2018; Carion et al., 2020), video processing (Wang et al., 2018; Sun et al., 2019), image classification (Wu et al., 2020), unsupervised object discovery (Locatello et al., 2020), or unified text-vision tasks (Chen et al., 2020c; Lu et al., 2019; Li et al., 2019).
　　人们对将卷积神经网络（CNN）与各种形式的自注意力相结合也很感兴趣，例如，通过增强图像分类的特征图（Bello等人，2019年）或利用自注意力进一步处理CNN的输出，例如，用于目标检测（Hu等人。2018；Carion等人，2020）、视频处理（Wang等人，2018；Sun等人，2019）、图像分类（Wu等人，2020）、无监督物体发现（Locatello等人，2020）或统一的文本视觉任务（Chen等人，2020c；Lu等人，2019；Li等人，2019）。
　　Another recent related model is image GPT (iGPT) (Chen et al., 2020a), which applies Transformers to image pixels after reducing image resolution and color space. The model is trained in an unsupervised fashion as a generative model, and the resulting representation can then be fine-tuned or probed linearly for classification performance, achieving a maximal accuracy of 72% on ImageNet.
　　另一个最近的相关模型是图像GPT（iGPT）（Chen等人，2020a），它在降低图像分辨率和色彩空间后对图像像素应用Transformers 。该模型以无监督的方式作为生成模型进行训练，然后可以对产生的表示进行微调，或对分类性能进行线性探测，在ImageNet上取得了72%的最高准确率。
　　Our work adds to the increasing collection of papers that explore image recognition at larger scales than the standard ImageNet dataset. The use of additional data sources allows to achieve state-ofthe-art results on standard benchmarks (Mahajan et al., 2018; Touvron et al., 2019; Xie et al., 2020). Moreover, Sun et al. (2017) study how CNN performance scales with dataset size, and Kolesnikov et al. (2020); Djolonga et al. (2020) perform an empirical exploration of CNN transfer learning from large scale datasets such as ImageNet-21k and JFT-300M. We focus on these two latter datasets as well, but train Transformers instead of ResNet-based models used in prior works.
　　我们的工作增加了越来越多的论文集，这些论文在比标准ImageNet数据集更大的范围内探索图像识别。使用额外的数据源可以在标准基准上取得最先进的结果（Mahajan等人，2018；Touvron等人，2019；Xie等人，2020）。此外，Sun等人（2017）研究了CNN的性能如何随数据集大小而变化，Kolesnikov等人（2020）；Djolonga等人（2020）对CNN从ImageNet-21k和JFT-300M等大规模数据集转移学习进行了经验性探索。我们也关注后两个数据集，但训练Transformers，而不是之前工作中使用的基于ResNet的模型。

3 方法

In model design we follow the original Transformer (Vaswani et al., 2017) as closely as possible. An advantage of this intentionally simple setup is that scalable NLP Transformer architectures – and their efficient implementations – can be used almost out of the box.
　　在模型设计中，我们尽可能地遵循原始的Transformer（Vaswani等人，2017）。这种有意的简单设置的好处是，可扩展的NLP Transformer架构–以及它们的高效实现–几乎可以开箱即用。

3.1 VISION TRANSFORMER (VIT)

An overview of the model is depicted in Figure 1. The standard Transformer receives as input a 1D sequence of token embeddings. To handle 2D images, we reshape the image $\mathbf{x}\in\mathbb{R}^{H×W ×C}$ into a sequence of flattened 2D patches $\mathbf{x}_p\in\mathbb{R}^{N\times(P^2\cdot C)}$ , where $(H, W)$ is the resolution of the original image, $C$ is the number of channels, $(P, P)$ is the resolution of each image patch, and $N = HW/P^2$ is the resulting number of patches, which also serves as the effective input sequence length for the Transformer. The Transformer uses constant latent vector size D through all of its layers, so we flatten the patches and map to D dimensions with a trainable linear projection (Eq. 1). We refer to the output of this projection as the patch embeddings.
　　图1描述了该模型的概况。标准转化器接收一个一维符号嵌入序列作为输入。为了处理二维图像，我们将图像 $\mathbf{x}\in\mathbb{R}^{H×W ×C}$ 重塑为一串扁平的二维斑块 $\mathbf{x}_p\in\mathbb{R}^{N\times(P^2\cdot C)}$ ，其中 $(H, W)$ 是原始图像的分辨率。 $C$ 是通道数， $(P, P)$ 是每个图像补丁的分辨率， $N = HW/P^2$ 是产生的补丁数，这也是转化器的有效输入序列长度。转化器在其所有层中使用恒定的潜伏向量大小D，因此我们将斑块扁平化，并用可训练的线性投影（公式1）映射到D维。我们把这个投影的输出称为补丁嵌入。
在这里插入图片描述
　　Similar to BERT’s [class] token, we prepend a learnable embedding to the sequence of embedded patches $(\mathbf{z}_0^0=\mathbf{x}_{class})$ , whose state at the output of the Transformer encoder $(\mathbf{z}_L^0)$ serves as the image representation $\mathbf{y}$ (Eq. 4). Both during pre-training and fine-tuning, a classification head is attached to $(\mathbf{z}_L^0)$ . The classification head is implemented by a MLP with one hidden layer at pre-training time and by a single linear layer at fine-tuning time.
　　与BERT的[class]标记类似，我们在嵌入式补丁序列 $(\mathbf{z}_0^0=\mathbf{x}_{class})$ 上预置一个可学习的嵌入，其在Transformer编码器 $(\mathbf{z}_L^0)$ 输出的状态作为图像表示 $\mathbf{y}$ （公式4）。在预训练和微调期间，一个分类头被连接到 $(\mathbf{z}_L^0)$ 。在预训练时，分类头由一个具有隐藏层的MLP实现，在微调时由一个单一的线性层实现。
　　Position embeddings are added to the patch embeddings to retain positional information. We use standard learnable 1D position embeddings, since we have not observed significant performance gains from using more advanced 2D-aware position embeddings (Appendix D.4). The resulting sequence of embedding vectors serves as input to the encoder.
　　位置嵌量被添加到补丁嵌入中以保留位置信息。我们使用标准的可学习的一维位置嵌量，因为我们没有观察到使用更先进的二维感知位置嵌量的明显性能提升（附录D.4）。由此产生的嵌入向量序列可作为编码器的输入。
　　The Transformer encoder (Vaswani et al., 2017) consists of alternating layers of multiheaded selfattention (MSA, see Appendix A) and MLP blocks (Eq. 2, 3). Layernorm (LN) is applied before every block, and residual connections after every block (Wang et al., 2019; Baevski & Auli, 2019).
　　Transformer编码器（Vaswani等人，2017）由多头自我注意（MSA，见附录A）和MLP块的交替层组成（公式2，3）。在每个区块之前应用Layernorm（LN），在每个区块之后应用剩余连接（Wang等人，2019；Baevski & Auli，2019）。
　　The MLP contains two layers with a GELU non-linearity.
　　MLP包含两个具有GELU非线性的层。在这里插入图片描述

Inductive bias. We note that Vision Transformer has much less image-specific inductive bias than CNNs. In CNNs, locality, two-dimensional neighborhood structure, and translation equivariance are baked into each layer throughout the whole model. In ViT, only MLP layers are local and translationally equivariant, while the self-attention layers are global. The two-dimensional neighborhood structure is used very sparingly: in the beginning of the model by cutting the image into patches and at fine-tuning time for adjusting the position embeddings for images of different resolution (as described below). Other than that, the position embeddings at initialization time carry no information about the 2D positions of the patches and all spatial relations between the patches have to be learned from scratch.
　　感应性偏差。我们注意到，与CNN相比，Vision Transformer的特定图像感应偏差要小得多。在CNN中，局部性、二维邻域结构和平移等价性在整个模型中被烘托到每个层。在ViT中，只有MLP层是局部和翻译等价的，而自注意层是全局的。二维邻域结构的使用非常少：在模型的开始阶段，将图像切割成patches，并在微调时为不同分辨率的图像调整位置嵌入（如下文所述）。除此之外，初始化时的位置嵌入没有携带任何关于patches的二维位置的信息，patches之间的所有空间关系都要从头开始学习。
　　Hybrid Architecture. As an alternative to raw image patches, the input sequence can be formed from feature maps of a CNN (LeCun et al., 1989). In this hybrid model, the patch embedding projection E (Eq. 1) is applied to patches extracted from a CNN feature map. As a special case, the patches can have spatial size 1x1, which means that the input sequence is obtained by simply flattening the spatial dimensions of the feature map and projecting to the Transformer dimension. The classification input embedding and position embeddings are added as described above.
　　混合结构。作为原始图像patches的替代品，输入序列可以由CNN的特征图形成（LeCun等人，1989）。在这个混合模型中，patches嵌入投影E（公式1）被应用于从CNN特征图中提取的patches。作为一种特殊情况，patches的空间尺寸可以是1x1，这意味着输入序列是通过简单地将特征图的空间维度压平并投射到Transformer维度而得到的。分类输入嵌入和位置嵌入如上所述被添加。

3.2 微调和更高的分辨率

Typically, we pre-train ViT on large datasets, and fine-tune to (smaller) downstream tasks. For this, we remove the pre-trained prediction head and attach a zero-initialized D × K feedforward layer, where K is the number of downstream classes. It is often beneficial to fine-tune at higher resolution than pre-training (Touvron et al., 2019; Kolesnikov et al., 2020). When feeding images of higher resolution, we keep the patch size the same, which results in a larger effective sequence length. The Vision Transformer can handle arbitrary sequence lengths (up to memory constraints), however, the pre-trained position embeddings may no longer be meaningful. We therefore perform 2D interpolation of the pre-trained position embeddings, according to their location in the original image. Note that this resolution adjustment and patch extraction are the only points at which an inductive bias about the 2D structure of the images is manually injected into the Vision Transformer.
　　通常情况下，我们在大型数据集上对ViT进行预训练，然后针对（较小的）下游任务进行微调。为此，我们去掉预训练的预测头，附加一个零初始化的D×K前馈层，其中K是下游类的数量。在比预训练更高的分辨率下进行微调通常是有益的（Touvron等人，2019；Kolesnikov等人，2020）。当输入更高分辨率的图像时，我们保持补丁大小不变，这导致有效序列长度更大。Vision Transformer可以处理任意的序列长度（达到内存限制），然而，预先训练好的位置嵌入可能不再有意义了。因此，我们对预训练的位置嵌入进行二维插值，根据它们在原始图像中的位置。请注意，这种分辨率调整和补丁提取是唯一的一点，在这一点上，关于图像的二维结构的归纳偏见被手动注入到视觉变换器中。

4 实验

We evaluate the representation learning capabilities of ResNet, Vision Transformer (ViT), and the hybrid. To understand the data requirements of each model, we pre-train on datasets of varying size and evaluate many benchmark tasks. When considering the computational cost of pre-training the model, ViT performs very favourably, attaining state of the art on most recognition benchmarks at a lower pre-training cost. Lastly, we perform a small experiment using self-supervision, and show that self-supervised ViT holds promise for the future.
　　我们评估了ResNet、Vision Transformer（ViT）和混合模型的表示学习能力。为了了解每个模型的数据要求，我们在不同规模的数据集上进行了预训练，并评估了许多基准任务。当考虑到预训练模型的计算成本时，ViT的表现非常好，以较低的预训练成本在大多数识别基准上达到了最先进的水平。最后，我们进行了一个使用自我监督的小实验，并表明自我监督的ViT在未来有希望。

4.1 设置

Datasets. To explore model scalability, we use the ILSVRC-2012 ImageNet dataset with 1k classes and 1.3M images (we refer to it as ImageNet in what follows), its superset ImageNet-21k with 21k classes and 14M images (Deng et al., 2009), and JFT (Sun et al., 2017) with 18k classes and 303M high-resolution images. We de-duplicate the pre-training datasets w.r.t. the test sets of the downstream tasks following Kolesnikov et al. (2020). We transfer the models trained on these dataset to several benchmark tasks: ImageNet on the original validation labels and the cleaned-up ReaL labels (Beyer et al., 2020), CIFAR-10/100 (Krizhevsky, 2009), Oxford-IIIT Pets (Parkhi et al., 2012), and Oxford Flowers-102 (Nilsback & Zisserman, 2008). For these datasets, pre-processing follows Kolesnikov et al. (2020).
　　数据集。为了探索模型的可扩展性，我们使用了具有1000类和130万图像的ILSVRC-2012 ImageNet数据集（我们在下文中将其称为ImageNet），其超集ImageNet-21k具有21000类和1400万图像（Deng等人，2009），以及JFT（Sun等人，2017）具有18000类和30300万高分辨率图像。我们按照Kolesnikov等人（2020）的做法，将预训练数据集与下游任务的测试集进行去重。我们将在这些数据集上训练的模型转移到几个基准任务上。在原始验证标签和清理过的ReaL标签上的ImageNet（Beyer等人，2020）、CIFAR-10/100（Krizhevsky，2009）、Oxford-IIIT Pets（Parkhi等人，2012）和Oxford Flowers-102（Nilsback & Zisserman，2008）。对于这些数据集，预处理遵循Kolesnikov等人（2020）。
　　We also evaluate on the 19-task VTAB classification suite (Zhai et al., 2019b). VTAB evaluates low-data transfer to diverse tasks, using 1 000 training examples per task. The tasks are divided into three groups: Natural – tasks like the above, Pets, CIFAR, etc. Specialized – medical and satellite imagery, and Structured – tasks that require geometric understanding like localization.
　　我们还对19个任务的VTAB分类套件（Zhai等人，2019b）进行评估。VTAB评估了低数据转移到不同的任务，每个任务使用1 000个训练实例。这些任务被分为三组。自然任务–如上述任务，宠物，CIFAR等。专业的–医疗和卫星图像，以及结构化的–需要几何理解的任务，如定位。
　　Model Variants. We base ViT configurations on those used for BERT (Devlin et al., 2019), as summarized in Table 1. The “Base” and “Large” models are directly adopted from BERT and we add the larger “Huge” model. In what follows we use brief notation to indicate the model size and the input patch size: for instance, ViT-L/16 means the “Large” variant with 16 × 16 input patch size. Note that the Transformer’s sequence length is inversely proportional to the square of the patch size, thus models with smaller patch size are computationally more expensive.
　　模型的变体。我们以BERT（Devlin等人，2019）所用的ViT配置为基础，如表1所示。基本 "和 "大型 "模型直接来自BERT，我们增加了更大的 "巨大 "模型。在下文中，我们使用简短的符号来表示模型的大小和输入补丁的大小：例如，ViT-L/16表示具有16×16输入补丁大小的 "大 "变体。请注意，变形器的序列长度与补丁大小的平方成反比，因此，补丁大小较小的模型在计算上更加昂贵。
在这里插入图片描述

For the baseline CNNs, we use ResNet (He et al., 2016), but replace the Batch Normalization layers (Ioffe & Szegedy, 2015) with Group Normalization (Wu & He, 2018), and used standardized convolutions (Qiao et al., 2019). These modifications improve transfer (Kolesnikov et al., 2020), and we denote the modified model “ResNet (BiT)”. For the hybrids, we feed the intermediate feature maps into ViT with patch size of one “pixel”. To experiment with different sequence lengths, we either (i) take the output of stage 4 of a regular ResNet50 or (ii) remove stage 4, place the same number of layers in stage 3 (keeping the total number of layers), and take the output of this extended stage 3. Option (ii) results in a 4x longer sequence length, and a more expensive ViT model.
　　对于基线CNN，我们使用ResNet（He等人，2016），但用Group Normalization（Wu & He，2018）取代Batch Normalization层（Ioffe & Szegedy，2015），并使用标准化卷积（Qiao等人，2019）。这些修改改善了转移（Kolesnikov等人，2020），我们将修改后的模型命名为 “ResNet（BiT）”。对于混合体，我们将中间的特征图送入ViT，补丁大小为一个 “像素”。为了实验不同的序列长度，我们要么（i）采取常规ResNet50的第4阶段的输出，要么（ii）去掉第4阶段，在第3阶段放置相同数量的层（保持总层数），并采取这个扩展的第3阶段的输出。选项(ii)的结果是序列长度增加了4倍，而且ViT模型的成本更高。
　　Training & Fine-tuning. We train all models, including ResNets, using Adam (Kingma & Ba, 2015) with $β_1 = 0.9, β_2 = 0.999$ , a batch size of 4096 and apply a high weight decay of 0.1, which we found to be useful for transfer of all models (Appendix D.1 shows that, in contrast to common practices, Adam works slightly better than SGD for ResNets in our setting). We use a linear learning rate warmup and decay, see Appendix B.1 for details. For fine-tuning we use SGD with momentum, batch size 512, for all models, see Appendix B.1.1. For ImageNet results in Table 2, we fine-tuned at higher resolution: 512 for ViT-L/16 and 518 for ViT-H/14, and also used Polyak & Juditsky (1992) averaging with a factor of 0.9999 (Ramachandran et al., 2019; Wang et al., 2020b).
　　训练和微调。我们使用Adam（Kingma & Ba, 2015）训练所有模型，包括ResNets， $β_1 = 0.9, β_2 = 0.999$ ，批处理量为4096，并应用0.1的高权重衰减，我们发现这对所有模型的转移是有用的（附录D.1显示，与通常的做法不同，在我们的设置中Adam对ResNets的效果比SGD略好）。我们使用线性学习率预热和衰减，详见附录B.1。对于微调，我们对所有的模型都使用SGD与动量，批次大小为512，见附录B.1.1。对于表2中的ImageNet结果，我们以更高的分辨率进行微调。512用于ViT-L/16，518用于ViT-H/14，还使用了Polyak & Juditsky（1992）的平均数，系数为0.9999（Ramachandran等人，2019；Wang等人，2020b）。
　　Metrics. We report results on downstream datasets either through few-shot or fine-tuning accuracy. Fine-tuning accuracies capture the performance of each model after fine-tuning it on the respective dataset. Few-shot accuracies are obtained by solving a regularized least-squares regression problem that maps the (frozen) representation of a subset of training images to {−1, 1}K target vectors. This formulation allows us to recover the exact solution in closed form. Though we mainly focus on fine-tuning performance, we sometimes use linear few-shot accuracies for fast on-the-fly evaluation where fine-tuning would be too costly.
　　度量衡。我们通过少许拍摄或微调精度报告下游数据集的结果。微调精度反映了每个模型在各自数据集上进行微调后的性能。少数次准确率是通过解决一个正则化的最小二乘回归问题获得的，该问题将训练图像子集的（冻结）表示映射到{-1，1}K目标向量。这种表述使我们能够以闭合形式恢复精确的解决方案。虽然我们主要关注的是微调性能，但有时我们也会使用线性的几张照片的精确度，以便在微调成本过高的情况下进行快速的即时评估。

4.2 与主流模型的比较

We first compare our largest models – ViT-H/14 and ViT-L/16 – to state-of-the-art CNNs from the literature. The first comparison point is Big Transfer (BiT) (Kolesnikov et al., 2020), which performs supervised transfer learning with large ResNets. The second is Noisy Student (Xie et al., 2020), which is a large EfficientNet trained using semi-supervised learning on ImageNet and JFT300M with the labels removed. Currently, Noisy Student is the state of the art on ImageNet and BiT-L on the other datasets reported here. All models were trained on TPUv3 hardware, and we report the number of TPUv3-core-days taken to pre-train each of them, that is, the number of TPU v3 cores (2 per chip) used for training multiplied by the training time in days.
　　我们首先将我们最大的模型–ViT-H/14和ViT-L/16–与文献中最先进的CNN进行比较。第一个比较点是Big Transfer（BiT）（Kolesnikov等人，2020），它用大型ResNets执行监督转移学习。第二个是Noisy Student（Xie等人，2020），它是一个在ImageNet和JFT300M上使用半监督学习训练的大型EfficientNet，并去除标签。目前，Noisy Student在ImageNet上是最先进的，而BiT-L在这里报告的其他数据集上是最先进的。所有的模型都是在TPUv3硬件上训练的，我们报告了预训练每个模型所花费的TPUv3核的天数，也就是用于训练的TPUv3核的数量（每个芯片2个）乘以训练时间（天）。
　　Table 2 shows the results. The smaller ViT-L/16 model pre-trained on JFT-300M outperforms BiT-L (which is pre-trained on the same dataset) on all tasks, while requiring substantially less computational resources to train. The larger model, ViT-H/14, further improves the performance, especially on the more challenging datasets – ImageNet, CIFAR-100, and the VTAB suite. Interestingly, this model still took substantially less compute to pre-train than prior state of the art. However, we note that pre-training efficiency may be affected not only by the architecture choice, but also other parameters, such as training schedule, optimizer, weight decay, etc. We provide a controlled study of performance vs. compute for different architectures in Section 4.4. Finally, the ViT-L/16 model pre-trained on the public ImageNet-21k dataset performs well on most datasets too, while taking fewer resources to pre-train: it could be trained using a standard cloud TPUv3 with 8 cores in approximately 30 days.
　　表2显示了结果。在JFT-300M上预训练的较小的ViT-L/16模型在所有任务上都优于BiT-L（在同一数据集上预训练），同时需要大量的计算资源来训练。更大的模型，ViT-H/14，进一步提高了性能，特别是在更具挑战性的数据集–ImageNet、CIFAR-100和VTAB套件。有趣的是，这个模型的预训练花费的计算量仍然大大低于之前的技术水平。然而，我们注意到，预训练的效率不仅会受到架构选择的影响，而且还会受到其他参数的影响，如训练计划、优化器、权重衰减等。我们在第4.4节中对不同架构的性能与计算量进行了对照研究。最后，在公共的ImageNet-21k数据集上预训练的ViT-L/16模型在大多数数据集上也表现良好，同时预训练所需的资源较少：它可以在大约30天内使用8个内核的标准云TPUv3进行训练。在这里插入图片描述
　　Figure 2 decomposes the VTAB tasks into their respective groups, and compares to previous SOTA methods on this benchmark: BiT, VIVI – a ResNet co-trained on ImageNet and Youtube (Tschannen et al., 2020), and S4L – supervised plus semi-supervised learning on ImageNet (Zhai et al., 2019a). ViT-H/14 outperforms BiT-R152x4, and other methods, on the Natural and Structured tasks. On the Specialized the performance of the top two models is similar.
　　图2将VTAB的任务分解为各自的组别，并与之前的SOTA方法在此基准上进行了比较。BiT、VIVI–在ImageNet和Youtube上共同训练的ResNet（Tschannen等人，2020），以及S4L–在ImageNet上的监督加半监督学习（Zhai等人，2019a）。ViT-H/14在自然和结构化任务上的表现优于BiT-R152x4和其他方法。在专业任务上，前两个模型的表现相似。
在这里插入图片描述

4.3 预训练的数据要求

The Vision Transformer performs well when pre-trained on a large JFT-300M dataset. With fewer inductive biases for vision than ResNets, how crucial is the dataset size? We perform two series of experiments.
　　在大型JFT-300M数据集上进行预训练时，视觉变换器表现良好。与ResNets相比，视觉的归纳偏差较少，数据集的大小有多关键？我们进行了两个系列的实验。
　　First, we pre-train ViT models on datasets of increasing size: ImageNet, ImageNet-21k, and JFT300M. To boost the performance on the smaller datasets, we optimize three basic regularization parameters – weight decay, dropout, and label smoothing. Figure 3 shows the results after finetuning to ImageNet (results on other datasets are shown in Table 5)2. When pre-trained on the smallest dataset, ImageNet, ViT-Large models underperform compared to ViT-Base models, despite (moderate) regularization. With ImageNet-21k pre-training, their performances are similar. Only with JFT-300M, do we see the full benefit of larger models. Figure 3 also shows the performance region spanned by BiT models of different sizes. The BiT CNNs outperform ViT on ImageNet, but with the larger datasets, ViT overtakes.
　　首先，我们在越来越大的数据集上对ViT模型进行预训练。ImageNet、ImageNet-21k和JFT300M。为了提高在小数据集上的性能，我们优化了三个基本的正则化参数–权重衰减、剔除和标签平滑。图3显示了对ImageNet进行微调后的结果（其他数据集的结果见表5）2。当在最小的数据集ImageNet上进行预训练时，尽管有（适度的）正则化，ViT-Large模型与ViT-Base模型相比表现不佳。在对ImageNet-21k进行预训练时，它们的表现相似。只有在JFT-300M下，我们才看到较大模型的全部优势。图3还显示了不同规模的BiT模型所跨越的性能区域。BiT CNN在ImageNet上的表现优于ViT，但在更大的数据集上，ViT超越了。
在这里插入图片描述
　　Second, we train our models on random subsets of 9M, 30M, and 90M as well as the full JFT300M dataset. We do not perform additional regularization on the smaller subsets and use the same hyper-parameters for all settings. This way, we assess the intrinsic model properties, and not the effect of regularization. We do, however, use early-stopping, and report the best validation accuracy achieved during training. To save compute, we report few-shot linear accuracy instead of full finetuning accuracy. Figure 4 contains the results. Vision Transformers overfit more than ResNets with comparable computational cost on smaller datasets. For example, ViT-B/32 is slightly faster than ResNet50; it performs much worse on the 9M subset, but better on 90M+ subsets. The same is true for ResNet152x2 and ViT-L/16. This result reinforces the intuition that the convolutional inductive bias is useful for smaller datasets, but for larger ones, learning the relevant patterns directly from data is sufficient, even beneficial.
　　其次，我们在9M、30M和90M的随机子集以及整个JFT300M数据集上训练我们的模型。我们不对较小的子集进行额外的正则化，并对所有设置使用相同的超参数。通过这种方式，我们评估了固有的模型属性，而不是正则化的影响。然而，我们确实使用了早期停止，并报告了在训练期间取得的最佳验证精度。为了节省计算量，我们报告了少量的线性精度，而不是完全的微调精度。图4包含了这些结果。在较小的数据集上，视觉变形器比ResNets的过拟合程度更高，计算成本也相当高。例如，ViT-B/32比ResNet50略快；它在9M子集上的表现差很多，但在90M以上的子集上表现更好。ResNet152x2和ViT-L/16的情况也是如此。这个结果加强了这样的直觉：卷积归纳偏向对于较小的数据集是有用的，但对于较大的数据集，直接从数据中学习相关模式是足够的，甚至是有益的。
在这里插入图片描述
　　Overall, the few-shot results on ImageNet (Figure 4), as well as the low-data results on VTAB (Table 2) seem promising for very low-data transfer. Further analysis of few-shot properties of ViT is an exciting direction of future work.
　　总的来说，ImageNet上的几张照片结果（图4），以及VTAB上的低数据结果（表2）似乎对非常低的数据传输很有希望。进一步分析ViT的少量数据特性是未来工作的一个令人兴奋的方向。

4.4 规模化研究

We perform a controlled scaling study of different models by evaluating transfer performance from JFT-300M. In this setting data size does not bottleneck the models’ performances, and we assess performance versus pre-training cost of each model. The model set includes: 7 ResNets, R50x1, R50x2 R101x1, R152x1, R152x2, pre-trained for 7 epochs, plus R152x2 and R200x3 pre-trained for 14 epochs; 6 Vision Transformers, ViT-B/32, B/16, L/32, L/16, pre-trained for 7 epochs, plus L/16 and H/14 pre-trained for 14 epochs; and 5 hybrids, R50+ViT-B/32, B/16, L/32, L/16 pretrained for 7 epochs, plus R50+ViT-L/16 pre-trained for 14 epochs (for hybrids, the number at the end of the model name stands not for the patch size, but for the total dowsampling ratio in the ResNet backbone).
　　我们通过评估JFT-300M的转移性能，对不同模型进行了控制性的扩展研究。在这种情况下，数据大小不会成为模型性能的瓶颈，我们评估每个模型的性能与预训练成本。该模型集包括。7个ResNets, R50x1, R50x2 R101x1, R152x1, R152x2, 预训练了7个历时，加上R152x2和R200x3预训练了14个历时；6个Vision Transformers, ViT-B/32, B/16, L/32, L/16，预训练了7个历时，加上L/16和H/14预训练了14个历时。5个混合模型，R50+ViT-B/32, B/16, L/32, L/16，预训练了7个历时，加上R50+ViT-L/16，预训练了14个历时（对于混合模型，模型名称后面的数字不是代表补丁大小，而是代表ResNet骨干的总取样率）。
　　Figure 5 contains the transfer performance versus total pre-training compute (see Appendix D.5 for details on computational costs). Detailed results per model are provided in Table 6 in the Appendix. A few patterns can be observed. First, Vision Transformers dominate ResNets on the performance/compute trade-off. ViT uses approximately 2 − 4× less compute to attain the same performance (average over 5 datasets). Second, hybrids slightly outperform ViT at small computational budgets, but the difference vanishes for larger models. This result is somewhat surprising, since one might expect convolutional local feature processing to assist ViT at any size. Third, Vision Transformers appear not to saturate within the range tried, motivating future scaling efforts.
　　图5包含了转移性能与训练前总计算量的对比（关于计算成本的细节，见附录D.5）。每个模型的详细结果见附录中的表6。可以观察到一些模式。首先，视觉变换器在性能/计算权衡上主导了ResNets。ViT使用大约2-4倍的计算量来达到相同的性能（5个数据集的平均值）。其次，混合体在小的计算预算中略胜于ViT，但对于较大的模型，这种差异就消失了。这个结果有些令人惊讶，因为人们可能期望卷积局部特征处理在任何规模下都能帮助ViT。第三，视觉变换器在所尝试的范围内似乎没有达到饱和，这激励了未来的扩展努力。
　　在这里插入图片描述

4.5 检查ViT

To begin to understand how the Vision Transformer processes image data, we analyze its internal representations. The first layer of the Vision Transformer linearly projects the flattened patches into a lower-dimensional space (Eq. 1). Figure 7 (left) shows the top principal components of the the learned embedding filters. The components resemble plausible basis functions for a low-dimensional representation of the fine structure within each patch.
　　为了开始了解视觉变换器是如何处理图像数据的，我们分析其内部表示。视觉变换器的第一层将扁平化的斑块线性地投射到一个较低的维度空间（公式1）。图7（左）显示了学习到的嵌入过滤器的顶部主成分。这些分量类似于每个补丁内精细结构的低维表示的可信的基础函数。
　　After the projection, a learned position embedding is added to the patch representations. Figure 7 (center) shows that the model learns to encode distance within the image in the similarity of position embeddings, i.e. closer patches tend to have more similar position embeddings. Further, the row-column structure appears; patches in the same row/column have similar embeddings. Finally, a sinusoidal structure is sometimes apparent for larger grids (Appendix D). That the position embeddings learn to represent 2D image topology explains why hand-crafted 2D-aware embedding variants do not yield improvements (Appendix D.4).
　　在投影之后，一个学习到的位置嵌入被添加到斑块表征中。图7（中间）显示，该模型学会了用位置嵌入的相似性来编码图像中的距离，即距离较近的斑块往往有更相似的位置嵌入。此外，出现了行-列结构；同一行/列的斑块有相似的嵌入。最后，对于较大的网格，正弦结构有时是明显的（附录D）。位置嵌入学会了代表二维图像拓扑结构，这解释了为什么手工制作的二维感知嵌入变体没有产生改进（附录D.4）。
　　Self-attention allows ViT to integrate information across the entire image even in the lowest layers. We investigate to what degree the network makes use of this capability. Specifically, we compute the average distance in image space across which information is integrated, based on the attention weights (Figure 7, right). This “attention distance” is analogous to receptive field size in CNNs. We find that some heads attend to most of the image already in the lowest layers, showing that the ability to integrate information globally is indeed used by the model. Other attention heads have consistently small attention distances in the low layers. This highly localized attention is less pronounced in hybrid models that apply a ResNet before the Transformer (Figure 7, right), suggesting that it may serve a similar function as early convolutional layers in CNNs. Further, the attention distance increases with network depth. Globally, we find that the model attends to image regions that are semantically relevant for classification (Figure 6).
　　自我关注使ViT能够整合整个图像的信息，即使是在最低层。我们研究该网络在多大程度上利用了这种能力。具体来说，我们根据注意力权重计算出信息在图像空间中被整合的平均距离（图7，右）。这个 "注意距离 "类似于CNN中的感受野大小。我们发现，有些头在最低层就已经注意到了大部分的图像，这表明模型确实使用了全局整合信息的能力。其他注意头在低层的注意距离一直很小。这种高度本地化的注意力在Transformer之前应用ResNet的混合模型中不太明显（图7，右），表明它可能起到与CNN中早期卷积层类似的功能。此外，注意力距离随着网络深度的增加而增加。总的来说，我们发现该模型关注的是与分类有语义关系的图像区域（图6）。
　　在这里插入图片描述

4.6 自我监督

Transformers show impressive performance on NLP tasks. However, much of their success stems not only from their excellent scalability but also from large scale self-supervised pre-training (Devlin et al., 2019; Radford et al., 2018). We also perform a preliminary exploration on masked patch prediction for self-supervision, mimicking the masked language modeling task used in BERT. With self-supervised pre-training, our smaller ViT-B/16 model achieves 79.9% accuracy on ImageNet, a significant improvement of 2% to training from scratch, but still 4% behind supervised pre-training. Appendix B.1.2 contains further details. We leave exploration of contrastive pre-training (Chen et al., 2020b; He et al., 2020; Bachman et al., 2019; H ́ enaff et al., 2020) to future work.
　　变形金刚在NLP任务上表现出令人印象深刻的性能。然而，它们的大部分成功不仅源于其出色的可扩展性，还源于大规模的自我监督预训练（Devlin等人，2019；Radford等人，2018）。我们还对自监督的屏蔽补丁预测进行了初步探索，模仿BERT中使用的屏蔽语言建模任务。通过自我监督的预训练，我们较小的ViT-B/16模型在ImageNet上取得了79.9%的准确率，与从头开始训练相比，有2%的明显改善，但仍比监督的预训练落后4%。附录B.1.2包含进一步的细节。我们将对比性预训练的探索（Chen等人，2020b；He等人，2020；Bachman等人，2019；H ́ enaff等人，2020）留给未来的工作。

5 结论

We have explored the direct application of Transformers to image recognition. Unlike prior works using self-attention in computer vision, we do not introduce image-specific inductive biases into the architecture apart from the initial patch extraction step. Instead, we interpret an image as a sequence of patches and process it by a standard Transformer encoder as used in NLP. This simple, yet scalable, strategy works surprisingly well when coupled with pre-training on large datasets. Thus, Vision Transformer matches or exceeds the state of the art on many image classification datasets, whilst being relatively cheap to pre-train.
　　我们已经探索了变形金刚在图像识别中的直接应用。与之前在计算机视觉中使用自我注意的工作不同，除了最初的补丁提取步骤外，我们没有在架构中引入特定于图像的感应偏差。相反，我们将图像解释为补丁序列，并通过NLP中使用的标准变形器编码器来处理它。这种简单而又可扩展的策略在与大型数据集的预训练相结合时效果出奇地好。因此，Vision Transformer在许多图像分类数据集上符合或超过了技术水平，同时预训练也相对便宜。
　　While these initial results are encouraging, many challenges remain. One is to apply ViT to other computer vision tasks, such as detection and segmentation. Our results, coupled with those in Carion et al. (2020), indicate the promise of this approach. Another challenge is to continue exploring selfsupervised pre-training methods. Our initial experiments show improvement from self-supervised pre-training, but there is still large gap between self-supervised and large-scale supervised pretraining. Finally, further scaling of ViT would likely lead to improved performance.
　　虽然这些初步结果令人鼓舞，但仍有许多挑战。一个是将ViT应用于其他计算机视觉任务，如检测和分割。我们的结果，加上Carion等人（2020）的结果，表明了这种方法的前景。另一个挑战是继续探索自我监督的预训练方法。我们的初步实验表明自监督预训练有了很大的改进，但自监督和大规模监督预训练之间仍有很大差距。最后，ViT的进一步扩展可能会导致性能的提高。