论文精读--ViT

__如果

已于 2024-02-07 15:23:20 修改

阅读量2.6k

点赞数 16

文章标签： transformer 深度学习人工智能论文笔记计算机视觉

于 2024-02-07 15:02:35 首次发布

本文链接：https://blog.csdn.net/m0_73202283/article/details/136064194

版权

ViT的提出挑战了CNN在计算机视觉领域的绝对统治地位。ViT不仅在视觉领域开了一个新坑，因为它打破了cv与nlp之间的壁垒，所以还在多模态领域挖了一个大坑

题目是AN IMAGE IS WORTH 16X16 WORDS:TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE，把一张图片划分为很多个patch，每个patch的大小是16x16

Transformer应用在视觉领域的难点：Transformer自注意力的输入序列长度为n，则它的计算复杂度为(n²)。若将一张224x224的图片中每个像素看作一个单词，则序列长度为224x224，计算复杂度太高

Abstract

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

翻译：

虽然Transformer架构已经成为自然语言处理任务的事实上的标准，但它在计算机视觉上的应用仍然有限。在视觉方面，注意力要么与卷积网络结合使用，要么用于替换卷积网络的某些组件，同时保持其整体结构不变。我们证明这种对cnn的依赖是不必要的，直接应用于图像pactch序列的纯变压器可以很好地完成图像分类任务。当对大量数据进行预训练并传输到多个中型或小型图像识别基准(ImageNet, CIFAR-100, VTAB等)时，Vision Transformer (ViT)与最先进的卷积网络相比获得了出色的结果，同时需要更少的计算资源来训练。

Introduction

Self-attention-based architectures, in particular Transformers (Vaswani et al, 2017), have become the model of choice in natural language processing (NLP). The dominant approach is to pre-train on a large text corpus and then fine-tune on a smaller task-specific dataset (Devlin et al, 2019). Thanks to Transformers’ computational efficiency and scalability, it has become possible to train models of unprecedented size, with over 100B parameters (Brown et al, 2020; Lepikhin et al, 2020). With the models and datasets growing, there is still no sign of saturating performance.

翻译：

基于自注意力的架构，特别是Transformer(Vaswani et al .， 2017)，已经成为自然语言处理(NLP)的首选模型。主要的方法是在大型文本语料库上进行预训练，然后在较小的任务特定数据集上进行微调(Devlin等人，2019)。由于Transformer的计算效率和可扩展性，它可以训练前所未有的模型，具有超过100B个参数(Brown et al, 2020;Lepikhin et al, 2020)。随着模型和数据集的增长，仍然没有出现性能饱和的迹象。

In computer vision, however, convolutional architectures remain dominant (LeCun et al, 1989; Krizhevsky et al, 2012; He et al, 2016). Inspired by NLP successes, multiple works try combining CNN-like architectures with self-attention (Wang et al, 2018; Carion et al, 2020), some replacing the convolutions entirely (Ramachandran et al, 2019; Wang et al, 2020a). The latter models, while theoretically efficient, have not yet been scaled effectively on modern hardware accelerators due to the use of specialized attention patterns. Therefore, in large-scale image recognition, classic ResNetlike architectures are still state of the art (Mahajan et al, 2018; Xie et al, 2020; Kolesnikov et al, 2020).

翻译：

然而，在计算机视觉中，卷积架构仍然占主导地位(LeCun et al, 1989;Krizhevsky et al, 2012;他等人，2016)。受NLP成功的启发，许多作品尝试将类似cnn的架构与自注意力相结合(Wang et al .， 2018;Carion等人，2020)，有些完全取代了卷积(Ramachandran等人，2019;Wang et al .， 2020a)。后一种模型虽然在理论上是有效的，但由于使用了专门的注意力模式，还没有在现代硬件加速器上有效地扩展。因此，在大规模图像识别中，经典的类resnet架构仍然是最先进的(Mahajan等人，2018;谢等，2020;Kolesnikov et al, 2020)。

总结：

为了尝试将Transformer应用于视觉领域，许多人做了不同的尝试：

（1）CNN与Transformer混到一起用，降低序列长度。使用CNN中间的特征图作为输入，例如ResNet-50的最后一个stage的feature map size是14x14，也就是196，这就满足了序列长度不能太长

（2）自注意力取代卷积。孤立自注意力：用一个local的window代替整张图片；轴注意力：把图片2d的矩阵拆成两个1d的向量，先在高度的dim上做一次自注意力，再去宽度的dim上做一次自注意力。理论上是高效的，但由于他们的自注意力操作都是些特殊的操作，所以没有在现在的硬件上加速，所以很难训练出一个大的模型

Inspired by the Transformer scaling successes in NLP, we experiment with applying a standard Transformer directly to images, with the fewest possible modifications. To do so, we split an image into patches and provide the sequence of linear embeddings of these patches as an input to a Transformer. Image patches are treated the same way as tokens (words) in an NLP application. We train the model on image classification in supervised fashion.

翻译：

受NLP中Transformer缩放成功的启发，我们尝试将标准Transformer直接应用于图像，并尽可能减少修改。为此，我们将图像分割成小块，并提供这些小块的线性嵌入序列作为Transformer的输入。在NLP应用程序中，图像patch的处理方式与token(单词)相同。我们以监督的方式对模型进行图像分类训练。

总结：

图片切割成16x16的patch，例如一张224x224的图片，切完之后变成14x14个16x16的小块，每个patch通过一个fc layer得到一个linear embedding，最后把14x14=196个linear embedding输入transformer

When trained on mid-sized datasets such as ImageNet without strong regularization, these models yield modest accuracies of a few percentage points below ResNets of comparable size. This seemingly discouraging outcome may be expected: Transformers lack some of the inductive biases inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well when trained on insufficient amounts of data.

However, the picture changes if the models are trained on larger datasets (14M-300M images). We find that large scale training trumps inductive bias. Our Vision Transformer (ViT) attains excellent results when pre-trained at sufficient scale and transferred to tasks with fewer datapoints. When pre-trained on the public ImageNet-21k dataset or the in-house JFT-300M dataset, ViT approaches or beats state of the art on multiple image recognition benchmarks. In particular, the best model reaches the accuracy of 88:55% on ImageNet, 90:72% on ImageNet-ReaL, 94:55% on CIFAR-100, and 77:63% on the VTAB suite of 19 tasks.

翻译：

当在中等规模的数据集(如ImageNet)上进行训练时，没有进行强正则化，这些模型产生的精度比同等规模的ResNets低几个百分点。这个看似令人沮丧的结果是可以预料到的:transformer缺乏cnn固有的一些归纳偏差，例如翻译等变性和局部性，因此在数据量不足的情况下训练时不能很好地泛化。

然而，如果模型在更大的数据集(14M-300M图像)上训练，图像就会发生变化。我们发现大规模训练胜过归纳偏见。我们的Vision Transformer(ViT)在足够的规模上进行预训练并转移到具有更少数据点的任务时获得了出色的结果。当在公共ImageNet-21k数据集或内部JFT-300M数据集上进行预训练时，ViT在多个图像识别基准上接近或超过了最先进的水平。特别是，最佳模型在ImageNet上达到88:55%的准确率，在ImageNet- real上达到90:72%，在CIFAR-100上达到94:55%，在VTAB套件的19个任务上达到77:63%。

总结：

Transformer与CNN相比，缺少了一些归纳偏置，也就是一些先验知识。比如对于CNN，我们常说有两个inductive bias，一个叫作Locality，CNN是以滑动窗口的形式在图片上进行卷积，所以它假设图片上相邻的区域会有相邻的特征；另一个叫作translation equivariance，f(g(x)) = g(f(x))，也就是先平移再卷积和先卷积再平移是一样的。

CNN有了这两个归纳偏置后其实就有了很多先验信息，所以可以用相对少的数据学一个比较好的模型。但Transformer没有这样的先验信息，只能从数据中学习。但在大规模数据集上Transformer的威力就体现出来了

Related Work

Transformers were proposed by Vaswani et al (2017) for machine translation, and have since become the state of the art method in many NLP tasks. Large Transformer-based models are often pre-trained on large corpora and then fine-tuned for the task at hand: BERT (Devlin et al, 2019) uses a denoising self-supervised pre-training task, while the GPT line of work uses language modeling as its pre-training task (Radford et al, 2018; 2019; Brown et al, 2020).

Naive application of self-attention to images would require that each pixel attends to every other pixel. With quadratic cost in the number of pixels, this does not scale to realistic input sizes. Thus, to apply Transformers in the context of image processing, several approximations have been tried in the past. Parmar et al (2018) applied the self-attention only in local neighborhoods for each query pixel instead of globally. Such local multi-head dot-product self attention blocks can completely replace convolutions (Hu et al, 2019; Ramachandran et al, 2019; Zhao et al, 2020). In a different line of work, Sparse Transformers (Child et al, 2019) employ scalable approximations to global selfattention in order to be applicable to images. An alternative way to scale attention is to apply it in blocks of varying sizes (Weissenborn et al, 2019), in the extreme case only along individual axes (Ho et al, 2019; Wang et al, 2020a). Many of these specialized attention architectures demonstrate promising results on computer vision tasks, but require complex engineering to be implemented efficiently on hardware accelerators.

翻译：

Transformer由Vaswani等人(2017)提出用于机器翻译，并且已经成为许多NLP任务中最先进的方法。基于大型transformer的模型通常在大型语料库上进行预训练，然后针对手头的任务进行微调:BERT (Devlin等人，2019)使用去噪自监督预训练任务，而GPT的工作线使用语言建模作为其预训练任务(Radford等人，2018;2019;Brown et al, 2020)。

对图像进行简单的自注意力的应用会要求每个像素都关注其他像素。由于像素数量的代价是二次的，因此不能按实际的输入大小进行缩放。因此，为了在图像处理的背景下应用变形，过去已经尝试了几种近似方法。Parmar等人(2018)仅在局部邻域对每个查询像素应用自注意力，而不是全局。这种局部多头点积自注意块可以完全取代卷积(Hu et al, 2019;Ramachandran等人，2019;赵等，2020)。在不同的工作中，Transformer(Child等人，2019)采用可扩展的全局自注意力近似，以便适用于图像。另一种扩展注意力的方法是将其应用于不同大小的块(Weissenborn等人，2019)，在极端情况下，仅沿着单个轴(Ho等人，2019;Wang et al .， 2020a)。许多这些专门的注意力架构在计算机视觉任务上显示出有希望的结果，但需要复杂的工程才能在硬件加速器上有效地实现。

Most related to ours is the model of Cordonnier et al (2020), which extracts patches of size 2 × 2 from the input image and applies full self-attention on top. This model is very similar to ViT, but our work goes further to demonstrate that large scale pre-training makes vanilla transformers competitive with (or even better than) state-of-the-art CNNs. Moreover, Cordonnier et al (2020) use a small patch size of 2 × 2 pixels, which makes the model applicable only to small-resolution images, while we handle medium-resolution images as well.

翻译：

与我们最相关的是Cordonnier等人(2020)的模型，该模型从输入图像中提取大小为2 × 2的patch块，并在其上应用完全的自注意力。这个模型与ViT非常相似，但我们的工作进一步证明了大规模的预训练使原始的Transformer与最先进的cnn竞争(甚至更好)。此外，Cordonnier et al(2020)使用了2 × 2像素的小patch尺寸，这使得该模型仅适用于小分辨率图像，而我们也可以处理中分辨率图像。

总结：

与ViT最相关的工作是在CIFAR-10上抽取2x2的patch，因为CIFAR-10中的图片是32x32的，并使用了自注意力模型，与ViT几乎一致。

大力出奇迹，Google那边显卡多，所以ViT跑的patch可以更大(😂)

There has also been a lot of interest in combining convolutional neural networks (CNNs) with forms of self-attention, e.g. by augmenting feature maps for image classification (Bello et al, 2019) or by further processing the output of a CNN using self-attention, e.g. for object detection (Hu et al, 2018; Carion et al, 2020), video processing (Wang et al, 2018; Sun et al, 2019), image classification (Wu et al, 2020), unsupervised object discovery (Locatello et al, 2020), or unified text-vision tasks (Chen et al, 2020c; Lu et al, 2019; Li et al, 2019).

翻译：

将卷积神经网络(CNN)与自注意形式相结合也引起了很多兴趣，例如通过增强图像分类的特征映射(Bello等人，2019)或通过使用自注意进一步处理CNN的输出，例如用于对象检测(Hu等人，2018;Carion等人，2020)，视频处理(Wang等人，2018;Sun等人，2019)、图像分类(Wu等人，2020)、无监督对象发现(Locatello等人，2020)或统一文本视觉任务(Chen等人，2020c;Lu et al .， 2019;Li et al, 2019)。

Another recent related model is image GPT (iGPT) (Chen et al, 2020a), which applies Transformers to image pixels after reducing image resolution and color space. The model is trained in an unsupervised fashion as a generative model, and the resulting representation can then be fine-tuned or probed linearly for classification performance, achieving a maximal accuracy of 72% on ImageNet.

翻译：

另一个最近的相关模型是图像GPT (iGPT) (Chen et al .， 2020a)，该模型在降低图像分辨率和色彩空间后，将transformer应用于图像像素。该模型以无监督的方式作为生成模型进行训练，然后可以对结果表示进行微调或线性探测以获得分类性能，在ImageNet上实现72%的最大准确率。

Our work adds to the increasing collection of papers that explore image recognition at larger scales than the standard ImageNet dataset. The use of additional data sources allows to achieve state-ofthe-art results on standard benchmarks (Mahajan et al, 2018; Touvron et al, 2019; Xie et al, 2020).Moreover, Sun et al (2017) study how CNN performance scales with dataset size, and Kolesnikov et al (2020); Djolonga et al (2020) perform an empirical exploration of CNN transfer learning from large scale datasets such as ImageNet-21k and JFT-300M. We focus on these two latter datasets as well, but train Transformers instead of ResNet-based models used in prior works.

翻译：

我们的工作增加了越来越多的论文，这些论文探索了比标准ImageNet数据集更大规模的图像识别。使用额外的数据源可以在标准基准上获得最先进的结果(Mahajan等人，2018;Touvron等人，2019;谢等，2020)。此外，Sun等人(2017)研究了CNN性能如何随数据集大小而扩展，Kolesnikov等人(2020);Djolonga等人(2020)在ImageNet-21k和JFT-300M等大规模数据集上对CNN迁移学习进行了实证探索。我们也关注后两个数据集，但是训练transformer而不是在之前的作品中使用的基于resnet的模型。

Method

In model design we follow the original Transformer (Vaswani et al, 2017) as closely as possible.An advantage of this intentionally simple setup is that scalable NLP Transformer architectures – and their efficient implementations – can be used almost out of the box.

翻译：

在模型设计中，我们尽可能地遵循原始Transformer (Vaswani et al .， 2017)。这种故意简单设置的一个优点是，可扩展的NLP Transformer架构——及其高效的实现——几乎可以开箱即用。

Vision Transformer (ViT)

先把图片划分成patch，然后把patch变成一个序列，每个patch会经过一个线性投射层得到一个特征(patch embedding) 。虽然自注意力是所有的元素两两之间进行交互，不存在顺序问题，但这里的序列顺序不可打乱，因为图片是一个整体，如果顺序打乱了就不是原来的图片了，所以embedding里还有位置编码的信息。然后送入Transformer得到很多输出，如何挑选输出做分类呢？这里借鉴了BERT中的cls，在ViT中用*表示分类字符，它的位置编码永远是0，由于自注意力机制，*包含了其它所有embedding中的信息，从而只需要根据*的输出做分类。

位置编码不是数字，1、2、3...表示的是矩阵中的行标，矩阵中存放的是一个个表示位置信息向量，是可学习的。也尝试过用二维的位置编码，但结果与向量差不多

An overview of the model is depicted in Figure 1. The standard Transformer receives as input a 1D sequence of token embeddings. To handle 2D images, we reshape the image x 2 R H×W×C into a sequence of flattened 2D patches xp 2 R N×(P 2 ·C) , where (H; W) is the resolution of the original image, C is the number of channels, (P; P) is the resolution of each image patch, and N = HW=P2 is the resulting number of patches, which also serves as the effective input sequence length for the Transformer. The Transformer uses constant latent vector size D through all of its layers, so we flatten the patches and map to D dimensions with a trainable linear projection (Eq. 1). We refer to the output of this projection as the patch embeddings.

Similar to BERT’s [class] token, we prepend a learnable embedding to the sequence of embedded patches (z 0 0 = xclass), whose state at the output of the Transformer encoder (z 0 L ) serves as the image representation y (Eq. 4). Both during pre-training and fine-tuning, a classification head is attached to z 0 L . The classification head is implemented by a MLP with one hidden layer at pre-training time and by a single linear layer at fine-tuning time.

Position embeddings are added to the patch embeddings to retain positional information. We use standard learnable 1D position embeddings, since we have not observed significant performance gains from using more advanced 2D-aware position embeddings (Appendix D.4). The resulting sequence of embedding vectors serves as input to the encoder.

The Transformer encoder (Vaswani et al, 2017) consists of alternating layers of multiheaded selfattention (MSA, see Appendix A) and MLP blocks (Eq. 2, 3). Layernorm (LN) is applied before every block, and residual connections after every block (Wang et al, 2019; Baevski & Auli, 2019).

翻译：

该模型的概述如图1所示。标准Transformer接收作为输入的令牌嵌入的1D序列。为了处理2D图像，我们将图像x 2 R H×W×C重塑为一系列平坦的2Dpatchxp 2 R nx (p2·C)，其中(H;W)为原始图像的分辨率，C为通道数，(P;P)为每个图像patch的分辨率，N = HW=P2为得到的patch数，也作为Transformer的有效输入序列长度。Transformer在其所有层中使用恒定的潜在向量大小D，因此我们将patch平坦化并使用可训练的线性投影(Eq. 1)映射到D维度。我们将该投影的输出称为patch embedding。

与BERT的[class]令牌类似，我们在嵌入的patch序列(z 0 0 = xclass)前添加一个可学习的嵌入，其在Transformer编码器输出(z 0 L)的状态作为图像表示y (Eq. 4)。在预训练和微调期间，分类头都附加到z 0 L。分类头在预训练时由一个隐藏层的MLP实现，在微调时由一个线性层实现。

将位置嵌入添加到patch嵌入中以保留位置信息。我们使用标准的可学习的1D位置嵌入，因为我们没有观察到使用更先进的2d感知位置嵌入的显著性能提升(附录D.4)。得到的嵌入向量序列作为编码器的输入。

Transformer编码器(Vaswani等人，2017)由多头自注意(MSA，见附录A)和MLP块(Eq. 2,3)的交替层组成。在每个块之前应用层模(LN)，并在每个块之后应用剩余连接(Wang等人，2019;Baevski & Auli, 2019)。

Inductive bias. We note that Vision Transformer has much less image-specific inductive bias than CNNs. In CNNs, locality, two-dimensional neighborhood structure, and translation equivariance are baked into each layer throughout the whole model. In ViT, only MLP layers are local and translationally equivariant, while the self-attention layers are global. The two-dimensional neighborhood structure is used very sparingly: in the beginning of the model by cutting the image into patches and at fine-tuning time for adjusting the position embeddings for images of different resolution (as described below). Other than that, the position embeddings at initialization time carry no information about the 2D positions of the patches and all spatial relations between the patches have to be learned from scratch.

Hybrid Architecture. As an alternative to raw image patches, the input sequence can be formed from feature maps of a CNN (LeCun et al, 1989). In this hybrid model, the patch embedding projection E (Eq. 1) is applied to patches extracted from a CNN feature map. As a special case, the patches can have spatial size 1x1, which means that the input sequence is obtained by simply flattening the spatial dimensions of the feature map and projecting to the Transformer dimension.The classification input embedding and position embeddings are added as described above.

翻译：

归纳偏置。我们注意到ViT比cnn具有更少的图像特定的感应偏置。在cnn中，局部性、二维邻域结构和平移等方差被嵌入到整个模型的每一层中。在ViT中，只有MLP层是局部和平移等变的，而自注意层是全局的。二维邻域结构的使用非常少:在模型开始时通过将图像切割成小块，以及在微调时用于调整不同分辨率图像的位置嵌入(如下所述)。除此之外，初始化时的位置嵌入不携带关于patch的二维位置信息，所有patch之间的空间关系都需要从头学习。

混合架构。作为原始图像patch的替代方案，输入序列可以由CNN的特征映射形成(LeCun et al, 1989)。在该混合模型中，将patch embedding投影E (Eq. 1)应用于从CNN feature map中提取的patch。作为一种特殊情况，patch的空间大小可以是1x1，这意味着输入序列是通过简单地将feature map的空间维度平坦化并投影到Transformer维度来获得的。如上所述添加分类输入嵌入和位置嵌入。

总结：

ViT没有归纳偏置，因此在小规模数据上的效果不如CNN是可以理解的

混合架构中，把得到的patch embedding再经过一个CNN，从特征图中提取并拉直得到新的patch embedding

D.3 Head Type And Class Token

In order to stay as close as possible to the original Transformer model, we made use of an additional [class] token, which is taken as image representation. The output of this token is then transformed into a class prediction via a small multi-layer perceptron (MLP) with tanh as non-linearity in the single hidden layer.

This design is inherited from the Transformer model for text, and we use it throughout the main paper. An initial attempt at using only image-patch embeddings, globally average-pooling (GAP) them, followed by a linear classifier—just like ResNet’s final feature map—performed very poorly.

However, we found that this is neither due to the extra token, nor to the GAP operation. Instead, the difference in performance is fully explained by the requirement for a different learning-rate, see Figure 9.

翻译：

为了尽可能接近原始的Transformer模型，我们使用了一个附加的[class]令牌，它被用作图像表示。然后，该令牌的输出通过一个小型多层感知器(MLP)转换为类预测，其中tanh为单个隐藏层中的非线性。

这种设计继承自用于文本的Transformer模型，我们将在整个主要论文中使用它。最初的尝试是只使用图像patch嵌入，全局平均池化(GAP)，然后是线性分类器——就像ResNet的最终特征图一样——表现非常糟糕。

然而，我们发现这既不是由于额外的token，也不是由于GAP操作。相反，性能上的差异完全可以通过对不同学习率的需求来解释，参见图9。

总结：

对transformer输出的结果做GAP与用cls自注意力表示两种方法都可以，之所以用cls是为了尽可能不对Transformer进行修改

D.4 Position Embedding

We ran ablations on different ways of encoding spatial information using positional embedding. We tried the following cases: • Providing no positional information: Considering the inputs as a bag of patches.

• 1-dimensional positional embedding: Considering the inputs as a sequence of patches in the raster order (default across all other experiments in this paper).

• 2-dimensional positional embedding: Considering the inputs as a grid of patches in two dimensions. In this case, two sets of embeddings are learned, each for one of the axes, X-embedding, and Y -embedding, each with size D=2. Then, based on the coordinate on the path in the input, we concatenate the X and Y embedding to get the final positional embedding for that patch.

• Relative positional embeddings: Considering the relative distance between patches to encode the spatial information as instead of their absolute position. To do so, we use 1dimensional Relative Attention, in which we define the relative distance all possible pairs of patches. Thus, for every given pair (one as query, and the other as key/value in the attention mechanism), we have an offset pq − pk, where each offset is associated with an embedding. Then, we simply run extra attention, where we use the original query (the content of query), but use relative positional embeddings as keys. We then use the logits from the relative attention as a bias term and add it to the logits of the main attention (content-based attention) before applying the softmax.

翻译：

我们使用位置嵌入对不同的空间信息编码方式进行了消融。我们尝试了以下情况:•不提供位置信息:将输入视为一袋patch。

•一维位置嵌入:将输入视为栅格顺序的patch序列(本文中所有其他实验的默认值)。

•二维位置嵌入:将输入视为二维patch的网格。在这种情况下，学习了两组嵌入，每组用于一个轴，x嵌入和Y嵌入，每个轴的大小为D=2。然后，基于输入路径上的坐标，我们将X和Y嵌入连接起来，得到该patch的最终位置嵌入。

•相对位置嵌入:考虑斑块之间的相对距离来编码空间信息，而不是它们的绝对位置。为此，我们使用一维相对注意，其中我们定义了所有可能的patch对的相对距离。因此，对于每个给定的对(一个作为查询，另一个作为注意机制中的键/值)，我们有一个偏移量pq−pk，其中每个偏移量都与嵌入相关联。然后，我们简单地运行额外的注意力，其中我们使用原始查询(查询的内容)，但使用相对位置嵌入作为键。然后，在应用softmax之前，我们使用相对注意的logit作为偏差项，并将其添加到主要注意(基于内容的注意)的logit中。

总结：

1-d：1、2、3...9

2-d：(1,1)、(1,2)、...（3,3）

relative：用相对距离代替绝对距离

Table 8 summarizes the results from this ablation study on a ViT-B/16 model. As we can see, while there is a large gap between the performances of the model with no positional embedding and models with positional embedding, there is little to no difference between different ways of encoding positional information. We speculate that since our Transformer encoder operates on patch-level inputs, as opposed to pixel-level, the differences in how to encode spatial information is less important. More precisely, in patch-level inputs, the spatial dimensions are much smaller than the original pixel-level inputs, e.g., 14 × 14 as opposed to 224 × 224, and learning to represent the spatial relations in this resolution is equally easy for these different positional encoding strategies. Even so, the specific pattern of position embedding similarity learned by the network depends on the training hyperparameters (Figure 10).

翻译：

表8总结了在vit - b /16模型上的消融研究结果。我们可以看到，虽然没有位置嵌入的模型和有位置嵌入的模型在性能上有很大的差距，但是不同的位置信息编码方式之间几乎没有差别。我们推测，由于我们的Transformer编码器在patch级输入上操作，而不是像素级输入，因此如何编码空间信息的差异不太重要。更准确地说，在patch级输入中，空间维度比原始像素级输入小得多，例如，14 × 14而不是224 × 224，并且对于这些不同的位置编码策略来说，学习在这种分辨率下表示空间关系同样容易。即便如此，网络学习到的位置嵌入相似度的具体模式取决于训练超参数(图10)。

总结：

patch个数比像素个数少得多，获取位置信息更加容易，所以不同的位置编码效果都差不多

Fine-Tuning And Higher Resolution

Typically, we pre-train ViT on large datasets, and fine-tune to (smaller) downstream tasks. For this, we remove the pre-trained prediction head and attach a zero-initialized D × K feedforward layer, where K is the number of downstream classes. It is often beneficial to fine-tune at higher resolution than pre-training (Touvron et al, 2019; Kolesnikov et al, 2020). When feeding images of higher resolution, we keep the patch size the same, which results in a larger effective sequence length. The Vision Transformer can handle arbitrary sequence lengths (up to memory constraints), however, the pre-trained position embeddings may no longer be meaningful. We therefore perform 2D interpolation of the pre-trained position embeddings, according to their location in the original image. Note that this resolution adjustment and patch extraction are the only points at which an inductive bias about the 2D structure of the images is manually injected into the Vision Transformer.

翻译：

通常，我们在大型数据集上预训练ViT，并对(较小的)下游任务进行微调。为此，我们移除预训练的预测头，并附加一个零初始化的D × K前传层，其中K是下游类的数量。与预训练相比，在更高分辨率下进行微调通常是有益的(Touvron等人，2019;Kolesnikov et al, 2020)。当输入更高分辨率的图像时，我们保持patch大小不变，从而获得更大的有效序列长度。Vision Transformer可以处理任意序列长度(直到内存限制)，但是，预训练的位置嵌入可能不再有意义。因此，我们根据预训练的位置嵌入在原始图像中的位置对其进行二维插值。请注意，这是唯一一个在Vision Transformer中手动注入关于图像2D结构的信息的地方，即分辨率调整和patch提取。

总结：

这里二维插值使用的是torch自带的interpolate函数，会带来掉点，只是一个临时的解决方案，是ViT在微调时的局限性

Experience

We evaluate the representation learning capabilities of ResNet, Vision Transformer (ViT), and the hybrid. To understand the data requirements of each model, we pre-train on datasets of varying size and evaluate many benchmark tasks. When considering the computational cost of pre-training the model, ViT performs very favourably, attaining state of the art on most recognition benchmarks at a lower pre-training cost. Lastly, we perform a small experiment using self-supervision, and show that self-supervised ViT holds promise for the future.

翻译：

我们评估了ResNet、Vision Transformer (ViT)和hybrid的表示学习能力。为了了解每个模型的数据需求，我们在不同大小的数据集上进行预训练，并评估许多基准任务。当考虑到预训练模型的计算成本时，ViT表现非常好，以较低的预训练成本在大多数识别基准上达到最先进的水平。最后，我们使用自我监督进行了一个小实验，并表明自我监督的ViT在未来是有希望的。

总结：

后面的实验部分就不详细说了，感兴趣的可以自行探索

大致说了CNN与ViT的效果比较、训练资源比较、训练速度比较、收敛性比较、与混合模型比较；位置编码的学到的相关性、patch embedding后的图片、自注意力的效果、masked的自监督

Conclusion

We have explored the direct application of Transformers to image recognition. Unlike prior works using self-attention in computer vision, we do not introduce image-specific inductive biases into the architecture apart from the initial patch extraction step. Instead, we interpret an image as a sequence of patches and process it by a standard Transformer encoder as used in NLP. This simple, yet scalable, strategy works surprisingly well when coupled with pre-training on large datasets.Thus, Vision Transformer matches or exceeds the state of the art on many image classification datasets, whilst being relatively cheap to pre-train.

翻译：

我们探索了Transformer在图像识别中的直接应用。与之前在计算机视觉中使用自注意的工作不同，除了初始patch提取步骤外，我们没有将特定于图像的归纳偏置引入体系结构。相反，我们将图像解释为一系列patch，并通过NLP中使用的标准Transformer编码器对其进行处理。当与大型数据集的预训练相结合时，这种简单但可扩展的策略效果出奇地好。因此，Vision Transformer在许多图像分类数据集上达到或超过了最先进的水平，同时预训练相对便宜。

总结：

ViT的好处是：不需要对视觉领域有什么了解或领域知识，我们可以直接把图片理解成一个序列的图像块，就像句子中的单词一样，然后丢进Transformer里面做图像分类

While these initial results are encouraging, many challenges remain. One is to apply ViT to other computer vision tasks, such as detection and segmentation. Our results, coupled with those in Carion et al (2020), indicate the promise of this approach. Another challenge is to continue exploring selfsupervised pre-training methods. Our initial experiments show improvement from self-supervised pre-training, but there is still large gap between self-supervised and large-scale supervised pretraining. Finally, further scaling of ViT would likely lead to improved performance.

翻译：

虽然这些初步结果令人鼓舞，但仍存在许多挑战。一是将ViT应用于其他计算机视觉任务，如检测和分割。我们的结果，加上Carion等人(2020)的结果，表明了这种方法的前景。另一个挑战是继续探索自我监督的预训练方法。我们的初步实验显示了自监督预训练的改进，但自监督预训练与大规模监督预训练之间仍有很大差距。最后，进一步扩展ViT可能会提高性能。

总结：

未解决的问题和对未来的展望：如何做检测和分割；探索自监督，因为在nlp领域中所有大的Transformer都是用了自监督训练；把模型变大，可能还会提升

__如果

关注

16
点赞
踩
20

收藏

觉得还不错? 一键收藏
2
评论
论文精读--ViT

虽然Transformer架构已经成为自然语言处理任务的事实上的标准，但它在计算机视觉上的应用仍然有限。在视觉方面，注意力要么与卷积网络结合使用，要么用于替换卷积网络的某些组件，同时保持其整体结构不变。我们证明这种对cnn的依赖是不必要的，直接应用于图像pactch序列的纯变压器可以很好地完成图像分类任务。
复制链接

扫一扫