论文翻译 | AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

btee

已于 2022-02-24 17:27:46 修改

阅读量1.8k

点赞数 2

文章标签：深度学习计算机视觉 transformer

于 2022-02-21 15:33:51 首次发布

原文链接：https://arxiv.org/abs/2010.11929

版权

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE 一张图片可类比为16*16的单词：transformer用于大规模的图像识别

论文地址：https://arxiv.org/abs/2010.11929

摘要：

transformer的架构目前是自然语言处理的实际标杆，但是在CV领域应用还是受限的。在视觉中，注意力机制要么与卷积网络结合，要么为保持整体的结构而替换掉某些卷积网络的结构。
我们发现Transfomer对卷积神经网络的依赖其实并不是必要的，其实一个“纯”的直接作用与图像块序列的transfomer结构在图像分类领域表现得很好。
当对模型进行在大型数据集上预训练，再将它转移到中小型图像分类的数据集上（这样的大型数据集有ImageNet, CIFAR-100, VTAB,等），视觉领域的Transformer（以下简称ViT）能得到比当前的卷积网络更好的结果，并且需要的训练计算资源更少。

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks(ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

1. 引言

自注意力机制，特别是Transformers, 是自然语言处理领域（NLP）里的首选模型。主要方法是在大型文本语料库上进行预训练，然后在较小的特定任务数据集上进行微调。
得益于Transformer的计算有效性和大规模可行性，训练超大的模型变得可能，即超过100B的参数。不仅如此，随着模型和数据的增大，到目前也没有出现性能饱和。

Self-attention-based architectures, in particular Transformers (Vaswani et al., 2017), have become the model of choice in natural language processing (NLP). The dominant approach is to pre-train on a large text corpus and then fine-tune on a smaller task-specific dataset (Devlin et al., 2019). Thanks to Transformers’ computational efficiency and scalability, it has become possible to train models of unprecedented size, with over 100B parameters (Brown et al., 2020; Lepikhin et al., 2020). With the models and datasets growing, there is still no sign of saturating performance.

然后计算机视觉中的卷积结构仍是主导。受到NLP成功的启发，大量的工作试图将卷积和自注意力机制结合起来，一些取代了整个卷积。之后的模型，理论上非常高效，但是由于用了专门的注意力模式，没能有效在现代硬件加速器上大规模的使用。因此，大规模的图像识别上，经典ResNet结构还是主流。

In computer vision, however, convolutional architectures remain dominant (LeCun et al., 1989; Krizhevsky et al., 2012; He et al., 2016). Inspired by NLP successes, multiple works try combining CNN-like architectures with self-attention (Wang et al., 2018; Carion et al., 2020), some replacing the convolutions entirely (Ramachandran et al., 2019; Wang et al., 2020a). The latter models, while theoretically efficient, have not yet been scaled effectively on modern hardware accelerators due to the use of specialized attention patterns. Therefore, in large-scale image recognition, classic ResNet like architectures are still state of the art (Mahajan et al., 2018; Xie et al., 2020; Kolesnikov et al.,2020).

受到NLP成功的启发，我们以最少的改动，直接用标准transfomer作用在图片上进行了实验。为了完成它，我们将图片分成许多图片块，并提供这些图片块的线性嵌入序列作为transformer的输入。图片块像NLP领域一样被当作文字符号（单词）。我们已有监督的方式训练图像分类的模型

Inspired by the Transformer scaling successes in NLP, we experiment with applying a standardTransformer directly to images, with the fewest possible modifications. To do so, we split an image into patches and provide the sequence of linear embeddings of these patches as an input to a Transformer. Image patches are treated the same way as tokens (words) in an NLP application. We trainthe model on image classification in supervised fashion.

当在中型数据集上训练时没有加强正则化，这些模型会的精度结果比相同大小的resnet的低几个百分点。这看起来不太好的结果也是可想而知的：transfomer缺少CNN固有的inductive biases，比如翻译的方差和位置translation equivariance and locality，因此在数据量少的时候训练泛化效果不佳。

When trained on mid-sized datasets such as ImageNet without strong regularization, these models yield modest accuracies of a few percentage points below ResNets of comparable size. This seemingly discouraging outcome may be expected: Transformers lack some of the inductive biases inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well when trained on insufficient amounts of data.

然而,如果在大型数据集上训练模型的话，图像就改变了。我们发现大规模的训练可以战胜诱导偏差inductive bias。当在大规模数据集上预训练，然后转移到更少数据上时，ViT取得了优越的结果。当在 ImageNet-21k 数据集，或the in-house JFT-300M 数据集，Transfomer接近或超越了最好的图像识别基准集上的结果。

However, the picture changes if the models are trained on larger datasets (14M-300M images). We find that large scale training trumps inductive bias. Our Vision Transformer (ViT) attains excellent results when pre-trained at sufficient scale and transferred to tasks with fewer datapoints. When pre-trained on the public ImageNet-21k dataset or the in-house JFT-300M dataset, ViT approaches or beats state of the art on multiple image recognition benchmarks. In particular, the best model reaches the accuracy of 88.55% on ImageNet, 90.72% on ImageNet-ReaL, 94.55% on CIFAR-100,and 77.63% on the VTAB suite of 19 tasks.

2. 相关工作

transfomer于2017年由 Vaswani提出用于机器学习领域，然后变成了自然语言处理领域中最主流的方法。大型的transformer-based模型通常在大型预料库进行预训练，然后微调到手头的任务上。 BERT用一种去噪自监督预训练任务， GPT line 用语言建模当成它的预训练任务。

Transformers were proposed by Vaswani et al. (2017) for machine translation, and have since become the state of the art method in many NLP tasks. Large Transformer-based models are often pre-trained on large corpora and then fine-tuned for the task at hand: BERT (Devlin et al., 2019) uses a denoising self-supervised pre-training task, while the GPT line of work uses language modeling as its pre-training task (Radford et al., 2018; 2019; Brown et al., 2020).

自注意力在图像上的简单应用需要每个像素关注其他像素。由于像素数的平方代价，这还不能扩展到实际的输入尺寸。因此，要在图像处理领域应用transformer, 过去已经尝试了几种近似方法。Parmar只在每个查询像素的邻近领域应用自注意力机制，而不是全局领域。这种局部的，多头，点积，自注意块可以完全取代卷积。在不同的工作线中，稀疏transformer采用可扩展的全局自注意近似，以便适用于图像。另一种衡量注意力的方法是将其应用于不同大小的块中。许多这些专门的注意架构在计算机视觉任务上显示出了有希望的结果，但需要复杂的工程才能在硬件加速器上有效地实现。

Naive application of self-attention to images would require that each pixel attends to every other pixel. With quadratic cost in the number of pixels, this does not scale to realistic input sizes. Thus, to apply Transformers in the context of image processing, several approximations have been tried in the past. Parmar et al. (2018) applied the self-attention only in local neighborhoods for each query pixel instead of globally. Such local multi-head dot-product self attention blocks can completely replace convolutions (Hu et al., 2019; Ramachandran et al., 2019; Zhao et al., 2020). In a different line of work, Sparse Transformers (Child et al., 2019) employ scalable approximations to global selfattention in order to be applicable to images. An alternative way to scale attention is to apply it in blocks of varying sizes (Weissenborn et al., 2019), in the extreme case only along individual axes (Ho et al., 2019; Wang et al., 2020a). Many of these specialized attention architectures demonstrate promising results on computer vision tasks, but require complex engineering to be implemented efficiently on hardware accelerators.

3. 方法

在模型的设计中，我们尽可能接近的遵循原始的transformer。一个这样设置的优点是规模化的NLP的Transformer架构以及他们的高效的应用，都可以即拿即用。

In model design we follow the original Transformer (Vaswani et al., 2017) as closely as possible.An advantage of this intentionally simple setup is that scalable NLP Transformer architectures – and their efficient implementations – can be used almost out of the box.

- 3.1 VISION TRANSFORMER (VIT)

An overview of the model is depicted in Figure 1. The standard Transformer receives as input a 1D sequence of token embeddings. To handle 2D images, we reshape the image x ∈ RH×W×C into a sequence of flattened 2D patches xp ∈ RN×(P 2·C), where (H, W) is the resolution of the original image, C is the number of channels, (P, P) is the resolution of each image patch, and N = HW/P2 is the resulting number of patches, which also serves as the effective input sequence length for the Transformer. The Transformer uses constant latent vector size D through all of its layers, so we flatten the patches and map to D dimensions with a trainable linear projection (Eq. 1). We refer to the output of this projection as the patch embeddings.

在这里插入图片描述
如上图虚线的左半部分，我们将图片分成固定大小的图像块patches（如图左下9×9的图像），将它们线性展开，同时还有位置块position embeddings，将图像块和位置块相加，喂到标准的Transformer的模型。为了完成分类任务，除了以上九个图像块，我们还在序列中添加了一个*的块0，叫额外的学习的分类标记classification token
标准的Transformer接收1维的排列好的标记序列。而要处理2维的图像，我们将原本的尺寸为H×W×C的图片展平成了图像块序列，尺寸为N×（p²×C），这里p为每个图像块的边长，即（p×p）为图像块尺寸，N = HW/P²是图像块的数目，也是输入序列的长度。Transformer中有个长度为D的潜在向量，贯穿了每一层，并将展平的图像块用这个向量映射成D维（这个向量可以学习得到）。这个映射的输出称为patch embeddings。
我们准备了一个对图像块序列的可学习的embeddings，作为图像表示y，它作用在Transformer编码器的输出位置。不管是预训练还是微调的过程中，classification head总是附在zl0。在预训练阶段，classification head被带有隐藏层的多层线性感知机MLP实现，在微调阶段，它由单一线性层实现。
Position embeddings加到图像块中是为了保留位置信息的。我们采用的是可以学习的标准一维position embeddings,因为实验观察到采用二维位置信息并不会给结果造成显著提高。
于是Transformer编码器包括了：

MSA：层数可变的多头自注意力机制
MLP blocks
Layernorm (LN)：加在每个block之前
residual connections：每个block之后进行跳跃连接

公式如下：
在这里插入图片描述
感知偏差： 我们发现视觉Transformer比CNN有有更少的针对图像的感知偏差。在CNN中，模型的每一层都有位置，二维的邻域结构和转换等价性。在ViT中，只有MLP层是局部和转换等价的 translationally equivariant,而自注意层是全局的。二维领域结构用起来要很小心。在模型的开始，在微调阶段将图片切成块，来适应不同分辨率大小图像的位置块。除此之外，位置块初始的时候不含图像块的二维位置信息，图像块之间的所有位置关系要从头开始学习。
混合架构： 输入的序列也可以是有CNN的特征图形成，它也能作为图像块的替代。在这种混合模型中，图像块映射E的是从CNN特征图提取出来的块。这种特殊的情况下，图像块的大小可以是1×1，它意味着输入序列是直接有特征图直接展平得到的。

3.2 微调和更高的分辨率

通常，我们在大型数据集上预训练VIT，再去下游任务中进行微调。这时候我们要去掉预训练好的预测头（predition head）,并附一个零初始化的D×K的前馈层，K是下游任务中分类的类别数。微调的时候比预训练有更高的分辨率会效果更佳。当输入高分辨率的图像时，我们保持图像块的大小不变，就会得到更长的有效序列。VIT是可以处理任意长度的序列的，不过这时候预训练的位置嵌入可能就不再有意义了。于是，我们根据这些块在原图像的位置，对预训练位置块进行二维插值。

4.实验

我们评估了ResNet，ViT和它们的混合模型的学习能力。为了理解每个模型的需求，我们预训练了不同大小的模型，并在每个基准测试集上作了评估。考虑到预训练模型的计算成本时，ViT就表现得非常出色，以更低的计算代价，在大多数识别的测试集上达到了一个当前最好的效果。最后，我们进行了一个用自监督学习的小实验，证明了=自我监督ViT未来前景广阔。

4.1 Setup

数据集： 为了探索模型的规模大小，我们使用了1000类的和1.3M图片的

ILSVRC-2012 ImageNet 数据集 1000类的和1.3M图片
ImageNet-21k 数据集 21000类别和14M图片
JFT 数据集 18000类和303M高分辨率图像

我们参照了 Kolesnikov用了预训练数据集和下游任务测试集，将在这个数据集上训练好的模型转移到几个基准测试任务中： ImageNet on the original validation labels and the cleaned-up ReaL labels, CIFAR-10/100 , Oxford-IIIT Pets。在这些数据集中，数据预处理遵循Kolesnikov。
我们还在VTAB上作了评估（包含19个任务的分类套件）。VTAB为每个任务使用1000个训练示例，评估了由少量数据转移到的不同任务上。这些任务被分成三组：

Natural –比如Pets, CIFAR
Specialized – 比如医疗或卫生图像
Structured – 需要几何理解的任务比如定位

Model Variants. 我们基于视觉Transformer配置在 BERT上，总结在了下表，Table 1。“Base” 和 “Large”直接从BERT中采用，除此之外，我们加了一个更大的“Huge”模型。在接下来的内容中，我们用简短的记号来表示模型大小和输入图像块的尺寸。比如，** ViT-L/16** 记为“Large”模型，16×16的输入尺寸。注意，Transformer的序列长度和图像块大小的平方成反比，块尺寸越小模型的计算量越大。
在这里插入图片描述
对于baselineCNN，我们选择的是ResNet，但是将其中的Batch Normalization替换成了Group Normalization，并且使用了标准化的卷积。这些改动能更好的Transfer，我们于是将修改后的模型记为“ResNet(BiT)。对于hybrids，我们将中间的特征图输入到ViT中，并且是以每个“像素”作的展平。为了对不同的输入序列作实验，我们要么去ResNet50的Stage4的输出，要么去掉stage4，放一个相同层数的stage3,将这个扩展成stage3的输出。Option (ii) 导致了4长了4倍的序列, 和一个更复杂的ViT模型。
Training&fune-tuning： 我们训练所有模型，采用Adam优化器，其中，β1 = 0.9, β2 = 0.999，batch size设置为4096， weight decay是0.1。在我们的设置中，Adam对于Resnet的效果比SGD稍微好一些。我们采用了线性学习率的 warmup and decay。微调选择的是带动量的SGD，所有模型的batch size是512。Table2是ImageNet结果，我们在更高精度图像上作微调: 512 for ViT-L/16 and 518 for ViT-H/14。
在这里插入图片描述
Table2是与不同流行的图像基准分类集的比较。我们报告了超过三轮的微调结果的精度的平均值和标准差。预训练在JFT-300M dataset的视觉Transformer要优于在所有数据集上预训练的ResNet-based baselines，同时预训练的计算资源更少。视觉Transformer在小的ImageNet-21k 数据集上也表现得很好。
评估标准： 我们报告得结果用的是微调模型或 few-shot方法。few-shot是通过求解一个正则化最小二乘回归问题来获得的。

4.2 与SOTA比较

我们首先用最大的模型，ViT-H/14和ViT-L/16与当前文献中的SOTA CNN模型比较。第一个比较的是Big Transfer(BiT)，它使用的大型ResNets执行监督迁移学习。第二个是Noisy Student，这是一个大型Efficientnet，使用半监督学习在删了部分标签的ImageNet和JFT-300M上训练。目前，Noisy Student是 ImageNet的最好的模型，BiT-L 是其他数据集上最好的模型。所有的模型都是在TPUv3上训练的，我们报告了在每块TPUv3的预训练天数TPUv3-core-days，即核数core乘训练天数days。
表二的结果中，在所有的任务中（第一列），小一点的模型，ViT-L/16 在 JFT-300M（第三列）优于BiT-L（第五列），而计算资源更少（最后一行）。大一点的模型, ViT-H/14（第二列），在更具挑战性的数据集上提高了性能，如 ImageNet, CIFAR-100, and the VTAB suite。即使相对大一点，也仍比CNN模型小得多。然而，我们注意到，训练效率不仅受到架构选择的影响，还会受到其他参数的影响，如训练安排、优化器、权重衰减等。我们在Section 4.4.提供了一个不同架构的受控研究performance vs. compute。最后，在公共ImageNet-21k数据集上，预训练的ViT-L/16模型在大多数数据集上也表现良好，同时花费更少的资源：它可以在大约30天内使用8核的TPUv3进行训练。
在这里插入图片描述
这里Figure2 将 VTAB分成了不同的组，并且和以往的SOTA方法在基准测试集上作比较，BiT, VIVI（在ImageNet和Youtube上共同训练的ResNet ），S4L （在ImageNet上监督加上半监督学习），在 Natural and Structured任务领域，ViT-H/14优于BiT-R152x4和其他方法，Specialized 任务领域两者差不多。

4.3 训练前的数据要求

VIT在大型JFT-300M dataset上预训练的效果好，比ResNet有更少的感知偏差，但数据大小到底多重要呢？我们做了两个实验。
首先，我们在规模不断增大的数据集上做了预训练：ImageNet, ImageNet-21k, and JFT-300M。为了提高在小一点数据集上的性能，我们优化了三个正则化参数，weight decay, dropout和 label smoothing。Figure3，展示在ImageNet上微调后的结果，在小一点的数据集 ImageNet上预训练时，ViT-Large models优于ViT-Base models，尽管做了正则化。在ImageNet-21k预训练后，它们的表现很相似。只有在大数据上 JFT-300M预训练，我们才能看到大网络的优势。Figure3表现了不同大小的BiT的表现领域。The BiT CNNs在ImageNet上由于ViT ，但大型数据集上 ViT 更具优势。
在这里插入图片描述
Figure 3 ：三种不同的架构Vision Transformers,ResNets, hybrids的精度和训练计算量对比。相同计算量下，VIT优于ResNets。 Hybrids在小模型的时候提高了纯VIT的性能，但在大模型的时候有差距。

其次，我们在以随机的尺寸 9M, 30M, and 90M和完整的JFT-300M数据集上训练我们的模型，我们在所有模型都不执行正则化，使用相同的参数。这样我们评估模型的时候就不会被正则化影响。不过我们使用了早停法，并报告在训练期间获得的最佳验证精度。为了节省计算量，我们报告了few-shot的线性精度，而不是fine-tunning精度。Figure 4显示，在小的数据集上，ViT过拟合比ResNets更严重。比如， ViT-B/32比ResNet50稍微快，但它在9M子集中表现很差，但在90M+子集中又表现很好。ResNet152x2和 ViT-L/16的比较也是一样。这一结果强化了一种直觉，即卷积感知偏差对于较小的数据集是有用的，但对大型数据集的时候，直接从数据中学习相关模式就足够了，甚至是有益的。
在这里插入图片描述

4.4 Scaling Study

我们对不同模型都做了一个受控的尺寸缩放研究来评估在JFT-300M数据集上的表现能力。在这种设置下，数据的大小就不会成为性能的阻碍了，然后我们评估每个模型的预训练性能，模型包括：

7 ResNets, R50x1,R50x2 R101x1, R152x1, R152x2, pre-trained for 7 epochs
plus R152x2 and R200x3 pre-trained for 14 epochs
6 Vision Transformers, ViT-B/32, B/16, L/32, L/16, pre-trained for 7 epochs
plus L/16 and H/14 pre-trained for 14 epochs
5 hybrids, R50+ViT-B/32, B/16, L/32, L/16 pretrained for 7 epochs
plus R50+ViT-L/16 pre-trained for 14 epochs

结果可以由Figure 5和Table 6显示。

第一，VIT在性能/计算权衡上好过 ResNets。ViT使用大约少2 -4×的计算来获得相同的性能（平均超过5个数据集）。
第二，混合模型在较小的计算预算下略优于ViT，但在较大的模型中，差异消失了。
第三，ViT似乎没有在试验范围内饱和，这激励了未来的扩展努力。

4.5 Inspecting ViT

为了研究视觉Transformer是怎么表示数据的，我们研究了它的内部表示。视觉Transformer第一层将展平的图像块线性映射到低维空间。Figure 7（左边）展示了学习到的滤波器的顶层的主要组成部分。它有点像图像块细微结构的低维表示的一个基函数。
在映射后，要将位置块加到学习到的表示中。Figure 7 的中间展示了模型学会编码图像的位置块相似性的距离。更近的块有更相似的位置块。此外，还会出现行-列结构；同一行/列中的块有更相似的embedding。最后，大网格有时会有比较明显的正弦结构。位置块能够学习二维的位置拓扑结构，因此这也解释了输入二维的位置块并不能提高性能。
在这里插入图片描述