AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

最新推荐文章于 2024-08-30 21:07:59 发布

64318@461

最新推荐文章于 2024-08-30 21:07:59 发布

阅读量3.2k

点赞数 2

分类专栏：特征提取 transfomer 文章标签： transformer 计算机视觉

本文链接：https://blog.csdn.net/weixin_56836871/article/details/122776704

版权

研究发现，直接应用Transformer架构到图像序列上可以有效进行图像分类任务，无需依赖CNN。通过对图像进行分块处理，避免了自注意力机制的计算复杂度问题。尽管在中等规模数据集上的准确性略低于ResNets，但在大规模预训练后，Vision Transformer在计算机视觉任务中展现出竞争力，甚至超越了最先进的CNN模型。

摘要由CSDN通过智能技术生成

动机：

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classifification tasks.
【CC】NLP领域Transfomer已经成了事实标准，那么CV领域靠纯纯的Transformer能不能work？这篇paper证明是可以的

The dominant approach is to pre-train on a large text corpus and then fine-tune on a smaller task-specifific dataset
【CC】这里是NLP领域的“标准”作业方式了，在大数据集上对模型进行训练（其实就是表达学习），然后在特定任务上进行fine-tune。关键在NLP的表达学习阶段基本是自监督的，这样大大提高生产效率，这也是本文挖了坑，然后被何大神的MAE给填了

Inspired by NLP successes, multiple works try combining CNN-like architectures with self-attention (Wang et al., 2018; Carion et al., 2020), some replacing the convolutions entirely (Ramachandran et al., 2019; Wang et al., 2020a). The latter models, while
theoretically efficient, have not yet been scaled effectively on modern hardware accelerators due to the use of specialized attention patterns.
【CC】在本文前，也有很多人将transformer框架用在CV，有几个方向：在CNN框架下通过transformer替换掉部分的block，或者在原有CNN后面接transfomer的block。按作者的说法是这几个方向理论上效率高但实际不work，没有为这些混合网络做专用的硬件/软件加速算子

解题思路：

Naive application of self-attention to images would require that each pixel attends to every other pixel. With quadratic cost in the number of pixels, this does not scale to realistic input sizes.
【CC】比较直接想法就是把图像的每个像素点当做一个token直接喂给后面的NN，但是这个计算量太大了N*N的，不现实！那后面的办法就是怎么去抽取token

Parmar (2018) applied the self-attention only in local neighborhoods for each query pixel instead of globally. Such local multi-head dot-product self attention blocks can completely replace convolutions (Hu et al., 2019; Ramachandran et al., 2019; Zhao et al., 2020). In a different line of work, Sparse Transformers (Child et al., 2019) employ scalable approximations to global self-attention in order to be applicable to images.
【CC】这几篇paper都是想办法模仿conv的操作，通过self-attention抽取局部的特征（其实就是做token出来）然后喂给后面，这样降低计算复杂度

An alternative way to scale attention is to apply it in blocks of varying sizes (Weissenborn et al., 2019), in the extreme case only along individual axes (Hoet al., 2019; Wang et al., 2020a).
【CC】这篇paper比较有意思-轴注意力机制：把图片的每行-每列作为token喂给后面；相当于降维了：把N*M的问题变成了N+M

Inspired by the Transformer scaling successes in NLP, we experiment with applying a standard Transformer directly to images, with the fewest possible modifications. To do so, we split an image into patches and provide the sequence of linear embeddings of these patches as an input to a Transformer. Image patches are treated the same way as tokens (words) in an NLP pplication. We train the model on image classification in supervised fashion.
【CC】上面的路子人家都走了，作者的想法是尽可能的不去动transfomer的任何结构，想办法把图片处理成NLP领域的token，然后直接喂给transfomer。怎么做的图片前处理呢？按照文章的标题，将图片分成16*16的patch，这个patch做矢量化（当然还有postion embeding）以后当做token喂给transformer。基本上本文的主要结构就在这里了。注意这里的训练方式都是跟CV一样监督式的

When trained on mid-sized datasets such as ImageNet without strong regularization, these models yield modest accuracies of a few percentage points below ResNets of comparable size. This seemingly discouraging outcome may be expected: Transformers lack some of the inductive biases inher