1. Motivation
Transformer在视觉上的应用存在limited。在视觉中,attention方法是用于连接卷积网络,或者用于取代卷积网络的部分构成,但同时保留了总体结构。
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited.
In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place.
2. Contribution
本文发现对CNN的依赖并不是必须的,并且提出了一个用于图像检测的pure transformer Vision Transformer(VIT),不含有任何CNNs,通过在更大的数据集(JFT-300 Datasets)进行预训练,然后迁移到mid-sized或者更小的benchmarks上,VIT可以和SOTA CNNs comparable。
We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks.
When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train
3.1 VISION TRANSFORMER (VIT)
VIT的架构如图1所示,对于输入的图片为 x ∈ R H × W × C x \in R^{H \times W \times C} x∈RH×W×C,首先reshape为一个2D的patches x p ∈ R N × (