[VIT] Visual Transformer

最新推荐文章于 2024-09-19 16:54:57 发布

Ah丶Weii

最新推荐文章于 2024-09-19 16:54:57 发布

阅读量369

点赞数

分类专栏：学习

本文链接：https://blog.csdn.net/weixin_43823854/article/details/114946220

版权

本文提出了一种名为Vision Transformer (VIT)的纯Transformer模型，该模型直接应用于图像序列的补丁上，挑战了在计算机视觉中对卷积神经网络的依赖。在大型数据集上预训练后，VIT在多个图像识别基准测试中达到与最先进的CNN相当的性能，同时训练所需的计算资源更少。

摘要由CSDN通过智能技术生成

1. Motivation

Transformer在视觉上的应用存在limited。在视觉中，attention方法是用于连接卷积网络，或者用于取代卷积网络的部分构成，但同时保留了总体结构。

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited.

In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place.

2. Contribution

本文发现对CNN的依赖并不是必须的,并且提出了一个用于图像检测的pure transformer Vision Transformer（VIT）,不含有任何CNNs，通过在更大的数据集（JFT-300 Datasets）进行预训练，然后迁移到mid-sized或者更小的benchmarks上，VIT可以和SOTA CNNs comparable。

We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks.

When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train