[VIT] Visual Transformer

本文提出了一种名为Vision Transformer (VIT)的纯Transformer模型,该模型直接应用于图像序列的补丁上,挑战了在计算机视觉中对卷积神经网络的依赖。在大型数据集上预训练后,VIT在多个图像识别基准测试中达到与最先进的CNN相当的性能,同时训练所需的计算资源更少。
摘要由CSDN通过智能技术生成
image-20210316152627171

1. Motivation

Transformer在视觉上的应用存在limited。在视觉中,attention方法是用于连接卷积网络,或者用于取代卷积网络的部分构成,但同时保留了总体结构。

​ While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited.

​ In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place.

2. Contribution

本文发现对CNN的依赖并不是必须的,并且提出了一个用于图像检测的pure transformer Vision Transformer(VIT),不含有任何CNNs,通过在更大的数据集(JFT-300 Datasets)进行预训练,然后迁移到mid-sized或者更小的benchmarks上,VIT可以和SOTA CNNs comparable。

We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks.

​ When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train

3.1 VISION TRANSFORMER (VIT)

image-20210316152751563

VIT的架构如图1所示,对于输入的图片为 x ∈ R H × W × C x \in R^{H \times W \times C} xRH×W×C,首先reshape为一个2D的patches x p ∈ R N × (

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值