ICLR-2021-ViT: AN IMAGE IS WORTH 16X16 WORDS:TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE 阅读笔记

菜菜子hoho

已于 2022-04-21 19:34:47 修改

阅读量728

点赞数

分类专栏：目标跟踪之Transformer文献阅读文章标签：深度学习 transformer

于 2022-04-19 21:20:31 首次发布

本文链接：https://blog.csdn.net/qq_41442511/article/details/124283877

版权

目标跟踪之Transformer文献阅读专栏收录该内容

12 篇文章 7 订阅

订阅专栏

论文地址：
https://arxiv.org/pdf/2010.11929.pdf
代码地址：
https://github.com/google-research/vision_transformer

Vision Transformer (ViT)框架：
在这里插入图片描述
模型概述：
将输入图像分割成固定大小的小块（patch）,并为他们嵌入位置编码后线性的馈送到标准的变换器编码器（Transformer encoder）中。该模型的设计遵循了原始变压器。

Conclusions:
[原文]
We have explored the direct application of Transformers to image recognition. Unlike prior works using self-attention in computer vision, we do not introduce image-specific inductive biases into the architecture apart from the initial patch extraction step. Instead, we interpret an image as a sequence of patches and process it by a standard Transformer encoder as used in NLP. This simple, yet scalable, strategy works surprisingly well when coupled with pre-training on large datasets. Thus, Vision Transformer matches or exceeds the state of the art on many image classification datasets, whilst being relatively cheap to pre-train.
[SwinTrack]
The Vision Transformer (ViT, the first fully attentional model in vision tasks) and many of its successors were inferior to convnets in terms of performance, until the appearance of the Swin-Transformer.
[MixFormer]
The Vision Transformer (ViT) first presented a pure vision transformer architecture, obtaining an impressive performance on image classification.
[Swin transformer]
The pioneering work of ViT directly applies a Transformer architecture on nonoverlapping medium-sized image patches for image classification. It achieves an impressive speed-accuracy tradeoff on image classification compared to convolutional networks.

菜菜子hoho

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
ICLR-2021-ViT: AN IMAGE IS WORTH 16X16 WORDS:TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE 阅读笔记

论文地址：https://arxiv.org/pdf/2010.11929.pdf代码地址：https://github.com/google-research/vision_transformerVision Transformer (ViT)框架：模型概述：将输入图像分割成固定大小的小块（patch）,并为他们嵌入位置编码后线性的馈送到标准的变换器编码器（Transformer encoder）中。该模型的设计遵循了原始变压器。Conclusions:[原文]We have exp
复制链接

扫一扫