极简笔记 An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
极简笔记 An Image is Worth 16x16 Words: Transformaers for Image Recognition at Scale原文地址https://arxiv.org/abs/2010.11929本文是第一篇将Transformer结构运用在图像分类任务的paper,方法叫做ViT(vision transformer)做法也非常的简单,把输入图片切成多个patch,然后将各个patch拉成向量加上position embedding输入到transformer结




