本文重点介绍ViT原理,同时简单介绍三篇相关论文,这四篇论文的源码见 https://github.com/google-research/vision_transformer
arXiv:2010.11929:An image is worth 16x16 words: Transformers for image recognition at scale(ViT大法,一般人没钱做的工作)
arXiv:2105.01601:MLP-Mixer: An all-MLP Architecture for Vision (用MLPs替代self-attention可以得到和ViT同样好的结果)