- An image is 11worth 16x16 words: Transformers for image recognition at scale.
- Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby.
- In International Conference on Learning Representations, 2021. 1, 2, 3, 4, 5, 6, 9
ViT原理:
1. 传入transformer之前的预处理: