最近的vision transformer阅读
- All Tokens Matter: Token Labeling for Training Better Vision Transformers(NIPS2021)
- DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification(NIPS2021)
- Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer(CVPR2022)
- NAT: Neighborhood Attention Transformer
- Multimodal Token Fusion for Vision Transformers(CVPR 2022)
All Tokens Matter: Token Labeling for Training Better Vision Transformers(NIPS2021)
paper: http://proceedings.neurips.cc/paper/2021/file/9a49a25d845a483fae4be7e341368e36-Paper.pdf
github: https://github.com/zihangJiang/TokenLabeling
这里All Tokens Matter是针对之前vanilla ViT只用class token用来最后的预测而言的,这里要用上所有token的信息。具体做法就是给多层的token输出都增加与位置相关(其实就是能和patch对应上)的监督,从而辅助整体网络学习定位更精准。
DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification(NIPS2021)
paper:http://proceedings.neurips.cc/paper/2021/file/747d3443e319a22747fbb873e8b2f9f2-Paper.pdf
github: