动机:
为啥挑这篇文章,因为效果炸裂,各种改款把各种数据集霸榜了:语义分割/分类/目标检测,前10都有它
Swin Transformer, that capably serves as a general-purpose backbone for computer vision.
【CC】接着VIT那篇论文挖的坑,transfomer能否做为CV领域的backbone,VIT里面只做了分类的尝试,留了检测/语义分割的坑,这篇文章直接回答swin transfomer可以
Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images. To address these differences, we propose a hierarchical Transformer whose representation is computed with Shifted windows
【CC】CV领域对transfomer有两个困难:各种各样的图片尺度, 高分辨率的图片(需要处理的数据量太大). 这个Shifted windows很像一个Conv Block;果然被人称为披着CNN 的 transfomer。
解题思路
Designed for sequence modeling and transduction tasks, the Transformer is notable for its use of attention to model long-range dependencies in the data
【CC】transfomer设计初衷是为了搞定序列模型里面大跨度间元素依赖关系
visual elements can vary substantially in scale, a problem that receives attention in tasks such as object detection. In existing Transformer-based models, tokens are all of a fifixed scale, a property unsuitable for these vision applications.
【CC】在目标检测任务中,各个物体的尺度大小差异非常大;跟NLP里面一个token就是一个词/或者字差别比较大. 所以,现有按照固定尺度作为token去处理图像不太合适,说的就是vit啊!~
There exist many vision tasks such as semantic segmentation that require dense prediction at the pixel level, and this would be intractable for Transformer on high-resolution images, as the computational complexity of its self-attention is quadratic to image size.
【CC】对于语义分割一类的任务,做像素集的估计对transfomer来说计算量太大,为图像分辨率的平方,不可行
To overcome these issues, we propose a general purpose Transformer backbone, called Swin Transformer, which constructs hierarchical feature maps and has linear computational complexity to image siz