[2103] [ICCV 2021] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

koukouvagia

已于 2022-04-04 18:43:37 修改

阅读量1k

点赞数

文章标签：计算机视觉深度学习

于 2022-04-04 18:31:32 首次发布

本文链接：https://blog.csdn.net/weixin_43355838/article/details/123234296

版权

paper
code

Content

Abstract

propose Shifted Window (Swin) self-attention
introduce Relative Position Encoding (RPE)

(a) The proposed Swin Transformer builds hierarchical feature maps by merging image patches (shown in gray) in deeper layers and has linear computation complexity to input image size due to computation of self-attention only within each local window (shown in red). It can thus serve as a general-purpose backbone for both image classification and dense recognition tasks. (b) In contrast, previous vision Transformers produce feature maps of a single low resolution and have quadratic computation complexity to input image size due to computation of self-attention globally.

Method

model architecture

(a) The architecture of a Swin Transformer (Swin-T); (b) two successive Swin Transformer Blocks. W-MSA and SW-MSA are multi-head self attention modules with regular and shifted windowing configurations, respectively.

patch partition: split input into $4\times4$ non-overlapping patches
stage 1
linear embed: project features to dim=C
${\Reals}^{B\times3\times H\times W}\rightarrow{\Reals}^{B\times(\frac{H}P\times\frac{W}P)\times C}, P=4, C=96$

SwinTransformerBlock
stage 2
patch merge: concatenate features (dim=4C) of $2\times2$ neighboring patches
${\Reals}^{B\times(h\times w)\times C}\rightarrow{\Reals}^{B\times(\frac{h}2\times\frac{w}2)\times4C}\xrightarrow{linear}{\Reals}^{B\times(\frac{h}2\times\frac{w}2)\times2C}$

SwinTransformerBlock
stage 3, 4 similar to stage 2, only different in patches size

in implementation, block design is different from figure in code
input $\rightarrow$ PatchEmbed: patch partition, linear embed $\rightarrow$ 3 BasicLayer: SwinTransformerBlock, PatchMerge $\rightarrow$ output

shifted window (Swin) attention

An illustration of the shifted window approach for computing self-attention in the proposed Swin Transformer architecture. In layer $\ell$ (left), a regular window partitioning scheme is adopted, and self-attention is computed within each window. In the next layer $\ell+1$ (right), the window partitioning is shifted, resulting in new windows. The self-attention computation in the new windows crosses the boundaries of the previous windows in layer $\ell$ , providing connections among them.

feature map $x\in{\Reals}^{h\times w\times C}$ divided into $ws\times ws$ -size windows
self-attention applied on each sub-windows who contain $ws\times ws$ tokens, respectively

WMSA lack connections across windows, shifted WMSA solve this problem

efficient batch computation for shifted window

problems of shifted window partition

more windows: $\lceil\frac{h}{ws}\rceil\times\lceil\frac{w}{ws}\rceil\rightarrow(\lceil\frac{h}{ws}\rceil+1)\times(\lceil\frac{w}{ws}\rceil+1)$
some windows size smaller than $ws\times ws$

Illustration of an efficient batch computation approach for self-attention in shifted window partitioning.

solution cyclic-shift patches towards top-left direction
step 1: shift $x$ towards top-left direction, with shift-size ( $=\frac12ws$ )
step 2: window partition on shifted $x$
step 3: masked window attention on $x$ windows, with a prepared top-left shifted mask
step 4: reverse cyclic shift on attended results (bottom-right)

computational complexity

in ViT, given input feature map $x\in{\Reals}^{h\times w\times C}$ , FLOPs of MSA is
$\Omega(\mathrm{MSA})=4hwC^2+2(hw)^2C$

for (Shifted-)WMSA, replace $h\times w$ with window size $ws\times ws$ and batchsize $\times n_{windows}=\frac{hw}{(ws)^2}$
$\Omega(\mathrm{(Shifted\text{-})WMSA})=4hwC^2+2hw(ws)^2C$

relative positional encoding (RPE)

include a relative position bias $B\in{\Reals}^{ws^2\times ws^2}$ to each head in computing similarity
$\mathrm{Attention}(Q, K, V)=\mathrm{softmax}(\frac{QK^T}{\sqrt{d}}+B)V$

where,

Swin transformer encoder

with consecutive encoder blocks alternating between (Shifted-)WMSA, transformer encoder computed as
$\begin{aligned} \widehat{z}_{\ell}&=\mathrm{WMSA}(\mathrm{LN}(z_{\ell-1}))+z_{\ell-1} \\ z_{\ell}&=\mathrm{FFN}(\mathrm{LN}(\widehat{z}_{\ell}))+\widehat{z}_{\ell} \\ \widehat{z}_{\ell+1}&=\mathrm{Shifted\text{-}WMSA}(LN(z_{\ell}))+z_{\ell} \\ z_{\ell+1}&=\mathrm{FFN}(\mathrm{LN}(\widehat{z}_{\ell+1}))+\widehat{z}_{\ell+1} \end{aligned}$

architecture variants

window size: $M = 7$
query dimension of each head: $d = 32$
expansion layer of each MLP: $\alpha=4$

architecture hyper-parameters of model variants

Swin-T: C = 96, layer numbers = {2, 2, 6, 2}
Swin-S: C = 96, layer numbers = {2, 2, 18, 2}
Swin-B: C = 128, layer numbers = {2, 2, 18, 2}

where, $C$ is channel number of hidden layers in the first stage

Detailed architecture specifications.

Experiment

image classification

dataset Image-1K

Comparison of different backbones on ImageNet-1K classification.

object detection and instance segmentation

framework Cascade Mask R-CNN, ATSS, RepPoints v2, Sparse RCNN
dataset COCO 2017

Results on COCO object detection and instance segmentation. “ $\dag$ ” denotes that additional deconvolution layers are used to produce hierarchical feature maps. “ $\ast$ ” indicates multi-scale testing.

semantic segmentation

framework
dataset ADE20K

Results of semantic segmentation on the ADE20K val and test set. “ $\dag$ ” indicates additional deconvolution layers are used to produce hierarchical feature maps. “ $\ddag$ ” indicates that the model is pre-trained on ImageNet-22K.

ablation studies

koukouvagia

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
[2103] [ICCV 2021] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
复制链接

扫一扫