[2106] [NIPS 2021] Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer

koukouvagia

已于 2022-04-04 10:11:38 修改

阅读量879

点赞数

文章标签：计算机视觉深度学习

于 2022-03-02 15:58:20 首次发布

本文链接：https://blog.csdn.net/weixin_43355838/article/details/123230579

版权

paper
code

Content

Contribution

propose Shuffle self-attention with spatial shuffle for cross-window connection

Method

model architecture

The architecture of a Shuffle Transformer (Shuffle-T).

shuffle window self-attention

window-based self-attention

feature map $x\in R^{h\times w\times C}$ divided into $ws\times ws$ -size windows
self-attention applied on each sub-windows who contain $ws\times ws$ tokens, respectively

spatial shuffle for cross-window connection

problem of window-based self-attention: limited receptive field in window, especially on HR input
solution introduce spatial shuffle inpired by channel shuffle from ShuffleNet

Spatial shuffle with two stacked window-based Transformer block. The MLP is omitted in the visualization because it does not affect the information interaction in the spatial dimension. WMSA stands for window-based multi-head self-attention. a) two stacked window-based Transformer blocks with the same window size. Each output token only relates to the tokens within the window. No cross-talk; b) tokens from different windows are fully related when WMSA2 takes data from different windows after WMSA1; c) an equivalent implementation to b) using spatial shuffle and alignment.

given window size is ws and tokens number in a window is N
spatial shuffle obtain input data from different windows
reshape spatial dimension into $\frac N{ws})$ , then transpose it into $(\frac N{ws}, ws)$ and flatten it back
spatial alignment adjust spatial tokens into original position to ensure spatial alignment of features and image content
reshape spatial dimension into $(\frac N{ws}, ws)$ , then transpose it into $\frac N{ws})$ and flatten it back
$[a_{11}...a_{1j}, a_{21}...a_{2j},..., a_{i1}...a_{ij}] \overset{shuffle}{\underset{alignment}{\rightleftharpoons}} [a_{11}...a_{1j}, a_{21}...a_{2j},..., a_{i1}...a_{ij}]$

neighbor-window connection (NWC) enhancement

spatial shuffle in window-based self-attention build cross-window connections, especially long-range cross-window
problem “grid issue”, when processing a HR image whose size much greater than window size
approaches to enhance neighbor-window connection

enlarge window size
use shifted window
introduce conv to shuffle transformer block

implemented by a depth-wise conv with a skip-connection between WMSA and MLP, whose kernel size the same as window size
strengthen information flow among nearby windows, thus alleviating “grid issue”

shuffle transformer encoder

Two successive Shuffle Transformer Block. The WMSA and Shuffle WMSA are windowbased multi-head self attention without/with spatial shuffle, respectively.

with consecutive encoder blocks alternating between (Shuffle-)WMSA, transformer encoder computed as
$\begin{aligned} \widehat{z}_l&=WMSA(LN(z_{l-1}))+z_{l-1} \\ \widehat{z}_l&=NWC(\widehat{z}_l)+\widehat{z}_l \\ z_l&=FFN(LN(\widehat{z}_l))+\widehat{z}_l \\ \widehat{z}_{l+1}&=Shuffle-WMSA(LN(z_l))+z_l \\ \widehat{z}_{l+1}&=NWC(\widehat{z}_{l+1})+\widehat{z}_{l+1} \\ z_{l+1}&=FFN(LN(\widehat{z}_{l+1}))+\widehat{z}_{l+1} \end{aligned}$
where, $N W C (.)$ is neighbor-window connection operation

computational complexity

in ViT, given input feature map $x\in R^{h\times w\times C}$ , FLOPs of MSA is
$\Omega(MSA)=4hwC^2+2(hw)^2C$
for (Shuffle-)WMSA, replace $h\times w$ with window size $ws\times ws$ and batchsize $\times n_{windows}=\frac {hw}{(ws)^2}$
$\Omega((Shuffle-)WMSA)=4hwC^2+2hw(ws)^2C$

architecture variants

window size: M=7
query dimension of each head: d=32
expansion layer of each MLP: $\alpha$ =4

architecture hyper-parameters of model variants

Shuffle-T: C = 96, layer numbers = {2, 2, 6, 2}
Shuffle-S: C = 96, layer numbers = {2, 2, 18, 2}
Shuffle-B: C = 128, layer numbers = {2, 2, 18, 2}

where, C is channel number of hidden layers in the first stage

Experiment

image classification

dataset ImageNet-1K, with augmentation and regularization as Swin
optimizer AdamW: batchsize=1024, epoch=300, init lr=1e-3, weigh decay=0.05, cosine decay, linear warm-up 20 epochs

Comparison of different backbones on ImageNet-1K classification. Throughput is measured with the batch size of 192 on a single V100 GPU. All models are trained and evaluated on 224x224 resolution.

object detection and instance segmentation

framework Mask R-CNN, Cascade Mask R-CNN
dataset COCO 2017
optimizer AdamW: batchsize=16, 36 epochs, init lr=1e-4, weigh decay=0.05

Object detection and instance segmentation performance on the COCO val2017 dataset using the Mask R-CNN and Cascade Mask R-CNN framework. FLOPs is evaluated on 1280x800 resolution.

semantic segmentation

framework UPerNet
dataset ADE20K, with augmentation
optimizer AdamW: batchsize=16, 1500 iterations, init lr=6e-5, weigh decay=0.01, cosine decay, linear warm-up 1500 iterations

Results of semantic segmentation on the ADE20K validation set. “+” indicates that the model is pretrained on ImageNet-22K. FLOPs is measured on 1024x1024 resolution. “*” indicates the FPS reproduced by us and is measured on 512x512 resolution.

ablation study

effect of spatial shuffle and NWC

Ablation study on the effect of spatial shuffle and the neighbor-window connection on two benchmarks, FLOPs is measured on 224x224 resolution.

way of spatial shuffle
3 kinds of spatial shuffle

long-range spatial shuffle
short-range spatial shuffle: reshape output spatial dimension into $(\frac N{2M}, M, 2)$
random spatial shuffle: reshape output spatial dimension randomly

Ablation study on different ways to spatial shuffle on two benchmarks.

long-range spatial shuffle perform best on classification and segmentation tasks
random spatial shuffle achieve comparable performance

position of NWC module in encoder

Left: Visualization of three different positions to insert the neighbor-window connection. A: before the shuffle WMSA; B: after the residual connection of the shuffle WMSA; C: inside the MLP block. Right: Ablation study on the effect of the neighbor-window connection inserted at different positions, where A, B and C refer to three positions depicted left, and “w/o NWC” means no neighbor-window connection is inserted. FLOPs is measured on 224x224 resolution.

NWC between Shuffle-WMSA and MLP (position B) achieve the best performance

koukouvagia

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
[2106] [NIPS 2021] Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer

Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer
复制链接

扫一扫