[2104] [NIPS 2021] Twins: Revisiting the Design of Spatial Attention in Vision Transformers

koukouvagia

已于 2022-04-04 10:11:04 修改

阅读量881

点赞数

文章标签：计算机视觉深度学习

于 2022-03-02 15:57:33 首次发布

本文链接：https://blog.csdn.net/weixin_43355838/article/details/123229914

版权

paper
code

Contribution

propose Spatially Separable self-attention (SSSA)
introduce Conditional Position Encoding (CPE)

Method

Twins-PCPVT: based on PVT and conditional position encoding (CPE), which only uses global attention
Twins-SVT: based on proposed spatially separable self-attention (SSSA) which interleaves local and global attention

Twins-PCPVT

model architecture

Architecture of Twins-PCPVT-S. “PEG” is the positional encoding generator from CPVT.

conditional position encoding (CPE)

given input image with size $H\times W$ , split into patches with size $S\times S$ , patches number is $N=\frac {HW}{S^2}$
patches added with the same number as learnable absolute positional encoding vectors

limitations of previous positional encodings

fixed length prevent model from handling sequences longer than the longest training sequences
make model not translation-invariant because a unique positional encoding vector is added to every one patch

Comparison of various positional encoding (PE) strategies tested on ImageNet validation set in terms of the top-1 accuracy. Removing the positional encodings greatly damages the performance. The relative postional encodings have inferior performance to the absolute ones.

solutions for aforementioned limitations

remove positional encoding
order of input sequence is an important clue.
model has no way to employ the order without the positional encodings.
interpolate position encoding to be shorter or the same length as fixed length
model require fine-tuning, otherwise performance will remarkably drop.
introduce relative position encoding (eg. in Swin)
cannot provide absolute position information, which is also important to classification tasks.
too complex and inefficient, for inner expression of transformer needed to be modified.

requirements of demanded positional encoding

make input sequence permutation-variant but translation-invariant
inductive and able to handle sequences longer than ones during training
provide absolute position information to a certain degree

positional encoding generator (PEG)

reshape flatten input sequence $X\in R^{B\times N\times C}$ to $X_1\in R^{B\times C\times H\times W}$
apply 2d-transformation F on $X_1$ and get output $X_2\in R^{B\times N\times C}$ , where F implement by 2d-conv, with kernel size k ( $k\geqslant 3$ ) and $\frac {k-1}2$ zero-padding
reshape $X_2$ to produce position encoding $E\in R^{B\times C\times H\times W}$

Schematic illustration of Positional Encoding Generator (PEG). Note d is the embedding size, N is the number of tokens. The function F can be depth-wise, separable convolution or other complicated blocks.

问：为什么要这么做？这种做法怎么就能把位置信息引入Transformer了？
答：给Transformer引入位置信息，说白了就是给一个sequence的N个向量assign a position。那这个position它既可以是绝对信息，也可以是相对信息。相对信息就是定义一个参考点然后给每个向量一个代表它与参考点相对位置的信息。这种做法相当于是使用卷积操作得到positional encoding，而卷积操作的zero-padding就是相当于是参考点，卷积操作相当于提取了每个向量与参考点的相对位置信息。所以这种办法用一句话概括就是：
PEG的卷积部分以zero-padding作为参考点，以卷积操作提取相对位置信息，借助卷积得到适用于Transformer的可变长度的位置编码。
ref: zhihu

an extra learnable class token needed to perform classification, which is not translation-invariant although it can learn to be translation-invariant
replace class token with a global average pooling (GAP), which is inherently translation-invariant

Vision Transformers: (a) ViT with explicit 1D learnable positional encodings (PE) (b) CPVT with conditional positional encoding from the proposed Position Encoding Generator (PEG) plugin, which is the default choice. © CPVT-GAP without class token (cls), but with global average pooling (GAP) over all items in the sequence. Note that GAP is a bonus version which has boosted performance.

architecture variants

Configuration details of Twins-PCPVT.

Twins-SVT

model architecture

Architecture of Twins-SVT-S. “PEG” is the positional encoding generator from CPVT.

spatially separable self-attention (SSSA)

(a) Twins-SVT interleaves locally-grouped attention (LSA) and global sub-sampled attention (GSA). (b) Schematic view of the locally-grouped attention (LSA) and global sub-sampled attention (GSA).

locally-grouped self-attention (LSA)

feature map $x\in R^{h\times w\times C}$ divided into $ws\times ws$ -size windows
self-attention applied on each sub-windows who contain $ws\times ws$ tokens, respectively

global sub-sampled attention (GSA)

Multi-head attention (MHA) vs. spatial-reduction attention (SRA). With the spatial-reduction operation, the computational/memory cost of our SRA is much lower than that of MHA.

transformer encoder

with consecutive encoder blocks alternating between LSA and GSA, transformer encoder computed as
$\begin{aligned} \widehat{z}_l&=LSA(LN(z_{l-1}))+z_{l-1} \\ z_l&=FFN(LN(\widehat{z}_l))+\widehat{z}_l \\ \widehat{z}_{l+1}&=GSA(LN(z_l))+z_l \\ z_{l+1}&=FFN(LN(\widehat{z}_{l+1}))+\widehat{z}_{l+1} \end{aligned}$

computational complexity

in ViT, given input feature map $x\in R^{h\times w\times C}$ , FLOPs of MSA is
$\Omega(MSA)=4hwC^2+2(hw)^2C$
for LSA, replace $h\times w$ with window size $ws\times ws$ and batchsize $\times n_{windows}=\frac {hw}{(ws)^2}$
$\Omega(LSA)=4hwC^2+2hw(ws)^2C$
for GSA, if reduction ratio is r
$\Omega(GSA)=4hwC^2+\frac {2(hw)^2C}{r^2}$
to sum up, for SSSA
$\begin{aligned} \Omega(SSSA)&=8hwC^2+2(hwmn+\frac {(hw)^2}{r^2})C \\ &=8hwC^2+2(k_1k_2hw+\frac {(hw)^2}{k_1k_2})C \\ &\geqslant 8hwC^2+2(hw)^{\frac 32}C \end{aligned}$
where, minimum obtained when $k_1k_2=\sqrt{hw}$

architecture variants

Configuration details of Twins-SVT.

Experiment

image classification

dataset ImageNet-1K, with regularization as DeiT
optimizer AdamW: batchsize=1024, 300 epochs, init lr=1e-3, linear warm-up 5 epochs, cosine decay to 0
stochastic depth 0.2, 0.3, 0.5 for Twins-S, Twins-B, Twins-L
max gradient norm clipped to 5.0

Comparisons with SOTA methods for ImageNet-1K classification. Throughput is tested on the batch size of 192 on a single V100 GPU. All models are trained and evaluated on 224x224 resolution on ImageNet-1K dataset. “+”: w/ CPVT’s position encodings.

object detection and instance segmentation

framework RetinaNet
dataset COCO
optimizer AdamW: batchsize=16, 12 epochs, init lr=1e-4, weigh decay=1e-4, warm-up 500 iterations, decay by 10 at 8, 11-th epoch

Object detection performance on the COCO val2017 split using the RetinaNet framework. 1x is 12 epochs and 3x is 36 epochs. “MS”: Multi-scale training. FLOPs are evaluated on 800x600 resolution.

framework Mask R-CNN
dataset COCO
optimizer AdamW: init lr=2e-4, default setting as mmdetection

Object detection and instance segmentation performance on the COCO val2017 dataset using the Mask R-CNN framework. FLOPs are evaluated on a 800x600 image.

semantic segmentation

framework Semantic FPN
dataset ADE20K, ImageNet (pre-training)

Twins-PCPVT
optimizer AdamW: batchsize=16, init lr=1e-4, polynomial lr decay (0.9)
Twins-SVT
optimizer AdamW: batchsize=16, 160K iterations, init lr=6e-6, weigh decay=5e-4, warm-up 1500 iterations, linear decay to 0

Performance comparisons with different backbones on ADE20K validation dataset. FLOPs are tested on 512x512 resolution. All backbones are pretrained on ImageNet-1k except SETR, which is pretrained on ImageNet-21k dataset.

ablation study

configuration of LSA and GSA blocks

Classification performance for different combinations of LSA (L) and GSA (G) blocks based on the small model.

sub-sampling function

ImageNet classification performance of different forms of sub-sampled functions for the global sub-sampled attention (GSA).

position encoding

Object detection performance on the COCO using different positional encoding strategies.

CPVT-based Swin cannot achieve better performance
indicate that improvement owing to paradigm of Twins-SVT instead of positional encodings

koukouvagia

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
[2104] [NIPS 2021] Twins: Revisiting the Design of Spatial Attention in Vision Transformers

Twins: Revisiting the Design of Spatial Attention in Vision Transformers
复制链接

扫一扫