[2104] [NIPS 2021] Twins: Revisiting the Design of Spatial Attention in Vision Transformers

paper
code

Contribution

  • propose Spatially Separable self-attention (SSSA)
  • introduce Conditional Position Encoding (CPE)

Method

Twins-PCPVT: based on PVT and conditional position encoding (CPE), which only uses global attention
Twins-SVT: based on proposed spatially separable self-attention (SSSA) which interleaves local and global attention

Twins-PCPVT

model architecture


Architecture of Twins-PCPVT-S. “PEG” is the positional encoding generator from CPVT.

conditional position encoding (CPE)

given input image with size H × W H\times W H×W, split into patches with size S × S S\times S S×S, patches number is N = H W S 2 N=\frac {HW}{S^2} N=S2HW
patches added with the same number as learnable absolute positional encoding vectors

limitations of previous positional encodings

  1. fixed length prevent model from handling sequences longer than the longest training sequences
  2. make model not translation-invariant because a unique positional encoding vector is added to every one patch


Comparison of various positional encoding (PE) strategies tested on ImageNet validation set in terms of the top-1 accuracy. Removing the positional encodings greatly damages the performance. The relative postional encodings have inferior performance to the absolute ones.

solutions for aforementioned limitations

  1. remove positional encoding
    order of input sequence is an important clue.
    model has no way to employ the order without the positional encodings.
  2. interpolate position encoding to be shorter or the same length as fixed length
    model require fine-tuning, otherwise performance will remarkably drop.
  3. introduce relative position encoding (eg. in Swin)
    cannot provide absolute position information, which is also important to classification tasks.
    too complex and inefficient, for inner expression of transformer needed to be modified.

requirements of demanded positional encoding

  1. make input sequence permutation-variant but translation-invariant
  2. inductive and able to handle sequences longer than ones during training
  3. provide absolute position information to a certain degree
positional encoding generator (PEG)

reshape flatten input sequence X ∈ R B × N × C X\in R^{B\times N\times C} XRB×N×C to X 1 ∈ R B × C × H × W X_1\in R^{B\times C\times H\times W} X1RB×C×H×W
apply 2d-transformation F on X 1 X_1 X1 and get output X 2 ∈ R B × N × C X_2\in R^{B\times N\times C} X2RB×N×C, where F implement by 2d-conv, with kernel size k ( k ⩾ 3 k\geqslant 3 k3) and k − 1 2 \frac {k-1}2 2k1 zero-padding
reshape X 2 X_2 X2 to produce position encoding E ∈ R B × C × H × W E\in R^{B\times C\times H\times W} ERB×C×H×W


Schematic illustration of Positional Encoding Generator (PEG). Note d is the embedding size, N is the number of tokens. The function F can be depth-wise, separable convolution or other complicated blocks.

问:为什么要这么做?这种做法怎么就能把位置信息引入Transformer了?
答:给Transformer引入位置信息,说白了就是给一个sequence的N个向量assign a position。那这个position它既可以是绝对信息,也可以是相对信息。相对信息就是定义一个参考点然后给每个向量一个代表它与参考点相对位置的信息。这种做法相当于是使用卷积操作得到positional encoding,而卷积操作的zero-padding就是相当于是参考点,卷积操作相当于提取了每个向量与参考点的相对位置信息。所以这种办法用一句话概括就是:
PEG的卷积部分以zero-padding作为参考点,以卷积操作提取相对位置信息,借助卷积得到适用于Transformer的可变长度的位置编码。
ref: zhihu

an extra learnable class token needed to perform classification, which is not translation-invariant although it can learn to be translation-invariant
replace class token with a global average pooling (GAP), which is inherently translation-invariant


Vision Transformers: (a) ViT with explicit 1D learnable positional encodings (PE) (b) CPVT with conditional positional encoding from the proposed Position Encoding Generator (PEG) plugin, which is the default choice. © CPVT-GAP without class token (cls), but with global average pooling (GAP) over all items in the sequence. Note that GAP is a bonus version which has boosted performance.

architecture variants


Configuration details of Twins-PCPVT.

Twins-SVT

model architecture


Architecture of Twins-SVT-S. “PEG” is the positional encoding generator from CPVT.

spatially separable self-attention (SSSA)


(a) Twins-SVT interleaves locally-grouped attention (LSA) and global sub-sampled attention (GSA). (b) Schematic view of the locally-grouped attention (LSA) and global sub-sampled attention (GSA).

locally-grouped self-attention (LSA)

feature map x ∈ R h × w × C x\in R^{h\times w\times C} xRh×w×C divided into w s × w s ws\times ws ws×ws-size windows
self-attention applied on each sub-windows who contain w s × w s ws\times ws ws×ws tokens, respectively

global sub-sampled attention (GSA)


Multi-head attention (MHA) vs. spatial-reduction attention (SRA). With the spatial-reduction operation, the computational/memory cost of our SRA is much lower than that of MHA.

transformer encoder

with consecutive encoder blocks alternating between LSA and GSA, transformer encoder computed as
z ^ l = L S A ( L N ( z l − 1 ) ) + z l − 1 z l = F F N ( L N ( z ^ l ) ) + z ^ l z ^ l + 1 = G S A ( L N ( z l ) ) + z l z l + 1 = F F N ( L N ( z ^ l + 1 ) ) + z ^ l + 1 \begin{aligned} \widehat{z}_l&=LSA(LN(z_{l-1}))+z_{l-1} \\ z_l&=FFN(LN(\widehat{z}_l))+\widehat{z}_l \\ \widehat{z}_{l+1}&=GSA(LN(z_l))+z_l \\ z_{l+1}&=FFN(LN(\widehat{z}_{l+1}))+\widehat{z}_{l+1} \end{aligned} z lzlz l+1zl+1=LSA(LN(zl1))+zl1=FFN(LN(z l))+z l=GSA(LN(zl))+zl=FFN(LN(z l+1))+z l+1

computational complexity

in ViT, given input feature map x ∈ R h × w × C x\in R^{h\times w\times C} xRh×w×C, FLOPs of MSA is
Ω ( M S A ) = 4 h w C 2 + 2 ( h w ) 2 C \Omega(MSA)=4hwC^2+2(hw)^2C Ω(MSA)=4hwC2+2(hw)2C
for LSA, replace h × w h\times w h×w with window size w s × w s ws\times ws ws×ws and batchsize × n w i n d o w s = h w ( w s ) 2 \times n_{windows}=\frac {hw}{(ws)^2} ×nwindows=(ws)2hw
Ω ( L S A ) = 4 h w C 2 + 2 h w ( w s ) 2 C \Omega(LSA)=4hwC^2+2hw(ws)^2C Ω(LSA)=4hwC2+2hw(ws)2C
for GSA, if reduction ratio is r
Ω ( G S A ) = 4 h w C 2 + 2 ( h w ) 2 C r 2 \Omega(GSA)=4hwC^2+\frac {2(hw)^2C}{r^2} Ω(GSA)=4hwC2+r22(hw)2C
to sum up, for SSSA
Ω ( S S S A ) = 8 h w C 2 + 2 ( h w m n + ( h w ) 2 r 2 ) C = 8 h w C 2 + 2 ( k 1 k 2 h w + ( h w ) 2 k 1 k 2 ) C ⩾ 8 h w C 2 + 2 ( h w ) 3 2 C \begin{aligned} \Omega(SSSA)&=8hwC^2+2(hwmn+\frac {(hw)^2}{r^2})C \\ &=8hwC^2+2(k_1k_2hw+\frac {(hw)^2}{k_1k_2})C \\ &\geqslant 8hwC^2+2(hw)^{\frac 32}C \end{aligned} Ω(SSSA)=8hwC2+2(hwmn+r2(hw)2)C=8hwC2+2(k1k2hw+k1k2(hw)2)C8hwC2+2(hw)23C
where, minimum obtained when k 1 k 2 = h w k_1k_2=\sqrt{hw} k1k2=hw

architecture variants


Configuration details of Twins-SVT.

Experiment

image classification

dataset ImageNet-1K, with regularization as DeiT
optimizer AdamW: batchsize=1024, 300 epochs, init lr=1e-3, linear warm-up 5 epochs, cosine decay to 0
stochastic depth 0.2, 0.3, 0.5 for Twins-S, Twins-B, Twins-L
max gradient norm clipped to 5.0


Comparisons with SOTA methods for ImageNet-1K classification. Throughput is tested on the batch size of 192 on a single V100 GPU. All models are trained and evaluated on 224x224 resolution on ImageNet-1K dataset. “+”: w/ CPVT’s position encodings.

object detection and instance segmentation

framework RetinaNet
dataset COCO
optimizer AdamW: batchsize=16, 12 epochs, init lr=1e-4, weigh decay=1e-4, warm-up 500 iterations, decay by 10 at 8, 11-th epoch


Object detection performance on the COCO val2017 split using the RetinaNet framework. 1x is 12 epochs and 3x is 36 epochs. “MS”: Multi-scale training. FLOPs are evaluated on 800x600 resolution.

framework Mask R-CNN
dataset COCO
optimizer AdamW: init lr=2e-4, default setting as mmdetection


Object detection and instance segmentation performance on the COCO val2017 dataset using the Mask R-CNN framework. FLOPs are evaluated on a 800x600 image.

semantic segmentation

framework Semantic FPN
dataset ADE20K, ImageNet (pre-training)

  • Twins-PCPVT
    optimizer AdamW: batchsize=16, init lr=1e-4, polynomial lr decay (0.9)
  • Twins-SVT
    optimizer AdamW: batchsize=16, 160K iterations, init lr=6e-6, weigh decay=5e-4, warm-up 1500 iterations, linear decay to 0


Performance comparisons with different backbones on ADE20K validation dataset. FLOPs are tested on 512x512 resolution. All backbones are pretrained on ImageNet-1k except SETR, which is pretrained on ImageNet-21k dataset.

ablation study

configuration of LSA and GSA blocks


Classification performance for different combinations of LSA (L) and GSA (G) blocks based on the small model.

sub-sampling function


ImageNet classification performance of different forms of sub-sampled functions for the global sub-sampled attention (GSA).

position encoding


Object detection performance on the COCO using different positional encoding strategies.

CPVT-based Swin cannot achieve better performance
indicate that improvement owing to paradigm of Twins-SVT instead of positional encodings

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
好的,以下是MIA、MICCAI、NIPS、CVPR、ICLR在2020到2021年的分割方面的一些高亮工作: 1. MIA:2020年的MIA会议上,一篇名为“Efficient and Robust Attention U-Net for Medical Image Segmentation”的论文提出了一种新的多任务医学图像分割方法,它结合了注意力机制和U-Net结构,能够有效地处理医学图像中的不同类型结构。 2. MICCAI:2020年的MICCAI会议上,一篇名为“Recursive Deep Supervision for Brain Tumor Segmentation”的论文提出了一种递归深度监督方法,用于进行脑肿瘤分割。该方法结合了多尺度特征和递归监督机制,能够在不需要大量标注数据的情况下实现高质量的分割结果。 3. NIPS:2020年的NIPS会议上,一篇名为“Learning to Zoom: a Saliency-Based Sampling Approach for Neural Network Training”的论文提出了一种基于显著性采样的神经网络训练方法,能够在处理大尺度图像时提高分割性能。 4. CVPR:2021年的CVPR会议上,一篇名为“Dual Attention Network for Scene Segmentation”的论文提出了一种双重注意力网络,用于进行场景分割。该方法结合了空间和通道注意力机制,能够更好地捕捉图像的局部和全局信息。 5. ICLR:2021年的ICLR会议上,一篇名为“ShapeMask: Learning to Segment Novel Objects by Refining Shape Priors”的论文提出了一种名为ShapeMask的方法,能够使用形状先验知识来分割新颖的物体。该方法结合了形状重建和分割网络,能够在只有少量标注数据的情况下实现高质量的分割结果。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值