[2203] SepViT: Separable Vision Transformer

koukouvagia

已于 2022-04-06 21:38:58 修改

阅读量1.3k

点赞数 1

文章标签：计算机视觉深度学习

于 2022-04-06 21:10:27 首次发布

本文链接：https://blog.csdn.net/weixin_43355838/article/details/123934676

版权

paper

Content

Abstract

propose Separable Vision Transformer (SepViT) with depth-wise separable self-attention
local information interaction within windows
global information exchange among windows
propose window token embedding
learn global feature representations of each window
model attention relationship among windows with negligible computational cost
extend depth-wise separable self-attention to grouped self-attention
capture more contextual concepts across multiple windows

Comparison of throughput and latency on ImageNet-1K classification. The throughput and the latency are tested based on the PyTorch framework with a V100 GPU and TensorRT framework with a T4 GPU, respectively.

Method

model architecture

Separable Vision Transformer (SepViT). The top row is the overall hierarchical architecture of SepViT. The bottom row is the SepViT block and the detailed visualization of our depth-wise separable self-attention and the window token embedding scheme.

depth-wise separable self-attention (DSSA)

depth-wise self-attention (DWA)

depth-wise convolution: fuse spatial information within each channel
depth-wise attention: fuse spatial information within each window

step 1 partition input features $z^{\ell-1}$ into windows
$z^{\ell-1}\in\Reals^{B\times C\times H\times W}\xrightarrow{partition}z^{\ell-1}\in\Reals^{(n_{windows}\times B)\times C\times N}$

where, $n_{windows}=\frac{H}{H_{window}}\times\frac{W}{W_{window}}, N=H_{window}\times W_{window}$
step 2 concatenate window tokens $w t$ and windowed features
$wt\in\Reals^{(n_{windows}\times B)\times C\times1}+z^{\ell-1}\in\Reals^{(n_{windows}\times B)\times C\times N}\xrightarrow{concat}\tilde{z}^{\ell}\in\Reals^{(n_{windows}\times B)\times C\times(N+1)}$

step 3 project features into query, key, value
$\tilde{z}^{\ell}\in\Reals^{(n_{windows}\times B)\times C\times(N+1)}\xrightarrow{linear}qkv\in\Reals^{(n_{windows}\times B)\times3C\times(N+1)}\xrightarrow{split}q, k, v\in\Reals^{(n_{windows}\times B)\times C\times(N+1)}$

step 4 split query, key, value into multi-head version
$v\in\Reals^{(n_{windows}\times B)\times C\times(N+1)}\xrightarrow{split}q, k, v\in\Reals^{(n_{windows}\times B)\times n_{heads}\times(N+1)\times C_{head}}$

where, $C=n_{heads}\times C_{head}$
step 5 produce features with depth-wise attention
$\ddot{z}^{\ell}=\mathrm{Attention}(q, k, v)\in\Reals^{(n_{windows}\times B)\times n_{heads}\times(N+1)\times C_{head}}$

to sum up, depth-wise attention formulated as
$\mathrm{DWA}(z^{\ell-1})=\mathrm{Attention}(z^{\ell-1}\cdot W_Q, z^{\ell-1}\cdot W_K, z^{\ell-1}\cdot W_V)$

window token embedding

aim model attention relationship among windows
straight solution employ all pixel tokens $\implies$ huge computational cost
new solution window token embedding $\implies$ negligible computational cost

a fixed zero vector
a learnable vector with initialization of zero

in implementation, window token is a 1D tensor with the same dimension as input

point-wise self-attention (PWA)

point-wise convolution: fuse information from different channels
point-wise attention: fuse information across windows

step 1 split window tokens and windowed features from DWA output $\ddot{z}^{\ell}$
$\ddot{z}^{\ell}\in\Reals^{(n_{windows}\times B)\times n_{heads}\times(N+1)\times C_{head}}\xrightarrow{slice}\dot{wt}\in\Reals^{B\times n_{heads}\times n_{windows}\times C_{head}}+\dot{z}^{\ell}\in\Reals^{B\times n_{heads}\times n_{windows}\times N\times C_{head}}$

step 2 project window token into window query, key
$\dot{wt}\in\Reals^{B\times n_{heads}\times n_{windows}\times C_{head}}\xrightarrow{norm+act}\xrightarrow{conv}\xrightarrow{split}q_w, k_w\in\Reals^{B\times n_{heads}\times n_{windows}\times C_{heads}}$

step 3 produce features with point-wise attention
$\hat{z}^{\ell}=\mathrm{PWA}(\dot{z}^{\ell}, \dot{wt})=\mathrm{Attention}(q, k, \dot{z}^{\ell})\in\Reals^{B\times n_{heads}\times n_{windows}\times N\times C_{head}}$

where, windowed features $\dot{z}^{\ell}$ from DWA output directly used as window value
to sum up, point-wise attention formulated as
$\mathrm{PWA}(\dot{z}^{\ell}, \dot{wt})=\mathrm{Attention}(\mathrm{GELU}(\mathrm{LN}(\dot{wt}))\cdot W_Q, \mathrm{GELU}(\mathrm{LN}(\dot{wt}))\cdot W_K, \dot{z}^{\ell})$

grouped self-attention (GSA)

A macro view of the similarities and differences between the depth-wise separable self-attention and the grouped self-attention.

transformer encoder

to sum up, each block formulated as
$\begin{aligned} \tilde{z}^{\ell}&=\mathrm{Concat}(z^{\ell-1}, wt) \\ \ddot{z}^{\ell}&= \mathrm{DWA}(\mathrm{LN}(\tilde{z}^{\ell})) \\ \dot{z}^{\ell}, \dot{wt}&=\mathrm{Slice}(\ddot{z}^{\ell}) \\ \hat{z}^{\ell}&=\mathrm{PWA}(\dot{z}^{\ell}, \dot{wt})+z^{\ell-1} \\ z^{\ell}&=\mathrm{MLP}(\mathrm{LN}(\hat{z}^{\ell}))+\hat{z}^{\ell} \end{aligned}$

Complexity comparison of an information interaction within and among windows in a single SepViT block with those two-block pattern works in each stage.

reasons of low computational cost

more lightweight
remove many redundant layers
1 MLP + 2 LN in a single SepViT block, 2 MLP + 2 LN in two successive Swin or Twins blocks

architecture variants

Detailed configurations of SepViT variants in different stages.

Experiment

image classification

dataset ImageNet-1K
optimizer AdamW: batch size=1024, 300 epochs, init lr=1e-3, weigh decay=0.05 for for SepViT-T/S or 0.1 for SepViT-B, linear warm-up 5 epochs for SepViT-T/S or 20 epochs for SepViT-B, cosine decay
stochastic depth 0.2, 0.3, 0.5 for SepViT-T, SepViT-S, SepViT-B

Comparison of different state-of-the-art methods on ImageNet-1K classification. Throughput and latency are tested based on the PyTorch framework with a V100 GPU (batchsize=192) and TensorRT framework with a T4 GPU (batchsize=8).

object detection and instance segmentation

framework RetinaNet, Mask R-CNN
dataset COCO 2017

1x schedule
optimizer AdamW: batch size=16, 12 epochs, init lr=1e-4, weigh decay=1e-3 for for SepViT-T or 1e-4 for SepViT-S, warm-up 500 iterations, decay rate=0.1 at 8, 11-th epoch
stochastic depth 0.2, 0.3 for SepViT-T, SepViT-S
3x-MS schedule
optimizer AdamW: batch size=16, 36 epochs, init lr=1e-4, weigh decay=0.05 for for SepViT-T or 0.1 for SepViT-S, warm-up 500 iterations, decay rate=0.1 at 27, 33-th epoch
stochastic depth 0.3 for SepViT-T/S

Comparison of different backbones on RetinaNet-based object detection task. FLOPs are measured with the input size of $800\times1280$ .

Comparison of different backbones on Mask R-CNN-based object detection and instance segmentation tasks. FLOPs are measured with the input size of $800\times1280$ . The superscript $b$ and $m$ denote the box detection and mask instance segmentation.

semantic segmentation

framework Semantic FPN
dataset ADE20K, ImageNet (pre-training)
optimizer AdamW: batch size=16, 80K iterations, init lr=1e-4, weigh decay=1e-4, polynomial lr decay (0.9)
stochastic depth 0.2, 0.3, 0.4 for SepViT-T, SepViT-S, SepViT-B

framework UperNet
dataset ADE20K, ImageNet (pre-training)
optimizer AdamW: batch size=16, 160K iterations, init lr=6e-5, weigh decay=0.01 for for SepViT-T/S or 0.03 for SepViT-B
stochastic depth 0.2, 0.3, 0.5 for SepViT-T, SepViT-S, SepViT-B

Comparison of different backbones on ADE20K semantic segmentation task. FLOPs are measured with the input size of $512\times2048$ .

ablation studies

efficient components

SepViT adopt conditional position encoding (CPE), overlapping patch embedding (OPE)
Swin-T+CPVT: taken as baseline
SepViT-T $\dag$ : with CPE but without OPE