[2107] [NIPS 2021] Focal Self-attention for Local-Global Interactions in Vision Transformers

koukouvagia

已于 2022-04-04 10:12:49 修改

阅读量1.9k

点赞数

文章标签：计算机视觉深度学习

于 2022-03-02 15:59:45 首次发布

本文链接：https://blog.csdn.net/weixin_43355838/article/details/123231651

版权

paper
code

Content

Contribution

propose Focal self-attention (FSA) with fine attention locally and coarse attention globally

Method

model architecture

Model architecture for our Focal Transformers. As highlighted in light blue boxes, our main innovation is the proposed focal self-attention mechanism in each Transformer layer.

focal self-attention (FSA)

Left: Visualization of the attention maps of the three heads at the given query patch (blue) in the first layer of the DeiT-Tiny model. Right: An illustrative depiction of focal self-attention mechanism. Three granularity levels are used to compose the attention region for the blue query.

FSA attend fine-grain tokens only locally instead of attending all tokens at fine-grain
cover as many regions as standard self-attention but with much less cost

The size of receptive field (yaxis) with the increase of used tokens (x-axis) for standard and our focal selfattention. For focal self-attention, we assume increasing the window granularity by factor 2 gradually but no more than 8. Note that the y-axis is logarithmic.

for a query position, when use gradually coarser-grain for its far surroundings, FSA have significantly larger receptive fields at the cost of attending the same number of visual tokens than baseline.
focal mechanism enable long-range self-attention with much less time and memory cost

window-wise attention

An illustration of our focal self-attention at window level. Each of the finest square cell represents a visual token either from the original feature map or the squeezed ones. Suppose we have an input feature map of size 20x20. We first partition it into 5x5 windows of size 4x4. Take the 4x4 blue window in the middle as the query, we extract its surroundings tokens at multiple granularity levels as its keys and values. For the first level, we extract the 8x8 tokens which are closest to the blue window at the finest grain. Then at the second level, we expand the attention region and pool the surrounding 2x2 sub-windows, which results in 6x6 pooled tokens. At the third level, we attend even larger region covering the whole feature map and pool 4x4 sub-windows. Finally, these three levels of tokens are concatenated to compute the keys and values for the 4x4=16 tokens (queries) in the blue window.

firstly define 3 terms for clarity

focal levels L number of granularity levels that extract tokens for focal self-attention
focal window size ${s_w}^l$ size of sub-window on which summarized tokens got at level $l\in {1, ..., L}$
focal region size ${s_r}^l$ number of sub-windows horizontally and vertically in attended regions at level l

specify focal self-attention proceeded in 2 main steps

sub-window pooling
given input feature map $x\in R^{h\times w\times C}$ , split into ${s_w}^l\times {s_w}^l$ -size sub-windows
$\widehat{x}=Reshape(x)\in R^{\frac h{{s_w}^l}\times \frac w{{s_w}^l}\times C\times({s_w}^l\times {s_w}^l)}$
use a linear layer to pool each sub-window spatially
$x^l={f_p}^l(\widehat{x})\in R^{\frac h{{s_w}^l}\times \frac w{{s_w}^l}}$
attention computation
obtained pooled feature maps ${{\{x^l\}}_1}^L$ , compute q, k, v with linear projection layers
$\begin{aligned} Q&=f_q(x^1) \\ K&={{\{K^l\}}_1}^L=f_k({x^1, ..., x^L}) \\ V&={{\{V^l\}}_1}^L=f_v({x^1, ..., x^L}) \end{aligned}$
first extract surrounding tokens for each query token in feature map
note that tokens inside a window partition $s_p\times s_p$ share the same set of surroundings
for queries in i-th window $Q_i\in R^{s_p\times s_p\times C}$ , extract ${s_r}^l\times {s_r}^l$ keys, values from $K_l$ , $V_l$ around the window where query lie in
then gather keys and values for all L levels to obtain
$K_i={{K_i}^1, ..., {K_i}^L}\in R^{s\times C}, V_i={{V_i}^1, ..., {V_i}^L}\in R^{s\times C}$
where, s is sum of focal regions from all levels, i.e., $s=\sum_{l=1}^L({s_r}^l)^2$
note that a strict version of focal self-attention requires to exclude overlapped regions across different levels
finally, include a relative position bias and compute focal self-attention
$Attention(Q_i, K_i, V_i)=softmax(\frac {Q_iK_i^T}{\sqrt{d}}+B)V_i$
where, $B={{\{B^l\}}_1}^L$ is a learnable relative position bias, consisting of L subsets for L focal levels

for the first level, parameterize B to $B_1\in R^{(2s_p-1)\times(2s_p-1)}$
where, horizontal and vertical position range in [- $s_p$ +1, $s_p$ -1]
for the other levels, because of different granularity to queries, treat all queries inside a window equally
use $B_l\in R^{{s_r}^l\times {s_r}^l}$ to represent relative position bias between query window, each of ${s_r}^l\times {s_r}^l$ pooled token

focal transformer encoder

with encoder blocks containing FSA, transformer encoder computed as
$\begin{aligned} \widehat{z}_l&=FSA(LN(z_{l-1}))+z_{l-1} \\ z_l&=FFN(LN(\widehat{z}_l))+\widehat{z}_l \end{aligned}$

computational complexity

in ViT, given input feature map $x\in R^{h\times w\times C}$ , FLOPs of MSA is
$\Omega(MSA)=4hwC^2+2(hw)^2C$
given input feature map $x\in R^{h\times w\times C}$ , $\frac h{s_p}\times \frac w{s_p}$ sub-windows at focal level l
for pooling on each ${s_w}^l\times {s_w}^l$ -size sub-window
$\Omega(pool)=({s_w}^l)^2C$
for aggregation of sub-windows in $h\times w$ feature map of each layer
$\Omega(aggr)=hwC$
attention cost for a $s_p\times s_p$ -size query window
$\Omega(attn_{win})=(s_p)^2C\sum_{l}({s_r}^l)^2$
attention cost in whole feature map
$\Omega(attn_{feat})=hwC\sum_{l}({s_r}^l)^2$
to sum up, for FSA
$\Omega(FSA)=n_{levels}\times\Omega(aggr)+\Omega(attn_{feat})=hwC(L+\sum_{l}({s_r}^l)^2)$

architecture variants

Model configurations for our focal Transformers. We introduce three configurations Focal-Tiny, Focal-Small and Focal-Base with different model capacities.

Experiment

image classification

dataset ImageNet-1K, with augmentation and regularization as DeiT
optimizer AdamW: batchsize=1024, 300 epochs, init lr=1e-3, weigh decay=0.05, linear warm-up 20 epochs, cosine decay
stochastic depth 0.2, 0.2, 0.3 for Focal-T, Focal-S, Focal-B
max gradient norm clipped to 5.0

Comparison of image classification on ImageNet-1K for different models. Except for ViT-Base/16, all other models are trained and evaluated on 224x224 resolution.

object detection and instance segmentation

framework Mask R-CNN, Cascade Mask R-CNN
dataset COCO 2017
optimizer AdamW: 12 or 36 epochs, init lr=1e-4, weigh decay=0.05
stochastic depth 0.2, 0.2, 0.3 for Focal-T, Focal-S, Focal-B

Comparisons with CNN and Transformer baselines and SoTA methods on COCO object detection. The box mAP ( $AP^b$ ) and mask mAP ( $AP^m$ ) are reported for RetinaNet and Mask R-CNN trained with 1x schedule.

COCO object detection and segmentation results with RetinaNet and Mask R-CNN. All models are trained with 3x schedule and multi-scale inputs (MS). The numbers before and after “/” at column 2 and 3 are the model size and complexity for RetinaNet and Mask R-CNN, respectively.

dataset COCO 2017
optimizer AdamW: 36 epochs, init lr=1e-4, weigh decay=0.05
stochastic depth 0.2, 0.2, 0.3 for Focal-T, Focal-S, Focal-B

Comparison with ResNet-50, Swin-Tiny across different object detection methods. We use Focal-Tiny as the backbone and train all models using 3x schedule.

semantic segmentation

dataset ADE20K
optimizer AdamW: batchsize=16, 160K iterations, init lr=6e-5, weigh decay=0.01, polynomial decay
scaling ratio [0.5, 0.75, 1.0, 1.25, 1.5, 1.75], for multi-scale evaluation

Comparison with SoTA methods for semantic segmentation on ADE20K val set. Both single- and multi-scale evaluations are reported at the last two columns. “\neq” means pretrained on ImageNet-22K.

ablation study

window size
one question is that whether increasing window size further help model learning giving enlarged receptive fields

Impact of different window sizes (WSize). We alter the default size 7 to 14 and observe consistent improvements for both methods.

necessity of window shift
window shift operations enable cross-window interactions between two successive layers

Impact of window shift (W-Shift) on Swin Transformer and Focal Transformer. Tiny models are used.

short- and long-interaction
ablate Focal-Tiny model to

Focal-Tiny-Window merely performing attention inside each window
Focal-Tiny-Local attending additional fine-grain surrounding tokens
Focal-Tiny-Global attending extra coarse-grain squeezed tokens

Ablating Focal-Tiny model by adding local, global and both interactions, respectively. Blue bars are for image classification and orange bars indicate object detection performance. Both local and global interactions are essential to obtain good performance.

model depth
since focal attention prompt local and global interactions at each Transformer layer, one question is that whether less number of layers needed to obtain similar modeling capacity as those without global interactions
reduce number of Transformer layers at stage 3 in Swin-Tiny, Focal-Tiny from 6 to 4, 2

Impact of the change of model depth. We gradually reduce the number of transformer layers at the third stage from original 6 to 4 and further 2. It apparently hurts the performance but our Focal Transformers has much slower drop rate than Swin Transformer.

koukouvagia

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
[2107] [NIPS 2021] Focal Self-attention for Local-Global Interactions in Vision Transformers

Focal Self-attention for Local-Global Interactions in Vision Transformers
复制链接

扫一扫