[2202] Visual Attention Network

koukouvagia

已于 2022-04-04 10:27:53 修改

阅读量3.5k

点赞数

文章标签：计算机视觉深度学习

于 2022-03-29 19:39:31 首次发布

本文链接：https://blog.csdn.net/weixin_43355838/article/details/123734189

版权

paper
code

Content

Abstract

demerits of self-attention (SA)

treat images as 1D sequence and neglect 2D structure
quadractic complexity is expensive for HR images
achieve spatial adaptability but ignore channel adaptability

demrit of multiple layer perceptron (MLP)

sensitive to input size and only process fixed-size images
consider global information but ignore local structure

contributions

propose large kernel attention (LKA)
local structure information, long-range dependence, adaptability in channel dimension
present visual attention network (VAN) as backbone based on LKA
SOTA performance, less parameters and FLOPS

Results of different models on ImageNet-1K validation set. Left: Comparing the performance of recent models DeiT, PVT, Swin Transformer, ConvNeXt, Focal Transformer and our VAN. All these models have a similar amount of parameters. Right: Comparing the performance of recent models and our VAN while keeping the computational cost similar.

Method

large kernel attention (LKA)

large-kernel convolution brings a huge amount of computational overhead and parameters
solution decompose a large kernel convolution

Decomposition diagram of large-kernel convolution. A standard convolution can be decomposed into three parts: a depth-wise convolution (DW-Conv), a depth-wise dilation convolution (DW-D-Conv), and a pointwise convolution ( $1\times1$ Conv). The colored grids represent the location of convolution kernel and the yellow grid means the center point. The diagram shows that a $13\times13$ convolution is decomposed into a $5\times5$ depth-wise convolution, a $5\times5$ depth-wise dilation convolution with dilation rate 3, and a pointwise convolution. Note: zero paddings are omitted in the above figure.

a $K\times K$ large kernel convolution divided into 3 components

a spatial local convolution: $(2d-1)\times(2d-1)$ depth-wise conv $\implies$ local contextual information
a spatial long-range convolution: $\lceil\frac Kd\rceil\times\lceil\frac Kd\rceil$ depth-wise dilated
conv $\implies$ large receptive field
a channel convolution: $1\times1$ conv $\implies$ adaptility in channel

where, $K$ is kernel size, $d$ is dilation

Desirable properties belonging to convolution, self-attention and LKA.

write LKA module as
$\begin{aligned} Attention&=Conv_{1\times1}(DW\text{-}D\text{-}Conv(DW\text{-}Conv(F))) \\ Output&=Attention\otimes F \end{aligned}$

where, $F\in\Reals^{C\times H\times W}$ is input features, $Attention\in\Reals^{C\times H\times W}$ is attention map

The structure of different modules: (a) the proposed Large Kernel Attention (LKA); (b) non-attention module; © the self-attention module; (d) a stage of our Visual Attention Network (VAN). “CFF” means convolutional feed-forward network. Residual connection is omitted in (d). The difference between (a) and (b) is the element-wise multiply. It is worth noting that © is designed for 1D sequences.

computational complexity

assumed input and output with same size $\Reals^{C\times H\times W}$
$\begin{aligned} \mathrm{Param}&=\lceil\frac Kd\rceil\times\lceil\frac Kd\rceil\times C+(2d-1)\times(2d-1)\times C+C\times C \\ \mathrm{FLOPs}&=(\lceil\frac Kd\rceil\times\lceil\frac Kd\rceil\times C+(2d-1)\times(2d-1)\times C+C\times C)\times H\times W \end{aligned}$

where, $K$ is kernel size with default $K = 21$ , $d$ is dilation with default $d = 3$

Comparison of parameters of different manners for a $21\times21$ convolution. X, Y and Our donate standard convolution, MobileNet decomposition and our decomposition, respectively. The input and output feature have the same size $H\times W\times C$ . Note: Bias is omitted for simplifying format.

architecture variants

The detailed setting for different versions of the VAN. “e.r.” represents expansion ratio in the feed-forward network.

Experiment

image classification

dataset ImageNet-1K, with augmentation
optimizer AdamW: batchsize=1024, 310 epochs, momentum=0.9, weigh decy=5e-2, init lr=5e-4, warm-up, cosine decay

Compare with the state-of-the-art methods on ImageNet validation set. Params means parameter. GFLOPs donates floating point operations. Top-1 Acc represents Top-1 accuracy.

object detection and instance segmentation

framework RetinaNet, Mask R-CNN, Cascade Mask R-CNN, Sparse R-CNN
dataset COCO 2017

Object detection on COCO 2017 dataset. #P means parameter. RetinaNet $1\times$ donates models are based on RetinaNet and we train them for 12 epochs.

Object detection and instance segmentation on COCO 2017 dataset. #P means parameter. Mask R-CNN $1\times$ donates models are based on Mask R-CNN and we train them for 12 epochs. $AP^b$ and $AP^m$ refer to bounding box AP and mask AP respectively.

Comparison with the state-of-the-art vision backbones on COCO 2017 benchmark. All models are trained for 36 epochs. We calculate FLOPs with input size of $1280\times800$ .

semantic segmentation

framework Semantic FPN, UperNet
dataset ADE20K

Results of semantic segmentation on ADE20K validation set. The upper and lower part are obtained under two different training/validation schemes. We calculate FLOPs with input size $512\times512$ for Semantic FPN and $2048\times512$ for UperNet.

ablation studies

architecture components

Ablation study of different modules in LKA. Results show that each part is critical. Acc(%) means Top-1 accuracy on ImageNet validation set.

key findings

local structural information, long-range dependence, adaptability in channel dimension are all critial
attention mechanism help network achieve adaptive property

kernel size and dilation

Ablation study of different kernel size in LKA. Acc(%) means Top-1 accuracy on ImageNet validation set.

key findings

decomposing a $21\times21$ convolution work better than decomposing a $7\times7$ convolution
$\implies$ large kernel is critical for visual tasks
decomposing a larger $35\times35$ convolution, gain is not obvious comparing with decomposing a $21\times21$ convolution

visualization

Visualization results. All images come from different categories in ImageNet validation set. CAM is produced by using Grad-CAM. We compare different CAMs produced by Swin-T, ConvNeXt-T and VAN-Base.