Content
Abstract
demerits of self-attention (SA)
- treat images as 1D sequence and neglect 2D structure
- quadractic complexity is expensive for HR images
- achieve spatial adaptability but ignore channel adaptability
demrit of multiple layer perceptron (MLP)
- sensitive to input size and only process fixed-size images
- consider global information but ignore local structure
contributions
- propose large kernel attention (LKA)
local structure information, long-range dependence, adaptability in channel dimension - present visual attention network (VAN) as backbone based on LKA
SOTA performance, less parameters and FLOPS
Results of different models on ImageNet-1K validation set. Left: Comparing the performance of recent models DeiT, PVT, Swin Transformer, ConvNeXt, Focal Transformer and our VAN. All these models have a similar amount of parameters. Right: Comparing the performance of recent models and our VAN while keeping the computational cost similar.
Method
large kernel attention (LKA)
large-kernel convolution brings a huge amount of computational overhead and parameters
solution decompose a large kernel convolution
Decomposition diagram of large-kernel convolution. A standard convolution can be decomposed into three parts: a depth-wise convolution (DW-Conv), a depth-wise dilation convolution (DW-D-Conv), and a pointwise convolution ( 1 × 1 1\times1 1×1 Conv). The colored grids represent the location of convolution kernel and the yellow grid means the center point. The diagram shows that a 13 × 13 13\times13 13×13 convolution is decomposed into a 5 × 5 5\times5 5×5 depth-wise convolution, a 5 × 5 5\times5 5×5 depth-wise dilation convolution with dilation rate 3, and a pointwise convolution. Note: zero paddings are omitted in the above figure.
a K × K K\times K K×K large kernel convolution divided into 3 components
- a spatial local convolution: ( 2 d − 1 ) × ( 2 d − 1 ) (2d-1)\times(2d-1) (2d−1)×(2d−1) depth-wise conv ⟹ \implies ⟹ local contextual information
- a spatial long-range convolution:
⌈
K
d
⌉
×
⌈
K
d
⌉
\lceil\frac Kd\rceil\times\lceil\frac Kd\rceil
⌈dK⌉×⌈dK⌉ depth-wise dilated
conv ⟹ \implies ⟹ large receptive field - a channel convolution: 1 × 1 1\times1 1×1 conv ⟹ \implies ⟹ adaptility in channel
where, K K K is kernel size, d d d is dilation
Desirable properties belonging to convolution, self-attention and LKA.
write LKA module as
A
t
t
e
n
t
i
o
n
=
C
o
n
v
1
×
1
(
D
W
-
D
-
C
o
n
v
(
D
W
-
C
o
n
v
(
F
)
)
)
O
u
t
p
u
t
=
A
t
t
e
n
t
i
o
n
⊗
F
\begin{aligned} Attention&=Conv_{1\times1}(DW\text{-}D\text{-}Conv(DW\text{-}Conv(F))) \\ Output&=Attention\otimes F \end{aligned}
AttentionOutput=Conv1×1(DW-D-Conv(DW-Conv(F)))=Attention⊗F
where, F ∈ R C × H × W F\in\Reals^{C\times H\times W} F∈RC×H×W is input features, A t t e n t i o n ∈ R C × H × W Attention\in\Reals^{C\times H\times W} Attention∈RC×H×W is attention map
The structure of different modules: (a) the proposed Large Kernel Attention (LKA); (b) non-attention module; © the self-attention module; (d) a stage of our Visual Attention Network (VAN). “CFF” means convolutional feed-forward network. Residual connection is omitted in (d). The difference between (a) and (b) is the element-wise multiply. It is worth noting that © is designed for 1D sequences.
computational complexity
assumed input and output with same size
R
C
×
H
×
W
\Reals^{C\times H\times W}
RC×H×W
P
a
r
a
m
=
⌈
K
d
⌉
×
⌈
K
d
⌉
×
C
+
(
2
d
−
1
)
×
(
2
d
−
1
)
×
C
+
C
×
C
F
L
O
P
s
=
(
⌈
K
d
⌉
×
⌈
K
d
⌉
×
C
+
(
2
d
−
1
)
×
(
2
d
−
1
)
×
C
+
C
×
C
)
×
H
×
W
\begin{aligned} \mathrm{Param}&=\lceil\frac Kd\rceil\times\lceil\frac Kd\rceil\times C+(2d-1)\times(2d-1)\times C+C\times C \\ \mathrm{FLOPs}&=(\lceil\frac Kd\rceil\times\lceil\frac Kd\rceil\times C+(2d-1)\times(2d-1)\times C+C\times C)\times H\times W \end{aligned}
ParamFLOPs=⌈dK⌉×⌈dK⌉×C+(2d−1)×(2d−1)×C+C×C=(⌈dK⌉×⌈dK⌉×C+(2d−1)×(2d−1)×C+C×C)×H×W
where, K K K is kernel size with default K = 21 K=21 K=21, d d d is dilation with default d = 3 d=3 d=3
Comparison of parameters of different manners for a 21 × 21 21\times21 21×21 convolution. X, Y and Our donate standard convolution, MobileNet decomposition and our decomposition, respectively. The input and output feature have the same size H × W × C H\times W\times C H×W×C. Note: Bias is omitted for simplifying format.
architecture variants
The detailed setting for different versions of the VAN. “e.r.” represents expansion ratio in the feed-forward network.
Experiment
image classification
dataset ImageNet-1K, with augmentation
optimizer AdamW: batchsize=1024, 310 epochs, momentum=0.9, weigh decy=5e-2, init lr=5e-4, warm-up, cosine decay
Compare with the state-of-the-art methods on ImageNet validation set. Params means parameter. GFLOPs donates floating point operations. Top-1 Acc represents Top-1 accuracy.
object detection and instance segmentation
framework RetinaNet, Mask R-CNN, Cascade Mask R-CNN, Sparse R-CNN
dataset COCO 2017
Object detection on COCO 2017 dataset. #P means parameter. RetinaNet 1 × 1\times 1× donates models are based on RetinaNet and we train them for 12 epochs.
Object detection and instance segmentation on COCO 2017 dataset. #P means parameter. Mask R-CNN 1 × 1\times 1× donates models are based on Mask R-CNN and we train them for 12 epochs. A P b AP^b APb and A P m AP^m APm refer to bounding box AP and mask AP respectively.
Comparison with the state-of-the-art vision backbones on COCO 2017 benchmark. All models are trained for 36 epochs. We calculate FLOPs with input size of 1280 × 800 1280\times800 1280×800.
semantic segmentation
framework Semantic FPN, UperNet
dataset ADE20K
Results of semantic segmentation on ADE20K validation set. The upper and lower part are obtained under two different training/validation schemes. We calculate FLOPs with input size 512 × 512 512\times512 512×512 for Semantic FPN and 2048 × 512 2048\times512 2048×512 for UperNet.
ablation studies
architecture components
Ablation study of different modules in LKA. Results show that each part is critical. Acc(%) means Top-1 accuracy on ImageNet validation set.
key findings
- local structural information, long-range dependence, adaptability in channel dimension are all critial
- attention mechanism help network achieve adaptive property
kernel size and dilation
Ablation study of different kernel size in LKA. Acc(%) means Top-1 accuracy on ImageNet validation set.
key findings
- decomposing a
21
×
21
21\times21
21×21 convolution work better than decomposing a
7
×
7
7\times7
7×7 convolution
⟹ \implies ⟹ large kernel is critical for visual tasks - decomposing a larger 35 × 35 35\times35 35×35 convolution, gain is not obvious comparing with decomposing a 21 × 21 21\times21 21×21 convolution
visualization
Visualization results. All images come from different categories in ImageNet validation set. CAM is produced by using Grad-CAM. We compare different CAMs produced by Swin-T, ConvNeXt-T and VAN-Base.
key findings
- activation area is more accurate
- show obvious advantages when object is dominant in images ⟹ \implies ⟹ ability to capture long-range dependence