[2111] [CVPR 2022] MetaFormer is Actually What You Need for Vision

koukouvagia

已于 2022-05-11 19:12:49 修改

阅读量569

点赞数

文章标签：计算机视觉深度学习

于 2022-01-19 10:11:08 首次发布

本文链接：https://blog.csdn.net/weixin_43355838/article/details/122574526

版权

paper
code

Content

Abstract

demonstrate that success of transformer/MLP-like models is largely attributed to MetaFormer architecture
hope to inspire more future research dedicated to improving MetaFormer instead of focusing on token mixer modules

Method

MetaFormer

transformer encoder consist of 2 parts:

attention module: token mixer, mix information among tokens
remaining modules: channel MLPs and residual connections

MetaFormer architecture. We present MetaFormer as a general architecture abstracted from transformers by not specifying the token mixer. When using attention/spatial MLP as the token mixer, MetaFormer is instantiated as transformer/MLP-like models. We argue that the competence of transformer/MLPlike models primarily stems from the general architecture MetaFormer instead of the equipped specific token mixers. To demonstrate this, we exploit an embarrassingly simple non-parametric operator, pooling, to conduct extremely basic token mixing.

MetaFormer is a general architecture where the token mixer is not specified while the other components are kept the same as transformers.

input $I$ first processed by input embedding, such as patch embedding
$X = P a t c h E m b e d (I)$

where, $X\in\Reals^{N\times C}$ are embedding tokens, $N$ is sequence length, $C$ is embedding dimension
embedding token fed into repeated MetaFormer encoders, including 2 res-blocks

the first sub-block is token mixer
$Y = T o k e n M i x e r (N o r m (X)) + X$

where, $Norm(\cdot)$ is normalization (such as LN, BN), $TokenMixer(\cdot)$ is a module which can be implemented by attention modules or MLP-like modules
note that the main function of token mixer is to propagate token information although those can mix channels, like attention.

the second sub-block is channel MLP
$Z=\sigma(Norm(Y)W_1)W_2+Y$

where, $W_1\in\Reals^{C\times rC}, W_2\in\Reals^{rC\times C}$ are learnable weighs, $\sigma(\cdot)$ is an active function

PoolFormer

model architecture

MetaFormer general architecture contributes mostly to the success of the recent transformer and MLP-like models.
to demonstrate it, employ an embarrassingly simple operator, pooling, as token mixer module

(a) The overall framework of PoolFormer. PoolFormer adopts hierarchical architecture with 4 stages. For a model with $L$ PoolFormer blocks, stage [1, 2, 3, 4] have $[\frac{L}6, \frac{L}6, \frac{L}2, \frac{L}6]$ blocks, respectively. The feature dimension $D_i$ of stage $i$ is shown in the figure. (b) The architecture of PoolFormer block. Compared with transformer block, it replaces attention with extremely simple non-parametric operator, pooling, to conduct only basic token mixing.

pooling operator expressed as
$T_{:,i,j}^{'}=\frac1{K\times K}\sum_{p, q=1}^KT_{:, i+p-\frac{K+1}2, j+q-\frac{K+1}2}-T_{:,i,j}$

where, $T\in\Reals^{C\times H\times W}$ , $K$ is pooling size
note that since MetaFormer block has a skip connection, input itself $T_{:,i,j}$ needed to be substracted from token mixer

computational complexity

self-attention quartic to token sequence length, extra learnable parameters
spatial MLP quartic to token sequence length, more learnable parameters
pooling linear to token sequence length, no learnable parameters

architecture variants

Configurations of different PoolFormer models. There are two groups of embedding dimensions, i.e., small size with [64, 128, 320, 512] dimensions and medium size with [96, 196, 384, 768]. Notation “S24” means the model is in small size of embedding dimensions with 24 PoolFormer blocks in total.

Experiment

image classification

dataset Image-1K
data augmentation MixUp, CutMix, CutOut, RandAugment
optimizer AdamW: batch size=4096, 300 epochs
learning rate initial 1e-3, weigh decay=0.5, warm-up 5 epochs, cosine decay
label smoothing 0.1

Performance of different types of models on ImageNet-1K classification. All these models are only trained on the ImageNet-1K training set and the accuracy on the validation set is reported. “ $\ast$ ” denotes results of ViT trained with with extra regularization from MLP-mixer.

ImageNet-1K validation accuracy vs. MACs/Model Size.

object detection and instance segmentation

framework RetinaNet
dataset COCO
initialization Xavier
optimizer AdamW: batch size=16, 12 epochs
learning rate initial 1e-4

Performance of object detection on COCO val2017. All models are based on RetinaNet and $1\times$ training schedule (i.e.12 epochs) is used for training detection models.

framework Mask R-CNN
dataset COCO
initialization Xavier
optimizer AdamW: batch size=16, 12 epochs
learning rate initial 1e-4

Performance of object detection and instance segmentation on COCO val2017. $AP^b$ and $AP^m$ represent bounding box AP and mask AP, respectively. All models are based on Mask R-CNN and trained by $1\times$ training schedule (i.e.12 epochs).

semantic segmentation

framework Semantic FPN
dataset ADE20K
initialization Xavier, ImageNet trained checkpoints
optimizer AdamW: batch size=32, 40K iterations
learning rate initial 2e-4, polynomial decay (0.9)

Performance of Semantic segmentation on ADE20K validation set. All models are equipped with Semantic FPN.

ablation study

Ablation for PoolFormer on ImageNet-1K classification benchmark. PoolFormer-S12 is utilized as the baseline to conduct ablation study. The top-1 accuracy on the validation set is reported.

pooling size
similar performance when ps=3, 5, 7, but an obvious drop of 0.5% when ps=9
adopt pooling size default=3 for PoolFormer

normalization
Group Norm with 0.7% or 0.8% higher than Layer Norm or Batch Norm
adopt normalization default=Group Norm

activation
GeLU with 0.8% higher than ReLU and the same as SiLU
adopt normalization default=GeLU

hybrid stages

pooling-based one can handle much longer input sequences while attention and spatial MLP are good at capturing global information
stack blocks with pooling in bottom stages to handle long sequences and use attention or spatial MLP-based mixer in top stages, for sequences have been largely shortened

replace token mixer pooling with attention or spatial FC in top one or two stages
all better than baseline, while 2pool-2attn achieve highly competitive performance with 81.0% accuracy

koukouvagia

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
[2111] [CVPR 2022] MetaFormer is Actually What You Need for Vision

MetaFormer is Actually What You Need for Vision
复制链接

扫一扫