[2111] [CVPR 2022] MetaFormer is Actually What You Need for Vision

paper
code

Abstract

  • demonstrate that success of transformer/MLP-like models is largely attributed to MetaFormer architecture
  • hope to inspire more future research dedicated to improving MetaFormer instead of focusing on token mixer modules

Method

MetaFormer

transformer encoder consist of 2 parts:

  1. attention module: token mixer, mix information among tokens
  2. remaining modules: channel MLPs and residual connections


MetaFormer architecture. We present MetaFormer as a general architecture abstracted from transformers by not specifying the token mixer. When using attention/spatial MLP as the token mixer, MetaFormer is instantiated as transformer/MLP-like models. We argue that the competence of transformer/MLPlike models primarily stems from the general architecture MetaFormer instead of the equipped specific token mixers. To demonstrate this, we exploit an embarrassingly simple non-parametric operator, pooling, to conduct extremely basic token mixing.

MetaFormer is a general architecture where the token mixer is not specified while the other components are kept the same as transformers.

input I I I first processed by input embedding, such as patch embedding
X = P a t c h E m b e d ( I ) X = PatchEmbed(I) X=PatchEmbed(I)

where, X ∈ R N × C X\in\Reals^{N\times C} XRN×C are embedding tokens, N N N is sequence length, C C C is embedding dimension
embedding token fed into repeated MetaFormer encoders, including 2 res-blocks

the first sub-block is token mixer
Y = T o k e n M i x e r ( N o r m ( X ) ) + X Y=TokenMixer(Norm(X))+X Y=TokenMixer(Norm(X))+X

where, N o r m ( ⋅ ) Norm(\cdot) Norm() is normalization (such as LN, BN), T o k e n M i x e r ( ⋅ ) TokenMixer(\cdot) TokenMixer() is a module which can be implemented by attention modules or MLP-like modules
note that the main function of token mixer is to propagate token information although those can mix channels, like attention.

the second sub-block is channel MLP
Z = σ ( N o r m ( Y ) W 1 ) W 2 + Y Z=\sigma(Norm(Y)W_1)W_2+Y Z=σ(Norm(Y)W1)W2+Y

where, W 1 ∈ R C × r C , W 2 ∈ R r C × C W_1\in\Reals^{C\times rC}, W_2\in\Reals^{rC\times C} W1RC×rC,W2RrC×C are learnable weighs, σ ( ⋅ ) \sigma(\cdot) σ() is an active function

PoolFormer

model architecture

MetaFormer general architecture contributes mostly to the success of the recent transformer and MLP-like models.
to demonstrate it, employ an embarrassingly simple operator, pooling, as token mixer module


(a) The overall framework of PoolFormer. PoolFormer adopts hierarchical architecture with 4 stages. For a model with L L L PoolFormer blocks, stage [1, 2, 3, 4] have [ L 6 , L 6 , L 2 , L 6 ] [\frac{L}6, \frac{L}6, \frac{L}2, \frac{L}6] [6L,6L,2L,6L] blocks, respectively. The feature dimension D i D_i Di of stage i i i is shown in the figure. (b) The architecture of PoolFormer block. Compared with transformer block, it replaces attention with extremely simple non-parametric operator, pooling, to conduct only basic token mixing.

pooling operator expressed as
T : , i , j ′ = 1 K × K ∑ p , q = 1 K T : , i + p − K + 1 2 , j + q − K + 1 2 − T : , i , j T_{:,i,j}^{'}=\frac1{K\times K}\sum_{p, q=1}^KT_{:, i+p-\frac{K+1}2, j+q-\frac{K+1}2}-T_{:,i,j} T:,i,j=K×K1p,q=1KT:,i+p2K+1,j+q2K+1T:,i,j

where, T ∈ R C × H × W T\in\Reals^{C\times H\times W} TRC×H×W, K K K is pooling size
note that since MetaFormer block has a skip connection, input itself T : , i , j T_{:,i,j} T:,i,j needed to be substracted from token mixer

computational complexity

self-attention quartic to token sequence length, extra learnable parameters
spatial MLP quartic to token sequence length, more learnable parameters
pooling linear to token sequence length, no learnable parameters

architecture variants


Configurations of different PoolFormer models. There are two groups of embedding dimensions, i.e., small size with [64, 128, 320, 512] dimensions and medium size with [96, 196, 384, 768]. Notation “S24” means the model is in small size of embedding dimensions with 24 PoolFormer blocks in total.

Experiment

image classification

dataset Image-1K
data augmentation MixUp, CutMix, CutOut, RandAugment
optimizer AdamW: batch size=4096, 300 epochs
learning rate initial 1e-3, weigh decay=0.5, warm-up 5 epochs, cosine decay
label smoothing 0.1


Performance of different types of models on ImageNet-1K classification. All these models are only trained on the ImageNet-1K training set and the accuracy on the validation set is reported. “ ∗ \ast ” denotes results of ViT trained with with extra regularization from MLP-mixer.


ImageNet-1K validation accuracy vs. MACs/Model Size.

object detection and instance segmentation

framework RetinaNet
dataset COCO
initialization Xavier
optimizer AdamW: batch size=16, 12 epochs
learning rate initial 1e-4


Performance of object detection on COCO val2017. All models are based on RetinaNet and 1 × 1\times 1× training schedule (i.e.12 epochs) is used for training detection models.

framework Mask R-CNN
dataset COCO
initialization Xavier
optimizer AdamW: batch size=16, 12 epochs
learning rate initial 1e-4


Performance of object detection and instance segmentation on COCO val2017. A P b AP^b APb and A P m AP^m APm represent bounding box AP and mask AP, respectively. All models are based on Mask R-CNN and trained by 1 × 1\times 1× training schedule (i.e.12 epochs).

semantic segmentation

framework Semantic FPN
dataset ADE20K
initialization Xavier, ImageNet trained checkpoints
optimizer AdamW: batch size=32, 40K iterations
learning rate initial 2e-4, polynomial decay (0.9)


Performance of Semantic segmentation on ADE20K validation set. All models are equipped with Semantic FPN.

ablation study


Ablation for PoolFormer on ImageNet-1K classification benchmark. PoolFormer-S12 is utilized as the baseline to conduct ablation study. The top-1 accuracy on the validation set is reported.

pooling size
similar performance when ps=3, 5, 7, but an obvious drop of 0.5% when ps=9
adopt pooling size default=3 for PoolFormer

normalization
Group Norm with 0.7% or 0.8% higher than Layer Norm or Batch Norm
adopt normalization default=Group Norm

activation
GeLU with 0.8% higher than ReLU and the same as SiLU
adopt normalization default=GeLU

hybrid stages

  • pooling-based one can handle much longer input sequences while attention and spatial MLP are good at capturing global information
  • stack blocks with pooling in bottom stages to handle long sequences and use attention or spatial MLP-based mixer in top stages, for sequences have been largely shortened

replace token mixer pooling with attention or spatial FC in top one or two stages
all better than baseline, while 2pool-2attn achieve highly competitive performance with 81.0% accuracy

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值