浅析poolformer 思考vits的架构问题

最新推荐文章于 2024-12-05 09:40:27 发布

AI Studio

最新推荐文章于 2024-12-05 09:40:27 发布

阅读量2.3k

点赞数 1

文章标签：深度学习计算机视觉 paddlepaddle

原文链接：https://aistudio.baidu.com/aistudio/projectdetail/3031966

版权

老人字体特供版

Transformer 技术在 NLP 的成功让学者将目光放在了 CV 领域，从 ViT 到 Swin Transformer 、MLP表现出了Transformer在视觉领域的成功，本项目来用Paddle实现最近的Transformer相关工作PoolFormer，通过配合代码实战帮助大家理解论文思想，和大家一起学习前沿的Transformer的工作

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Q0QFtymI-1638535620939)(https://ai-studio-static-online.cdn.bcebos.com/5b1032e7d69d484e8e588dc3f16583a5d64fc76b282b409d8ee1764f0f4e909b)]

We argue that the competence of transformer/MLPlike models primarily stems from the general architecture MetaFormer instead of the equipped specific token mixer

paper：arxiv

code：github

打个广告 🔥🔥🔥

欢迎关注 PPViT 包括颜水成团队其他两篇work：ViP、VOLO

也关注一下 PASSL

hi guy 我们又见面了，这次来搞一下 poolformer ，很有意思的文章，很有意思啊

vits的牛逼来自哪？之前很多研究都说，MSA咯，全局表征牛逼哇，也有人认为是 patchembed 操作

前段时间 convmixer 刷爆了网络，为啥，就是证明 patchembed 牛逼了呗，牛逼是牛逼，人家新的resnet都能刷到80点，作为conv的尊严还是打不过efficientnetv2

再看一下 convmixer 的 performance

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-owO9cgsC-1638535620950)(https://ai-studio-static-online.cdn.bcebos.com/d41840552c564897848d6ca84eb24459c7c932cfcf1d463eae74e113b84644c7)]

我giao啊，patch size 是多少 7！！！大家都知道在vits里面 MSA 的计算复杂度是token的平方，这里不讨论WMSA，为啥一般vits的patch size要选择大一点就是让token少一点，你看那个vits的patch size这么小？

这问题来了，严谨来说你要证明在patch embed操作里面patch size要一样，同样token下，干翻人家，才能有很好的说服力，不过人家文章也说了，不是追求牛逼模型，旨在让大家思考

实际上patch操作也可以研究一下，比如，FAI的研究改一下stem，解决训练问题

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-uefFfc3h-1638535620954)(https://ai-studio-static-online.cdn.bcebos.com/982dcf79e9f84a87b112aad2cb7f994535b6d6036d23429ca47f03decfdaeca4)]

不过这是后话了，没有确切研究能说明 vits 牛逼来自 patch embed

再回到这篇文章，颜博士说vits牛逼来自架构，来自架构

refer

convmixer：PPViT有实现

Early Convolutions Help Transformers See Better

poolformer

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4WBMz6Bl-1638535620960)(https://ai-studio-static-online.cdn.bcebos.com/8fe466c6dfe84d4099a30c800c4b944b1b86383c835d40a4997b89bcb19d6c95)]

上图还是很好理解的，就是attn 、 mlp 、 Pool结构替换罢了，前两个就是vits，MLPs，最后就是这篇文章的work

来看看代码啥样子

class Pooling(nn.Layer):
    """
    Implementation of pooling for PoolFormer
    --pool_size: pooling size
    """
    def __init__(self, kernel_size=3):
        super().__init__()
        self.pool = nn.AvgPool2D(
            kernel_size, stride=1, padding=kernel_size//2, exclusive=False)

    def forward(self, x):
        return self.pool(x) - x

这个就是MSA的替代品，你敢想象吗

看看这个，没有参数学习，计算量真小

FLOPs的计算，（取自PASSL，即将支持）大家可以简单算算多少FLOPs

def count_avgpool(m, x, y):
    num_elements = y.numel()
    m.flops += int(num_elements)

不说多话，show 代码

干就vans

先来定义一下简单的东西，这给后面组网打个基础

import paddle
import paddle.nn as nn
import paddle.nn.functional as F

# 定义一些初始化
trunc_normal_ = nn.initializer.TruncatedNormal(std=0.02)
zeros_ = nn.initializer.Constant(value=0.0)
ones_ = nn.initializer.Constant(value=1.0)

# 啥都不做的 对于torch.nn.Identity
class Identity(nn.Layer):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return x

# 下面是DropPath, 一个正则化方法
def drop_path(x, drop_prob=0.0, training=False):

    if drop_prob == 0.0 or not training:
        return x
    keep_prob = 1 - drop_prob
    shape = (x.shape[0],) + (1,) * (x.ndim - 1)
    random_tensor = paddle.to_tensor(keep_prob) + paddle.rand(shape)
    random_tensor = paddle.floor(random_tensor)
    output = x.divide(keep_prob) * random_tensor
    return output

class DropPath(nn.Layer):
    def __init__(self, drop_prob=None):
        super(DropPath, self).__init__()
        self.drop_prob = drop_prob

    def forward(self, x):
        return drop_path(x, self.drop_prob, self.training)

看看网络结构

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-KOiyXUTf-1638535620984)(https://ai-studio-static-online.cdn.bcebos.com/c24c898cdf2b451590bfd0385777aca71a578856902a430fafdddde45d213256)]

分为两个部分

一个是PatchEmbed操作

一个是block

patch embed这就不用讲了，要是看不懂就说明你们没有好好听课【滑稽】

特意说一下这个，这个思想虽然和vits一样，但是输出的不是[B, N, C]，少了reshape和transpose，因为主要用于下采样

class PatchEmbed(nn.Layer):
    """
    Patch Embedding that is implemented by a layer of conv. 
    Input: tensor in shape [B, C, H, W]
    Output: tensor in shape [B, C, H/stride, W/stride]
    """
    def __init__(self, patch_size=16, stride=16, padding=0, 
                 in_chans=3, embed_dim=768, norm_layer=None):
        super().__init__()
        patch_size = (patch_size, patch_size)
        stride = (stride, stride)
        padding = (padding, padding)
        self.proj = nn.Conv2D(in_chans, embed_dim, kernel_size=patch_size, 
                              stride=stride, padding=padding)
        self.norm = norm_layer(embed_dim) if norm_layer else Identity()

    def forward(self, x):
        x = self.proj(x)
        x = self.norm(x)
        return x

下面就是PoolFormer Block搭建

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-qbSJ31gg-1638535620988)(https://ai-studio-static-online.cdn.bcebos.com/6dd99cb8057847e5b494ce3486355acaa9f1c4a2fe294f73b616a00ff9b8376c)]

Norm 操作

时间原因没研究为什么不用`nn.LayerNrom`，大家自己思考一下，评论区见

class LayerNormChannel(nn.Layer):
    """
    LayerNorm only for Channel Dimension.
    Input: tensor in shape [B, C, H, W]
    """
    def __init__(self, num_channels, epsilon=1e-05):
        super().__init__()
        self.weight = paddle.create_parameter(
            shape=[num_channels],
            dtype='float32',
            default_initializer=ones_)
        self.bias = paddle.create_parameter(
            shape=[num_channels],
            dtype='float32',
            default_initializer=zeros_)
        self.epsilon = epsilon

    def forward(self, x):
        u = x.mean(1, keepdim=True)
        s = (x - u).pow(2).mean(1, keepdim=True)
        x = (x - u) / paddle.sqrt(s + self.eps)
        x = self.weight.unsqueeze(-1).unsqueeze(-1) * x \
            + self.bias.unsqueeze(-1).unsqueeze(-1)
        return x


class GroupNorm(nn.GroupNorm):
    """
    Group Normalization with 1 group.
    Input: tensor in shape [B, C, H, W]
    """
    def __init__(self, num_channels, **kwargs):
        super().__init__(1, num_channels, **kwargs)

Pooling 操作

他们的亮点

class Pooling(nn.Layer):
    """
    Implementation of pooling for PoolFormer
    --pool_size: pooling size
    """
    def __init__(self, kernel_size=3):
        super().__init__()
        self.pool = nn.AvgPool2D(
            kernel_size, stride=1, padding=kernel_size//2, exclusive=True)

    def forward(self, x):
        return self.pool(x) - x

MLP操作

仔细看看，和vit有啥区别

我称之为fake版本MLP

class Mlp(nn.Layer):
    """
    Implementation of MLP with 1*1 convolutions.
    Input: tensor with shape [B, C, H, W]
    """
    def __init__(self, in_features, hidden_features=None, 
                 out_features=None, act_layer=nn.GELU, drop=0.):
        super().__init__()
        out_features = out_features or in_features
        hidden_features = hidden_features or in_features
        self.fc1 = nn.Conv2D(in_features, hidden_features, 1)
        self.act = act_layer()
        self.fc2 = nn.Conv2D(hidden_features, out_features, 1)
        self.drop = nn.Dropout(drop)
        self.apply(self._init_weights)

    def _init_weights(self, m):
        if isinstance(m, nn.Conv2D):
            trunc_normal_(m.weight)
            if m.bias is not None:
                zeros_(m.bias)

    def forward(self, x):
        x = self.fc1(x)     # (B, C, H, W) --> (B, C, H, W)
        x = self.act(x)     
        x = self.drop(x)
        x = self.fc2(x)     # (B, C, H, W) --> (B, C, H, W)
        x = self.drop(x)
        return x

来让我们像乐高一样组装起来把

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-CXlPKawm-1638535620999)(https://ai-studio-static-online.cdn.bcebos.com/6dd99cb8057847e5b494ce3486355acaa9f1c4a2fe294f73b616a00ff9b8376c)]

class PoolFormerBlock(nn.Layer):
    """
    Implementation of one PoolFormer block.
    --dim: embedding dim
    --pool_size: pooling size
    --mlp_ratio: mlp expansion ratio
    --act_layer: activation
    --norm_layer: normalization
    --drop: dropout rate
    --drop path: Stochastic Depth, 
        refer to https://arxiv.org/abs/1603.09382
    --use_layer_scale, --layer_scale_init_value: LayerScale, 
        refer to https://arxiv.org/abs/2103.17239
    """
    def __init__(self, dim, pool_size=3, mlp_ratio=4., 
                 act_layer=nn.GELU, norm_layer=GroupNorm, 
                 drop=0., drop_path=0., 
                 use_layer_scale=True, layer_scale_init_value=1e-5):

        super().__init__()

        self.norm1 = norm_layer(dim)
        self.token_mixer = Pooling(kernel_size=pool_size)   # vits是msa，MLPs是mlp，这个用pool来替代
        self.norm2 = norm_layer(dim)
        mlp_hidden_dim = int(dim * mlp_ratio)
        self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, 
                       act_layer=act_layer, drop=drop)

        # The following two techniques are useful to train deep PoolFormers.
        self.drop_path = DropPath(drop_path) if drop_path > 0. \
            else Identity()
        self.use_layer_scale = use_layer_scale
        if use_layer_scale:

            self.layer_scale_1 = paddle.create_parameter(
                shape=[dim],
                dtype='float32',
                default_initializer=nn.initializer.Constant(value=layer_scale_init_value))

            self.layer_scale_2 = paddle.create_parameter(
                shape=[dim],
                dtype='float32',
                default_initializer=nn.initializer.Constant(value=layer_scale_init_value))

    def forward(self, x):
        if self.use_layer_scale:
            x = x + self.drop_path(
                self.layer_scale_1.unsqueeze(-1).unsqueeze(-1)
                * self.token_mixer(self.norm1(x)))
            x = x + self.drop_path(
                self.layer_scale_2.unsqueeze(-1).unsqueeze(-1)
                * self.mlp(self.norm2(x)))
        else:
            x = x + self.drop_path(self.token_mixer(self.norm1(x)))
            x = x + self.drop_path(self.mlp(self.norm2(x)))
        return x

说一下啊`use_layer_scale`，这是一个可学习的参数，提供一个特征的缩放

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-OiLMYywY-1638535621005)(https://ai-studio-static-online.cdn.bcebos.com/ee9f8783b29649baa3c234694101431ed2b1647b439c428e87afafc55e4f9142)]

直观上来说，是为了一不小心计算的值比Origin大太多，希望每个branch都能有贡献，不希望某个因为scale过于变态起主导作用，简言而知就是增加模型表征能力

把上面的进行多个组合

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-aSnR3glB-1638535621101)(https://ai-studio-static-online.cdn.bcebos.com/d5577a73650b4d9ab5b648dc1f0b508461d70cda108049f081efc46d2db4a1d5)]

def basic_blocks(dim, index, layers, 
                 pool_size=3, mlp_ratio=4., 
                 act_layer=nn.GELU, norm_layer=GroupNorm, 
                 drop_rate=.0, drop_path_rate=0., 
                 use_layer_scale=True, layer_scale_init_value=1e-5):
    """
    generate PoolFormer blocks for a stage
    return: PoolFormer blocks 
    """
    blocks = []
    for block_idx in range(layers[index]):
        block_dpr = drop_path_rate * (
            block_idx + sum(layers[:index])) / (sum(layers) - 1)
        blocks.append(PoolFormerBlock(
            dim, pool_size=pool_size, mlp_ratio=mlp_ratio, 
            act_layer=act_layer, norm_layer=norm_layer, 
            drop=drop_rate, drop_path=block_dpr, 
            use_layer_scale=use_layer_scale, 
            layer_scale_init_value=layer_scale_init_value, 
            ))
    blocks = nn.Sequential(*blocks)

    return blocks

def poolformer_s12(**kwargs):
    """
    PoolFormer-S12 model, Params: 12M
    --layers: [x,x,x,x], numbers of layers for the four stages
    --embed_dims, --mlp_ratios: 
        embedding dims and mlp ratios for the four stages
    --downsamples: flags to apply downsampling or not in four blocks
    """
    layers = [2, 2, 6, 2]
    embed_dims = [64, 128, 320, 512]
    mlp_ratios = [4, 4, 4, 4]
    downsamples = [True, True, True, True]
    model = PoolFormer(
        layers, embed_dims=embed_dims, 
        mlp_ratios=mlp_ratios, downsamples=downsamples, 
        **kwargs)
    return model

重点到了

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-m9fOtcoH-1638535621106)(https://ai-studio-static-online.cdn.bcebos.com/e7937f951bf142ebb1efbcb8756bc4f64a7de8c071dd462785a540beca393a76)]

把红圈部分搞明白，你就明白了，就是灵魂是卷积，架构是vit，抛弃了token概念

class PoolFormer(nn.Layer):
    """
    PoolFormer, the main class of our model
    --layers: [x,x,x,x], number of blocks for the 4 stages
    --embed_dims, --mlp_ratios, --pool_size: the embedding dims, mlp ratios and 
        pooling size for the 4 stages
    --downsamples: flags to apply downsampling or not
    --norm_layer, --act_layer: define the types of normalizaiotn and activation
    --num_classes: number of classes for the image classification
    --in_patch_size, --in_stride, --in_pad: specify the patch embedding
        for the input image
    --down_patch_size --down_stride --down_pad: 
        specify the downsample (patch embed.)
    """
    def __init__(self, layers, embed_dims=None, 
                 mlp_ratios=None, downsamples=None, 
                 pool_size=3, 
                 norm_layer=GroupNorm, act_layer=nn.GELU, 
                 num_classes=1000,
                 in_patch_size=7, in_stride=4, in_pad=2, 
                 down_patch_size=3, down_stride=2, down_pad=1, 
                 drop_rate=0., drop_path_rate=0.,
                 use_layer_scale=True, layer_scale_init_value=1e-5, 
                 **kwargs):

        super().__init__()
        
        ### 定义 patch embed 要调用很多次
        self.patch_embed = PatchEmbed(
            patch_size=in_patch_size, stride=in_stride, padding=in_pad, 
            in_chans=3, embed_dim=embed_dims[0])

        # set the main block in network
        network = []
        for i in range(len(layers)):
            stage = basic_blocks(embed_dims[i], i, layers, 
                                 pool_size=pool_size, mlp_ratio=mlp_ratios[i],
                                 act_layer=act_layer, norm_layer=norm_layer, 
                                 drop_rate=drop_rate, 
                                 drop_path_rate=drop_path_rate,
                                 use_layer_scale=use_layer_scale, 
                                 layer_scale_init_value=layer_scale_init_value)
            network.append(stage)
            if i >= len(layers) - 1: # 层数够了就不搭建了，实际上就4层
                break
            if downsamples[i] or embed_dims[i] != embed_dims[i+1]: # 这就是红圈部分的解释，通过多次调用patchembed来降低特征，和convnet一样
                # downsampling between two stages
                network.append(
                    PatchEmbed(
                        patch_size=down_patch_size, stride=down_stride, 
                        padding=down_pad, 
                        in_chans=embed_dims[i], embed_dim=embed_dims[i+1]
                        )
                    )

        self.network = nn.LayerList(network)

        # Classifier head
        self.norm = norm_layer(embed_dims[-1])
        self.head = nn.Linear(
            embed_dims[-1], num_classes) if num_classes > 0 \
            else Identity()

        self.apply(self.cls_init_weights)

    # init for classification
    def cls_init_weights(self, m):
        if isinstance(m, nn.Linear):
            trunc_normal_(m.weight)
            if isinstance(m, nn.Linear) and m.bias is not None:
                zeros_(m.bias)


    def forward_embeddings(self, x):
        x = self.patch_embed(x)
        return x

    def forward_tokens(self, x):
        outs = []
        for idx, block in enumerate(self.network):
            x = block(x)
        return x

    def forward(self, x):
        # input embedding
        x = self.forward_embeddings(x)
        # through backbone
        x = self.forward_tokens(x)
        x = self.norm(x)
        cls_out = self.head(x.mean([-2, -1]))
        # for image classification
        return cls_out

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-XauwEuWF-1638535621118)(https://ai-studio-static-online.cdn.bcebos.com/8d440c293f934fb1a7c27d1d175b07f8a37f9bd3b9114a87b089d82ce52d9edf)]

def poolformer_s12(**kwargs):
    """
    PoolFormer-S12 model, Params: 12M
    --layers: [x,x,x,x], numbers of layers for the four stages
    --embed_dims, --mlp_ratios: 
        embedding dims and mlp ratios for the four stages
    --downsamples: flags to apply downsampling or not in four blocks
    """
    layers = [2, 2, 6, 2]
    embed_dims = [64, 128, 320, 512]
    mlp_ratios = [4, 4, 4, 4]
    downsamples = [True, True, True, True]
    model = PoolFormer(
        layers, embed_dims=embed_dims, 
        mlp_ratios=mlp_ratios, downsamples=downsamples, 
        **kwargs)
    return model

def poolformer_s24(**kwargs):
    """
    PoolFormer-S24 model, Params: 21M
    """
    layers = [4, 4, 12, 4]
    embed_dims = [64, 128, 320, 512]
    mlp_ratios = [4, 4, 4, 4]
    downsamples = [True, True, True, True]
    model = PoolFormer(
        layers, embed_dims=embed_dims, 
        mlp_ratios=mlp_ratios, downsamples=downsamples, 
        **kwargs)
    return model

def poolformer_s36(**kwargs):
    """
    PoolFormer-S36 model, Params: 31M
    """
    layers = [6, 6, 18, 6]
    embed_dims = [64, 128, 320, 512]
    mlp_ratios = [4, 4, 4, 4]
    downsamples = [True, True, True, True]
    model = PoolFormer(
        layers, embed_dims=embed_dims, 
        mlp_ratios=mlp_ratios, downsamples=downsamples, 
        layer_scale_init_value=1e-6, 
        **kwargs)
    return model


def poolformer_m36(**kwargs):
    """
    PoolFormer-M36 model, Params: 56M
    """
    layers = [6, 6, 18, 6]
    embed_dims = [96, 192, 384, 768]
    mlp_ratios = [4, 4, 4, 4]
    downsamples = [True, True, True, True]
    model = PoolFormer(
        layers, embed_dims=embed_dims, 
        mlp_ratios=mlp_ratios, downsamples=downsamples, 
        layer_scale_init_value=1e-6, 
        **kwargs)
    return model


def poolformer_m48(**kwargs):
    """
    PoolFormer-M48 model, Params: 73M
    """
    layers = [8, 8, 24, 8]
    embed_dims = [96, 192, 384, 768]
    mlp_ratios = [4, 4, 4, 4]
    downsamples = [True, True, True, True]
    model = PoolFormer(
        layers, embed_dims=embed_dims, 
        mlp_ratios=mlp_ratios, downsamples=downsamples, 
        layer_scale_init_value=1e-6, 
        **kwargs)
    return model

用PPMA来测试一下性能把

觉得好用就star一波呗 PPMA

# 安装 ppma
# 解压 ImageNet 数据集
! pip install ppma
! tar -xf /home/aistudio/data/data96753/ILSVRC2012_img_val.tar -C /home/aistudio/data/data96753

精度与paper对齐

import ppma

m = poolformer_s12()
m.set_state_dict(paddle.load('/home/aistudio/data/data118603/poolformer_s12.pdparams')) 
data_path = "/home/aistudio/data/data96753"    

ppma.imagenet.val(m, data_path, batch_size=128 ,image_size=224, crop_pct=0.9, normalize=0.485)

stat

import ppma
studio/data/data96753

精度与paper对齐

import ppma

m = poolformer_s12()
m.set_state_dict(paddle.load('/home/aistudio/data/data118603/poolformer_s12.pdparams')) 
data_path = "/home/aistudio/data/data96753"    

ppma.imagenet.val(m, data_path, batch_size=128 ,image_size=224, crop_pct=0.9, normalize=0.485)

stat

import ppma
ppma.modelstat.flops(model=m, img_size=224, per_op=True)

Results on ImageNet-1k

Model	# Param	Top-1 Acc.	Top-5 Acc.	Crop	FLOPs
poolformer_s12	12M	0.7724	0.9351	0.9	1.8G
poolformer_s24				0.9
poolformer_s36				0.9
poolformer_m36	56M	0.8211	0.9569	0.95	8.9G
poolformer_m48				0.95

其他的模型你们自己测把，我懒

总结

这图看着还行，对比的模型是几个月前的，不过性能看着也可以

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1YxL1p16-1638535621125)(https://ai-studio-static-online.cdn.bcebos.com/4a25a198c1ac485a91e808e69151dd1d6bf8c8485c7645e7bcc6e18c316a8cb7)]

这篇文章主要说架构，说这样的架构很好

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-OHI6bePn-1638535621150)(https://ai-studio-static-online.cdn.bcebos.com/1555d736ad5f4a8abc9d1a1aec5d9afcbcada921f9e64101a7f34c342a3c385f)]

怎么说，我仿佛看见了当年的MLP，不过人家也没说 Pool is all you need【笑】

证明模型牛逼，要看下游，这篇工作在检测分割都做了实验，点赞，比某些work好多了

这里只讲分类，继续说重要的 Ablation

Ablation

baseline 是 s12

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-UVkMT9ci-1638535621154)(https://ai-studio-static-online.cdn.bcebos.com/bc44e7e714be4a1bb5bf69945fa2bec7399d41d62ca54a3b93bffa0055d800ca)]