DFFormer实战：使用DFFormer实现图像分类任务（一）

AI浩

已于 2025-01-27 09:39:37 修改

阅读量5.2k

点赞数 31

分类专栏：图像分类文章标签：分类数据挖掘人工智能

于 2025-01-26 21:30:00 首次发布

本文为博主原创文章，未经博主允许不得转载。

本文链接：https://blog.csdn.net/hhhhhhhhhhwwwwwwwwww/article/details/145368717

版权

图像分类专栏收录该内容

183 篇文章

订阅专栏

摘要

论文信息

标题: FFT-based Dynamic Token Mixer for Vision

论文链接: https://arxiv.org/pdf/2303.03932

创新点

本论文提出了一种新的令牌混合器，称为动态滤波器（Dynamic Filter），旨在解决多头自注意力（MHSA）模型在处理高分辨率图像时的计算复杂度问题。传统的MHSA模型在输入特征图的像素数量增加时，其计算复杂度呈二次增长，导致处理速度缓慢。通过引入基于快速傅里叶变换（FFT）的动态滤波器，论文展示了在保持全局操作能力的同时，显著降低计算复杂度的可能性。

方法

论文中提出的DFFormer和CDFFormer模型利用动态滤波器作为新的令牌混合器。具体方法包括：

动态滤波器: 通过FFT实现高效的全局信息处理，降低计算复杂度。
DFFormer: 直接使用动态滤波器进行图像识别。
CDFFormer: 在DFFormer的基础上结合卷积层，以进一步提升性能。

这两种模型的设计旨在与快速发展的MetaFormer架构兼容，并解决MHSA在高分辨率图像处理中的不足。

效果

DFFormer和CDFFormer在多个图像识别任务中表现出色，尤其是在高分辨率图像处理方面。CDFFormer在ImageNet-1K数据集上的Top-1准确率达到了85.0%，接近于结合卷积和MHSA的混合架构。

本文使用DFFormer模型实现图像分类任务，模型选择dfformer_s18，在植物幼苗分类任务ACC达到了96%+。

请添加图片描述

通过深入阅读本文，您将能够掌握以下关键技能与知识：

数据增强的多种策略：包括利用PyTorch的transforms库进行基本增强，以及进阶技巧如CutOut、MixUp、CutMix等，这些方法能显著提升模型泛化能力。
DFFormer模型的训练实现：了解如何从头开始构建并训练DFFormer，涵盖模型定义、数据加载、训练循环等关键环节。
混合精度训练：学习如何利用PyTorch自带的混合精度训练功能，加速训练过程同时减少内存消耗。
梯度裁剪技术：掌握梯度裁剪的应用，有效防止梯度爆炸问题，确保训练过程的稳定性。
分布式数据并行（DP）训练：了解如何在多GPU环境下使用PyTorch的分布式数据并行功能，加速大规模模型训练。
可视化训练过程：学习如何绘制训练过程中的loss和accuracy曲线，直观监控模型学习状况。
评估与生成报告：掌握在验证集上评估模型性能的方法，并生成详细的评估报告，包括ACC等指标。
测试脚本编写：学会编写测试脚本，对测试集进行预测，评估模型在实际应用中的表现。
学习率调整策略：理解并应用余弦退火策略动态调整学习率，优化训练效果。
自定义统计工具：使用AverageMeter类或其他工具统计和记录训练过程中的ACC、loss等关键指标，便于后续分析。
深入理解ACC1与ACC5：掌握图像分类任务中ACC1（Top-1准确率）和ACC5（Top-5准确率）的含义及其计算方法。
指数移动平均（EMA）：学习如何在模型训练中应用EMA技术，进一步提升模型在测试集上的表现。

若您在以上任一领域基础尚浅，感到理解困难，推荐您参考我的专栏“经典主干网络精讲与实战”，该专栏从零开始，循序渐进地讲解上述所有知识点，助您轻松掌握深度学习中的这些核心技能。

安装包

安装timm

使用pip就行，命令：

pip install timm

mixup增强和EMA用到了timm。

数据增强Cutout和Mixup

为了提高模型的泛化能力和性能，我在数据预处理阶段加入了Cutout和Mixup这两种数据增强技术。Cutout通过随机遮挡图像的一部分来强制模型学习更鲁棒的特征，而Mixup则通过混合两张图像及其标签来生成新的训练样本，从而增加数据的多样性。实现这两种增强需要安装torchtoolbox。安装命令：

pip install torchtoolbox

Cutout实现，在transforms中。

from torchtoolbox.transform import Cutout
# 数据预处理
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    Cutout(),
    transforms.ToTensor(),
    transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])

])

需要导入包：from timm.data.mixup import Mixup，

定义Mixup，和SoftTargetCrossEntropy

  mixup_fn = Mixup(
    mixup_alpha=0.8, cutmix_alpha=1.0, cutmix_minmax=None,
    prob=0.1, switch_prob=0.5, mode='batch',
    label_smoothing=0.1, num_classes=12)
 criterion_train = SoftTargetCrossEntropy()

Mixup 是一种在图像分类任务中常用的数据增强技术，它通过将两张图像以及其对应的标签进行线性组合来生成新的数据和标签。
参数详解：

mixup_alpha (float): mixup alpha 值，如果 > 0，则 mixup 处于活动状态。

cutmix_alpha (float)：cutmix alpha 值，如果 > 0，cutmix 处于活动状态。

cutmix_minmax (List[float])：cutmix 最小/最大图像比率，cutmix 处于活动状态，如果不是 None，则使用这个 vs alpha。

如果设置了 cutmix_minmax 则cutmix_alpha 默认为1.0

prob (float): 每批次或元素应用 mixup 或 cutmix 的概率。

switch_prob (float): 当两者都处于活动状态时切换cutmix 和mixup 的概率。

mode (str): 如何应用 mixup/cutmix 参数（每个’batch’，‘pair’（元素对），‘elem’（元素）。

correct_lam (bool): 当 cutmix bbox 被图像边框剪裁时应用。 lambda 校正

label_smoothing (float)：将标签平滑应用于混合目标张量。

num_classes (int): 目标的类数。

EMA

EMA（Exponential Moving Average）在深度学习中是一种用于模型参数优化的技术，它通过计算参数的指数移动平均值来平滑模型的学习过程。这种方法有助于提高模型的稳定性和泛化能力，特别是在训练后期。以下是关于EMA的总结，表达进行了优化：

EMA概述

EMA是一种加权移动平均技术，其中每个新的平均值都是前一个平均值和当前值的加权和。在深度学习中，EMA被用于模型参数的更新，以减缓参数在训练过程中的快速波动，从而得到更加平滑和稳定的模型表现。

工作原理

在训练过程中，除了维护当前模型的参数外，还额外保存一份EMA参数。每个训练步骤或每隔一定步骤，根据当前模型参数和EMA参数，按照指数衰减的方式更新EMA参数。具体来说，EMA参数的更新公式通常如下：

$model_parameters \text{EMA}_{\text{new}} = \text{decay} \times \text{EMA}_{\text{old}} + (1 - \text{decay}) \times \text{model\_parameters}$
其中，decay是一个介于0和1之间的超参数，控制着旧EMA值和新模型参数值之间的权重分配。较大的decay值意味着EMA更新时更多地依赖于旧值，即平滑效果更强。

应用优势

稳定性：EMA通过平滑参数更新过程，减少了模型在训练过程中的波动，使得模型更加稳定。
泛化能力：由于EMA参数是历史参数的平滑版本，它往往能捕捉到模型训练过程中的全局趋势，因此在测试或评估时，使用EMA参数往往能获得更好的泛化性能。
快速收敛：虽然EMA本身不直接加速训练过程，但通过稳定模型参数，它可能间接地帮助模型更快地收敛到更优的解。

使用场景

EMA在深度学习中的使用场景广泛，特别是在需要高度稳定性和良好泛化能力的任务中，如图像分类、目标检测等。在训练大型模型时，EMA尤其有用，因为它可以帮助减少过拟合的风险，并提高模型在未见数据上的表现。

具体实现如下：

import logging
from collections import OrderedDict
from copy import deepcopy
import torch
import torch.nn as nn

_logger = logging.getLogger(__name__)

class ModelEma:
    def __init__(self, model, decay=0.9999, device='', resume=''):
        # make a copy of the model for accumulating moving average of weights
        self.ema = deepcopy(model)
        self.ema.eval()
        self.decay = decay
        self.device = device  # perform ema on different device from model if set
        if device:
            self.ema.to(device=device)
        self.ema_has_module = hasattr(self.ema, 'module')
        if resume:
            self._load_checkpoint(resume)
        for p in self.ema.parameters():
            p.requires_grad_(False)

    def _load_checkpoint(self, checkpoint_path):
        checkpoint = torch.load(checkpoint_path, map_location='cpu')
        assert isinstance(checkpoint, dict)
        if 'state_dict_ema' in checkpoint:
            new_state_dict = OrderedDict()
            for k, v in checkpoint['state_dict_ema'].items():
                # ema model may have been wrapped by DataParallel, and need module prefix
                if self.ema_has_module:
                    name = 'module.' + k if not k.startswith('module') else k
                else:
                    name = k
                new_state_dict[name] = v
            self.ema.load_state_dict(new_state_dict)
            _logger.info("Loaded state_dict_ema")
        else:
            _logger.warning("Failed to find state_dict_ema, starting from loaded model weights")

    def update(self, model):
        # correct a mismatch in state dict keys
        needs_module = hasattr(model, 'module') and not self.ema_has_module
        with torch.no_grad():
            msd = model.state_dict()
            for k, ema_v in self.ema.state_dict().items():
                if needs_module:
                    k = 'module.' + k
                model_v = msd[k].detach()
                if self.device:
                    model_v = model_v.to(device=self.device)
                ema_v.copy_(ema_v * self.decay + (1. - self.decay) * model_v)

加入到模型中。

#初始化
if use_ema:
     model_ema = ModelEma(
            model_ft,
            decay=model_ema_decay,
            device='cpu',
            resume=resume)

# 训练过程中，更新完参数后，同步update shadow weights
def train():
    optimizer.step()
    if model_ema is not None:
        model_ema.update(model)


# 将model_ema传入验证函数中
val(model_ema.ema, DEVICE, test_loader)

针对没有预训练的模型，容易出现EMA不上分的情况，这点大家要注意啊！

项目结构

DFFormer_Demo
├─data1
│  ├─Black-grass
│  ├─Charlock
│  ├─Cleavers
│  ├─Common Chickweed
│  ├─Common wheat
│  ├─Fat Hen
│  ├─Loose Silky-bent
│  ├─Maize
│  ├─Scentless Mayweed
│  ├─Shepherds Purse
│  ├─Small-flowered Cranesbill
│  └─Sugar beet
├─models
│  └─dfformer.py
├─mean_std.py
├─makedata.py
├─train.py
└─test.py

mean_std.py：计算mean和std的值。
makedata.py：生成数据集。
train.py：训练models文件下DFFormer的模型
dfformer：来源官方代码，在官方的模型基础上做了一些修改。

计算mean和std

在深度学习中，特别是在处理图像数据时，计算数据的均值（mean）和标准差（standard deviation, std）并进行归一化（Normalization）是加速模型收敛、提高模型性能的关键步骤之一。这里我将详细解释这两个概念，并讨论它们如何帮助模型学习。

均值（Mean）

均值是所有数值加和后除以数值的个数得到的平均值。在图像处理中，我们通常对每个颜色通道（如RGB图像的三个通道）分别计算均值。这意味着，如果我们的数据集包含多张图像，我们会计算所有图像在R通道上的像素值的均值，同样地，我们也会计算G通道和B通道的均值。

标准差（Standard Deviation, Std）

标准差是衡量数据分布离散程度的统计量。它反映了数据点与均值的偏离程度。在计算图像数据的标准差时，我们也是针对每个颜色通道分别进行的。标准差较大的颜色通道意味着该通道上的像素值变化较大，而标准差较小的通道则相对较为稳定。

归一化（Normalization）

归一化是将数据按比例缩放，使之落入一个小的特定区间，通常是[0, 1]或[-1, 1]。在图像处理中，我们通常会使用计算得到的均值和标准差来进行归一化，公式如下：

$\text{Normalized Value} = \frac{\text{Original Value} - \text{Mean}}{\text{Std}}$

注意，在某些情况下，为了简化计算并确保数据非负，我们可能会选择将数据缩放到[0, 1]区间，这时使用的是最大最小值归一化，而不是基于均值和标准差的归一化。但在这里，我们主要讨论基于均值和标准差的归一化，因为它能保留数据的分布特性。

为什么需要归一化？

加速收敛：归一化后的数据具有相似的尺度，这有助于梯度下降算法更快地找到最优解，因为不同特征的梯度更新将在同一数量级上，从而避免了某些特征因尺度过大或过小而导致的训练缓慢或梯度消失/爆炸问题。
提高精度：归一化可以改善模型的泛化能力，因为它使得模型更容易学习到特征之间的相对关系，而不是被特征的绝对大小所影响。
稳定性：归一化后的数据更加稳定，减少了训练过程中的波动，有助于模型更加稳定地收敛。

如何计算和使用mean和std

计算全局mean和std：在整个数据集上计算mean和std。这通常是在训练开始前进行的，并使用这些值来归一化训练集、验证集和测试集。
使用库函数：许多深度学习框架（如PyTorch、TensorFlow等）提供了计算mean和std的便捷函数，并可以直接用于数据集的归一化。
动态调整：在某些情况下，特别是当数据集非常大或持续更新时，可能需要动态地计算mean和std。这通常涉及到在训练过程中使用移动平均（如EMA）来更新这些统计量。

计算并使用数据的mean和std进行归一化是深度学习中的一项基本且重要的预处理步骤，它对于加速模型收敛、提高模型性能和稳定性具有重要意义。新建mean_std.py,插入代码：

from torchvision.datasets import ImageFolder
import torch
from torchvision import transforms

def get_mean_and_std(train_data):
    train_loader = torch.utils.data.DataLoader(
        train_data, batch_size=1, shuffle=False, num_workers=0,
        pin_memory=True)
    mean = torch.zeros(3)
    std = torch.zeros(3)
    for X, _ in train_loader:
        for d in range(3):
            mean[d] += X[:, d, :, :].mean()
            std[d] += X[:, d, :, :].std()
    mean.div_(len(train_data))
    std.div_(len(train_data))
    return list(mean.numpy()), list(std.numpy())

if __name__ == '__main__':
    train_dataset = ImageFolder(root=r'data1', transform=transforms.ToTensor())
    print(get_mean_and_std(train_dataset))

数据集结构：

运行结果：

([0.3281186, 0.28937867, 0.20702125], [0.09407319, 0.09732835, 0.106712654])

把这个结果记录下来，后面要用！

生成数据集

我们整理还的图像分类的数据集结构是这样的

data
├─Black-grass
├─Charlock
├─Cleavers
├─Common Chickweed
├─Common wheat
├─Fat Hen
├─Loose Silky-bent
├─Maize
├─Scentless Mayweed
├─Shepherds Purse
├─Small-flowered Cranesbill
└─Sugar beet

pytorch和keras默认加载方式是ImageNet数据集格式，格式是

├─data
│  ├─val
│  │   ├─Black-grass
│  │   ├─Charlock
│  │   ├─Cleavers
│  │   ├─Common Chickweed
│  │   ├─Common wheat
│  │   ├─Fat Hen
│  │   ├─Loose Silky-bent
│  │   ├─Maize
│  │   ├─Scentless Mayweed
│  │   ├─Shepherds Purse
│  │   ├─Small-flowered Cranesbill
│  │   └─Sugar beet
│  └─train
│      ├─Black-grass
│      ├─Charlock
│      ├─Cleavers
│      ├─Common Chickweed
│      ├─Common wheat
│      ├─Fat Hen
│      ├─Loose Silky-bent
│      ├─Maize
│      ├─Scentless Mayweed
│      ├─Shepherds Purse
│      ├─Small-flowered Cranesbill
│      └─Sugar beet

新增格式转化脚本makedata.py,插入代码：

import glob
import os
import shutil

image_list=glob.glob('data1/*/*.png')
print(image_list)
file_dir='data'
if os.path.exists(file_dir):
    print('true')
    #os.rmdir(file_dir)
    shutil.rmtree(file_dir)#删除再建立
    os.makedirs(file_dir)
else:
    os.makedirs(file_dir)

from sklearn.model_selection import train_test_split
trainval_files, val_files = train_test_split(image_list, test_size=0.3, random_state=42)
train_dir='train'
val_dir='val'
train_root=os.path.join(file_dir,train_dir)
val_root=os.path.join(file_dir,val_dir)
for file in trainval_files:
    file_class=file.replace("\\","/").split('/')[-2]
    file_name=file.replace("\\","/").split('/')[-1]
    file_class=os.path.join(train_root,file_class)
    if not os.path.isdir(file_class):
        os.makedirs(file_class)
    shutil.copy(file, file_class + '/' + file_name)

for file in val_files:
    file_class=file.replace("\\","/").split('/')[-2]
    file_name=file.replace("\\","/").split('/')[-1]
    file_class=os.path.join(val_root,file_class)
    if not os.path.isdir(file_class):
        os.makedirs(file_class)
    shutil.copy(file, file_class + '/' + file_name)

完成上面的内容就可以开启训练和测试了。

DFFormer代码

# Copyright 2022 Garena Online Private Limited
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""
MetaFormer baselines including IdentityFormer, RandFormer, PoolFormerV2,
ConvFormer and CAFormer.
Some implementations are modified from timm (https://github.com/rwightman/pytorch-image-models).
"""
import os
from functools import partial
import torch
import torch.nn as nn
import torch.nn.functional as F
from timm.models.layers import trunc_normal_, DropPath
from timm.models.registry import register_model
from timm.data import IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD
from timm.models.layers import to_2tuple


def _cfg(url='', **kwargs):
    return {
        'url': url,
        'num_classes': 1000, 'input_size': (3, 224, 224), 'pool_size': None,
        'crop_pct': 1.0, 'interpolation': 'bicubic',
        'mean': IMAGENET_DEFAULT_MEAN, 'std': IMAGENET_DEFAULT_STD,
        'classifier': 'head',
        **kwargs
    }


default_cfgs = {
    'dfformer_s18': _cfg(url="https://github.com/okojoalg/dfformer/releases/download/weights/dfformer_s18.pth"),
    'dfformer_s36': _cfg(url="https://github.com/okojoalg/dfformer/releases/download/weights/dfformer_s36.pth"),
    'dfformer_m36': _cfg(url="https://github.com/okojoalg/dfformer/releases/download/weights/dfformer_m36.pth"),
    'dfformer_b36': _cfg(url="https://github.com/okojoalg/dfformer/releases/download/weights/dfformer_b36.pth"),
    'gfformer_s18': _cfg(),
    'cdfformer_s18': _cfg(url="https://github.com/okojoalg/dfformer/releases/download/weights/cdfformer_s18.pth"),
    'cdfformer_s36': _cfg(url="https://github.com/okojoalg/dfformer/releases/download/weights/cdfformer_s36.pth"),
    'cdfformer_m36': _cfg(url="https://github.com/okojoalg/dfformer/releases/download/weights/cdfformer_m36.pth"),
    'cdfformer_b36': _cfg(url="https://github.com/okojoalg/dfformer/releases/download/weights/cdfformer_b36.pth"),
    'dfformer_s18_k2': _cfg(),
    'dfformer_s18_d8': _cfg(),
    'dfformer_s18_gelu': _cfg(),
    'dfformer_s18_relu': _cfg(),
}


class Downsampling(nn.Module):
    """
    Downsampling implemented by a layer of convolution.
    """

    def __init__(self, in_channels, out_channels,
                 kernel_size, stride=1, padding=0,
                 pre_norm=None, post_norm=None, pre_permute=False):
        super().__init__()
        self.pre_norm = pre_norm(in_channels) if pre_norm else nn.Identity()
        self.pre_permute = pre_permute
        self.conv = nn.Conv2d(in_channels, out_channels, kernel_size=kernel_size,
                              stride=stride, padding=padding)
        self.post_norm = post_norm(out_channels) if post_norm else nn.Identity()

    def forward(self, x):
        x = self.pre_norm(x)
        if self.pre_permute:
            # if take [B, H, W, C] as input, permute it to [B, C, H, W]
            x = x.permute(0, 3, 1, 2)
        x = self.conv(x)
        x = x.permute(0, 2, 3, 1)  # [B, C, H, W] -> [B, H, W, C]
        x = self.post_norm(x)
        return x


class Scale(nn.Module):
    """
    Scale vector by element multiplications.
    """

    def __init__(self, dim, init_value=1.0, trainable=True):
        super().__init__()
        self.scale = nn.Parameter(init_value * torch.ones(dim), requires_grad=trainable)

    def forward(self, x):
        return x * self.scale


class SquaredReLU(nn.Module):
    """
        Squared ReLU: https://arxiv.org/abs/2109.08668
    """

    def __init__(self, inplace=False):
        super().__init__()
        self.relu = nn.ReLU(inplace=inplace)

    def forward(self, x):
        return torch.square(self.relu(x))


class StarReLU(nn.Module):
    """
    StarReLU: s * relu(x) ** 2 + b
    """

    def __init__(self, scale_value=1.0, bias_value=0.0,
                 scale_learnable=True, bias_learnable=True,
                 mode=None, inplace=False):
        super().__init__()
        self.inplace = inplace
        self.relu = nn.ReLU(inplace=inplace)
        self.scale = nn.Parameter(scale_value * torch.ones(1),
                                  requires_grad=scale_learnable)
        self.bias = nn.Parameter(bias_value * torch.ones(1),
                                 requires_grad=bias_learnable)

    def forward(self, x):
        return self.scale * self.relu(x) ** 2 + self.bias


class Attention(nn.Module):
    """
    Vanilla self-attention from Transformer: https://arxiv.org/abs/1706.03762.
    Modified from timm.
    """

    def __init__(self, dim, head_dim=32, num_heads=None, qkv_bias=False,
                 attn_drop=0., proj_drop=0., proj_bias=False, **kwargs):
        super().__init__()

        self.head_dim = head_dim
        self.scale = head_dim ** -0.5

        self.num_heads = num_heads if num_heads else dim // head_dim
        if self.num_heads == 0:
            self.num_heads = 1

        self.attention_dim = self.num_heads * self.head_dim

        self.qkv = nn.Linear(dim, self.attention_dim * 3, bias=qkv_bias)
        self.attn_drop = nn.Dropout(attn_drop)
        self.proj = nn.Linear(self.attention_dim, dim, bias=proj_bias)
        self.proj_drop = nn.Dropout(proj_drop)

    def forward(self, x):
        B, H, W, C = x.shape
        N = H * W
        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, self.head_dim).permute(2, 0,
                                                                                  3, 1,
                                                                                  4)
        q, k, v = qkv.unbind(0)  # make torchscript happy (cannot use tensor as tuple)

        attn = (q @ k.transpose(-2, -1)) * self.scale
        attn = attn.softmax(dim=-1)
        attn = self.attn_drop(attn)

        x = (attn @ v).transpose(1, 2).reshape(B, H, W, self.attention_dim)
        x = self.proj(x)
        x = self.proj_drop(x)
        return x


class RandomMixing(nn.Module):
    def __init__(self, num_tokens=196, **kwargs):
        super().__init__()
        self.random_matrix = nn.parameter.Parameter(
            data=torch.softmax(torch.rand(num_tokens, num_tokens), dim=-1),
            requires_grad=False)

    def forward(self, x):
        B, H, W, C = x.shape
        x = x.reshape(B, H * W, C)
        x = torch.einsum('mn, bnc -> bmc', self.random_matrix, x)
        x = x.reshape(B, H, W, C)
        return x


class LayerNormGeneral(nn.Module):
    r""" General LayerNorm for different situations.
    Args:
        affine_shape (int, list or tuple): The shape of affine weight and bias.
            Usually the affine_shape=C, but in some implementation, like torch.nn.LayerNorm,
            the affine_shape is the same as normalized_dim by default. 
            To adapt to different situations, we offer this argument here.
        normalized_dim (tuple or list): Which dims to compute mean and variance. 
        scale (bool): Flag indicates whether to use scale or not.
        bias (bool): Flag indicates whether to use scale or not.
        We give several examples to show how to specify the arguments.
        LayerNorm (https://arxiv.org/abs/1607.06450):
            For input shape of (B, *, C) like (B, N, C) or (B, H, W, C),
                affine_shape=C, normalized_dim=(-1, ), scale=True, bias=True;
            For input shape of (B, C, H, W),
                affine_shape=(C, 1, 1), normalized_dim=(1, ), scale=True, bias=True.
        Modified LayerNorm (https://arxiv.org/abs/2111.11418)
            that is idental to partial(torch.nn.GroupNorm, num_groups=1):
            For input shape of (B, N, C),
                affine_shape=C, normalized_dim=(1, 2), scale=True, bias=True;
            For input shape of (B, H, W, C),
                affine_shape=C, normalized_dim=(1, 2, 3), scale=True, bias=True;
            For input shape of (B, C, H, W),
                affine_shape=(C, 1, 1), normalized_dim=(1, 2, 3), scale=True, bias=True.
        For the several metaformer baslines,
            IdentityFormer, RandFormer and PoolFormerV2 utilize Modified LayerNorm without bias (bias=False);
            ConvFormer and CAFormer utilizes LayerNorm without bias (bias=False).
    """

    def __init__(self, affine_shape=None, normalized_dim=(-1,), scale=True,
                 bias=True, eps=1e-5):
        super().__init__()
        self.normalized_dim = normalized_dim
        self.use_scale = scale
        self.use_bias = bias
        self.weight = nn.Parameter(torch.ones(affine_shape)) if scale else None
        self.bias = nn.Parameter(torch.zeros(affine_shape)) if bias else None
        self.eps = eps

    def forward(self, x):
        c = x - x.mean(self.normalized_dim, keepdim=True)
        s = c.pow(2).mean(self.normalized_dim, keepdim=True)
        x = c / torch.sqrt(s + self.eps)
        if self.use_scale:
            x = x * self.weight
        if self.use_bias:
            x = x + self.bias
        return x


class GlobalFilter(nn.Module):
    def __init__(self, dim, expansion_ratio=2,
                 act1_layer=StarReLU, act2_layer=nn.Identity,
                 bias=False, size=14,
                 **kwargs, ):
        super().__init__()
        size = to_2tuple(size)
        self.size = size[0]
        self.filter_size = size[1] // 2 + 1
        self.dim = dim
        self.med_channels = int(expansion_ratio * dim)
        self.pwconv1 = nn.Linear(dim, self.med_channels, bias=bias)
        self.act1 = act1_layer()
        self.complex_weights = nn.Parameter(
            torch.randn(self.size, self.filter_size, self.med_channels, 2,
                        dtype=torch.float32) * 0.02)
        self.act2 = act2_layer()
        self.pwconv2 = nn.Linear(self.med_channels, dim, bias=bias)

    def forward(self, x):
        B, H, W, _ = x.shape
        x = self.pwconv1(x)
        x = self.act1(x)
        x = x.to(torch.float32)
        x = torch.fft.rfft2(x, dim=(1, 2), norm='ortho')
        complex_weights = torch.view_as_complex(self.complex_weights)
        x = x * complex_weights
        x = torch.fft.irfft2(x, s=(H, W), dim=(1, 2), norm='ortho')
        x = self.act2(x)
        x = self.pwconv2(x)
        return x


class DynamicFilter(nn.Module):
    def __init__(self, dim, expansion_ratio=2, reweight_expansion_ratio=.25,
                 act1_layer=StarReLU, act2_layer=nn.Identity,
                 bias=False, num_filters=4, size=14, weight_resize=False,
                 **kwargs):
        super().__init__()
        size = to_2tuple(size)
        self.size = size[0]
        self.filter_size = size[1] // 2 + 1
        self.num_filters = num_filters
        self.dim = dim
        self.med_channels = int(expansion_ratio * dim)
        self.weight_resize = weight_resize
        self.pwconv1 = nn.Linear(dim, self.med_channels, bias=bias)
        self.act1 = act1_layer()
        self.reweight = Mlp(dim, reweight_expansion_ratio, num_filters * self.med_channels)
        self.complex_weights = nn.Parameter(
            torch.randn(self.size, self.filter_size, num_filters, 2,
                        dtype=torch.float32) * 0.02)
        self.act2 = act2_layer()
        self.pwconv2 = nn.Linear(self.med_channels, dim, bias=bias)

    def forward(self, x):
        B, H, W, _ = x.shape

        routeing = self.reweight(x.mean(dim=(1, 2))).view(B, self.num_filters,
                                                          -1).softmax(dim=1)
        x = self.pwconv1(x)
        x = self.act1(x)
        x = x.to(torch.float32)
        x = torch.fft.rfft2(x, dim=(1, 2), norm='ortho')

        if self.weight_resize:
            complex_weights = resize_complex_weight(self.complex_weights, x.shape[1],
                                                    x.shape[2])
            complex_weights = torch.view_as_complex(complex_weights.contiguous())
        else:
            complex_weights = torch.view_as_complex(self.complex_weights)
        routeing = routeing.to(torch.complex64)
        weight = torch.einsum('bfc,hwf->bhwc', routeing, complex_weights)
        if self.weight_resize:
            weight = weight.view(-1, x.shape[1], x.shape[2], self.med_channels)
        else:
            weight = weight.view(-1, self.size, self.filter_size, self.med_channels)
        x = x * weight
        x = torch.fft.irfft2(x, s=(H, W), dim=(1, 2), norm='ortho')

        x = self.act2(x)
        x = self.pwconv2(x)
        return x


class SepConv(nn.Module):
    r"""
    Inverted separable convolution from MobileNetV2: https://arxiv.org/abs/1801.04381.
    """

    def __init__(self, dim, expansion_ratio=2,
                 act1_layer=StarReLU, act2_layer=nn.Identity,
                 bias=False, kernel_size=7, padding=3,
                 **kwargs, ):
        super().__init__()
        med_channels = int(expansion_ratio * dim)
        self.pwconv1 = nn.Linear(dim, med_channels, bias=bias)
        self.act1 = act1_layer()
        self.dwconv = nn.Conv2d(
            med_channels, med_channels, kernel_size=kernel_size,
            padding=padding, groups=med_channels, bias=bias)  # depthwise conv
        self.act2 = act2_layer()
        self.pwconv2 = nn.Linear(med_channels, dim, bias=bias)

    def forward(self, x):
        x = self.pwconv1(x)
        x = self.act1(x)
        x = x.permute(0, 3, 1, 2)
        x = self.dwconv(x)
        x = x.permute(0, 2, 3, 1)
        x = self.act2(x)
        x = self.pwconv2(x)
        return x


# ref https://github.com/NVlabs/AFNO-transformer/blob/master/afno/afno2d.py
class AFNO2D(nn.Module):
    def __init__(self, dim, expansion_ratio=2,
                 act1_layer=StarReLU, act2_layer=nn.Identity,
                 bias=False, size=14,
                 num_blocks=8, sparsity_threshold=0.01,
                 hard_thresholding_fraction=1, hidden_size_factor=1,
                 **kwargs):
        super().__init__()
        size = to_2tuple(size)
        self.size = size[0]
        self.filter_size = size[1] // 2 + 1
        self.dim = dim
        self.med_channels = int(expansion_ratio * dim)

        assert self.med_channels % num_blocks == 0, f"hidden_size {self.med_channels} should be divisble by num_blocks {num_blocks}"

        self.sparsity_threshold = sparsity_threshold
        self.num_blocks = num_blocks
        self.block_size = self.med_channels // self.num_blocks
        self.hard_thresholding_fraction = hard_thresholding_fraction
        self.hidden_size_factor = hidden_size_factor
        self.scale = 0.02

        self.pwconv1 = nn.Linear(dim, self.med_channels, bias=bias)
        self.act1 = act1_layer()
        self.w1 = nn.Parameter(self.scale * torch.randn(2, self.num_blocks, self.block_size, self.block_size * self.hidden_size_factor))
        self.b1 = nn.Parameter(self.scale * torch.randn(2, self.num_blocks, self.block_size * self.hidden_size_factor))
        self.w2 = nn.Parameter(self.scale * torch.randn(2, self.num_blocks, self.block_size * self.hidden_size_factor, self.block_size))
        self.b2 = nn.Parameter(self.scale * torch.randn(2, self.num_blocks, self.block_size))
        self.act2 = act2_layer()
        self.pwconv2 = nn.Linear(self.med_channels, dim, bias=bias)

    def forward(self, x):
        x = self.pwconv1(x)
        x = self.act1(x)
        x = x.to(torch.float32)
        bias = x
        B, H, W, C = x.shape
        x = torch.fft.rfft2(x, dim=(1, 2), norm='ortho')
        x = x.reshape(B, self.size, self.filter_size, self.num_blocks, self.block_size)

        o1_real = torch.zeros([B, self.size, self.filter_size, self.num_blocks, self.block_size * self.hidden_size_factor], device=x.device)
        o1_imag = torch.zeros([B, self.size, self.filter_size, self.num_blocks, self.block_size * self.hidden_size_factor], device=x.device)
        o2_real = torch.zeros(x.shape, device=x.device)
        o2_imag = torch.zeros(x.shape, device=x.device)

        kept_modes = int(self.filter_size * self.hard_thresholding_fraction)

        o1_real[:, :, :kept_modes] = F.relu(
            torch.einsum('...bi,bio->...bo', x[:, :, :kept_modes].real, self.w1[0]) - \
            torch.einsum('...bi,bio->...bo', x[:, :, :kept_modes].imag, self.w1[1]) + \
            self.b1[0]
        )

        o1_imag[:, :, :kept_modes] = F.relu(
            torch.einsum('...bi,bio->...bo', x[:, :, :kept_modes].imag, self.w1[0]) + \
            torch.einsum('...bi,bio->...bo', x[:, :, :kept_modes].real, self.w1[1]) + \
            self.b1[1]
        )

        o2_real[:, :, :kept_modes] = (
            torch.einsum('...bi,bio->...bo', o1_real[:, :, :kept_modes], self.w2[0]) - \
            torch.einsum('...bi,bio->...bo', o1_imag[:, :, :kept_modes], self.w2[1]) + \
            self.b2[0]
        )

        o2_imag[:, :, :kept_modes] = (
            torch.einsum('...bi,bio->...bo', o1_imag[:, :, :kept_modes], self.w2[0]) + \
            torch.einsum('...bi,bio->...bo', o1_real[:, :, :kept_modes], self.w2[1]) + \
            self.b2[1]
        )

        x = torch.stack([o2_real, o2_imag], dim=-1)
        x = F.softshrink(x, lambd=self.sparsity_threshold)
        x = torch.view_as_complex(x)
        x = x.reshape(B, self.size, self.filter_size, C)
        x = torch.fft.irfft2(x, s=(H, W), dim=(1, 2), norm='ortho')
        x = x + bias
        x = self.act2(x)
        x = self.pwconv2(x)
        return x


class Pooling(nn.Module):
    """
    Implementation of pooling for PoolFormer: https://arxiv.org/abs/2111.11418
    Modfiled for [B, H, W, C] input
    """

    def __init__(self, pool_size=3, **kwargs):
        super().__init__()
        self.pool = nn.AvgPool2d(
            pool_size, stride=1, padding=pool_size // 2, count_include_pad=False)

    def forward(self, x):
        y = x.permute(0, 3, 1, 2)
        y = self.pool(y)
        y = y.permute(0, 2, 3, 1)
        return y - x


class Mlp(nn.Module):
    """ MLP as used in MetaFormer models, eg Transformer, MLP-Mixer, PoolFormer, MetaFormer baslines and related networks.
    Mostly copied from timm.
    """

    def __init__(self, dim, mlp_ratio=4, out_features=None, act_layer=StarReLU, drop=0.,
                 bias=False, **kwargs):
        super().__init__()
        in_features = dim
        out_features = out_features or in_features
        hidden_features = int(mlp_ratio * in_features)
        drop_probs = to_2tuple(drop)

        self.fc1 = nn.Linear(in_features, hidden_features, bias=bias)
        self.act = act_layer()
        self.drop1 = nn.Dropout(drop_probs[0])
        self.fc2 = nn.Linear(hidden_features, out_features, bias=bias)
        self.drop2 = nn.Dropout(drop_probs[1])

    def forward(self, x):
        x = self.fc1(x)
        x = self.act(x)
        x = self.drop1(x)
        x = self.fc2(x)
        x = self.drop2(x)
        return x


class MlpHead(nn.Module):
    """ MLP classification head
    """

    def __init__(self, dim, num_classes=1000, mlp_ratio=4, act_layer=SquaredReLU,
                 norm_layer=nn.LayerNorm, head_dropout=0., bias=True):
        super().__init__()
        hidden_features = int(mlp_ratio * dim)
        self.fc1 = nn.Linear(dim, hidden_features, bias=bias)
        self.act = act_layer()
        self.norm = norm_layer(hidden_features)
        self.fc2 = nn.Linear(hidden_features, num_classes, bias=bias)
        self.head_dropout = nn.Dropout(head_dropout)

    def forward(self, x):
        x = self.fc1(x)
        x = self.act(x)
        x = self.norm(x)
        x = self.head_dropout(x)
        x = self.fc2(x)
        return x


class MetaFormerBlock(nn.Module):
    """
    Implementation of one MetaFormer block.
    """

    def __init__(self, dim,
                 token_mixer=nn.Identity, mlp=Mlp,
                 norm_layer=nn.LayerNorm,
                 drop=0., drop_path=0.,
                 layer_scale_init_value=None, res_scale_init_value=None,
                 size=14,
                 ):
        super().__init__()

        self.norm1 = norm_layer(dim)
        self.token_mixer = token_mixer(dim=dim, drop=drop, size=size)
        self.drop_path1 = DropPath(drop_path) if drop_path > 0. else nn.Identity()
        self.layer_scale1 = Scale(dim=dim, init_value=layer_scale_init_value) \
            if layer_scale_init_value else nn.Identity()
        self.res_scale1 = Scale(dim=dim, init_value=res_scale_init_value) \
            if res_scale_init_value else nn.Identity()

        self.norm2 = norm_layer(dim)
        self.mlp = mlp(dim=dim, drop=drop)
        self.drop_path2 = DropPath(drop_path) if drop_path > 0. else nn.Identity()
        self.layer_scale2 = Scale(dim=dim, init_value=layer_scale_init_value) \
            if layer_scale_init_value else nn.Identity()
        self.res_scale2 = Scale(dim=dim, init_value=res_scale_init_value) \
            if res_scale_init_value else nn.Identity()

    def forward(self, x):
        x = self.res_scale1(x) + \
            self.layer_scale1(
                self.drop_path1(
                    self.token_mixer(self.norm1(x))
                )
            )
        x = self.res_scale2(x) + \
            self.layer_scale2(
                self.drop_path2(
                    self.mlp(self.norm2(x))
                )
            )
        return x


r"""
downsampling (stem) for the first stage is a layer of conv with k7, s4 and p2
downsamplings for the last 3 stages is a layer of conv with k3, s2 and p1
DOWNSAMPLE_LAYERS_FOUR_STAGES format: [Downsampling, Downsampling, Downsampling, Downsampling]
use `partial` to specify some arguments
"""
DOWNSAMPLE_LAYERS_FOUR_STAGES = [partial(Downsampling,
                                         kernel_size=7, stride=4, padding=2,
                                         post_norm=partial(LayerNormGeneral, bias=False,
                                                           eps=1e-6)
                                         )] + \
                                [partial(Downsampling,
                                         kernel_size=3, stride=2, padding=1,
                                         pre_norm=partial(LayerNormGeneral, bias=False,
                                                          eps=1e-6), pre_permute=True
                                         )] * 3


class MetaFormer(nn.Module):
    r""" MetaFormer
        A PyTorch impl of : `MetaFormer Baselines for Vision`  -
          https://arxiv.org/abs/2210.13452
    Args:
        in_chans (int): Number of input image channels. Default: 3.
        num_classes (int): Number of classes for classification head. Default: 1000.
        depths (list or tuple): Number of blocks at each stage. Default: [2, 2, 6, 2].
        dims (int): Feature dimension at each stage. Default: [64, 128, 320, 512].
        downsample_layers: (list or tuple): Downsampling layers before each stage.
        token_mixers (list, tuple or token_fcn): Token mixer for each stage. Default: nn.Identity.
        mlps (list, tuple or mlp_fcn): Mlp for each stage. Default: Mlp.
        norm_layers (list, tuple or norm_fcn): Norm layers for each stage. Default: partial(LayerNormGeneral, eps=1e-6, bias=False).
        drop_path_rate (float): Stochastic depth rate. Default: 0.
        head_dropout (float): dropout for MLP classifier. Default: 0.
        layer_scale_init_values (list, tuple, float or None): Init value for Layer Scale. Default: None.
            None means not use the layer scale. Form: https://arxiv.org/abs/2103.17239.
        res_scale_init_values (list, tuple, float or None): Init value for Layer Scale. Default: [None, None, 1.0, 1.0].
            None means not use the layer scale. From: https://arxiv.org/abs/2110.09456.
        fork_feat (bool): whether output features of the 4 stages, for dense prediction
        output_norm: norm before classifier head. Default: partial(nn.LayerNorm, eps=1e-6).
        head_fn: classification head. Default: nn.Linear.
    """

    def __init__(self, in_chans=3, num_classes=1000,
                 depths=[2, 2, 6, 2],
                 dims=[64, 128, 320, 512],
                 downsample_layers=DOWNSAMPLE_LAYERS_FOUR_STAGES,
                 token_mixers=nn.Identity,
                 mlps=Mlp,
                 norm_layers=partial(LayerNormGeneral, eps=1e-6, bias=False),
                 drop_path_rate=0.,
                 head_dropout=0.0,
                 layer_scale_init_values=None,
                 res_scale_init_values=[None, None, 1.0, 1.0],
                 fork_feat=False,
                 output_norm=partial(nn.LayerNorm, eps=1e-6),
                 head_fn=nn.Linear,
                 input_size=(3, 224, 224),
                 **kwargs,
                 ):
        super().__init__()
        if not fork_feat:
            self.num_classes = num_classes
        self.fork_feat = fork_feat

        if not isinstance(depths, (list, tuple)):
            depths = [depths]  # it means the model has only one stage
        if not isinstance(dims, (list, tuple)):
            dims = [dims]

        num_stage = len(depths)
        self.num_stage = num_stage

        if not isinstance(downsample_layers, (list, tuple)):
            downsample_layers = [downsample_layers] * num_stage
        down_dims = [in_chans] + dims
        self.downsample_layers = nn.ModuleList(
            [downsample_layers[i](down_dims[i], down_dims[i + 1]) for i in
             range(num_stage)]
        )

        if not isinstance(token_mixers, (list, tuple)):
            token_mixers = [token_mixers] * num_stage

        if not isinstance(mlps, (list, tuple)):
            mlps = [mlps] * num_stage

        if not isinstance(norm_layers, (list, tuple)):
            norm_layers = [norm_layers] * num_stage

        dp_rates = [x.item() for x in torch.linspace(0, drop_path_rate, sum(depths))]

        if not isinstance(layer_scale_init_values, (list, tuple)):
            layer_scale_init_values = [layer_scale_init_values] * num_stage
        if not isinstance(res_scale_init_values, (list, tuple)):
            res_scale_init_values = [res_scale_init_values] * num_stage

        self.stages = nn.ModuleList()  # each stage consists of multiple metaformer blocks
        cur = 0
        for i in range(num_stage):
            stage = nn.Sequential(
                *[MetaFormerBlock(dim=dims[i],
                                  token_mixer=token_mixers[i],
                                  mlp=mlps[i],
                                  norm_layer=norm_layers[i],
                                  drop_path=dp_rates[cur + j],
                                  layer_scale_init_value=layer_scale_init_values[i],
                                  res_scale_init_value=res_scale_init_values[i],
                                  size=(input_size[1] // (2 ** (i + 2)),
                                        input_size[2] // (2 ** (i + 2))),
                                  ) for j in range(depths[i])]
            )
            self.stages.append(stage)
            cur += depths[i]

        if self.fork_feat:
            # add a norm layer for each output
            for i in range(4):
                if i == 0 and os.environ.get('FORK_LAST3', None):
                    # TODO: more elegant way
                    """For RetinaNet, `start_level=1`. The first norm layer will not used.
                    cmd: `FORK_LAST3=1 python -m torch.distributed.launch ...`
                    """
                    layer = nn.Identity()
                else:
                    layer = output_norm(dims[i])
                layer_name = f'norm{i}'
                self.add_module(layer_name, layer)
        else:
            self.norm = output_norm(dims[-1])

            if head_dropout > 0.0:
                self.head = head_fn(dims[-1], num_classes, head_dropout=head_dropout)
            else:
                self.head = head_fn(dims[-1], num_classes)

        self.apply(self._init_weights)

    def _init_weights(self, m):
        if isinstance(m, (nn.Conv2d, nn.Linear)):
            trunc_normal_(m.weight, std=.02)
            if m.bias is not None:
                nn.init.constant_(m.bias, 0)

    @torch.jit.ignore
    def no_weight_decay(self):
        return {'norm'}

    def forward_features(self, x):
        outs = []
        for i in range(self.num_stage):
            x = self.downsample_layers[i](x)
            x = self.stages[i](x)
            if self.fork_feat:
                norm_layer = getattr(self, f'norm{i}')
                x_out = norm_layer(x)
                outs.append(x_out.permute(0, 3, 1, 2))
        if self.fork_feat:
            # output the features of four stages for dense prediction
            return outs
        return self.norm(x.mean([1, 2]))  # (B, H, W, C) -> (B, C)

    def forward(self, x):
        x = self.forward_features(x)
        if self.fork_feat:
            # output features of four stages for dense prediction
            return x
        x = self.head(x)
        return x


def resize_complex_weight(origin_weight, new_h, new_w):
    h, w, num_heads = origin_weight.shape[0:3]  # size, w, c, 2
    origin_weight = origin_weight.reshape(1, h, w, num_heads * 2).permute(0, 3, 1, 2)
    new_weight = torch.nn.functional.interpolate(
        origin_weight,
        size=(new_h, new_w),
        mode='bicubic',
        align_corners=True
    ).permute(0, 2, 3, 1).reshape(new_h, new_w, num_heads, 2)
    return new_weight


def load_weights(model, input_size):
    out_dict = {}
    state_dict = torch.hub.load_state_dict_from_url(
        url=model.default_cfg['url'], map_location="cpu", check_hash=True)
    for k, v in state_dict.items():
        if 'complex_weights' in k:
            if 'stages.0' in k:
                size = input_size[1] // 4
                filter_size = input_size[2] // 8 + 1
            elif 'stages.1' in k:
                size = input_size[1] // 8
                filter_size = input_size[2] // 16 + 1
            elif 'stages.2' in k:
                size = input_size[1] // 16
                filter_size = input_size[2] // 32 + 1
            elif 'stages.3' in k:
                size = input_size[1] // 32
                filter_size = input_size[2] // 64 + 1
            v = resize_complex_weight(v, size, filter_size)
        out_dict[k] = v
    model.load_state_dict(out_dict, strict=False)


@register_model
def dfformer_s18(pretrained=False, **kwargs):
    default_cfg = default_cfgs['dfformer_s18']
    model = MetaFormer(
        depths=[3, 3, 9, 3],
        dims=[64, 128, 320, 512],
        token_mixers=DynamicFilter,
        head_fn=MlpHead,
        input_size=default_cfg['input_size'],
        **kwargs)
    model.default_cfg = default_cfg
    if pretrained:
        load_weights(model, default_cfg['input_size'])
    return model


@register_model
def dfformer_s36(pretrained=False, **kwargs):
    default_cfg = default_cfgs['dfformer_s36']
    model = MetaFormer(
        depths=[3, 12, 18, 3],
        dims=[64, 128, 320, 512],
        token_mixers=DynamicFilter,
        head_fn=MlpHead,
        input_size=default_cfg['input_size'],
        **kwargs)
    model.default_cfg = default_cfg
    if pretrained:
        load_weights(model, default_cfg['input_size'])
    return model


@register_model
def dfformer_m36(pretrained=False, **kwargs):
    default_cfg = default_cfgs['dfformer_m36']
    model = MetaFormer(
        depths=[3, 12, 18, 3],
        dims=[96, 192, 384, 576],
        token_mixers=DynamicFilter,
        head_fn=MlpHead,
        input_size=default_cfg['input_size'],
        **kwargs)
    model.default_cfg = default_cfg
    if pretrained:
        load_weights(model, default_cfg['input_size'])
    return model


@register_model
def dfformer_b36(pretrained=False, **kwargs):
    default_cfg = default_cfgs['dfformer_b36']
    model = MetaFormer(
        depths=[3, 12, 18, 3],
        dims=[128, 256, 512, 768],
        token_mixers=DynamicFilter,
        head_fn=MlpHead,
        input_size=default_cfg['input_size'],
        **kwargs)
    model.default_cfg = default_cfg
    if pretrained:
        load_weights(model, default_cfg['input_size'])
    return model


@register_model
def gfformer_s18(pretrained=False, **kwargs):
    default_cfg = default_cfgs['gfformer_s18']
    model = MetaFormer(
        depths=[3, 3, 9, 3],
        dims=[64, 128, 320, 512],
        token_mixers=GlobalFilter,
        head_fn=MlpHead,
        input_size=default_cfg['input_size'],
        **kwargs)
    model.default_cfg = default_cfg
    if pretrained:
        load_weights(model, default_cfg['input_size'])
    return model


@register_model
def cdfformer_s18(pretrained=False, **kwargs):
    default_cfg = default_cfgs['cdfformer_s18']
    model = MetaFormer(
        depths=[3, 3, 9, 3],
        dims=[64, 128, 320, 512],
        token_mixers=[SepConv, SepConv, DynamicFilter, DynamicFilter],
        head_fn=MlpHead,
        input_size=default_cfg['input_size'],
        **kwargs)
    model.default_cfg = default_cfg
    if pretrained:
        load_weights(model, default_cfg['input_size'])
    return model


@register_model
def cdfformer_s36(pretrained=False, **kwargs):
    default_cfg = default_cfgs['cdfformer_s36']
    model = MetaFormer(
        depths=[3, 12, 18, 3],
        dims=[64, 128, 320, 512],
        token_mixers=[SepConv, SepConv, DynamicFilter, DynamicFilter],
        head_fn=MlpHead,
        input_size=default_cfg['input_size'],
        **kwargs)
    model.default_cfg = default_cfg
    if pretrained:
        load_weights(model, default_cfg['input_size'])
    return model


@register_model
def cdfformer_m36(pretrained=False, **kwargs):
    default_cfg = default_cfgs['cdfformer_m36']
    model = MetaFormer(
        depths=[3, 12, 18, 3],
        dims=[96, 192, 384, 576],
        token_mixers=[SepConv, SepConv, DynamicFilter, DynamicFilter],
        head_fn=MlpHead,
        input_size=default_cfg['input_size'],
        **kwargs)
    model.default_cfg = default_cfg
    if pretrained:
        load_weights(model, default_cfg['input_size'])
    return model


@register_model
def cdfformer_b36(pretrained=False, **kwargs):
    default_cfg = default_cfgs['cdfformer_b36']
    model = MetaFormer(
        depths=[3, 12, 18, 3],
        dims=[128, 256, 512, 768],
        token_mixers=[SepConv, SepConv, DynamicFilter, DynamicFilter],
        head_fn=MlpHead,
        input_size=default_cfg['input_size'],
        **kwargs)
    model.default_cfg = default_cfg
    if pretrained:
        load_weights(model, default_cfg['input_size'])
    return model


# ablation


@register_model
def dfformer_s18_gelu(pretrained=False, **kwargs):
    default_cfg = default_cfgs['dfformer_s18_gelu']
    model = MetaFormer(
        depths=[3, 3, 9, 3],
        dims=[64, 128, 320, 512],
        mlps=partial(Mlp, act_layer=nn.GELU),
        token_mixers=partial(DynamicFilter, act1_layer=nn.GELU),
        head_fn=MlpHead,
        input_size=default_cfg['input_size'],
        **kwargs)
    model.default_cfg = default_cfg
    if pretrained:
        load_weights(model, default_cfg['input_size'])
    return model


@register_model
def dfformer_s18_relu(pretrained=False, **kwargs):
    default_cfg = default_cfgs['dfformer_s18_relu']
    model = MetaFormer(
        depths=[3, 3, 9, 3],
        dims=[64, 128, 320, 512],
        mlps=partial(Mlp, act_layer=nn.ReLU),
        token_mixers=partial(DynamicFilter, act1_layer=nn.ReLU),
        head_fn=MlpHead,
        input_size=default_cfg['input_size'],
        **kwargs)
    model.default_cfg = default_cfg
    if pretrained:
        load_weights(model, default_cfg['input_size'])
    return model


@register_model
def dfformer_s18_k2(pretrained=False, **kwargs):
    default_cfg = default_cfgs['dfformer_s18_k2']
    model = MetaFormer(
        depths=[3, 3, 9, 3],
        dims=[64, 128, 320, 512],
        token_mixers=partial(DynamicFilter, num_filters=2),
        head_fn=MlpHead,
        input_size=default_cfg['input_size'],
        **kwargs)
    model.default_cfg = default_cfg
    if pretrained:
        load_weights(model, default_cfg['input_size'])
    return model


@register_model
def dfformer_s18_d8(pretrained=False, **kwargs):
    default_cfg = default_cfgs['dfformer_s18_d8']
    model = MetaFormer(
        depths=[3, 3, 9, 3],
        dims=[64, 128, 320, 512],
        token_mixers=partial(DynamicFilter, reweight_expansion_ratio=.125),
        head_fn=MlpHead,
        input_size=default_cfg['input_size'],
        **kwargs)
    model.default_cfg = default_cfg
    if pretrained:
        load_weights(model, default_cfg['input_size'])
    return model


@register_model
def dfformer_s18_afno(pretrained=False, **kwargs):
    default_cfg = default_cfgs['dfformer_s18_k2']
    model = MetaFormer(
        depths=[3, 3, 9, 3],
        dims=[64, 128, 320, 512],
        token_mixers=AFNO2D,
        head_fn=MlpHead,
        input_size=default_cfg['input_size'],
        **kwargs)
    model.default_cfg = default_cfg
    if pretrained:
        load_weights(model, default_cfg['input_size'])
    return model