可分离卷积

最新推荐文章于 2023-11-02 10:19:32 发布

AI算法网奇

最新推荐文章于 2023-11-02 10:19:32 发布

阅读量3.5k

点赞数 3

分类专栏：深度学习基础

本文链接：https://blog.csdn.net/jacke121/article/details/105858529

版权

深度学习基础专栏收录该内容

166 篇文章 17 订阅

订阅专栏

分组卷积

输入通道和分组数量一样的分组卷积等效于深度卷积。

https://zhuanlan.zhihu.com/p/65377955

原文：https://baijiahao.baidu.com/s?id=1634399239921135758&wfr=spider&for=pc

空间可分离卷积可分：

空间可分离卷积（spatial separable convolutions）
深度可分离卷积（depthwise separable convolutions）

第1部分-深度卷积:

第2部分-逐点卷积，即1×1卷积

标准的卷积：

如果你不知道卷积如何在一个二维的角度下进行工作，请阅读本文或查看此站点。

然而，典型的图像并不是2D的; 它在具有宽度和高度的同时还具有深度。让我们假设我们有一个12x12x3像素的输入图像，即一个大小为12x12的RGB图像。

让我们对图像进行5x5卷积，没有填充（padding）且步长为1.如果我们只考虑图像的宽度和高度，卷积过程就像这样：12x12 - （5x5） - > 8x8。 5x5卷积核每25个像素进行标量乘法，每次输出1个数。我们最终得到一个8x8像素的图像，因为没有填充（12-5 + 1 = 8）。

然而，由于图像有3个通道，我们的卷积核也需要有3个通道。这就意味着，每次卷积核移动时，我们实际上执行5x5x3 = 75次乘法，而不是进行5x5 = 25次乘法。

和二维中的情况一样，我们每25个像素进行一次标量矩阵乘法，输出1个数字。经过5x5x3的卷积核后，12x12x3的图像将成为8x8x1的图像。

图4：具有8x8x1输出的标准卷积

如果我们想增加输出图像中的信道数量呢？如果我们想要8x8x256的输出呢？

好吧，我们可以创建256个卷积核来创建256个8x8x1图像，然后将它们堆叠在一起便可创建8x8x256的图像输出。

图5:拥有8x8x256输出的标准卷积

这就是标准卷积的工作原理。我喜欢把它想象成一个函数：12x12x3-（5x5x3x256）->12x12x256（其中5x5x3x256表示内核的高度、宽度、输入信道数和输出信道数）。并不是说这不是矩阵乘法；我们不是将整个图像乘以卷积核，而是将卷积核移动到图像的每个部分，并分别乘以图像的一小部分。

深度可分离卷积的过程可以分为两部分：深度卷积（depthwise convolution）和逐点卷积（pointwise convolution）。

第1部分-深度卷积:

在第一部分，深度卷积中，我们在不改变深度的情况下对输入图像进行卷积。我们使用3个形状为5x5x1的内核。

视频1:通过一个3通道的图像迭代3个内核：

https://www.youtube.com/watch?v=D_VJoaSew7Q

图6:深度卷积，使用3个内核将12x12x3图像转换为8x8x3图像

每个5x5x1内核迭代图像的一个通道(注意:一个通道，不是所有通道)，得到每25个像素组的标量积，得到一个8x8x1图像。将这些图像叠加在一起可以创建一个8x8x3的图像。

第2部分-逐点卷积:

记住，原始卷积将12x12x3图像转换为8x8x256图像。目前，深度卷积已经将12x12x3图像转换为8x8x3图像。现在，我们需要增加每个图像的通道数。

逐点卷积之所以如此命名是因为它使用了一个1x1核函数，或者说是一个遍历每个点的核函数。该内核的深度为输入图像有多少通道;在我们的例子中，是3。因此，我们通过8x8x3图像迭代1x1x3内核，得到8x8x1图像。

图7:逐点卷积，将一个3通道的图像转换为一个1通道的图像

我们可以创建256个1x1x3内核，每个内核输出一个8x8x1图像，以得到形状为8x8x256的最终图像。

图8:256个核的逐点卷积，输出256个通道的图像

就是这样!我们把卷积分解成两部分:深度卷积和逐点卷积。更抽象地说，如果原始卷积函数是12x12x3 - (5x5x3x256)→12x12x256，我们可以将这个新的卷积表示为12x12x3 - (5x5x1x1) - > (1x1x3x256) - >12x12x256。

好的，但是创建一个深度可分离卷积有什么意义呢?

我们来计算一下计算机在原始卷积中要做的乘法的个数。有256个5x5x3内核可以移动8x8次。这是256 x3x5x5x8x8 = 1228800乘法。

可分离卷积呢?在深度卷积中，我们有3个5x5x1的核它们移动了8x8次。也就是3x5x5x8x8 = 4800乘以。在点态卷积中，我们有256个1x1x3的核它们移动了8x8次。这是256 x1x1x3x8x8 = 49152乘法。把它们加起来，就是53952次乘法。

52,952比1,228,800小很多。计算量越少，网络就能在更短的时间内处理更多的数据。

然而，这是如何实现的呢?我第一次遇到这种解释时，我的直觉并没有真正理解它。这两个卷积不是做同样的事情吗?在这两种情况下，我们都通过一个5x5内核传递图像，将其缩小到一个通道，然后将其扩展到256个通道。为什么一个的速度是另一个的两倍多?

经过一段时间的思考，我意识到主要的区别是:在普通卷积中，我们对图像进行了256次变换。每个变换都要用到5x5x3x8x8=4800次乘法。在可分离卷积中，我们只对图像做一次变换——在深度卷积中。然后，我们将转换后的图像简单地延长到256通道。不需要一遍又一遍地变换图像，我们可以节省计算能力。

值得注意的是，在Keras和Tensorflow中，都有一个称为“深度乘法器”的参数。默认设置为1。通过改变这个参数，我们可以改变深度卷积中输出通道的数量。例如，如果我们将深度乘法器设置为2，每个5x5x1内核将输出8x8x2的图像，使深度卷积的总输出(堆叠)为8x8x6，而不是8x8x3。有些人可能会选择手动设置深度乘法器来增加神经网络中的参数数量，以便更好地学习更多的特征。

深度可分离卷积的缺点是什么?当然!因为它减少了卷积中参数的数量，如果你的网络已经很小，你可能会得到太少的参数，你的网络可能无法在训练中正确学习。然而，如果使用得当，它可以在不显著降低效率的情况下提高效率，这使得它成为一个非常受欢迎的选择。

1x1内核:

最后，由于逐点卷积使用了这个概念，我想讨论一下1x1内核的用法。

一个1x1内核——或者更确切地说，n个1x1xm内核，其中n是输出通道的数量，m是输入通道的数量——可以在可分离卷积之外使用。1x1内核的一个明显目的是增加或减少图像的深度。如果你发现卷积有太多或太少的通道，1x1核可以帮助平衡它。

然而，对我来说，1x1核的主要目的是应用非线性。在神经网络的每一层之后，我们都可以应用一个激活层。无论是ReLU、PReLU、Softmax还是其他，与卷积层不同，激活层是非线性的。直线的线性组合仍然是直线。非线性层扩展了模型的可能性，这也是通常使“深度”网络优于“宽”网络的原因。为了在不显著增加参数和计算量的情况下增加非线性层的数量，我们可以应用一个1x1内核并在它之后添加一个激活层。这有助于给网络增加一层深度。雷锋网雷锋网雷锋网

如果你有任何问题，请在下方留言!别忘了给这个故事点赞!

想要继续查看该篇文章相关链接和参考文献？

点击【可分离卷积基础介绍】即可访问：

https://ai.yanxishe.com/page/TextTranslation/1639

通俗解释

https://blog.csdn.net/qq_27825451/article/details/102457264

参考：https://www.lizenghai.com/archives/49372.html

MobileNet V1网络结构pytorch实现代码

import torch.nn as nn
class MobileNet_V1(nn.Module):
    def __init__(self):
        super(MobileNet_V1, self).__init__()
        # 网络模型声明
        self.model = nn.Sequential(
            self.conv_bn(3, 32, 2),
            self.conv_dw(32, 64, 1),
            self.conv_dw(64, 128, 2),
            self.conv_dw(128, 128, 1),
            self.conv_dw(128, 256, 2),
            self.conv_dw(256, 256, 1),
            self.conv_dw(256, 512, 2),
            self.conv_dw(512, 512, 1),
            self.conv_dw(512, 512, 1),
            self.conv_dw(512, 512, 1),
            self.conv_dw(512, 512, 1),
            self.conv_dw(512, 512, 1),
            self.conv_dw(512, 1024, 2),
            self.conv_dw(1024, 1024, 1),
            nn.AvgPool2d(7)
        )
        self.fc = nn.Linear(1024, 1000)
    def forward(self, input):
        output = self.model(input)
        output = output.view(input.size(0), -1)
        output = self.fc(output)
        return output
    # 标准卷积
    def conv_bn(self, in_channel, out_channel, stride):
        return nn.Sequential(
            nn.Conv2d(in_channel, out_channel, kernel_size=3, stride=stride, bias=False),
            nn.BatchNorm2d(out_channel),
            nn.ReLU(inplace=True)
        )
    # 深度可分离卷积
    def conv_dw(self, in_channel, out_channel, stride):
        return nn.Sequential(
            # 深度卷积
            nn.Conv2d(in_channel, in_channel, kernel_size=3, stride=stride, padding=1, groups=in_channel,
                      bias=False),
            nn.BatchNorm2d(in_channel),
            nn.ReLU6(inplace=True),
            # 逐点卷积
            nn.Conv2d(in_channel, out_channel, kernel_size=1, stride=1, padding=0, bias=False),
            nn.BatchNorm2d(out_channel),
            nn.ReLU6(inplace=True)
        )

MobileNet V2

创新点：1、Linear Bottlenecks。2、Inverted Residuals。这两点其实已经直接简明地写在标题上了:)
一、Linear Bottlenecks：
即将V1中逐点卷积后的非线性激活函数去掉。

Linear Bottlenecks

原因：深度可分离卷积确实是大大降低了计算量，而且NxN Depthwise + 1X1 PointWise的结构在性能上也能接近NxN的标准卷积。但在实际使用的时候，发现Depthwise 部分的kernel比较容易训废掉：训练完之后发现depthwise训出来的kernel有不少是空的(即出现下图情况)。
由于输入channel太少，导致很容易出现小于0的情况，如果再用非线性函数激活会出现死节点，使得神经元输出变为0，所以就学废了：ReLU对于0的输出的梯度为0，所以一旦陷入了0输出，就没法恢复了。并且这个问题在定点化低精度训练的时候会进一步放大。所以将逐点卷积的激活函数去掉，减少ReLU对特征的破坏。

深度卷积存在的问题

二、Inverted Residuals：有点参考残差网络的意思，由于残差网络ResNet采用的是先降维再升维的操作，而MobileNet V2采用的是先升维再降维的反向操作，所以取名为Inverted Residuals。

image.png

Invered residual 有两个好处：1. 复用特征。2. 旁支block内先通过1×1升维，再接depthwise conv以及ReLU，通过增加ReLU的Input维度，来缓解特征的退化情况.
将两个创新点结合：最后将一和二创新点结合与V1进行对比。

V1和V2版本的对比

MobileNet V2网络结构pytorch实现代码

import torch.nn as nn
class ResidualBlock(nn.Module):
    def __init__(self, in_channel, out_channel, stride, expand):
        super(ResidualBlock, self).__init__()
        self.stride = stride
        # 升维逐点卷积
        self.conv_pw1 = nn.Sequential(
            nn.Conv2d(in_channel, in_channel * expand, kernel_size=1, stride=1, bias=False),
            nn.BatchNorm2d(in_channel * expand),
            nn.ReLU6(inplace=True)
        )
        # 深度卷积
        self.conv_dw = nn.Sequential(
            nn.Conv2d(in_channel * expand, in_channel * expand, kernel_size=3, stride=stride, padding=1,
                      groups=in_channel * expand, bias=False),
            nn.BatchNorm2d(in_channel * expand),
            nn.ReLU6(inplace=True)
        )
      #降维逐点卷积
        self.conv_pw2 = nn.Sequential(
            nn.Conv2d(in_channel * expand, out_channel, kernel_size=1, stride=1, bias=False),
            nn.BatchNorm2d(out_channel)
        )
        self.down_sample = None
        if self.stride == 1 and in_channel != out_channel:
            self.down_sample = nn.Sequential(
                nn.Conv2d(in_channel, out_channel, kernel_size=3, stride=1, padding=1, bias=False),
                nn.BatchNorm2d(out_channel)
            )
    def forward(self, input):
        output = self.conv_pw1(input)
        output = self.conv_dw(output)
        output = self.conv_pw2(output)
        if self.down_sample is not None:
            output = output + self.down_sample(input)
        return output
class MobileNet_v2(nn.Module):
    def __init__(self, num_classes=10):
        super(MobileNet_v2, self).__init__()
        self.module_list = nn.ModuleList()
        self.module_list.add_module('stem', self.conv_bn(3, 32, 2))
        self.in_channels = 32
        self.layers = [1, 2, 3, 4, 3, 3, 1]  # 该模块重复次数
        self.strides = [1, 2, 2, 2, 1, 2, 1]  # 该模块步长
        self.expand = [1, 6, 6, 6, 6, 6, 6]  # 输入通道的倍增系数
        self.out_channel = [16, 24, 32, 64, 96, 160, 320]  # 输出通道数
        for index in range(len(self.layers)):
            self.module_list.add_module('bottleneck{}'.format(index),
                                        self.make_layer(ResidualBlock, self.out_channel[index], self.layers[index],
                                                        self.strides[index], self.expand[index]))
        self.module_list.add_module('conv1', nn.Sequential(
            nn.Conv2d(self.in_channels, 1280, kernel_size=1, stride=1, bias=False),
            nn.BatchNorm2d(1280),
            nn.ReLU(inplace=True)
        ))
        self.module_list.add_module('avgpool', nn.Sequential(
            nn.AvgPool2d(kernel_size=7)
        ))
        self.module_list.add_module('liear', nn.Sequential(
            nn.Linear(1280, num_classes)
        ))
    def forward(self, input):
        output = input
        for index, module in enumerate(self.module_list):
            if index == len(self.module_list) - 1:
                output = output.view(input.size(0), -1)
            output = module(output)
        return output
    # 标准卷积
    def conv_bn(self, in_channel, out_channel, stride):
        return nn.Sequential(
            nn.Conv2d(in_channel, out_channel, kernel_size=3, stride=stride, padding=1, bias=False),
            nn.BatchNorm2d(out_channel),
            nn.ReLU(inplace=True)
        )
    def make_layer(self, block, out_channel, blocks, stride, expand):
        layers = []
        model = block(self.in_channels, out_channel, stride, expand)
        layers.append(model)
        for num in range(1, blocks):
            model = block(out_channel, out_channel, stride=1, expand=expand)
            layers.append(model)
        self.in_channels = out_channel
        return nn.Sequential(*layers)

MobileNet V3

创新点：1、引入基于squeeze and excitation结构的轻量级注意力模型，为深度卷积后的每个通道特征分配相应的权重。(具体可以参考https://www.jianshu.com/p/40ee2e9c9530
)2、引入新的激活函数h-swish。3、网络结构尾部的调整，加快运算速度。
一、引入squeeze and excitation注意力机制：

MobileNet V2与V3的对比

二、h-swish激活函数：
swish论文的作者认为，swish具备无上界有下界、平滑、非单调的特性，发现使用swish激活函数在深层模型上的效果优于ReLU，但是sigmoid的计算对于移动设备并不友好，于是作者想到了用值相近的函数来替代swish，于是便出现了h-swish。下图可以看出h-swish与swish的值相差很小，而且h-swish中没有sigmoid操作，对于移动端的设备计算比较友好。

swish和h-swish计算公式

swish和h-swish的计算值

同时，作者认为随着网络的深入，应用非线性激活函数的成本会降低，能够更好的减少参数量。作者发现swish的大多数好处都是通过在更深的层中使用它们实现的。因此，在V3的架构中，只在模型的后半部分使用h-swish(HS)。

MobileNet V3-Large网络结构

三、网络结构尾部的调整：
原先使用1×1卷积来构建最后层，可以便于拓展到更高维的特征空间。在预测时，有更多更丰富的特征来满足预测，但是同时也引入了额外的计算成本与延时。所以现在为了保留高维特征并减少计算延迟，去掉了最后的一些层来提速，即先使用global average pooling降低计算代价。

网络结构尾部的调整

MobileNet V3网络结构pytorch实现代码

import torch.nn as nn
class HardSwish(nn.Module):
    def __init__(self, inplace=True):
        super(HardSwish, self).__init__()
        self.relu6 = nn.ReLU6(inplace)
    def forward(self, x):
        return x * self.relu6(x + 3) / 6
# 深度卷积
def DwBNActivation(in_channels, out_channels, kernel_size, stride, activate):
    return nn.Sequential(
        nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=kernel_size, stride=stride,
                  padding=(kernel_size - 1) // 2, groups=in_channels),
        nn.BatchNorm2d(out_channels),
        nn.ReLU6(inplace=True) if activate == 'relu' else HardSwish()
    )
# 逐点卷积
def PwBNActivation(in_channels, out_channels, activate):
    return nn.Sequential(
        nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=1, stride=1),
        nn.BatchNorm2d(out_channels),
        nn.ReLU6(inplace=True) if activate == 'relu' else HardSwish()
    )
def Conv1x1BN(in_channels, out_channels):
    return nn.Sequential(
        nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=1, stride=1),
        nn.BatchNorm2d(out_channels)
    )
# SEblock
class SqueezeAndExcite(nn.Module):
    def __init__(self, in_channels, out_channels, se_kernel_size, divide=4):
        super(SqueezeAndExcite, self).__init__()
        mid_channels = in_channels // divide
        self.pool = nn.AvgPool2d(kernel_size=se_kernel_size, stride=1)
        self.SEblock = nn.Sequential(
            nn.Linear(in_features=in_channels, out_features=mid_channels),
            nn.ReLU6(inplace=True),
            nn.Linear(in_features=mid_channels, out_features=out_channels),
            HardSwish(inplace=True),
        )
    def forward(self, x):
        b, c, h, w = x.size()
        out = self.pool(x)
        out = out.view(b, -1)
        out = self.SEblock(out)
        out = out.view(b, c, 1, 1)
        return out * x
# 1、逐点卷积升维。2、深度卷积。3、SEblock。4、逐点卷积降维。5、shortcut(若stride为1)
class SEInvertedBottleneck(nn.Module):
    def __init__(self, in_channels, mid_channels, out_channels, kernel_size, stride, activate, use_se,
                 se_kernel_size=1):
        super(SEInvertedBottleneck, self).__init__()
        self.stride = stride
        self.use_se = use_se
        # mid_channels = (in_channels * expansion_factor)
        self.conv = PwBNActivation(in_channels, mid_channels, activate)
        self.depth_conv = DwBNActivation(mid_channels, mid_channels, kernel_size, stride, activate)
        if self.use_se:
            self.SEblock = SqueezeAndExcite(mid_channels, mid_channels, se_kernel_size)
        self.point_conv = PwBNActivation(mid_channels, out_channels, activate)
        if self.stride == 1:
            self.shortcut = Conv1x1BN(in_channels, out_channels)
    def forward(self, x):
        out = self.depth_conv(self.conv(x))
        if self.use_se:
            out = self.SEblock(out)
        out = self.point_conv(out)
        out = (out + self.shortcut(x)) if self.stride == 1 else out
        return out
class MobileNetV3(nn.Module):
    def __init__(self, num_classes=1000, type='large'):
        super(MobileNetV3, self).__init__()
        self.type = type
        self.first_conv = nn.Sequential(
            nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=2, padding=1),
            nn.BatchNorm2d(16),
            HardSwish(inplace=True),
        )
        if type == 'large':
            self.large_bottleneck = nn.Sequential(
                SEInvertedBottleneck(in_channels=16, mid_channels=16, out_channels=16, kernel_size=3, stride=1,
                                     activate='relu', use_se=False),
                SEInvertedBottleneck(in_channels=16, mid_channels=64, out_channels=24, kernel_size=3, stride=2,
                                     activate='relu', use_se=False),
                SEInvertedBottleneck(in_channels=24, mid_channels=72, out_channels=24, kernel_size=3, stride=1,
                                     activate='relu', use_se=False),
                SEInvertedBottleneck(in_channels=24, mid_channels=72, out_channels=40, kernel_size=5, stride=2,
                                     activate='relu', use_se=True, se_kernel_size=28),
                SEInvertedBottleneck(in_channels=40, mid_channels=120, out_channels=40, kernel_size=5, stride=1,
                                     activate='relu', use_se=True, se_kernel_size=28),
                SEInvertedBottleneck(in_channels=40, mid_channels=120, out_channels=40, kernel_size=5, stride=1,
                                     activate='relu', use_se=True, se_kernel_size=28),
                SEInvertedBottleneck(in_channels=40, mid_channels=240, out_channels=80, kernel_size=3, stride=1,
                                     activate='hswish', use_se=False),
                SEInvertedBottleneck(in_channels=80, mid_channels=200, out_channels=80, kernel_size=3, stride=1,
                                     activate='hswish', use_se=False),
                SEInvertedBottleneck(in_channels=80, mid_channels=184, out_channels=80, kernel_size=3, stride=2,
                                     activate='hswish', use_se=False),
                SEInvertedBottleneck(in_channels=80, mid_channels=184, out_channels=80, kernel_size=3, stride=1,
                                     activate='hswish', use_se=False),
                SEInvertedBottleneck(in_channels=80, mid_channels=480, out_channels=112, kernel_size=3, stride=1,
                                     activate='hswish', use_se=True, se_kernel_size=14),
                SEInvertedBottleneck(in_channels=112, mid_channels=672, out_channels=112, kernel_size=3, stride=1,
                                     activate='hswish', use_se=True, se_kernel_size=14),
                SEInvertedBottleneck(in_channels=112, mid_channels=672, out_channels=160, kernel_size=5, stride=2,
                                     activate='hswish', use_se=True, se_kernel_size=7),
                SEInvertedBottleneck(in_channels=160, mid_channels=960, out_channels=160, kernel_size=5, stride=1,
                                     activate='hswish', use_se=True, se_kernel_size=7),
                SEInvertedBottleneck(in_channels=160, mid_channels=960, out_channels=160, kernel_size=5, stride=1,
                                     activate='hswish', use_se=True, se_kernel_size=7),
            )
            self.large_last_stage = nn.Sequential(
                nn.Conv2d(in_channels=160, out_channels=960, kernel_size=1, stride=1),
                nn.BatchNorm2d(960),
                HardSwish(inplace=True),
                nn.AvgPool2d(kernel_size=7, stride=1),
                nn.Conv2d(in_channels=960, out_channels=1280, kernel_size=1, stride=1),
                HardSwish(inplace=True),
            )
        else:
            self.small_bottleneck = nn.Sequential(
                SEInvertedBottleneck(in_channels=16, mid_channels=16, out_channels=16, kernel_size=3, stride=2,
                                     activate='relu', use_se=True, se_kernel_size=56),
                SEInvertedBottleneck(in_channels=16, mid_channels=72, out_channels=24, kernel_size=3, stride=2,
                                     activate='relu', use_se=False),
                SEInvertedBottleneck(in_channels=24, mid_channels=88, out_channels=24, kernel_size=3, stride=1,
                                     activate='relu', use_se=False),
                SEInvertedBottleneck(in_channels=24, mid_channels=96, out_channels=40, kernel_size=5, stride=2,
                                     activate='hswish', use_se=True, se_kernel_size=14),
                SEInvertedBottleneck(in_channels=40, mid_channels=240, out_channels=40, kernel_size=5, stride=1,
                                     activate='hswish', use_se=True, se_kernel_size=14),
                SEInvertedBottleneck(in_channels=40, mid_channels=240, out_channels=40, kernel_size=5, stride=1,
                                     activate='hswish', use_se=True, se_kernel_size=14),
                SEInvertedBottleneck(in_channels=40, mid_channels=120, out_channels=48, kernel_size=5, stride=1,
                                     activate='hswish', use_se=True, se_kernel_size=14),
                SEInvertedBottleneck(in_channels=48, mid_channels=144, out_channels=48, kernel_size=5, stride=1,
                                     activate='hswish', use_se=True, se_kernel_size=14),
                SEInvertedBottleneck(in_channels=48, mid_channels=288, out_channels=96, kernel_size=5, stride=2,
                                     activate='hswish', use_se=True, se_kernel_size=7),
                SEInvertedBottleneck(in_channels=96, mid_channels=576, out_channels=96, kernel_size=5, stride=1,
                                     activate='hswish', use_se=True, se_kernel_size=7),
                SEInvertedBottleneck(in_channels=96, mid_channels=576, out_channels=96, kernel_size=5, stride=1,
                                     activate='hswish', use_se=True, se_kernel_size=7),
            )
            self.small_last_stage = nn.Sequential(
                nn.Conv2d(in_channels=96, out_channels=576, kernel_size=1, stride=1),
                nn.BatchNorm2d(576),
                HardSwish(inplace=True),
                nn.AvgPool2d(kernel_size=7, stride=1),
                nn.Conv2d(in_channels=576, out_channels=1280, kernel_size=1, stride=1),
                HardSwish(inplace=True),
            )
        self.classifier = nn.Conv2d(in_channels=1280, out_channels=num_classes, kernel_size=1, stride=1)
    def init_params(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight)
                nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.BatchNorm2d) or isinstance(m, nn.Linear):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)
    def forward(self, x):
        x = self.first_conv(x)
        if self.type == 'large':
            x = self.large_bottleneck(x)
            x = self.large_last_stage(x)
        else:
            x = self.small_bottleneck(x)
            x = self.small_last_stage(x)
        out = self.classifier(x)
        out = out.view(out.size(0), -1)
        return out