【EfficientNet】EfficientNet网络结构及代码详解

寻找永不遗憾

已于 2022-05-02 16:38:19 修改

阅读量2.8w

点赞数 51

分类专栏：神经网络结构解读文章标签：深度学习 cnn 神经网络

于 2022-04-29 11:07:33 首次发布

本文链接：https://blog.csdn.net/weixin_45377629/article/details/124430796

版权

神经网络结构解读专栏收录该内容

8 篇文章

订阅专栏

文章目录

1 从三个问题出发
2 MBConv结构解析
3 网络详细结构
4 EfficientNet-B0~B7结构解读
5 代码解读
6 感谢链接

1 从三个问题出发

对于一个网络，输入图像分辨越高越好？
设计网络时，越深越好？
设计网络时，越宽越好？

回答： 总体而言，是的。

图像分辨率越高，网络能得到 潜在的 更高细粒度的特征channel。但有瓶颈，且会增加计算量，分辨率太高，得不偿失。
网络越深，学习到的特征越丰富、复杂。但太深容易造成梯度消失，导致训练困难。
网络越宽，网络能得到更高粒度的特征，且容易训练。但光宽不深的话，学习不到深层次的特征，且要综合衡量计算量的问题。

EfficientNet闪亮登场。
EfficientNet利用NAS(Neural Architecture Search)搜索技术，将输入分辨率，网络的深度、宽度三者同时考虑，搭建更nice的网络结构。

2 MBConv结构解析

网络中在堆叠MBConv结构(类似于Mobilenetv3中的block)，那MBConv结构长什么样？

如下图所示(文中图片感谢太阳花的小绿豆 )：
在这里插入图片描述
MBConv结构主要包括：
一个 1x1 的普通卷积(升维作用，包含BN和Swish激活)
一个 kxk 的Depthwise Conv卷积(包含BN和Swish激活)，kxk 有3x3和5x5两种情况
一个SE模块
一个 1x1 的普通卷积(降维作用，包含BN和线性激活，线性激活y=x)
一个Droupout层
搭建过程中还需要注意几点：

第一个1x1卷积层，是用来升维的，过滤器个数是输入特征矩阵channel的n倍，n ∈ { 1 , 6 } 。
当n = 1时，不要第一个升维的1x1卷积层，这和MobileNetV3网络类似。
关于shortcut短连接，仅当输入MBConv结构的特征矩阵与输出的特征矩阵shape相同(尺寸、个数)时才存在（代码中可通过stride==1 and inputc_channels==output_channels条件来判断）。
SE模块，由一个全局平均池化，两个全连接层组成。下方给出MobileNetv3中的SE模块 示意图，在此要注意区分两点：
第一个全连接层的节点个数是输入该MBConv特征矩阵channels的 $\frac{1}{4}$ ，且使用Swish激活函数。
第二个全连接层的节点个数等于Depthwise Conv层输出的特征矩阵channels，且使用Sigmoid激活函数。
Dropout层 的dropout_rate对应的是drop_connect_rate，和全连接层对应的dropout要区分开(注意，在源码实现中只有使用shortcut且drop_rate大于0 的时候才有Dropout层，至于为什么可以参考第四节的解释)。

3 网络详细结构

EfficientNet作者给了8个网络，下文以以EfficientNet-B0为例进行介绍，因为EfficientNet-B1~B7是在EfficientNet-B0的基础上，利用NAS搜索技术，对输入分辨率Resolution、网络深度Layers、网络宽度Channels三者进行综合调整。

EfficientNet-B0的网络框架，总体看，分成了9个Stage：

Stage1 是一个卷积核大小为3x3，步距为2的普通卷积层（包含BN和激活函数Swish）
Stage2～Stage8 是在重复堆叠 MBConv 结构
Stage9 是一个普通的1x1的卷积层(包含BN和激活函数Swish) + 一个平均池化层 + 一个全连接层组成

EfficientNet-B0
表格中每个参数解析：

表格中每个MBConv后会跟一个数字1或6，这里的1或6是倍率因子n (channels变胖倍数)，即MBConv中第一个1x1的卷积层会将输入特征矩阵的channels扩充为n倍。
表格中k3x3或k5x5 表示MBConv中Depthwise Conv所采用的卷积核大小。
Resolution表示该Stage的输入channel尺寸。
Channels表示通过该Stage后输出特征矩阵的Channels。
Layers表示该Stage重复MBConv结构多少次
stride表示步长，重复多次的Layers，只针对第一个有效(stride=2，Resolution长宽变为原来一半)，其余均为1。

4 EfficientNet-B0~B7结构解读

EfficientNet-B0~B7结构区别如下：
在这里插入图片描述
表格中每个参数解析：

input_size 代表网络训练时输入图像大小
width_coefficient 代表channel维度上的倍率因子，比如在 EfficientNetB0中Stage1的3x3卷积层所使用的卷积核个数是32，那么在B6中就是32 × 1.8 = 57.6，接着取整到离它最近的8的整数倍即56，其它Stage同理。
depth_coefficient 代表depth维度上的倍率因子（仅针对Stage2到Stage8），比如在EfficientNetB0中Stage7中的MBConv6重复了4次，也就是的Layers=4，那么在B6中就是 $4 \times 2.6 = 10.4$ ，接着向上取整即11。
drop_connect_rate 是在MBConv结构中dropout层使用的drop_rate，MBConv结构的drop_rate是从0递增到drop_connect_rate的 (在源码实现中只有使用shortcut且drop_rate大于0的时候才有Dropout层)。
在MBConv中的Dropout层表示随机丢掉整个block的主分支，只剩shortcut短分支，相当于直接跳过了这个block(可以理解为减少了网络的深度)。
dropout_rate是最后一个全连接层前的dropout层（在stage9的Pooling与FC之间）的dropout_rate。

5 代码解读

直接看代码注释，可运行。

import math
import copy
from functools import partial
from collections import OrderedDict
from typing import Optional, Callable

import torch
import torch.nn as nn
from torch import Tensor
from torch.nn import functional as F

# 让channel数为8的整数倍
def _make_divisible(ch, divisor=8, min_ch=None):
    """
    This function is taken from the original tf repo.
    It ensures that all layers have a channel number that is divisible by 8
    It can be seen here:
    https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py
    """
    if min_ch is None:
        min_ch = divisor
    new_ch = max(min_ch, int(ch + divisor / 2) // divisor * divisor)
    # Make sure that round down does not go down by more than 10%.
    if new_ch < 0.9 * ch:
        new_ch += divisor
    return new_ch


def drop_path(x, drop_prob: float = 0., training: bool = False):
    """
    Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
    "Deep Networks with Stochastic Depth", https://arxiv.org/pdf/1603.09382.pdf

    This function is taken from the rwightman.
    It can be seen here:
    https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/layers/drop.py#L140
    """
    if drop_prob == 0. or not training:     
        return x
    keep_prob = 1 - drop_prob       # 保留的比率
    shape = (x.shape[0],) + (1,) * (x.ndim - 1)  # work with diff dim tensors, not just 2D ConvNets
    random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
    random_tensor.floor_()  # binarize  二值化
    # ---------------------------------------------------------------------------------------#
    #   对x先进行x.div(keep_prob)的放缩，这是为什么？
    #   回答：对权值放缩是为了获得输出的一致性，即期望不变。
    #       假设一个神经元的输出激活值为a，在不使用dropout的情况下，
    #       其输出期望值为a，如果使用了dropout，神经元就可能有保留和关闭两种状态，
    #       把它看作一个离散型随机变量，它就符合概率论中的0-1分布，
    #       其输出激活值的期望变为 p*a+(1-p)*0=pa，此时若要保持期望和不使用dropout时一致，就要除以p。
    # ---------------------------------------------------------------------------------------#
    output = x.div(keep_prob) * random_tensor
    return output


class DropPath(nn.Module):
    """
    Drop paths (Stochastic Depth) per sample  (when applied in main path of residual blocks).
    "Deep Networks with Stochastic Depth", https://arxiv.org/pdf/1603.09382.pdf
    """
    def __init__(self, drop_prob=None):
        super(DropPath, self).__init__()
        self.drop_prob = drop_prob

    def forward(self, x):
        return drop_path(x, self.drop_prob, self.training)


class ConvBNActivation(nn.Sequential):
    def __init__(self,
                 in_planes: int,
                 out_planes: int,
                 kernel_size: int = 3,
                 stride: int = 1,
                 groups: int = 1,   # groups用来控制是使用DW卷积还是普通卷积
                 norm_layer: Optional[Callable[..., nn.Module]] = None,
                 activation_layer: Optional[Callable[..., nn.Module]] = None):
        padding = (kernel_size - 1) // 2
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        if activation_layer is None:
            activation_layer = nn.SiLU  # 又叫 Swish  (torch>=1.7)

        super(ConvBNActivation, self).__init__(nn.Conv2d(in_channels=in_planes,
                                                         out_channels=out_planes,
                                                         kernel_size=kernel_size,
                                                         stride=stride,
                                                         padding=padding,
                                                         groups=groups,
                                                         bias=False),   # 使用BN，故bias为False
                                               norm_layer(out_planes),
                                               activation_layer())


# SE模块
class SqueezeExcitation(nn.Module):
    def __init__(self,
                 input_c: int,   # block input channel
                 expand_c: int,  # block expand channel，DW卷积channel不变
                 squeeze_factor: int = 4):  # 控制第一个FC层节点个数，等于 input_c // squeeze_factor
        super(SqueezeExcitation, self).__init__()
        squeeze_c = input_c // squeeze_factor
        self.fc1 = nn.Conv2d(expand_c, squeeze_c, 1)    # 1x1卷积 代替全连接层，作用相同
        self.ac1 = nn.SiLU()  # alias Swish
        self.fc2 = nn.Conv2d(squeeze_c, expand_c, 1)
        self.ac2 = nn.Sigmoid()

    def forward(self, x: Tensor) -> Tensor:
        # output_size=(1, 1)：对每个channel进行全局平均池化
        scale = F.adaptive_avg_pool2d(x, output_size=(1, 1))
        scale = self.fc1(scale)
        scale = self.ac1(scale)
        scale = self.fc2(scale)
        scale = self.ac2(scale)     # 得到对应每个channel的对应程度
        return scale * x


# 对应每个MBConv模块的配置参数
class InvertedResidualConfig:
    # kernel_size, in_channel, out_channel, exp_ratio, strides, use_SE, drop_connect_rate
    def __init__(self,
                 kernel: int,          # 3 or 5     两种可能
                 input_c: int,         # 输入MBConv的channel数
                 out_c: int,           # MBConv输出的channel数
                 expanded_ratio: int,  # 1 or 6     变胖倍数
                 stride: int,          # 1 or 2
                 use_se: bool,         # True       都用
                 drop_rate: float,     # MBConv中的随机失活比例
                 index: str,           # 1a, 2a, 2b, ...    用来记录当前MBConv模块的名称，方便后期分析
                 width_coefficient: float):     # 宽度倍率因子
        # 宽度*倍率因子，然后调整到离它最近的8的整数倍
        # 默认是EfficientNet_B0参数，B1~B7都是通过这样的倍率因子进行调节
        self.input_c = self.adjust_channels(input_c, width_coefficient)     
        self.kernel = kernel
        #   变胖的channel数=输入channel数 * 变胖倍数
        self.expanded_c = self.input_c * expanded_ratio 
        self.out_c = self.adjust_channels(out_c, width_coefficient)
        self.use_se = use_se
        self.stride = stride
        self.drop_rate = drop_rate
        self.index = index

    @staticmethod
    def adjust_channels(channels: int, width_coefficient: float):
        return _make_divisible(channels * width_coefficient, 8)


# MBConv模块
class InvertedResidual(nn.Module):
    def __init__(self,
                 cnf: InvertedResidualConfig,       # 传入配置文件
                 norm_layer: Callable[..., nn.Module]): # BN
        super(InvertedResidual, self).__init__()

        if cnf.stride not in [1, 2]:
            raise ValueError("illegal stride value.")

        # 满足cnf.stride == 1 and cnf.input_c == cnf.out_c这两个条件，采用短连接
        self.use_res_connect = (cnf.stride == 1 and cnf.input_c == cnf.out_c)

        # 定义有序的字典，用于搭建MBConv结构
        layers = OrderedDict()
        activation_layer = nn.SiLU  # alias Swish   SiLU别名Swish

        # expand
        #   针对第一个MBConv，相等时跳过下面这个语句
        if cnf.expanded_c != cnf.input_c:
            layers.update({"expand_conv": ConvBNActivation(cnf.input_c,
                                                           cnf.expanded_c,
                                                           kernel_size=1,
                                                           norm_layer=norm_layer,
                                                           activation_layer=activation_layer)})

        # depthwise
        layers.update({"dwconv": ConvBNActivation(cnf.expanded_c,
                                                  cnf.expanded_c,
                                                  kernel_size=cnf.kernel,
                                                  stride=cnf.stride,
                                                  groups=cnf.expanded_c,    # DW卷积！
                                                  norm_layer=norm_layer,
                                                  activation_layer=activation_layer)})

        if cnf.use_se:
            layers.update({"se": SqueezeExcitation(cnf.input_c,
                                                   cnf.expanded_c)})

        # project
        layers.update({"project_conv": ConvBNActivation(cnf.expanded_c,
                                                        cnf.out_c,
                                                        kernel_size=1,
                                                        norm_layer=norm_layer,
                                                        activation_layer=nn.Identity)})     # 线性激活层，不做任何处理

        self.block = nn.Sequential(layers)  # 把有序字典layers传给Sequential这个类
        self.out_channels = cnf.out_c
        self.is_strided = cnf.stride > 1    # cnf.stride=1，则self.is_strided为False

        # 只有在使用shortcut连接 且drop_rate大于0时 才使用dropout层
        if self.use_res_connect and cnf.drop_rate > 0:
            self.dropout = DropPath(cnf.drop_rate)      # 通过 DropPath类 实现MBConv中的dropout
        else:
            self.dropout = nn.Identity()        # nn.Identity()表示不做任何处理

    def forward(self, x: Tensor) -> Tensor:
        result = self.block(x)
        result = self.dropout(result)
        if self.use_res_connect:
            result += x

        return result


class EfficientNet(nn.Module):
    def __init__(self,
                 width_coefficient: float,      # 宽度倍率因子，用于切换B0~B7
                 depth_coefficient: float,      # 深度倍率因子
                 num_classes: int = 1000,
                 dropout_rate: float = 0.2,     # 控制最后一个全连接层前的dropout层
                 drop_connect_rate: float = 0.2,    # 控制SE模块里的dropout
                 block: Optional[Callable[..., nn.Module]] = None,      # MBConv模块
                 norm_layer: Optional[Callable[..., nn.Module]] = None  # BN
                 ):
        super(EfficientNet, self).__init__()

        # 默认配置表，stage2~stage8的配置参数
        # kernel_size, in_channel, out_channel, exp_ratio, strides, use_SE, drop_connect_rate, repeats
        default_cnf = [[3, 32, 16, 1, 1, True, drop_connect_rate, 1],
                       [3, 16, 24, 6, 2, True, drop_connect_rate, 2],
                       [5, 24, 40, 6, 2, True, drop_connect_rate, 2],
                       [3, 40, 80, 6, 2, True, drop_connect_rate, 3],
                       [5, 80, 112, 6, 1, True, drop_connect_rate, 3],
                       [5, 112, 192, 6, 2, True, drop_connect_rate, 4],
                       [3, 192, 320, 6, 1, True, drop_connect_rate, 1]]

        # 用于控制B0~B7中MBConv重复几次
        def round_repeats(repeats):
            """Round number of repeats based on depth multiplier."""
            return int(math.ceil(depth_coefficient * repeats))      # math.ceil(): 大于或等于，即向上取整

        # 默认是MBConv模块
        if block is None:
            block = InvertedResidual

        if norm_layer is None:
            # partial()：往BN里传两个默认参数，下次再使用时可不用再传入这两个超参数了
            norm_layer = partial(nn.BatchNorm2d, eps=1e-3, momentum=0.1)    
        
        # partial()：往InvertedResidualConfig.adjust_channels里传参数
        adjust_channels = partial(InvertedResidualConfig.adjust_channels,   
                                  width_coefficient=width_coefficient)

        # build inverted_residual_setting
        bneck_conf = partial(InvertedResidualConfig,
                             width_coefficient=width_coefficient)

        b = 0   # 用于记录MBConv模块序号
        # 遍历default_cnf，取每个list元素的最后一个参数repeats
        #   然后传给round_repeats函数，再求和
        num_blocks = float(sum(round_repeats(i[-1]) for i in default_cnf))
        # 用于存储所有MBConv模块的配置文件
        inverted_residual_setting = []
        # stage是索引，args是数据
        for stage, args in enumerate(default_cnf):
            cnf = copy.copy(args)
            # 把每一行中的最后一个元素repeats pop出来进round_repeats，得到应该遍历这个MBConv模块的次数
            #   注意：此时cnf中没有repeats元素了
            for i in range(round_repeats(cnf.pop(-1))):
                # 针对重复多次的MBConv，非第一个block，有些参数要变一下
                if i > 0:
                    # strides equal 1 except first cnf
                    cnf[-3] = 1         # strides       此时cnf中没有repeats元素了
                    cnf[1] = cnf[2]     # input_channel equal output_channel

                # MBConv中的droprate是逐步递增上去的
                cnf[-1] = args[-2] * b / num_blocks  # update dropout ratio 
                # index = str(stage + 1) + chr(i + 97)  # 1a, 2a, 2b, ...     原来index长这样，权重里也是这个
                index = 'stage' + str(stage + 1) + chr(i + 97)  # stage1a, stage2a, stage2b, ...     修改一下，index长这样
                # 好好理一下 *cnf 操作
                # index用来记录当前MBConv模块的名称，方便后期分析
                inverted_residual_setting.append(bneck_conf(*cnf, index))   
                b += 1

        # create layers，通过有序字典进行构建
        #   往里面加东西，是update
        layers = OrderedDict()

        # first conv，命名为stem_conv
        layers.update({"stem_conv": ConvBNActivation(in_planes=3,
                                                    # out_planes针对B0是32，其它的不一定，需要用宽度倍率因子调整，
                                                    #   故用adjust_channels()函数进行操作
                                                     out_planes=adjust_channels(32),    
                                                     kernel_size=3,
                                                     stride=2,
                                                     norm_layer=norm_layer)})

        # building inverted residual blocks
        for cnf in inverted_residual_setting:
            # 把index作为其名称，block就是上面写的MBConv类
            layers.update({cnf.index: block(cnf, norm_layer)})

        # build top
        # last_conv_input_c：最后那个1x1 conv输入channel是多少
        last_conv_input_c = inverted_residual_setting[-1].out_c
        last_conv_output_c = adjust_channels(1280)
        # 起名叫top，也是人才
        layers.update({"top": ConvBNActivation(in_planes=last_conv_input_c,
                                               out_planes=last_conv_output_c,
                                               kernel_size=1,
                                               norm_layer=norm_layer)})

        # 把有序字典layers传给nn.Sequential这个类，实例化得到self.features
        self.features = nn.Sequential(layers)
        self.avgpool = nn.AdaptiveAvgPool2d(1)

        classifier = []
        if dropout_rate > 0:    # 在全连接层前添加dropout层？是的，没看到
            classifier.append(nn.Dropout(p=dropout_rate, inplace=True))
        classifier.append(nn.Linear(last_conv_output_c, num_classes))
        # 把classifier列表中的元素通过 *list 依次传给nn.Sequential这个类，实例化得到self.classifier
        self.classifier = nn.Sequential(*classifier)

        # initial weights
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode="fan_out")
                if m.bias is not None:
                    nn.init.zeros_(m.bias)
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.ones_(m.weight)
                nn.init.zeros_(m.bias)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.zeros_(m.bias)

    def _forward_impl(self, x: Tensor) -> Tensor:
        x = self.features(x)
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)

        return x

    def forward(self, x: Tensor) -> Tensor:
        return self._forward_impl(x)


# 通过上面的EfficientNet类得到efficientnet_b0 ~ efficientnet_b7
def efficientnet_b0(num_classes=1000):
    # input image size 224x224
    return EfficientNet(width_coefficient=1.0,
                        depth_coefficient=1.0,
                        dropout_rate=0.2,
                        num_classes=num_classes)


def efficientnet_b1(num_classes=1000):
    # input image size 240x240
    return EfficientNet(width_coefficient=1.0,
                        depth_coefficient=1.1,
                        dropout_rate=0.2,
                        num_classes=num_classes)


def efficientnet_b2(num_classes=1000):
    # input image size 260x260
    return EfficientNet(width_coefficient=1.1,
                        depth_coefficient=1.2,
                        dropout_rate=0.3,
                        num_classes=num_classes)


def efficientnet_b3(num_classes=1000):
    # input image size 300x300
    return EfficientNet(width_coefficient=1.2,
                        depth_coefficient=1.4,
                        dropout_rate=0.3,
                        num_classes=num_classes)


def efficientnet_b4(num_classes=1000):
    # input image size 380x380
    return EfficientNet(width_coefficient=1.4,
                        depth_coefficient=1.8,
                        dropout_rate=0.4,
                        num_classes=num_classes)


def efficientnet_b5(num_classes=1000):
    # input image size 456x456
    return EfficientNet(width_coefficient=1.6,
                        depth_coefficient=2.2,
                        dropout_rate=0.4,
                        num_classes=num_classes)


def efficientnet_b6(num_classes=1000):
    # input image size 528x528
    return EfficientNet(width_coefficient=1.8,
                        depth_coefficient=2.6,
                        dropout_rate=0.5,
                        num_classes=num_classes)


def efficientnet_b7(num_classes=1000):
    # input image size 600x600
    return EfficientNet(width_coefficient=2.0,
                        depth_coefficient=3.1,
                        dropout_rate=0.5,
                        num_classes=num_classes)


if __name__ == "__main__":
    model = efficientnet_b0()
    print(model)

    from torchsummaryX import summary
    summary(model, torch.zeros(1, 3, 224, 224))

输出：

EfficientNet(
  (features): Sequential(
    (stem_conv): ConvBNActivation(
      (0): Conv2d(3, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (1): BatchNorm2d(32, eps=0.001, momentum=0.1, affine=True, track_running_stats=True)
      (2): SiLU()
    )
    (1a): InvertedResidual(
      (block): Sequential(
        (dwconv): ConvBNActivation(
          (0): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=32, bias=False)
          (1): BatchNorm2d(32, eps=0.001, momentum=0.1, affine=True, track_running_stats=True)
          (2): SiLU()
        )
        (se): SqueezeExcitation(
        ...
  )
  (avgpool): AdaptiveAvgPool2d(output_size=1)
  (classifier): Sequential(
    (0): Dropout(p=0.2, inplace=True)
    (1): Linear(in_features=1280, out_features=1000, bias=True)
  )
)
=============================================================================================================
                                                       Kernel Shape  \
Layer
0_features.stem_conv.Conv2d_0                         [3, 32, 3, 3]
1_features.stem_conv.BatchNorm2d_1                             [32]
2_features.stem_conv.SiLU_2                                       -
...
227_avgpool                                                -
228_classifier.Dropout_0                                   -
229_classifier.Linear_1                                1.28M
-------------------------------------------------------------------------------------------------------------
                          Totals
Total params           5.288548M
Trainable params       5.288548M
Non-trainable params         0.0
Mult-Adds             385.83576M

6 感谢链接

https://www.bilibili.com/video/BV1XK4y1U7PX/?spm_id_from=333.788
https://blog.csdn.net/qq_37541097/article/details/114434046