2017MobileNet详解

最新推荐文章于 2023-11-01 11:03:34 发布

周周11周周

最新推荐文章于 2023-11-01 11:03:34 发布

阅读量334

点赞数 1

文章标签：深度学习计算机视觉 cnn

本文链接：https://blog.csdn.net/zhou123333/article/details/129039370

版权

论文名称：MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

一、轻量化网络

传统卷积神经网络，内存需求大、运算量大导致无法在移动设备以及嵌入式设备上运行

轻量化网络大概分为四个角度来进行加速和压缩：

1.压缩已训练好的模型：知识蒸馏(用大的老师模型去训练小的学生模型），权值量化，剪枝（权重剪枝，通道剪枝），注意力迁移

2.直接训练轻量化网络：SqueezeNet,MobileNet(v1,v2,v3),ShuffleNet，Xception,EfficientNet,NasNet,DARTS

3.加速卷积运算：im2col+GEMM,Winograd,低秩分解

4.硬件部署：TensorRT,Jetson,Tensorflow-slim Tensorflow,Tensorflow_lite,Openvin

使其这些参数量，计算量，内存访问量，耗时，能耗，碳排放，CUDA加速，对抗学习改变

二、MobileNetV1网络结构详解

MobileNet网络是由google团队在2017年提出的，专注于移动端或者嵌入式设备中的轻量级CNN网络。相比传统卷积神经网络，在准确率小幅降低的前提下大大减少模型参数与运算量。(相比VGG16准确率减少了0.9%，但模型参数只有VGG的1/32)

网络中的亮点:

➢Depthwise Convolution(大大减少运算量和参数数量)
➢增加超参数a(控制卷积核卷积层个数）、β（(控制输入图像大小）

这两个超参数是人为设定的

传统卷积

卷积核channel=输入特征矩阵channel

卷积核个数=输出特征矩阵channel

在这里插入图片描述

DW（Depthwise Separable Convolution)卷积:

卷积核channel=1

输入特征矩阵channel=卷积核个数=输出特征矩阵channel

每一个卷积核就负责一个channel

在这里插入图片描述

PW（Pointwise Separable Convolution)点卷积:

就是普通卷积但是卷积核大小为1

卷积核channel=输入特征矩阵channel

卷积核个数=输出特征矩阵channel

在这里插入图片描述

DW和PW一般放在一起使用

比较计算量

在这里插入图片描述

其中的参数，DF为输入特征矩阵的大小，DK是卷积核大小，M是输入特征矩阵的深度，N是输出特征矩阵的深度计算量公式：卷积核大小✖输入特征矩阵深度✖卷积核个数✖输入特征矩阵大小

所以普通卷积的计算量为：DK✖DK✖M✖N✖DF✖DF

在这里插入图片描述

DW（卷积核深度=1）+PW（卷积核大小=1）卷积的计算量为DK✖DK✖M✖DF✖DF+M✖N✖DF✖DF

将两种卷积进行对比：相比等于在这里插入图片描述

因为MobileNet中的卷积核大小一般都为3✖3所以可以写成在这里插入图片描述

所以在理论上普通卷积计算量是DW+PW的8到9倍

在这里插入图片描述

conv/s2表示是卷积层，stride=2

Filter shape=卷积核大小✖深度✖卷积核个数

其中有dw的表示是dw卷积，深度为1省略不写

在这里插入图片描述

准确率，计算量和模型参数

在这里插入图片描述

a(控制卷积核卷积层个数）、β（(控制输入图像大小）

depthwise部分的卷积核容易废掉，即卷积核参数大部分为零，在MobileNetV2中会改善

三、MobileNet v2

MobileNet v2网络是由google团队在2018年提出的，相比MobileNet V1网
络，准确率更高，模型更小。

论文名称：MobileNetV2: Inverted Residuals and Linear Bottlenecks

网络中的亮点：

➢Inverted Residuals (倒残差结构)
➢Linear Bottlenecks

1.Inverted Residuals (倒残差结构)

知识回顾：ResNet中的残差结构

在这里插入图片描述

先回顾resnet网络里的残差结构（左图），对输入特征矩阵采用11的卷积核来对特征矩阵进行压缩，也就是减少输入特征矩阵的channel，然后用33卷积核处理，最后用1*1的卷积核来扩充channel；

mobile net网络里的倒残差结构（右图），先用11的卷积核进行升维操作，将channel变的更深，通过卷积核大小为33的DW操作进行卷积，再用1*1的卷积核进行降维；

倒残差结构是先用卷积升维，在用卷积降维；残差结构是反过来的，先用卷积降维，在用卷积升维。

升维降维就是增加或减少channel

因为和普通残差结构相反所以叫倒残差结构

另外普通残差结构的激活函数为ReLU，倒残差结构的激活函数为ReLU6

普通的Relu是当x<0时置0，x>0时不变

在这里插入图片描述

Relu6激活函数：普通的Relu是当x<0时置0，0<x<6时不变，x>6时取6

2.Linear Bottlenecks

针对倒残差结构最后一个1✖1卷积层使用了线性激活函数而不是Relu激活函数，是因为Relu激活函数对低维特征信息造成大量的损失

论文中通过一个实验得到了这个结论

在到残差结构中作为最后一层1✖1卷积的输入特征矩阵它的维度是很小的，所以使用线性激活函数更为合适

在这里插入图片描述

左边是MobileNet V2倒残差网络结构图，通过1✖1的卷积层，ReLU6激活，3✖3的DW卷积，ReLU6激活，1✖1的卷积来处理，线性激活。

对应表格给的每一层信息:

第一层输入特征矩阵：高✖宽✖深度=h✖w✖k，1✖1卷积核进行升维处理，output里面的t是扩展因子，表示1✖1卷积核的个数是tk个，输出特征矩阵的深度就是tk，所以输出特征矩阵大小为h✖w✖tk；

第二层输入等于第一层输出h✖w✖tk，经过3✖3的DW卷积核（输入特征矩阵channel=卷积核个数=输出特征矩阵channel），因为步长为s，所以输出矩阵的高和宽缩减为原来的1/s;因为是输出矩阵的深度和输入矩阵深度相同,所以输出特征矩阵大小为h/s✖w/s✖tk；

第三层输入等于第二层输出h/s✖w/s✖tk，采用1*1的卷积核降维，使用k’个卷积核，所以输出特征矩阵大小为h/s✖w/s✖k’

当stride=1且输入特征矩阵与输出特征矩阵shape相同才有shortcut连接，就是那个弧线

在这里插入图片描述

MobileNet v2网络结构模型参数如上

参数t为扩展倍率，是第一层k个1✖1卷积核的扩展倍率，可以用来控制卷积核的个数。*

参数c为输出特征矩阵的深度channel，就是第三层1*1卷积核的个数；参数n为bottleneck重复的次数

bottleneck就是论文中的倒残差结构，n=1代表bottleneck重复一次；一个block由一系列bottleneck组成

s代表步距的意思，指的是每一个block代表的第一层bottleneck的步距，其余的bottleneck的stride都是=1，比如表格中第三行，n=2，s=2，bottleneck要重复2遍，第一层的bottleneck步距为2，第二层的bottleneck步距为1

看t=1那一行代表它卷积核个数还是k个，所以输出矩阵的channel还是k并没有进行升维，所以这一层可以省略

怎么理解当stride=1且输入特征矩阵与输出特征矩阵shape相同才有shortcut连接？

看14✖14✖64那一行，n=3，表示有三层bottleneck，但是stride=,它是不会有shortcut连接，因为输入是64维，输出维度是96；输入和输出的维度是不相同的，没办法进行相加。

但是第二层步距还是1，输入特征矩阵的深度是输出特征矩阵的深度，都是96，stride=1，特征矩阵的高和宽不变，所以第二层bottleneck的输入特征矩阵和输出特征矩阵的shape是一样的，此时，才能使用shortcut捷径分支

最后一层是一个conv2d卷积层，但是其实是全连接层，因为输入为1✖1✖1280，1✖1如果不看，就是一维向量，卷积层和全连接层功能一样，k为分类类别数

MobileNetV2论文中的性能对比

在这里插入图片描述

MobileNet V2在alpha=1.4的时候，准确率为74.7

在目标检测方面的应用上，与SSDLite相结合。MobileNet V2作为前置网络，SSDLite中的一些卷积层也加上了PW卷积操作。与其他模型相比，准确率不如SSDLite300，SSDLite512，但比Yolov2要好。但是参数量，运算量和执行时间比MobileNet V1降低了

四、MobileNet v3

论文名称：Searching for MobileNetV3

亮点：

更新Block (bneck)

使用NAS搜索参数(Neural Architecture Search)

重新设计耗时层结构

V3版本更准确更有效

在这里插入图片描述

后三列是推理速度

亮点一：更新Block

1.加入了SE模块，也就是注意力机制

2.更新了激活函数

SE模块

在这里插入图片描述

带箭头的实线是shortcut残差结构

stride=1是指Depthwise Convolution的stride=1

在这里插入图片描述

NL是非线性激活函数的意思

V3的block最大的改变就是加入了s1模块也就是注意力机制

针对我们得到的特征矩阵，对他的每一个channel进行一个池化处理，特征矩阵的channel等于多少，得到的一维向量就有多少个元素，再通过两个全连接层，得到输出向量。其中第一个全连接层，它的全连接层节点数是等于输入特征矩阵channel的1/4,第二个全连接层的channel是与我们特征矩阵的channel保持一致的。经过平均池化+两个全连接层，输出的特征向量可以理解为是对SE之前的特征矩阵的每一个channel分析出了一个权重关系,它认为比较重要的channel会赋予一个更大的权重，对于不是那么重要的channel维度上对应一个比较小的权重

在这里插入图片描述

]如上图所示：假设我们特征矩阵的channel为2，使用Avg pooling针对每一个channel去求一个均值，因为有两个channel，所以得到2个元素的向量，然后依次在经过两个全连接层，第一个channel为原来channel的1/4,并且对应relu激活函数。对于第二个全连接层它的channel和我们特征矩阵channel维度是一致的，注意这里使用的激活函数使h-sigmod,然后得到和特征矩阵channel大小一样的向量，每个元素就对应于每个channel的权重.比如第一个元素是0.5，将0.5与特征矩阵第一个channel的元素相乘，得到一个新的channel数据。

更新激活函数

在这里插入图片描述

对应图中的NL表示的就是非线性激活函数，在不同层中使用的激活函数不一样，这里没有给出一个明确的激活函数，而是标注了一个NL,注意最后一层1x1的卷积是没有使用非线性激活函数的，用的是线性激活函数。

MobieNetV3 Block和MobieNetV2 Block结构基本是一样的，主要是增加了SE结构，并对激活函数进行了更新

重新设计耗时层结构

在原论文中主要讲了两个部分：

1.减少第一个卷积层的卷积个数(32 -> 16)

在这里插入图片描述

在v1,v2版本第一层卷积核个数都是32的，在v3版本中我们只使用了16个

在原论文中，作者说将卷积核Filter个数从32变为16之后，它的准确率是和32是一样的,既然准确率没有影响，使用更少的卷积核计算量就变得更小了。这里节省了大概2ms的运算时间

2.重新设计耗时层结构：精简 Last Stage

在使用NAS搜索出来的网络结构的最后一部分，叫做Original last Stage,它的网络结构如下：

在这里插入图片描述

该网络是主要是通过堆叠卷积而来的，作者在使用过程中发现这个Original Last Stage是一个比较耗时的部分，作者就针对该结构进行了精简，提出了一个Efficient Last Stage

Efficient Last Stage相比之前的Original Last Stage，少了很多卷积层，作者发现更新网络后准确率是几乎没有变化的，但是节省了7ms的执行时间。这7ms占据了推理11%的时间，因此使用Efficient Last Stage之后，对我们速度提升还是比较明显的。

重新设计激活函数

之前在v2版本我们基本都是使用ReLU6激活函数，现在比较常用的激活函数叫swish激活函数。
在这里插入图片描述

其中σ的计算公式如下：
在这里插入图片描述

使用swish激活函数之后，确实能够提高网络的准确率，但是也存在2个问题：

计算、求导复杂；对量化过程不友好(对移动端设备，基本上为了加速都会对它进行量化操作)
由于存在这个问题，作者就提出了h-swish激活函数，在讲h-swish激活函数之前我们来讲一下h-sigmoid激活函数

h-sigmoid激活函数是在relu6激活函数上进行修改的：
在这里插入图片描述

在这里插入图片描述

从图中可以看出h-sigmoid与sigmoid激活函数比较接近，所以在很多场景中会使用h-sigmoid激活函数去替代我们的sigmoid激活函数。因此h-swish中σ 替换为h-sigmoid之后，函数的形式如下:
在这里插入图片描述

如上图右侧部分，是swish和h-swish激活函数的比较，很明显这两个曲线是非常相似的，所以说使用h-switch来替代switch激活函数还是挺棒的。

在原论文中，作者说将h-swish替换swish，将h-sigmoid替换sigmoid，对于网络的推理过程而言，在推理速度上是有一定帮助的，并且对量化过程也是非常友好的。

MobieNetV3-Large 版本的网络结构

简单看下表中各参数的含义：
在这里插入图片描述

input输入层特征矩阵的shape（长✖宽✖channel)
operator表示的是操作，对于第一个卷积层conv2d;这里的
#out代表的输出特征矩阵的channel,我们说过在v3版本中第一个卷积核使用的是16个卷积核
这里的NL代表的是激活函数，其中HS代表的是hard swish激活函数，RE代表的是ReLU激活函数；
这里的s代表的DW卷积的步距；
这里的beneck对应的是图左边的结构；
exp size代表的是第一个升维的卷积，我们要将维度升到多少，exp size多少，我们就用第一层1x1卷积升到多少维。
SE:表示是否使用注意力机制，只要表格中标√所对应的bneck结构才会使用我们的注意力机制，对没有打√就不会使用注意力机制
NB：最后两个卷积的operator提示NBN，表示这两个卷积不使用BN结构，最后两个卷积相当于全连接的作用

注意：

第一个bneck结构，这里有一点比较特殊，它的exp size和输入的特征矩阵channel是一样的，本来bneck中第一个卷积起到升维的作用，但这里并没有升维。所以在实现过程中，第一个bneck结构是没有1x1卷积层的，它是直接对我们特征矩阵进行DW卷积处理

bneck
首先通过1x1卷积进行升维到exp size，通过DW卷积它的维度是不会发生变化的，同样通过SE之后channel也会不会发生变化的。最后通过1x1卷积进行降维处理。降维后的channel对应于#out所给的数值。
对于shortcut捷径分支，必须是DW卷积的步距为1，且bneck的input_channel=output_channel才有shortcut连接
通过这个表我们就可以搭建MobilenetV3网络结构了

MobileNetV3-Small
在这里插入图片描述

五、使用pytorch搭建MobileNetV2并基于迁移学习训练

github上的官方模型地址

https://github.com/pytorch/vision/tree/main/torchvision/models

1.model_v2.py

from torch import nn
import torch

#是官方定义的方法，min_ch是最小通道数
#算法的作用就是把通道数设为一个离这个数最近的8的倍数的一个数
def _make_divisible(ch, divisor=8, min_ch=None):
    """
    This function is taken from the original tf repo.
    It ensures that all layers have a channel number that is divisible by 8
    It can be seen here:
    https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py
    """
    if min_ch is None:
        min_ch = divisor
    new_ch = max(min_ch, int(ch + divisor / 2) // divisor * divisor)#会进行四舍五入，例如ch=8,new_ch=8；ch=12则new_ch=16
    # Make sure that round down does not go down by more than 10%.确保向下取整时不会减少超过10%
    if new_ch < 0.9 * ch:
        new_ch += divisor
    return new_ch


class ConvBNReLU(nn.Sequential):
    #group=1表示普通卷积，如果将groups设置为输入特征矩阵的深度即in_channel的话，就是DW卷积
    def __init__(self, in_channel, out_channel, kernel_size=3, stride=1, groups=1):
        #根据kernel_size计算padding
        padding = (kernel_size - 1) // 2
        super(ConvBNReLU, self).__init__(
            #卷积核个数就是out_channel;因为要设置偏置，所以bias=False
            nn.Conv2d(in_channel, out_channel, kernel_size, stride, padding, groups=groups, bias=False),
            nn.BatchNorm2d(out_channel),
            nn.ReLU6(inplace=True)
        )

#倒残差结构；channel=1,输入矩阵channel=输出矩阵channel
#倒残差结构是否有捷径分支是通过是否满足：当stride=1且输入特征矩阵与输出特征矩阵shape相同时才有shortcut连接   来决定的
class InvertedResidual(nn.Module):
    #expand_ratio是扩展因子，也就是论文中的t
    def __init__(self, in_channel, out_channel, stride, expand_ratio):
        super(InvertedResidual, self).__init__()
        #hidden_channel其实就是第一层卷积的卷积核的个数=tk也就是扩展因子✖输入特征矩阵的深度
        hidden_channel = in_channel * expand_ratio
        #判断是否使用捷径分支
        self.use_shortcut = stride == 1 and in_channel == out_channel
        #定义层列表
        layers = []
        #判断扩展因子是否等于1，因为如果等于1的话，第一层卷积的输入特征矩阵的深度=输出特征矩阵不会进行升维，所以也就不需要这层卷积了
        if expand_ratio != 1:
            # 1x1 pointwise conv
            layers.append(ConvBNReLU(in_channel, hidden_channel, kernel_size=1))
            #extend和append函数其实是一样的，但是extend函数可以进行批量的插入
        layers.extend([
            # groups=hidden_channel证明是DW卷积，3x3 depthwise conv
            ConvBNReLU(hidden_channel, hidden_channel, stride=stride, groups=hidden_channel),
            # 1x1 pointwise conv(linear)，因为使用的是线性激活函数，所以这里使用普通的Conv2d而不是ConvBNReLU
            #线性激活函数其实就相当于y=x，所以就不需要额外定义激活函数，不定义是一样的效果
            nn.Conv2d(hidden_channel, out_channel, kernel_size=1, bias=False),
            nn.BatchNorm2d(out_channel),
        ])
        #将layers通过位置参数的形式传入进去
        self.conv = nn.Sequential(*layers)

    def forward(self, x):#x就是输入的特征矩阵
        #是否使用捷径分支，之前已经通过判断得到了参数的值
        if self.use_shortcut:
            return x + self.conv(x)
        else:
            return self.conv(x)

#定义MobileNetV2网络
class MobileNetV2(nn.Module):
    #alpha是v1中就提到的超分类参数α，控制卷积层使用的卷积核个数的倍率
    def __init__(self, num_classes=1000, alpha=1.0, round_nearest=8):
        super(MobileNetV2, self).__init__()
        #将InvertedResidual传给block
        block = InvertedResidual
        #_make_divisible将卷积核个数也就是输出矩阵的深度调整为round_nearest的整数倍，原因应该是为了更好的调用硬件设备
        input_channel = _make_divisible(32 * alpha, round_nearest)
        last_channel = _make_divisible(1280 * alpha, round_nearest)

        inverted_residual_setting = [
            #论文表格中的参数
            # t, c, n, s
            [1, 16, 1, 1],
            [6, 24, 2, 2],
            [6, 32, 3, 2],
            [6, 64, 4, 2],
            [6, 96, 3, 1],
            [6, 160, 3, 2],
            [6, 320, 1, 1],
        ]

        features = []
        # 添加第一个卷积层 conv1 layer
        features.append(ConvBNReLU(3, input_channel, stride=2))

        # 定义一系列block结构 building inverted residual residual blockes
        for t, c, n, s in inverted_residual_setting:
            output_channel = _make_divisible(c * alpha, round_nearest)
            for i in range(n):
                #stride表示的是第一层的步距，其他层都等于1
                stride = s if i == 0 else 1
                features.append(block(input_channel, output_channel, stride, expand_ratio=t))
                input_channel = output_channel
        # building last several layers
        features.append(ConvBNReLU(input_channel, last_channel, 1))
        # combine feature layers
        self.features = nn.Sequential(*features)

        # building classifier
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.classifier = nn.Sequential(
            nn.Dropout(0.2),
            nn.Linear(last_channel, num_classes)
        )

        # weight initialization
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out')
                if m.bias is not None:
                    nn.init.zeros_(m.bias)
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.ones_(m.weight)
                nn.init.zeros_(m.bias)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.zeros_(m.bias)

    def forward(self, x):
        x = self.features(x)
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x

2.model_v3.py

from typing import Callable, List, Optional

import torch
from torch import nn, Tensor
from torch.nn import functional as F
from functools import partial


def _make_divisible(ch, divisor=8, min_ch=None):
    """
    This function is taken from the original tf repo.
    It ensures that all layers have a channel number that is divisible by 8
    It can be seen here:
    https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py
    """
    if min_ch is None:
        min_ch = divisor
    new_ch = max(min_ch, int(ch + divisor / 2) // divisor * divisor)
    # Make sure that round down does not go down by more than 10%.
    if new_ch < 0.9 * ch:
        new_ch += divisor
    return new_ch


class ConvBNActivation(nn.Sequential):
    def __init__(self,
                 in_planes: int,
                 out_planes: int,
                 kernel_size: int = 3,
                 stride: int = 1,
                 groups: int = 1,
                 norm_layer: Optional[Callable[..., nn.Module]] = None,
                 activation_layer: Optional[Callable[..., nn.Module]] = None):
        padding = (kernel_size - 1) // 2
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        if activation_layer is None:
            activation_layer = nn.ReLU6
        super(ConvBNActivation, self).__init__(nn.Conv2d(in_channels=in_planes,
                                                         out_channels=out_planes,
                                                         kernel_size=kernel_size,
                                                         stride=stride,
                                                         padding=padding,
                                                         groups=groups,
                                                         bias=False),
                                               norm_layer(out_planes),
                                               activation_layer(inplace=True))

#SE模块也就是注意力机制模块
class SqueezeExcitation(nn.Module):
    # 第一个全连接层的节点个数是输入特征矩阵channel的四分之一
    def __init__(self, input_c: int, squeeze_factor: int = 4):
        super(SqueezeExcitation, self).__init__()
        #不光变为输入特征矩阵channel的四分之一还得是离它最近的8的倍数
        squeeze_c = _make_divisible(input_c // squeeze_factor, 8)
        #这里使用Conv2d与全连接层功能一样
        self.fc1 = nn.Conv2d(input_c, squeeze_c, 1)
        self.fc2 = nn.Conv2d(squeeze_c, input_c, 1)

    def forward(self, x: Tensor) -> Tensor:

        scale = F.adaptive_avg_pool2d(x, output_size=(1, 1))
        scale = self.fc1(scale)
        scale = F.relu(scale, inplace=True)
        scale = self.fc2(scale)
        #hardsigmoid
        scale = F.hardsigmoid(scale, inplace=True)
        return scale * x

#InvertedResidualConfig对应的是MobileNetV3中的每一个bneck结构的参数配置
class InvertedResidualConfig:
    def __init__(self,
                 input_c: int,
                 kernel: int,
                 expanded_c: int,
                 out_c: int,
                 use_se: bool,
                 activation: str,
                 stride: int,
                 width_multi: float):#width_multi是倍率因子α
        self.input_c = self.adjust_channels(input_c, width_multi)
        self.kernel = kernel
        self.expanded_c = self.adjust_channels(expanded_c, width_multi)
        self.out_c = self.adjust_channels(out_c, width_multi)
        self.use_se = use_se
        self.use_hs = activation == "HS"  # whether using hard-swish activation
        self.stride = stride

    @staticmethod
    def adjust_channels(channels: int, width_multi: float):
        return _make_divisible(channels * width_multi, 8)


class InvertedResidual(nn.Module):
    def __init__(self,
                 cnf: InvertedResidualConfig,
                 norm_layer: Callable[..., nn.Module]):
        super(InvertedResidual, self).__init__()

        if cnf.stride not in [1, 2]:
            raise ValueError("illegal stride value.")

        self.use_res_connect = (cnf.stride == 1 and cnf.input_c == cnf.out_c)

        layers: List[nn.Module] = []
        activation_layer = nn.Hardswish if cnf.use_hs else nn.ReLU

        #第一个bneck结构的第一层没有升维的那层卷积，因此要进行判断
        if cnf.expanded_c != cnf.input_c:
            layers.append(ConvBNActivation(cnf.input_c,
                                           cnf.expanded_c,
                                           kernel_size=1,
                                           norm_layer=norm_layer,
                                           activation_layer=activation_layer))

        # depthwise
        layers.append(ConvBNActivation(cnf.expanded_c,
                                       cnf.expanded_c,
                                       kernel_size=cnf.kernel,
                                       stride=cnf.stride,
                                       groups=cnf.expanded_c,#每一个channel都有一个对应的卷积核所以groups数等于expanded_c
                                       norm_layer=norm_layer,
                                       activation_layer=activation_layer))

        if cnf.use_se:
            layers.append(SqueezeExcitation(cnf.expanded_c))

        # 降维的卷积层


        layers.append(ConvBNActivation(cnf.expanded_c,
                                       cnf.out_c,
                                       kernel_size=1,
                                       norm_layer=norm_layer,
                                       activation_layer=nn.Identity))

        self.block = nn.Sequential(*layers)
        self.out_channels = cnf.out_c
        self.is_strided = cnf.stride > 1

    def forward(self, x: Tensor) -> Tensor:
        result = self.block(x)
        #如果有short_cut连接就相加，没有就直接输出 result
        if self.use_res_connect:
            result += x

        return result


class MobileNetV3(nn.Module):
    def __init__(self,
                 inverted_residual_setting: List[InvertedResidualConfig],
                 last_channel: int,
                 num_classes: int = 1000,
                 block: Optional[Callable[..., nn.Module]] = None,
                 norm_layer: Optional[Callable[..., nn.Module]] = None):
        super(MobileNetV3, self).__init__()

        if not inverted_residual_setting:
            raise ValueError("The inverted_residual_setting should not be empty.")
        elif not (isinstance(inverted_residual_setting, List) and
                  all([isinstance(s, InvertedResidualConfig) for s in inverted_residual_setting])):
            raise TypeError("The inverted_residual_setting should be List[InvertedResidualConfig]")

        if block is None:
            block = InvertedResidual

        if norm_layer is None:
            #partial为BatchNorm2d方法传入两个参数eps和momentum，将它们设为默认值，这样在使用BatchNorm2d时就不用再进行设置了
            norm_layer = partial(nn.BatchNorm2d, eps=0.001, momentum=0.01)

        layers: List[nn.Module] = []
        #开始构建网络
        # building first layer
        firstconv_output_c = inverted_residual_setting[0].input_c
        layers.append(ConvBNActivation(3,
                                       firstconv_output_c,
                                       kernel_size=3,
                                       stride=2,
                                       norm_layer=norm_layer,
                                       activation_layer=nn.Hardswish))
        # building inverted residual blocks
        for cnf in inverted_residual_setting:
            layers.append(block(cnf, norm_layer))

        # building last several layers bneck后面的几层
        lastconv_input_c = inverted_residual_setting[-1].out_c
        lastconv_output_c = 6 * lastconv_input_c
        layers.append(ConvBNActivation(lastconv_input_c,
                                       lastconv_output_c,
                                       kernel_size=1,
                                       norm_layer=norm_layer,
                                       activation_layer=nn.Hardswish))
        self.features = nn.Sequential(*layers)
        self.avgpool = nn.AdaptiveAvgPool2d(1)
        #最后两个全连接层
        self.classifier = nn.Sequential(nn.Linear(lastconv_output_c, last_channel),
                                        nn.Hardswish(inplace=True),
                                        nn.Dropout(p=0.2, inplace=True),
                                        nn.Linear(last_channel, num_classes))

        # initial weights
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode="fan_out")
                if m.bias is not None:
                    nn.init.zeros_(m.bias)
            elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
                nn.init.ones_(m.weight)
                nn.init.zeros_(m.bias)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.zeros_(m.bias)

    def _forward_impl(self, x: Tensor) -> Tensor:
        x = self.features(x)
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)

        return x

    def forward(self, x: Tensor) -> Tensor:
        return self._forward_impl(x)


def mobilenet_v3_large(num_classes: int = 1000,
                       reduced_tail: bool = False) -> MobileNetV3:
    """
    Constructs a large MobileNetV3 architecture from
    "Searching for MobileNetV3" <https://arxiv.org/abs/1905.02244>.

    weights_link:
    https://download.pytorch.org/models/mobilenet_v3_large-8738ca79.pth

    Args:
        num_classes (int): number of classes
        reduced_tail (bool): If True, reduces the channel counts of all feature layers
            between C4 and C5 by 2. It is used to reduce the channel redundancy in the
            backbone for Detection and Segmentation.
    """
    width_multi = 1.0
    bneck_conf = partial(InvertedResidualConfig, width_multi=width_multi)
    adjust_channels = partial(InvertedResidualConfig.adjust_channels, width_multi=width_multi)

    reduce_divider = 2 if reduced_tail else 1

    inverted_residual_setting = [
        # input_c, kernel, expanded_c, out_c, use_se, activation, stride
        bneck_conf(16, 3, 16, 16, False, "RE", 1),
        bneck_conf(16, 3, 64, 24, False, "RE", 2),  # C1
        bneck_conf(24, 3, 72, 24, False, "RE", 1),
        bneck_conf(24, 5, 72, 40, True, "RE", 2),  # C2
        bneck_conf(40, 5, 120, 40, True, "RE", 1),
        bneck_conf(40, 5, 120, 40, True, "RE", 1),
        bneck_conf(40, 3, 240, 80, False, "HS", 2),  # C3
        bneck_conf(80, 3, 200, 80, False, "HS", 1),
        bneck_conf(80, 3, 184, 80, False, "HS", 1),
        bneck_conf(80, 3, 184, 80, False, "HS", 1),
        bneck_conf(80, 3, 480, 112, True, "HS", 1),
        bneck_conf(112, 3, 672, 112, True, "HS", 1),
        bneck_conf(112, 5, 672, 160 // reduce_divider, True, "HS", 2),  # C4
        bneck_conf(160 // reduce_divider, 5, 960 // reduce_divider, 160 // reduce_divider, True, "HS", 1),
        bneck_conf(160 // reduce_divider, 5, 960 // reduce_divider, 160 // reduce_divider, True, "HS", 1),
    ]
    last_channel = adjust_channels(1280 // reduce_divider)  # C5

    return MobileNetV3(inverted_residual_setting=inverted_residual_setting,
                       last_channel=last_channel,
                       num_classes=num_classes)


def mobilenet_v3_small(num_classes: int = 1000,
                       reduced_tail: bool = False) -> MobileNetV3:
    """
    Constructs a large MobileNetV3 architecture from
    "Searching for MobileNetV3" <https://arxiv.org/abs/1905.02244>.

    weights_link:
    https://download.pytorch.org/models/mobilenet_v3_small-047dcff4.pth

    Args:
        num_classes (int): number of classes
        reduced_tail (bool): If True, reduces the channel counts of all feature layers
            between C4 and C5 by 2. It is used to reduce the channel redundancy in the
            backbone for Detection and Segmentation.
    """
    width_multi = 1.0
    bneck_conf = partial(InvertedResidualConfig, width_multi=width_multi)
    #使用partical给InvertedResidualConfig设置了一个默认超参数width_multi
    adjust_channels = partial(InvertedResidualConfig.adjust_channels, width_multi=width_multi)

    reduce_divider = 2 if reduced_tail else 1

    inverted_residual_setting = [
        #与论文中表格的参数相对应
        # input_c, kernel, expanded_c, out_c, use_se, activation, stride
        bneck_conf(16, 3, 16, 16, True, "RE", 2),  # C1
        bneck_conf(16, 3, 72, 24, False, "RE", 2),  # C2
        bneck_conf(24, 3, 88, 24, False, "RE", 1),
        bneck_conf(24, 5, 96, 40, True, "HS", 2),  # C3
        bneck_conf(40, 5, 240, 40, True, "HS", 1),
        bneck_conf(40, 5, 240, 40, True, "HS", 1),
        bneck_conf(40, 5, 120, 48, True, "HS", 1),
        bneck_conf(48, 5, 144, 48, True, "HS", 1),
        bneck_conf(48, 5, 288, 96 // reduce_divider, True, "HS", 2),  # C4
        bneck_conf(96 // reduce_divider, 5, 576 // reduce_divider, 96 // reduce_divider, True, "HS", 1),
        bneck_conf(96 // reduce_divider, 5, 576 // reduce_divider, 96 // reduce_divider, True, "HS", 1)
    ]
    last_channel = adjust_channels(1024 // reduce_divider)  # C5

    return MobileNetV3(inverted_residual_setting=inverted_residual_setting,
                       last_channel=last_channel,
                       num_classes=num_classes)

3.train.py

import os
import sys
import json

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import transforms, datasets
from tqdm import tqdm

#from model_v2 import MobileNetV2
from model_v3 import mobilenet_v3_large

def main():
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    print("using {} device.".format(device))

    batch_size = 16
    epochs = 5

    data_transform = {
        "train": transforms.Compose([transforms.RandomResizedCrop(224),
                                     transforms.RandomHorizontalFlip(),
                                     transforms.ToTensor(),
                                     transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])]),
        "val": transforms.Compose([transforms.Resize(256),
                                   transforms.CenterCrop(224),
                                   transforms.ToTensor(),
                                   transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])])}

    data_root = os.path.abspath(os.path.join(os.getcwd(), "../.."))  # get data root path
    image_path = os.path.join(data_root, "data_set", "flower_data")  # flower data set path
    assert os.path.exists(image_path), "{} path does not exist.".format(image_path)
    train_dataset = datasets.ImageFolder(root=os.path.join(image_path, "train"),
                                         transform=data_transform["train"])
    train_num = len(train_dataset)

    # {'daisy':0, 'dandelion':1, 'roses':2, 'sunflower':3, 'tulips':4}
    flower_list = train_dataset.class_to_idx
    cla_dict = dict((val, key) for key, val in flower_list.items())
    # write dict into json file
    json_str = json.dumps(cla_dict, indent=4)
    with open('class_indices.json', 'w') as json_file:
        json_file.write(json_str)

    nw = min([os.cpu_count(), batch_size if batch_size > 1 else 0, 8])  # number of workers
    print('Using {} dataloader workers every process'.format(nw))

    train_loader = torch.utils.data.DataLoader(train_dataset,
                                               batch_size=batch_size, shuffle=True,
                                               num_workers=nw)

    validate_dataset = datasets.ImageFolder(root=os.path.join(image_path, "val"),
                                            transform=data_transform["val"])
    val_num = len(validate_dataset)
    validate_loader = torch.utils.data.DataLoader(validate_dataset,
                                                  batch_size=batch_size, shuffle=False,
                                                  num_workers=nw)

    print("using {} images for training, {} images for validation.".format(train_num,
                                                                           val_num))

    # create model
    net = mobilenet_v3_large(num_classes=5)
    #net = MobileNetV2(num_classes=5)

    # load pretrain weights
    # download url: https://download.pytorch.org/models/mobilenet_v2-b0353104.pth
    model_weight_path = "./mobilenet_v3_large.pth"
    assert os.path.exists(model_weight_path), "file {} dose not exist.".format(model_weight_path)
    #载入预训练模型
    pre_weights = torch.load(model_weight_path, map_location='cpu')

    # 因为官方是在imagenet上进行的预训练，所以它最后一层的结点个数是1000，而我们这里设得是5，所以最后一层参数是不能用的，所以要删掉delete classifier weights
    pre_dict = {k: v for k, v in pre_weights.items() if net.state_dict()[k].numel() == v.numel()}
    missing_keys, unexpected_keys = net.load_state_dict(pre_dict, strict=False)

    # freeze features weights
    #将features的参数进行冻结，只训练后两个全连接层。如果想全部都进行训练将这两句注释掉
    for param in net.features.parameters():
        param.requires_grad = False

    net.to(device)

    # define loss function
    loss_function = nn.CrossEntropyLoss()

    # construct an optimizer
    params = [p for p in net.parameters() if p.requires_grad]
    optimizer = optim.Adam(params, lr=0.0001)

    best_acc = 0.0
    save_path = './mobilenet_v3_large.pth'
    train_steps = len(train_loader)
    for epoch in range(epochs):
        # train
        net.train()
        running_loss = 0.0
        train_bar = tqdm(train_loader, file=sys.stdout)
        for step, data in enumerate(train_bar):
            images, labels = data
            optimizer.zero_grad()
            logits = net(images.to(device))
            loss = loss_function(logits, labels.to(device))
            loss.backward()
            optimizer.step()

            # print statistics
            running_loss += loss.item()

            train_bar.desc = "train epoch[{}/{}] loss:{:.3f}".format(epoch + 1,
                                                                     epochs,
                                                                     loss)

        # validate
        net.eval()
        acc = 0.0  # accumulate accurate number / epoch
        with torch.no_grad():
            val_bar = tqdm(validate_loader, file=sys.stdout)
            for val_data in val_bar:
                val_images, val_labels = val_data
                outputs = net(val_images.to(device))
                # loss = loss_function(outputs, test_labels)
                predict_y = torch.max(outputs, dim=1)[1]
                acc += torch.eq(predict_y, val_labels.to(device)).sum().item()

                val_bar.desc = "valid epoch[{}/{}]".format(epoch + 1,
                                                           epochs)
        val_accurate = acc / val_num
        print('[epoch %d] train_loss: %.3f  val_accuracy: %.3f' %
              (epoch + 1, running_loss / train_steps, val_accurate))

        if val_accurate > best_acc:
            best_acc = val_accurate
            torch.save(net.state_dict(), save_path)

    print('Finished Training')


if __name__ == '__main__':
    main()

4.predict.py

import os
import json

import torch
from PIL import Image
from torchvision import transforms
import matplotlib.pyplot as plt

from model_v2 import MobileNetV2


def main():
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

    data_transform = transforms.Compose(
        [transforms.Resize(256),
         transforms.CenterCrop(224),
         transforms.ToTensor(),
         transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])])

    # load image
    img_path = "tulip.jpg"
    assert os.path.exists(img_path), "file: '{}' dose not exist.".format(img_path)
    img = Image.open(img_path)
    plt.imshow(img)
    # [N, C, H, W]
    img = data_transform(img)
    # expand batch dimension
    img = torch.unsqueeze(img, dim=0)

    # read class_indict
    json_path = './class_indices.json'
    assert os.path.exists(json_path), "file: '{}' dose not exist.".format(json_path)

    with open(json_path, "r") as f:
        class_indict = json.load(f)

    # create model
    model = MobileNetV2(num_classes=5).to(device)
    # load model weights
    model_weight_path = "./MobileNetV2.pth"
    model.load_state_dict(torch.load(model_weight_path, map_location=device))
    model.eval()
    with torch.no_grad():
        # predict class
        output = torch.squeeze(model(img.to(device))).cpu()
        predict = torch.softmax(output, dim=0)
        predict_cla = torch.argmax(predict).numpy()

    print_res = "class: {}   prob: {:.3}".format(class_indict[str(predict_cla)],
                                                 predict[predict_cla].numpy())
    plt.title(print_res)
    for i in range(len(predict)):
        print("class: {:10}   prob: {:.3}".format(class_indict[str(i)],
                                                  predict[i].numpy()))
    plt.show()


if __name__ == '__main__':
    main()
 not exist.".format(json_path)

    with open(json_path, "r") as f:
        class_indict = json.load(f)

    # create model
    model = MobileNetV2(num_classes=5).to(device)
    # load model weights
    model_weight_path = "./MobileNetV2.pth"
    model.load_state_dict(torch.load(model_weight_path, map_location=device))
    model.eval()
    with torch.no_grad():
        # predict class
        output = torch.squeeze(model(img.to(device))).cpu()
        predict = torch.softmax(output, dim=0)
        predict_cla = torch.argmax(predict).numpy()

    print_res = "class: {}   prob: {:.3}".format(class_indict[str(predict_cla)],
                                                 predict[predict_cla].numpy())
    plt.title(print_res)
    for i in range(len(predict)):
        print("class: {:10}   prob: {:.3}".format(class_indict[str(i)],
                                                  predict[i].numpy()))
    plt.show()


if __name__ == '__main__':
    main()