【论文阅读】Mobile Net 系列【V1—V3】

何如千泷

已于 2024-05-28 11:08:39 修改

阅读量2.8k

点赞数 2

分类专栏： # 图像分类论文阅读文章标签：计算机视觉深度学习人工智能 mobilenet

于 2021-10-30 17:03:10 首次发布

本文链接：https://blog.csdn.net/qq_42735631/article/details/121053364

版权

论文阅读同时被 2 个专栏收录

23 篇文章 0 订阅

订阅专栏

图像分类

9 篇文章 5 订阅

订阅专栏

1. MobileNet V1

1.1 Abstract

我们提出了一类用于移动和嵌入式视觉应用程序的高效模——MobileNet，此模型使用深度可分离卷积来构建轻量级深度神经网络。我们还介绍了两个超参数：用于控制模型的延迟（模型运行时间）和准确率

1.2 Introduction

在计算机视觉中，目前存在的一般趋势是制造更深更复杂的网络以实现更高的准确性。但是，在现实世界中的应用程序中，需要在有限的平台上以低延迟的方式实现识别任务。

最近出现的许多方法只是关注模型的大小，而没有考虑速度，主要通过压缩预训练网络或直接训练小型网络。

本文介绍了一种有效的网络架构（基于深度可分离卷积）和一组两个超参数，以便构建非常小的低延迟模型，可以轻松匹配移动和嵌入式视觉应用程序的设计要求。

1.3 MobileNet Architecture

1.3.1 Depthwise Separable Convolution

MobileNet基于深度可分离卷积，此卷积将标准卷积分解为深度卷积（depthwise conv）和1x1点卷积（pointwise conv）。

depthwise conv：将单个滤波器作用于每个输入通道
pointwise conv：利用1x1卷积合并depthwise conv的输出

在这里插入图片描述

例子分析：
在这里插入图片描述

标准卷积与深度可分离卷积计算量比较：

input feature map： $D_F \times D_F \times M$

output feature map： $D_G \times D_G \times N$

conv kernel size： $D_K \times D_k \times M \times N$

$D_F$ ：输入特征图的宽度和高度的大小
$D_G$ ：输入特征图的宽度和高度的大小
M：输入特征图的个数
N：输出特征图的个数
$D_K$ ：卷积核的大小

标准卷积的计算量：

对于每一个像素点的计算量为： $D_K \times D_K \times M$ ，共有 $D_G \times D_G \times N$ 个像素点，所以总计算量为： $D_K \times D_K \times M \times N \times D_G \times D_G$

深度可分离卷积的计算量：

depthwise conv：对于每一个像素点的计算量为： $D_K \times D_K$ ，共有 $D_G \times D_G \times M$ 个像素点，所以总计算量为： $D_K \times D_K \times M \times D_G \times D_G$
pointwise conv：计算量为： $\times 1 \times M \times N \times D_G \times D_G$

$\frac {深度可分离卷积计算量} {标准卷积计算量} = \frac {D_K \times D_K \times M \times D_G \times D_G + M \times N \times D_G \times D_G } {D_K \times D_K \times M \times N \times D_G \times D_G} = \frac {1} {N} + \frac {1} {D_K^2}$

MobileNet使用3x3深度可分离卷积，可在只需减少略小的准确率换来8-9倍的计算量的减少

1.3.2 Network Structure and Training

MobileNet基于深度可分离卷积构建，除了第一层是标准卷积外，其他均为深度可分离卷积。每个卷积层后边跟着BatchNorm层和ReLU层最后加一个全连接层和softmax层。
在这里插入图片描述

在这里插入图片描述

1.3.3 Width Multiplier: Thinner Models

width multiplier $\alpha$ 的作用是在每层均匀缩小网络。此超参数将输入通道数 $M$ 变为 $\alpha M$ ，输出通道 $N$ 变为 $\alpha N$ ，从而计算量就变为： $D_K \times D_K \times \alpha M \times D_G \times D_G + \alpha M \times \alpha N \times D_G \times D_G$

1.3.4 Resolution Multiplier: Reduced Representation

resolution multiplier $\rho$ 的作用是改变输入图像的大小从而改变每层特征图的大小。此参数将特征图的大小 $D_G$ 变为 $\rho D_G$ ，从而计算量就变为： $D_K \times D_K \times M \times \rho D_G \times \rho D_G + M \times N \times \rho D_G \times \rho D_G$

1.4 Experiments

在这里插入图片描述

从表4可知，深度可分离卷积仅减少了1%的准确度，但是运行时间快了8.5倍，并且模型的大小缩小了近7倍。

在这里插入图片描述

从表4可知，深度可分离卷积仅减少了1%的准确度，但是运行时间快了8.5倍，并且模型的大小缩小了近7倍。

2. MobileNet V2

2.1. Abstract

我们提出了MobileNetV2，此网络结构基于倒残差结构，其中shortcut在窄的bottleneck之间。中间扩展层使用深度卷积，此外，我们还发现为了维持模型的表达能力，删除窄层中的非线性层是必要的。

2.2. Introduction

我们的主要贡献是一种新型层模块——具有线性bottleneck的倒残差结构。该模块的输入是一个低维压缩表示，该表示首先被扩展到高维并用深度卷积，随后通过一个线性层将其转回至低维表示。

这种卷积模块特别适用于移动设计，因为它允许通过从不完全实现大型中间张量来显着降低推理期间所需的存储空间。这减少了在许多嵌入式硬件设计中对主要内存访问的需求，可提供少量的非常快速的软件控制缓存存储器。

2.3 Preliminaries, discussion and intuition

2.3.1 Depthwise Separable Convolutions

深度可分离卷积是许多效率网络结构中的关键。其基本思想为：将完整卷积分解为两个单独部分，第一部分为深度卷积，它通过将每个输入通道应用单个卷积滤波来执行轻量级滤波功能；第二部分是一个1x1卷积，称为点卷积，通过计算输入通道的线性组合。

2.3.2 Linear Bottlenecks

在这里插入图片描述

由上图可知，ReLU在低维度是会对特征有损失。所以当inverted residual block 先升维再降至低维度之后不在使用非线性的RELU作为激活函数，这也就是所谓的linear bottleneck

2.3.3 Inverted residuals

在这里插入图片描述

2.4 Model Architecture

在这里插入图片描述

我们使用ReLU6作为非线性层，因为它具有低精度计算时的鲁棒性。

除了第一层，我们在整个网络中使用常量扩展速率，在我们的主要实验中将expansion factor设置为6

2.5 Pytorch实现

import torch
import torch.nn as nn
import torchvision


class ConvBNReLu(nn.Sequential):
    def __init__(self, in_channel, out_channel, kernel_size=3, stride=1, groups=1):
        padding = (kernel_size - 1) // 2
        super(ConvBNReLu, self).__init__(
            nn.Conv2d(in_channels=in_channel, out_channels=out_channel, kernel_size=kernel_size, stride=stride, padding=padding, groups=groups, bias=False),
            nn.BatchNorm2d(out_channel),
            nn.ReLU6(inplace=True)
        )


class InvertedResidual(nn.Module):
    def __init__(self, in_channel, out_channel, stride, expand_ratio):
        super(InvertedResidual, self).__init__()
        hidden_channel = in_channel * expand_ratio
        self.use_shortcut = stride == 1 and in_channel == out_channel

        layers = []
        if expand_ratio != 1:
            # 1x1 pointWise conv
            layers.append(ConvBNReLu(in_channel, hidden_channel, kernel_size=1))
        layers.extend([
            # 3x3 depthWise conv
            ConvBNReLu(hidden_channel, hidden_channel, stride=stride, groups=hidden_channel),
            # 1x1 pointWise conv(linear)
            nn.Conv2d(in_channels=hidden_channel, out_channels=out_channel, kernel_size=1, bias=False),
            nn.BatchNorm2d(out_channel),
        ])

        self.conv = nn.Sequential(*layers)

    def forward(self, x):
        if self.use_shortcut:
            return x + self.conv(x)
        else:
            return self.conv(x)


class MobileNetV2(nn.Module):
    def __init__(self, num_classes=1000):
        super(MobileNetV2, self).__init__()
        block = InvertedResidual
        input_channel = 32
        last_channel = 1280

        inverted_residual_setting = [
            # t, c, n, s
            [1, 16, 1, 1],
            [6, 24, 2, 2],
            [6, 32, 3, 2],
            [6, 64, 4, 2],
            [6, 96, 3, 1],
            [6, 169, 3, 2],
            [6, 320, 1, 1],
        ]

        features = []
        features.append(ConvBNReLu(in_channel=3, out_channel=input_channel, stride=2))
        for t, c, n, s in inverted_residual_setting:
            for i in range(n):
                stride = s if i == 0 else 1
                features.append(block(in_channel=input_channel, out_channel=c, stride=stride, expand_ratio=t))
                input_channel = c
        features.append(ConvBNReLu(in_channel=input_channel, out_channel=last_channel, kernel_size=1))
        self.features = nn.Sequential(*features)

        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.classifier = nn.Sequential(
            nn.Dropout(p=0.2),
            nn.Linear(last_channel, num_classes)
        )

        # weight init
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out')
                if m.bias is not None:
                    nn.init.zeros_(m.bias)
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.ones_(m.weight)
                nn.init.zeros_(m.bias)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, mean=0, std=0.01)
                nn.init.zeros_(m.bias)

    def forward(self, x):
        x = self.features(x)
        x = self.avgpool(x)
        x = torch.flatten(x, start_dim=1)
        x = self.classifier(x)
        return x

3. Mobile Net V3

3.1 Efficient Mobile Building Blocks

移动模型已经建立在越来越高效的构建块上。 MobileNetV1引入了深度方向可分离卷积作为传统卷积层的有效替代。深度可分离卷积通过将空间滤波与特征生成机制分离，有效地分解了传统卷积。深度可分离卷积由两个单独的层定义：用于空间滤波的轻量级深度卷积和用于特征生成的1x1点卷积。

MobileNetV2引入了线性瓶颈和倒置残差结构，以通过利用问题的低秩性质来提高层结构的效率。这种结构如下图所示，由1x1扩展卷积、深度卷积和1x1投影层定义。当且仅当输入和输出具有相同数量的通道时，才通过残差连接进行连接。这种结构在输入和输出处保持紧凑的表示形式，同时在内部扩展到更高维度的特征空间以提高非线性每通道变换的表达能力。
在这里插入图片描述

MnasNet通过在瓶颈结构中引入基于SE的轻量级注意力模块来建立在MobileNetV2结构的基础上。注意，与SENet中提出的基于ResNet的模块相比，SE模块集成在不同的位置。该模块被放置在扩展中的深度过滤器之后，以便为了表达注意力的最大化，如下图所示。
在这里插入图片描述

其中SE模块为：
在这里插入图片描述

对于MobileNetV3，我们使用这些层的组合作为构建块，以构建最有效的模型。层也通过修改的swish非线性进行了升级。SE以及swish非线性都使用Sigmoid形，这在定点算术中计算效率低下，而且难以保持精度，因此我们将其替换hard sigmoid

3.2 Network Improvements

3.2.1 Redesigning Expensive Layers

第一个修改重做了网络的最后几层如何交互以更有效地产生最终特征。当前基于MobileNet V2的倒置瓶颈结构和变体的模型使用1x1卷积作为最后一层，以便扩展到更高维度的特征空间。为了具有丰富的预测功能，这一层至关重要。但是，这要付出额外的等待时间。为了减少延迟并保留高维特征，我们将该层移到最终的平均池化之外。现在以1x1的空间分辨率而不是7x7的空间分辨率计算出最终的特征图。这种设计选择的结果是，就计算和等待时间而言，特征的计算几乎变得没有消耗

一旦减轻了该特征生成层的成本，就不再需要先前的瓶颈投影层来减少计算量。该观察结果使我们能够删除先前瓶颈层中的投影和过滤层，从而进一步降低了计算复杂性。原始的和优化的最后阶段可以在下图中看到。有效的最后阶段将等待时间减少了7毫秒，这是运行时间的11％，并减少了3000万MAdds的操作次数，几乎没有准确性的损失。

在这里插入图片描述

另一个昂贵的层是过滤器的初始集合。当前的移动模型倾向于在完整的3x3卷积中使用32个滤波器来构建用于边缘检测的初始滤波器组。这些滤镜中有十个互为镜像。我们尝试减少滤波器的数量并使用不同的非线性来尝试减少冗余。我们决定在该层的性能以及其他经过测试的非线性上使用hard swish非线性。使用ReLU或swish，我们能够将过滤器数量减少到16个，同时保持与32个过滤器相同的精度。这样可以节省额外的2毫秒和1000万个MAdd。

3.2.2 Nonlinearities

引入了称为swish的非线性，当它用作ReLU的直接替代品时，可以大大提高神经网络的准确性。非线性定义为 $\dot \sigma(x)$

尽管这种非线性提高了精度，但在嵌入式环境中却带来了非零成本，因为在移动设备上计算Sigmoid型函数的成本要高得多。我们以两种方式处理这个问题：

我们用它的分段线性的 $\frac {ReLU6(x+3)} {6}$ 代替sigmoid函数，较小的区别是我们使用ReLU6而不是自定义裁剪常数。同样，swish的hard版本变成 $\frac {ReLU6(x+3)} {6}$
随着我们深入网络，应用非线性的成本降低，这是因为每当分辨率降低时，每个层的激活内存通常都会减半。顺便说一句，我们发现，仅在更深层次中使用它们就可以实现大部分收益。因此，在我们的架构中，我们仅在模型的后半部分使用h-swish。

3.3.3 Large squeeze-and-excite

在MnasNet中，SE瓶颈的大小是卷积瓶颈的大小的相对值。相反，我们将它们全部替换为固定为扩展层中通道数量的1/4。我们发现这样做可以在不增加参数数量的情况下提高准确度，并且没有明显的等待时间成本。
在这里插入图片描述

3.3.4 MobileNetV3 Definitions:

在这里插入图片描述

3.4 Pytorch实现

import torch
import torch.nn as nn
import torchvision


class HardSwish(nn.Module):
    def __init__(self, inplace=True):
        super(HardSwish, self).__init__()
        self.relu6 = nn.ReLU6(inplace=inplace)

    def forward(self, x):
        return x * self.relu6(x+3)/6


class ConvBNActivation(nn.Sequential):
    def __init__(self, in_channel, out_channel, kernel_size, stride, groups, activate):
        padding = (kernel_size - 1) // 2
        super(ConvBNActivation, self).__init__(
            nn.Conv2d(in_channels=in_channel, out_channels=out_channel, kernel_size=kernel_size, stride=stride, padding=padding, groups=groups, bias=False),
            nn.BatchNorm2d(out_channel),
            nn.ReLU6(inplace=True) if activate == 'relu' else HardSwish()
        )


class SqueezeAndExcite(nn.Module):
    def __init__(self, in_channel, out_channel, divide=4):
        super(SqueezeAndExcite, self).__init__()
        mid_channel = in_channel // divide
        self.pool = nn.AdaptiveAvgPool2d((1, 1))
        self.SEblock = nn.Sequential(
            nn.Linear(in_features=in_channel, out_features=mid_channel),
            nn.ReLU6(inplace=True),
            nn.Linear(in_features=mid_channel, out_features=out_channel),
            HardSwish(),
        )

    def forward(self, x):
        b, c, h, w = x.size()
        out = self.pool(x)
        out = torch.flatten(out, start_dim=1)
        out = self.SEblock(out)
        out = out.view(b, c, 1, 1)
        return out * x


class SEInverteBottleneck(nn.Module):
    def __init__(self, in_channel, mid_channel, out_channel, kernel_size, use_se, activate, stride):
        super(SEInverteBottleneck, self).__init__()
        self.use_shortcut = stride == 1 and in_channel == out_channel
        self.use_se = use_se

        self.conv = ConvBNActivation(in_channel=in_channel, out_channel=mid_channel, kernel_size=1, stride=1, groups=1, activate=activate)
        self.depth_conv = ConvBNActivation(in_channel=mid_channel, out_channel=mid_channel, kernel_size=kernel_size, stride=stride, groups=mid_channel, activate=activate)
        if self.use_se:
            self.SEblock = SqueezeAndExcite(in_channel=mid_channel, out_channel=mid_channel)

        self.point_conv = ConvBNActivation(in_channel=mid_channel, out_channel=out_channel, kernel_size=1, stride=1, groups=1, activate=activate)

    def forward(self, x):
        out = self.conv(x)
        out = self.depth_conv(out)
        if self.use_se:
            out = self.SEblock(out)
        out = self.point_conv(out)
        if self.use_shortcut:
            return x + out
        return out


class MobileNetV3(nn.Module):
    def __init__(self, num_classes=1000, type='large'):
        super(MobileNetV3, self).__init__()
        self.type = type

        self.first_conv = nn.Sequential(
            nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=2, padding=1, bias=False),
            nn.BatchNorm2d(16),
            HardSwish(),
        )

        if self.type == 'large':
            self.large_bottleneck = nn.Sequential(
                SEInverteBottleneck(in_channel=16, mid_channel=16, out_channel=16, kernel_size=3, use_se=False, activate='relu', stride=1),
                SEInverteBottleneck(in_channel=16, mid_channel=64, out_channel=24, kernel_size=3, use_se=False, activate='relu', stride=2),
                SEInverteBottleneck(in_channel=24, mid_channel=72, out_channel=24, kernel_size=3, use_se=False, activate='relu', stride=1),
                SEInverteBottleneck(in_channel=24, mid_channel=72, out_channel=40, kernel_size=5, use_se=True, activate='relu', stride=2),
                SEInverteBottleneck(in_channel=40, mid_channel=120, out_channel=40, kernel_size=5, use_se=True, activate='relu', stride=1),
                SEInverteBottleneck(in_channel=40, mid_channel=120, out_channel=40, kernel_size=5, use_se=True, activate='relu', stride=1),
                SEInverteBottleneck(in_channel=40, mid_channel=240, out_channel=80, kernel_size=3, use_se=False, activate='hswish', stride=2),
                SEInverteBottleneck(in_channel=80, mid_channel=200, out_channel=80, kernel_size=3, use_se=False, activate='hswish', stride=1),
                SEInverteBottleneck(in_channel=80, mid_channel=184, out_channel=80, kernel_size=3, use_se=False, activate='hswish', stride=1),
                SEInverteBottleneck(in_channel=80, mid_channel=184, out_channel=80, kernel_size=3, use_se=False, activate='hswish', stride=1),
                SEInverteBottleneck(in_channel=80, mid_channel=480, out_channel=112, kernel_size=3, use_se=True, activate='hswish', stride=1),
                SEInverteBottleneck(in_channel=112, mid_channel=672, out_channel=112, kernel_size=3, use_se=True, activate='hswish', stride=1),
                SEInverteBottleneck(in_channel=112, mid_channel=672, out_channel=160, kernel_size=5, use_se=True, activate='hswish', stride=2),
                SEInverteBottleneck(in_channel=160, mid_channel=960, out_channel=160, kernel_size=5, use_se=True, activate='hswish', stride=1),
                SEInverteBottleneck(in_channel=160, mid_channel=960, out_channel=160, kernel_size=5, use_se=True, activate='hswish', stride=1),
            )
            self.large_last_stage = nn.Sequential(
                nn.Conv2d(in_channels=160, out_channels=960, kernel_size=1, stride=1, bias=False),
                nn.BatchNorm2d(960),
                HardSwish(),
                nn.AdaptiveAvgPool2d((1, 1)),
                nn.Conv2d(in_channels=960, out_channels=1280, kernel_size=1, stride=1, bias=False),
                HardSwish(),
            )
        else:
            self.small_bottleneck = nn.Sequential(
                SEInverteBottleneck(in_channel=16, mid_channel=16, out_channel=16, kernel_size=3, use_se=True, activate='relu', stride=2),
                SEInverteBottleneck(in_channel=16, mid_channel=72, out_channel=24, kernel_size=3, use_se=False, activate='relu', stride=2),
                SEInverteBottleneck(in_channel=24, mid_channel=88, out_channel=24, kernel_size=3, use_se=False, activate='relu', stride=1),
                SEInverteBottleneck(in_channel=24, mid_channel=96, out_channel=40, kernel_size=5, use_se=True, activate='hswish', stride=2),
                SEInverteBottleneck(in_channel=40, mid_channel=240, out_channel=40, kernel_size=5, use_se=True, activate='hswish', stride=1),
                SEInverteBottleneck(in_channel=40, mid_channel=240, out_channel=40, kernel_size=5, use_se=True, activate='hswish', stride=1),
                SEInverteBottleneck(in_channel=40, mid_channel=120, out_channel=48, kernel_size=5, use_se=True, activate='hswish', stride=1),
                SEInverteBottleneck(in_channel=48, mid_channel=144, out_channel=48, kernel_size=5, use_se=True, activate='hswish', stride=1),
                SEInverteBottleneck(in_channel=48, mid_channel=288, out_channel=96, kernel_size=5, use_se=True, activate='hswish', stride=2),
                SEInverteBottleneck(in_channel=96, mid_channel=576, out_channel=96, kernel_size=5, use_se=True, activate='hswish', stride=1),
                SEInverteBottleneck(in_channel=96, mid_channel=576, out_channel=96, kernel_size=5, use_se=True, activate='hswish', stride=1),
            )
            self.small_last_stage = nn.Sequential(
                nn.Conv2d(in_channels=96, out_channels=576, kernel_size=1, stride=1, bias=False),
                nn.BatchNorm2d(576),
                HardSwish(),
                nn.AdaptiveAvgPool2d((1, 1)),
                nn.Conv2d(in_channels=576, out_channels=1280, kernel_size=1, stride=1, bias=False),
                HardSwish(),
            )

        self.classifier = nn.Sequential(
            nn.Dropout(p=0.2),
            nn.Linear(in_features=1280, out_features=num_classes),
        )

        # weight init
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out')
                if m.bias is not None:
                    nn.init.zeros_(m.bias)
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.ones_(m.weight)
                nn.init.zeros_(m.bias)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, mean=0, std=0.01)
                nn.init.zeros_(m.bias)

    def forward(self, x):
        x = self.first_conv(x)
        if self.type == 'large':
            x = self.large_bottleneck(x)
            x = self.large_last_stage(x)
        else:
            x = self.small_bottleneck(x)
            x = self.small_last_stage(x)

        x = torch.flatten(x, start_dim=1)
        x = self.classifier(x)
        return x

上一篇： DenseNet
下一篇：ShuffleNet系列
 完整代码

何如千泷

关注

2
点赞
踩
18

收藏

觉得还不错? 一键收藏
0
评论
【论文阅读】Mobile Net 系列【V1—V3】

1. MobileNet V11.1 Abstract我们提出了一类用于移动和嵌入式视觉应用程序的高效模——MobileNet，此模型使用深度可分离卷积来构建轻量级深度神经网络。我们还介绍了两个超参数：用于控制模型的延迟（模型运行时间）和准确率1.2 Introduction在计算机视觉中，目前存在的一般趋势是制造更深更复杂的网络以实现更高的准确性。但是，在现实世界中的应用程序中，需要在有限的平台上以低延迟的方式实现识别任务。最近出现的许多方法只是关注模型的大小，而没有考虑速度，主要通过压缩预训练
复制链接

扫一扫