【MobileNet】《MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications》

bryant_meng

已于 2024-07-03 09:58:23 修改

阅读量1k

点赞数 3

分类专栏： CNN / Transformer 文章标签： Mobilenet

于 2019-01-06 22:59:18 首次发布

本文链接：https://blog.csdn.net/bryant_meng/article/details/85798881

版权

CNN / Transformer 专栏收录该内容

201 篇文章 7 订阅

订阅专栏

在这里插入图片描述
CVPR-2017

caffe版代码：https://github.com/yihui-he/Xception-caffe
caffe代码可视化工具：http://ethereon.github.io/netscope/#/editor
keras版代码：https://github.com/Hedlen/Mobilenet-Keras/blob/master/model/mobilenet.py
在 CIFAR-10 上的小实验可以参考博客【Keras-MobileNet v1】CIFAR-10

1 Background and Motivation

在这里插入图片描述

CNN 很猛，However，落地还有很多限制

In many real world applications such as robotics, self-driving car and augmented reality, the recognition tasks need to be carried out in a timely fashion on a computationally limited platform.

在小型化方面常用的手段有¹：

（1）卷积核分解，使用 1×N 和 N×1 的卷积核代替 N×N 的卷积核

（2）使用 bottleneck 结构，以 SqueezeNet为代表

（3）以低精度浮点数保存，例如 Deep Compression

（4）冗余卷积核剪枝及哈弗曼编码

idea 来源，depth-wise separable convolution

在这里插入图片描述

用上述的卷积来替换传统的卷积，大大降低计算量和参数量！让旧时王谢堂前燕（服务器），可以飞入寻常百姓家（mobile and embedded vision applications）。

2 Innovations

本 paper 将 depth-wise separable convolution 用到分类网络中，提出 class of efficient models called MobileNets，大大降低了计算量和参数量
设置两个 shrinking 超参数：width multiplier（channels） and resolution multiplier（resolution）来 trade off between latency and accuracy.
在许多任务上都取得不错的 shrinking 效果，Fine Grained Recognition，Large Scale Geolocalizaton，Face Attributes，Object Detection，Face Embeddings.
（We concluded by demonstrating MobileNet’s effectiveness when applied to a wide variety of tasks）

3 Advantages / Contributions

大大降低了参数量和计算量（使得部署在手机和嵌入式视觉设备成为可能）
应用广泛

4 Method

4.1 Depthwise Separable Convolution

1）传统卷积
在这里插入图片描述

G：卷积后的 feature map
K：滤波器（卷积核）
F：feature map
m,n：输入和输出 feature map 的 channels
k,l：卷积后 feature map 的坐标，k 行 l 列
i,j：滤波器（卷积核）的坐标

就是不太理解 -1是什么操作

2）传统卷积的计算量
在这里插入图片描述

$D_K$ ：filter size
$M, N$ ：feature map 输入和输出 channels
$D_F$ ：输出特征图的分辨率，也即 h 和 w

计算卷积的参数量大家可能很熟悉，计算量呢？也就是浮点运算（乘法+加法），在参数量的基础上，乘以输出特征图的分辨率，为啥？你想想，一个 $D_k*D_k*M*N$ 卷积操作得到的结果是 $1 * 1 * N$ ，要得到 output 的 feature map（ $D_F*D_F*M$ ），这样的卷积需要做 $D_F*D_F$ 次，所以最终参数量如下：

在这里插入图片描述

3）深度可分离卷积（depth-wise separable convolution）
在这里插入图片描述
图片来源：https://zhuanlan.zhihu.com/p/28749411

分为两步：

先 depth-wise convolution，对每个channels 进行卷积，一般 filter size 为 3

对比传统卷积
然后 point-wise convolution，对上一步的 $D_k*D_K*M$ 的 feature map 进行 $1 * 1 * M * N$ 卷积

4）深度可分离卷积（depth-wise separable convolution）的计算量
在这里插入图片描述
图片来源：https://zhuanlan.zhihu.com/p/28749411

depth-wise convolution 计算量为 $D_K*D_K*M*D_F*D_F$ （对每个 channel 进行 $D_K*D_K$ 的卷积，一共 $M$ 个 channels）
point-wise convolution 计算量为 $1*1*M*N*D_F*D_F$

总计算量如下：
在这里插入图片描述

$D_K$ ：filter size
$M, N$ ：feature map 输入和输出 channels
$D_F$ ：输出特征图的分辨率，也即 h 和 w

计算量减少如下：
在这里插入图片描述

注意 $N$ （ $32/64/128/256/512/1024$ ）往往远远大于 $D_K^2$ （ $3^2$ ），所以参数量的降低主要集中在 $\frac{1}{D_K^2}$ ，总体也即是 8-9 time，每个参数减少比率如下：

for i in range(5,11):
    print(2**i,':',1/(1/2**i+ 1/9))

output

32 : 7.024390243902439
64 : 7.89041095890411
128 : 8.408759124087592
256 : 8.69433962264151
512 : 8.844529750479847
1024 : 8.9215876089061

depth-wise separable convolution 的 pytorch 代码实现如下

class depthwise_separable_conv(nn.Module):
	def init(self, nin, nout):
		super(depthwise_separable_conv, self).init()
		self.depthwise = nn.Conv2d(nin, nin, kernel_size=3, padding=1, groups=nin)
		self.pointwise = nn.Conv2d(nin, nout, kernel_size=1)
		
	def forward(self, x):
		out = self.depthwise(x)
		out = self.pointwise(out)
		return out

参考如何在pytorch中使用可分离卷积 depth-wise Separable convolution

和普通 3x3 卷积的差别仅在 groups 参数的设定上，普通 conv 的 groups 设定为 1

普通 3x3 卷积的 pytorch 实现如下：

class conv(nn.Module):
	def init(self, nin, nout):
		super(conv, self).init()
		self.conv3 = nn.Conv2d(nin, nout, kernel_size=3, padding=1, groups=1)
		
	def forward(self, x):
		out = self.conv3(x)
		return out

4.2 Network Structure and Training

左边传统卷积，右边 depth-wise separable 卷积
在这里插入图片描述
整体结构如下：

$s_1$ 表示 $stride = 1,s_2$ 表示 $s t r i d e = 2$

Down sampling ( $s t r i d e = 2$ ) is handled with strided convolution in the depthwise convolutions as well as in the first layer.
use less regularization and data augmentation techniques because small models have less trouble with overfitting.
put very little or no weight decay (l2 regularization) on the depthwise filters since their are so few parameters in them.
using RMSprop with asynchronous gradient descent

图片来源：http://cs231n.github.io/neural-networks-3/

参数量集中在 1x1 卷积
在这里插入图片描述
This can be implemented with highly optimized general matrix multiply (GEMM) functions.（用 GEMM 技术来优化 1×1 卷积）

4.3 Width Multiplier: Thinner Models

hyper-parameters： $\alpha$ 来降低 feature map 的 channels，总计算量如下：
在这里插入图片描述
$\alpha \in (0,1]$

对比改进前的计算量：
在这里插入图片描述

计算量减少比率（相对传统卷积）如下：
$\frac{D_K \cdot D_K \cdot \alpha M \cdot D_F \cdot D_F + \alpha M \cdot \alpha N \cdot D_F \cdot D_F}{D_K \cdot D_K \cdot M \cdot N \cdot D_F \cdot D_F} = \frac{\alpha}{N} + \frac{\alpha ^2}{D_K^2}$

for a in (1,0.75,0.5,0.25):
    for i in range(5,11):
        print(2**i,', a =',a,':',1/(a/2**i+ a**2/9))
    print('\n')

output

32 , a = 1 : 7.024390243902439
64 , a = 1 : 7.89041095890411
128 , a = 1 : 8.408759124087592
256 , a = 1 : 8.69433962264151
512 , a = 1 : 8.844529750479847
1024 , a = 1 : 8.9215876089061

32 , a = 0.75 : 11.636363636363637
64 , a = 0.75 : 13.473684210526315
128 , a = 0.75 : 14.628571428571428
256 , a = 0.75 : 15.283582089552239
512 , a = 0.75 : 15.633587786259541
1024 , a = 0.75 : 15.814671814671815

32 , a = 0.5 : 23.04
64 , a = 0.5 : 28.097560975609756
128 , a = 0.5 : 31.56164383561644
256 , a = 0.5 : 33.63503649635037
512 , a = 0.5 : 34.77735849056604
1024 , a = 0.5 : 35.37811900191939

32 , a = 0.25 : 67.76470588235294
64 , a = 0.25 : 92.16
128 , a = 0.25 : 112.39024390243902
256 , a = 0.25 : 126.24657534246576
512 , a = 0.25 : 134.54014598540147
1024 , a = 0.25 : 139.10943396226415

4.4 Resolution Multiplier: Reduced Representation

进一步 hyper-parameters： $\rho$ 控制输入图片的 resolution，进而相应的减少所有 feature map 的分辨率，总计算量如下：
在这里插入图片描述
计算量减少比率（相对传统卷积）如下：
$\frac{D_K \cdot D_K \cdot \alpha M \cdot \rho D_F \cdot \rho D_F + \alpha M \cdot \alpha N \cdot \rho D_F \cdot \rho D_F}{D_K \cdot D_K \cdot M \cdot N \cdot D_F \cdot D_F} = \frac{\alpha \cdot \rho }{N} + \frac{\alpha ^2 \cdot \rho^2 }{D_K^2}$
计算量进一步减少 $\rho^2$

看个小例子
在这里插入图片描述
根据公式，计算过程如下：
$3 * 3 * 14 * 14 * 512 * 512 = 462, 422, 016$
$3 * 3 * 512 * 512 = 2, 359, 296$

$3 * 3 * 512 * 14 * 14 + 1 * 1 * 512 * 512 * 14 * 14 = 52, 283, 392$
$3 * 3 * 512 + 512 * 1 * 1 * 512 = 266, 752$

$3 * 3 * 0.75 * 512 * 14 * 14 + 1 * 1 * 0.75 * 512 * 0.75 * 512 * 14 * 14 = 29, 578, 752$
$3 * 3 * 0.75 * 512 + 1 * 1 * 0.75 * 512 * 0.75 * 512 = 150, 912$

$\approx 15,079,129$ （截断了小数）
$3 * 3 * 0.75 * 512 + 1 * 1 * 0.75 * 512 * 0.75 * 512 = 150, 912$

注意一点， $\rho$ 控制分辨率只对计算量有影响，不改变参数量的计算。

5 Experiments

5.1 Model Choices

1）传统卷积 vs depth-separable convolution
在这里插入图片描述
精度相仿，参数量和计算量减少了超级多! 数据集 ImageNet，应该是 top 1 accuracy，从后面和 vgg 、googlenet 对比中可以看出来，记住这个 70.6% baseline，后面很多实验都用到了。

2）健全版（baseline） vs 瘦身版
在这里插入图片描述

健全版为 baseline，也即上图所示结构
瘦身版，不要红线框出来的部分

在这里插入图片描述
保证对比的公平性，设置 $\alpha =0.75$ 使得 accuracy 相仿。

5.2 Model Shrinking Hyperparameters

1）同 $\rho$ 不同 $\alpha$
第一行 baseline，评价指标，ImageNet top 1 accuracy
在这里插入图片描述
2）同 $\alpha$ 不同 $\rho$
第一行 baseline，评价指标，ImageNet top 1 accuracy

3）accuracy 与 computation cost 关系
$\alpha = 1/0.75/0.5/0.25$ ， $\rho = 224/192/160/128$ ，16种组合，对应下面16个点，作者总结为 log linear 关系，评价指标，ImageNet top 1 accuracy
在这里插入图片描述
4）accuracy 与 parameters 的关系
注意，相同分辨率下， parameters 是一样的，由卷积产生， $\alpha = 1/0.75/0.5/0.25$ ， $\rho = 224/192/160/128$ ，16种组合，评价指标，ImageNet top 1 accuracy

5）baseline vs VGG / GoogleNet
评价指标，ImageNet top 1 accuracy
在这里插入图片描述
精度相仿，参数量和计算量大大降低

6）smaller mobilenet vs popular models
评价指标，ImageNet top 1 accuracy
在这里插入图片描述
又准计算量和参数量又小

5.3 Various applications

作者在如下领域做了对比实验

Fine Grained Recognition
Large Scale Geolocalization（不知道原文中 4.4 Large Scale Geolocalizaton 是不是有拼写错误）
Face Attributes
Object Detection
Face Embeddings

1）这里对 Object Detection 的实验进行一下描述
在这里插入图片描述

精度还是挺惨的，哈哈哈，计算量下降的比较多，参数量下降的特别多

2）Face Embeddings
用了知识蒸馏，用 mobilenet 作为学生网络，有关知识蒸馏的知识，可以参考如下三篇博客

在这里插入图片描述

6 Conclusion

用 depth-wise separable convolution 提出了 MobileNet model
两个 hyper parameters $\alpha$ 、 $\rho$ 来 trade off between latency and accuracy
各种应用来证明这种设计的 effectiveness

Q1：如下公式 -1 怎理解
在这里插入图片描述
Q2：GEMM（General matrix multiply）优化 1*1 卷积细节

附录——R TALK 如何打造云、端、芯上的视觉计算

R TALK | 旷视孙剑：如何打造云、端、芯上的视觉计算

旷视首席科学家、研究院院长孙剑在2018年全球人工智能与机器人峰会（CCF-GAIR）带来的”云、端、芯上的视觉计算”的精彩Talk
在这里插入图片描述

日常生活和各个行业中有很多的摄像头，比如说手机、安防、工业、零售、无人车、机器人、家庭、无人机、医疗、遥感等等

DorefaNet 是第一个对梯度也做量化的研究工作，从而可以让我们在 FPGA 甚至 ASIC 上训练

在端上的应用更多，第一个就是手机。vivo V7 是第一款海外上市旗舰机，搭载了我们的人脸解锁技术，还有小米 Note 3 的人脸解锁。我们帮助 vivo 和小米在 iPhoneX 发布之前推出了人脸解锁手机。华为荣耀 V10 和 7C 手机同样使用了我们的技术。华为为什么请孙杨做代言人？因为他长期游泳，指纹已经磨光了，必须用人脸解锁才能很好地使用手机

在这里插入图片描述

深度解读谷歌MobileNet ↩︎

bryant_meng

关注

3
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
【MobileNet】《MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications》

CVPR-2017caffe版代码：https://github.com/yihui-he/Xception-caffecaffe代码可视化工具：http://ethereon.github.io/netscope/#/editorkeras版代码：https://github.com/Hedlen/Mobilenet-Keras/blob/master/model/mobilenet.py在 CIFAR-10 上的小实验可以参考博客【Keras-MobileNet v1】CIFAR-10.
复制链接

扫一扫