CVPR-2017
caffe版代码:https://github.com/yihui-he/Xception-caffe
caffe代码可视化工具:http://ethereon.github.io/netscope/#/editor
keras版代码:https://github.com/Hedlen/Mobilenet-Keras/blob/master/model/mobilenet.py
在 CIFAR-10 上的小实验可以参考博客 【Keras-MobileNet v1】CIFAR-10
文章目录
1 Background and Motivation
CNN 很猛,However,落地还有很多限制
In many real world applications such as robotics, self-driving car and augmented reality, the recognition tasks need to be carried out in a timely fashion on a computationally limited platform.
在小型化方面常用的手段有1:
(1)卷积核分解,使用 1×N 和 N×1 的卷积核代替 N×N 的卷积核
(2)使用 bottleneck 结构,以 SqueezeNet为代表
(3)以低精度浮点数保存,例如 Deep Compression
(4)冗余卷积核剪枝及哈弗曼编码
idea 来源,depth-wise separable convolution
用上述的卷积来替换传统的卷积,大大降低计算量和参数量!让旧时王谢堂前燕(服务器),可以飞入寻常百姓家(mobile and embedded vision applications)。
2 Innovations
- 本 paper 将 depth-wise separable convolution 用到分类网络中,提出 class of efficient models called MobileNets,大大降低了计算量和参数量
- 设置两个 shrinking 超参数:width multiplier(channels) and resolution multiplier(resolution) 来 trade off between latency and accuracy.
- 在许多任务上都取得不错的 shrinking 效果,Fine Grained Recognition,Large Scale Geolocalizaton,Face Attributes,Object Detection,Face Embeddings.
(We concluded by demonstrating MobileNet’s effectiveness when applied to a wide variety of tasks)
3 Advantages / Contributions
- 大大降低了参数量和计算量(使得部署在手机和嵌入式视觉设备成为可能)
- 应用广泛
4 Method
4.1 Depthwise Separable Convolution
1)传统卷积
- G:卷积后的 feature map
- K:滤波器(卷积核)
- F:feature map
- m,n:输入和输出 feature map 的 channels
- k,l:卷积后 feature map 的坐标,k 行 l 列
- i,j:滤波器(卷积核)的坐标
就是不太理解 -1是什么操作
2)传统卷积的计算量
- D K D_K DK:filter size
- M , N M,N M,N:feature map 输入和输出 channels
- D F D_F DF:输出特征图的分辨率,也即 h 和 w
计算卷积的参数量大家可能很熟悉,计算量呢?也就是浮点运算(乘法+加法),在参数量的基础上,乘以输出特征图的分辨率,为啥?你想想,一个 D k ∗ D k ∗ M ∗ N D_k*D_k*M*N Dk∗Dk∗M∗N 卷积操作得到的结果是 1 ∗ 1 ∗ N 1*1*N 1∗1∗N,要得到 output 的 feature map( D F ∗ D F ∗ M D_F*D_F*M DF∗DF∗M),这样的卷积需要做 D F ∗ D F D_F*D_F DF∗DF 次,所以最终参数量如下:
3)深度可分离卷积(depth-wise separable convolution)
图片来源:https://zhuanlan.zhihu.com/p/28749411
分为两步:
-
先 depth-wise convolution,对每个channels 进行卷积,一般 filter size 为 3
对比传统卷积
-
然后 point-wise convolution,对上一步的 D k ∗ D K ∗ M D_k*D_K*M Dk∗DK∗M 的 feature map 进行 1 ∗ 1 ∗ M ∗ N 1*1*M*N 1∗1∗M∗N 卷积
4)深度可分离卷积(depth-wise separable convolution)的计算量
图片来源:https://zhuanlan.zhihu.com/p/28749411
depth-wise convolution 计算量为
D
K
∗
D
K
∗
M
∗
D
F
∗
D
F
D_K*D_K*M*D_F*D_F
DK∗DK∗M∗DF∗DF(对每个 channel 进行
D
K
∗
D
K
D_K*D_K
DK∗DK的卷积,一共
M
M
M 个 channels)
point-wise convolution 计算量为
1
∗
1
∗
M
∗
N
∗
D
F
∗
D
F
1*1*M*N*D_F*D_F
1∗1∗M∗N∗DF∗DF
总计算量如下:
- D K D_K DK:filter size
- M , N M,N M,N:feature map 输入和输出 channels
- D F D_F DF:输出特征图的分辨率,也即 h 和 w
计算量减少如下:
注意 N N N ( 32 / 64 / 128 / 256 / 512 / 1024 32/64/128/256/512/1024 32/64/128/256/512/1024)往往远远大于 D K 2 D_K^2 DK2( 3 2 3^2 32),所以参数量的降低主要集中在 1 D K 2 \frac{1}{D_K^2} DK21,总体也即是 8-9 time,每个参数减少比率如下:
for i in range(5,11):
print(2**i,':',1/(1/2**i+ 1/9))
output
32 : 7.024390243902439
64 : 7.89041095890411
128 : 8.408759124087592
256 : 8.69433962264151
512 : 8.844529750479847
1024 : 8.9215876089061
depth-wise separable convolution 的 pytorch 代码实现如下
class depthwise_separable_conv(nn.Module):
def init(self, nin, nout):
super(depthwise_separable_conv, self).init()
self.depthwise = nn.Conv2d(nin, nin, kernel_size=3, padding=1, groups=nin)
self.pointwise = nn.Conv2d(nin, nout, kernel_size=1)
def forward(self, x):
out = self.depthwise(x)
out = self.pointwise(out)
return out
参考 如何在pytorch中使用可分离卷积 depth-wise Separable convolution
和普通 3x3 卷积的差别仅在 groups 参数的设定上,普通 conv 的 groups 设定为 1
普通 3x3 卷积的 pytorch 实现如下:
class conv(nn.Module):
def init(self, nin, nout):
super(conv, self).init()
self.conv3 = nn.Conv2d(nin, nout, kernel_size=3, padding=1, groups=1)
def forward(self, x):
out = self.conv3(x)
return out
4.2 Network Structure and Training
左边传统卷积,右边 depth-wise separable 卷积
整体结构如下:
s
1
s_1
s1 表示
s
t
r
i
d
e
=
1
,
s
2
stride = 1,s_2
stride=1,s2 表示
s
t
r
i
d
e
=
2
stride = 2
stride=2
- Down sampling ( s t r i d e = 2 stride = 2 stride=2) is handled with strided convolution in the depthwise convolutions as well as in the first layer.
- use less regularization and data augmentation techniques because small models have less trouble with overfitting.
- put very little or no weight decay (l2 regularization) on the depthwise filters since their are so few parameters in them.
- using RMSprop with asynchronous gradient descent
![](https://i-blog.csdnimg.cn/blog_migrate/c8a4b4a7359e8792066672954f9c4b0e.gif)
![](https://i-blog.csdnimg.cn/blog_migrate/f6ebaad2cb1c6bd44d179f13f281036f.gif)
图片来源:http://cs231n.github.io/neural-networks-3/
参数量集中在 1x1 卷积
This can be implemented with highly optimized general matrix multiply (GEMM) functions.(用 GEMM 技术来优化 1×1 卷积)
4.3 Width Multiplier: Thinner Models
hyper-parameters:
α
\alpha
α 来降低 feature map 的 channels,总计算量如下:
α
∈
(
0
,
1
]
\alpha \in (0,1]
α∈(0,1]
对比改进前的计算量:
计算量减少比率(相对传统卷积)如下:
D
K
⋅
D
K
⋅
α
M
⋅
D
F
⋅
D
F
+
α
M
⋅
α
N
⋅
D
F
⋅
D
F
D
K
⋅
D
K
⋅
M
⋅
N
⋅
D
F
⋅
D
F
=
α
N
+
α
2
D
K
2
\frac{D_K \cdot D_K \cdot \alpha M \cdot D_F \cdot D_F + \alpha M \cdot \alpha N \cdot D_F \cdot D_F}{D_K \cdot D_K \cdot M \cdot N \cdot D_F \cdot D_F} = \frac{\alpha}{N} + \frac{\alpha ^2}{D_K^2}
DK⋅DK⋅M⋅N⋅DF⋅DFDK⋅DK⋅αM⋅DF⋅DF+αM⋅αN⋅DF⋅DF=Nα+DK2α2
for a in (1,0.75,0.5,0.25):
for i in range(5,11):
print(2**i,', a =',a,':',1/(a/2**i+ a**2/9))
print('\n')
output
32 , a = 1 : 7.024390243902439
64 , a = 1 : 7.89041095890411
128 , a = 1 : 8.408759124087592
256 , a = 1 : 8.69433962264151
512 , a = 1 : 8.844529750479847
1024 , a = 1 : 8.9215876089061
32 , a = 0.75 : 11.636363636363637
64 , a = 0.75 : 13.473684210526315
128 , a = 0.75 : 14.628571428571428
256 , a = 0.75 : 15.283582089552239
512 , a = 0.75 : 15.633587786259541
1024 , a = 0.75 : 15.814671814671815
32 , a = 0.5 : 23.04
64 , a = 0.5 : 28.097560975609756
128 , a = 0.5 : 31.56164383561644
256 , a = 0.5 : 33.63503649635037
512 , a = 0.5 : 34.77735849056604
1024 , a = 0.5 : 35.37811900191939
32 , a = 0.25 : 67.76470588235294
64 , a = 0.25 : 92.16
128 , a = 0.25 : 112.39024390243902
256 , a = 0.25 : 126.24657534246576
512 , a = 0.25 : 134.54014598540147
1024 , a = 0.25 : 139.10943396226415
4.4 Resolution Multiplier: Reduced Representation
进一步 hyper-parameters:
ρ
\rho
ρ 控制输入图片的 resolution,进而相应的减少所有 feature map 的分辨率,总计算量如下:
计算量减少比率(相对传统卷积)如下:
D
K
⋅
D
K
⋅
α
M
⋅
ρ
D
F
⋅
ρ
D
F
+
α
M
⋅
α
N
⋅
ρ
D
F
⋅
ρ
D
F
D
K
⋅
D
K
⋅
M
⋅
N
⋅
D
F
⋅
D
F
=
α
⋅
ρ
N
+
α
2
⋅
ρ
2
D
K
2
\frac{D_K \cdot D_K \cdot \alpha M \cdot \rho D_F \cdot \rho D_F + \alpha M \cdot \alpha N \cdot \rho D_F \cdot \rho D_F}{D_K \cdot D_K \cdot M \cdot N \cdot D_F \cdot D_F} = \frac{\alpha \cdot \rho }{N} + \frac{\alpha ^2 \cdot \rho^2 }{D_K^2}
DK⋅DK⋅M⋅N⋅DF⋅DFDK⋅DK⋅αM⋅ρDF⋅ρDF+αM⋅αN⋅ρDF⋅ρDF=Nα⋅ρ+DK2α2⋅ρ2
计算量进一步减少
ρ
2
\rho^2
ρ2
看个小例子
根据公式,计算过程如下:
3
∗
3
∗
14
∗
14
∗
512
∗
512
=
462
,
422
,
016
3*3*14*14*512*512 = 462,422,016
3∗3∗14∗14∗512∗512=462,422,016
3
∗
3
∗
512
∗
512
=
2
,
359
,
296
3*3*512*512 = 2,359,296
3∗3∗512∗512=2,359,296
3
∗
3
∗
512
∗
14
∗
14
+
1
∗
1
∗
512
∗
512
∗
14
∗
14
=
52
,
283
,
392
3*3*512*14*14+1*1*512*512*14*14 = 52,283,392
3∗3∗512∗14∗14+1∗1∗512∗512∗14∗14=52,283,392
3
∗
3
∗
512
+
512
∗
1
∗
1
∗
512
=
266
,
752
3*3*512+512*1*1*512 = 266,752
3∗3∗512+512∗1∗1∗512=266,752
3
∗
3
∗
0.75
∗
512
∗
14
∗
14
+
1
∗
1
∗
0.75
∗
512
∗
0.75
∗
512
∗
14
∗
14
=
29
,
578
,
752
3*3*0.75*512*14*14+1*1*0.75*512*0.75*512*14*14 = 29,578,752
3∗3∗0.75∗512∗14∗14+1∗1∗0.75∗512∗0.75∗512∗14∗14=29,578,752
3
∗
3
∗
0.75
∗
512
+
1
∗
1
∗
0.75
∗
512
∗
0.75
∗
512
=
150
,
912
3*3*0.75*512+1*1*0.75*512*0.75*512 = 150,912
3∗3∗0.75∗512+1∗1∗0.75∗512∗0.75∗512=150,912
3
∗
3
∗
0.75
∗
512
∗
0.714
∗
14
∗
0.714
∗
14
+
1
∗
1
∗
0.75
∗
512
∗
0.75
∗
512
∗
0.714
∗
14
∗
0.714
∗
14
≈
15
,
079
,
129
3*3*0.75*512*0.714*14*0.714*14+1*1*0.75*512*0.75*512*0.714*14*0.714*14 \approx 15,079,129
3∗3∗0.75∗512∗0.714∗14∗0.714∗14+1∗1∗0.75∗512∗0.75∗512∗0.714∗14∗0.714∗14≈15,079,129(截断了小数)
3
∗
3
∗
0.75
∗
512
+
1
∗
1
∗
0.75
∗
512
∗
0.75
∗
512
=
150
,
912
3*3*0.75*512+1*1*0.75*512*0.75*512 = 150,912
3∗3∗0.75∗512+1∗1∗0.75∗512∗0.75∗512=150,912
注意一点, ρ \rho ρ 控制分辨率只对计算量有影响,不改变参数量的计算。
5 Experiments
5.1 Model Choices
1)传统卷积 vs depth-separable convolution
精度相仿,参数量和计算量减少了超级多! 数据集 ImageNet,应该是 top 1 accuracy,从后面和 vgg 、googlenet 对比中可以看出来,记住这个 70.6% baseline,后面很多实验都用到了。
2)健全版(baseline) vs 瘦身版
- 健全版为 baseline,也即上图所示结构
- 瘦身版,不要红线框出来的部分
保证对比的公平性,设置
α
=
0.75
\alpha =0.75
α=0.75 使得 accuracy 相仿。
5.2 Model Shrinking Hyperparameters
1)同
ρ
\rho
ρ 不同
α
\alpha
α
第一行 baseline,评价指标,ImageNet top 1 accuracy
2)同
α
\alpha
α 不同
ρ
\rho
ρ
第一行 baseline,评价指标,ImageNet top 1 accuracy
3)accuracy 与 computation cost 关系
α
=
1
/
0.75
/
0.5
/
0.25
\alpha = 1/0.75/0.5/0.25
α=1/0.75/0.5/0.25,
ρ
=
224
/
192
/
160
/
128
\rho = 224/192/160/128
ρ=224/192/160/128,16种组合,对应下面16个点,作者总结为 log linear 关系,评价指标,ImageNet top 1 accuracy
4)accuracy 与 parameters 的关系
注意,相同分辨率下, parameters 是一样的,由卷积产生,
α
=
1
/
0.75
/
0.5
/
0.25
\alpha = 1/0.75/0.5/0.25
α=1/0.75/0.5/0.25,
ρ
=
224
/
192
/
160
/
128
\rho = 224/192/160/128
ρ=224/192/160/128,16种组合,评价指标,ImageNet top 1 accuracy
5)baseline vs VGG / GoogleNet
评价指标,ImageNet top 1 accuracy
精度相仿,参数量和计算量大大降低
6)smaller mobilenet vs popular models
评价指标,ImageNet top 1 accuracy
又准计算量和参数量又小
5.3 Various applications
作者在如下领域做了对比实验
- Fine Grained Recognition
- Large Scale Geolocalization(不知道原文中 4.4 Large Scale Geolocalizaton 是不是有拼写错误)
- Face Attributes
- Object Detection
- Face Embeddings
1)这里对 Object Detection 的实验进行一下描述
精度还是挺惨的,哈哈哈,计算量下降的比较多,参数量下降的特别多
2)Face Embeddings
用了知识蒸馏,用 mobilenet 作为学生网络,有关知识蒸馏的知识,可以参考如下三篇博客
- 【Distilling】《Distilling the Knowledge in a Neural Network》
- 【Mimic】《Mimicking Very Efficient Network for Object Detection》
- 【Tiny CNN】《Quantization Mimic: Towards Very Tiny CNN for Object Detection》
6 Conclusion
- 用 depth-wise separable convolution 提出了 MobileNet model
- 两个 hyper parameters α \alpha α、 ρ \rho ρ 来 trade off between latency and accuracy
- 各种应用来证明这种设计的 effectiveness
Q1:如下公式 -1 怎理解
Q2:GEMM(General matrix multiply) 优化 1*1 卷积细节
附录——R TALK 如何打造云、端、芯上的视觉计算
旷视首席科学家、研究院院长孙剑在2018年全球人工智能与机器人峰会(CCF-GAIR)带来的”云、端、芯上的视觉计算”的精彩Talk
日常生活和各个行业中有很多的摄像头,比如说手机、安防、工业、零售、无人车、机器人、家庭、无人机、医疗、遥感等等
DorefaNet 是第一个对梯度也做量化的研究工作,从而可以让我们在 FPGA 甚至 ASIC 上训练
在端上的应用更多,第一个就是手机。vivo V7 是第一款海外上市旗舰机,搭载了我们的人脸解锁技术,还有小米 Note 3 的人脸解锁。我们帮助 vivo 和小米在 iPhoneX 发布之前推出了人脸解锁手机。华为荣耀 V10 和 7C 手机同样使用了我们的技术。华为为什么请孙杨做代言人?因为他长期游泳,指纹已经磨光了,必须用人脸解锁才能很好地使用手机