ICRL-2017
文章目录
1 Background and Motivation
CNN 百家争鸣(accuracy),同一精度(相仿)下,可能有很多网络结构,作者从模型压缩的角度展开,追求相仿精度,更小的模型。
smaller CNN architectures 有如下3点优势:
- More efficient distributed training
因为通讯开销正比于模型参数量(Forrest N. Iandola, Khalid Ashraf, MatthewW. Moskewicz, and Kurt Keutzer. FireCaffe: near-linear acceleration of deep neural network training on compute clusters. In CVPR, 2016.) - Less overhead(开支) when exporting new models to clients
eg:自动驾驶,over-the-air update,可以更快,更频繁的更新 - Feasible FPGA and embedded deployment
当时 inception-v1、v2、v3、v4 已问世!
The overarching goal of our work is to identify a model that has very few parameters while preserving accuracy.
2 Advantages / Contributions
- 提出 SqueezeNet,模型压缩的新型网络结构
- AlexNet level accuracy on ImageNet,50× fewer parameters(AlexNet 240MB,SqueezeNet 4.8MB),配合压缩技术可以让模型 <0.5 MB(510× smaller than AlexNet)
- 探讨了网络结构的设计对精度和模型大小的影响(Design Space Exploration,从 microarchitecture 和 macroarchitecture两个方面,详情见后面的方法)
3 Notions / Innovations
3.1 Innovations
- 提出了 SqueezeNet 网络结构,AlexNet 相仿精度,大大压缩了模型的参数量
- 探索了 how CNN architecture design choices impact model size and accuracy(从 microarchitecture 和 macroarchitecture 角度展开,有自己深刻的见解)
3.2 Notions
- microarchitecture:the organization and dimensionality of individual layers and modules.(卷积的形式,卷积的组合方式-module,卷积核的 number,参考 figure 3)
- macroarchitecture:the system-level organization of multiple modules into an end-to-end CNN architecture(本文中涉及到 bypass 结构,也即 residual connections,参考 figure2)
4 Related work
相关工作从如下四个方面展开
-
Mode Compression
-
CNN microarchitecture(LeNet、VGG、Inception family—— 5 ∗ 5 , 3 ∗ 3 , 1 ∗ 1 5*5,3*3,1*1 5∗5,3∗3,1∗1)
-
CNN macroarchitecture
一开始大家关注的多的是 depth. The choice of connections across multiple layers or modules is an emerging area(新兴领域) of CNN macroarchitectural research. -
Neural Network Design Space Exploration
Much of the work on design space exploration (DSE) of NNs has focused on developing automated approaches for finding NN architectures that deliver higher accuracy.
-
bayesian optimization(贝叶斯优化)
-
simulated annealing(模拟退火)
-
randomized search(随机搜索)
-
genetic algorithms(遗传算法)
However, these papers make no attempt to provide intuition about the shape of the NN design space
-
5 Method
5.1 Architecture design strategy
参数量计算方法
(
n
u
m
b
e
r
o
f
i
n
p
u
t
c
h
a
n
n
e
l
s
)
∗
(
n
u
m
b
e
r
o
f
f
i
l
t
e
r
s
)
∗
(
3
∗
3
)
(number \ of \ input \ channels) * (number\ of\ filters) * (3*3)
(number of input channels)∗(number of filters)∗(3∗3)
- strategy 1:Replace 3x3 filters with 1x1 filters(9X fewer parameters)
- strategy 2:Decrease the number of input channels to 3x3 filters
- strategy 3:Downsample late in the network so that convolution layers have large activation maps(延迟 down sampling)
关于 strategy 3,Our intuition is that large activation maps (due to delayed downsampling) can lead to higher
classification accuracy. 这个好理解,因为feature map 的 resolution 大,包含的信息可能就越多!极端情况, feature map
2
∗
2
2*2
2∗2,卷积核
3
∗
3
3*3
3∗3,这样卷积的效果肯定不好!
strategy 1 和 2 都是从减小参数量的角度考虑的,strategy 3 is about maximizing accuracy on a limited budget of parameters.
5.2 The Fire Module
fire module = squeeze layer + expand layer
用到了
1
∗
1
1*1
1∗1 体现了 strategy 1
s
1
x
1
<
e
1
x
1
+
e
3
x
3
s_{1x1} < e_{1x1} + e_{3x3}
s1x1<e1x1+e3x3 体现了 strategy 2
实现的时候如下所示:
1
∗
1
1*1
1∗1 和
3
∗
3
3*3
3∗3 并行,最后 concatenate
5.3 The SqueezeNet architecture
figure 2 左边结构
maxpooling 策略体现了 strategy 3,maxpooling 8 已经在很后面了,有点像 【Xception】《Xception: Deep Learning with Depthwise Separable Convolutions》,值得注意的是, imagenet 上,往往 down sampling 5 次,而这里仅仅 down 了 4 次。
compression info 是用了 deep compress 中的压缩技术,Data Type 由 32 bit 变成 6 bit
6 Experiments
Dataset:ImageNet
6.1 Evaluation of SqueezeNet
作者发问?
are small models amenable to compression, or do small models “need” all of the representational power afforded
by dense floating-point values?(SqueezeNet 适合继续压缩吗?有必要 32 bit 表示吗?)
Table 2 最后两行给出了答案,采用 Deep Compression 的压缩技术,在 SqueezeNet 的基础上,还能压缩!
Our small model is indeed amenable to compression.(侧面体现了 Deep Compression 的压缩能力)
改变 Data Type 从 coding 上好像不容易实现!
6.2 CNN microarchitecture design space exploration
However, SqueezeNet and other models reside in a broad and largely unexplored design space of CNN architectures.
microarchitecture:the organization and dimensionality of individual layers and modules.(卷积的形式,卷积的组合方式-module,卷积核的 number,参考 figure 3)
6.2.1 CNN microarchitecture metaparameters
metaparameters 可以理解为 the parameters that are used to control other parameters.
每个 fire module 有 3 个 hyper parameters(
s
1
x
1
,
e
1
x
1
,
e
3
x
3
s_{1x1},e_{1x1},e_{3x3}
s1x1,e1x1,e3x3),一共 8 个 modules 共 24 个 hyper parameters,为了方便控制这些 hyper parameters,作者设计了一套 metaparameters!具体如下:
规定 e i e_i ei 为 the number of expand layer filters, i i i 表示第 i i i 个 fire module
expand layer 中有
1
×
1
1×1
1×1 和
3
×
3
3×3
3×3 卷积,
e
i
=
e
i
,
1
x
1
+
e
i
,
3
x
3
e_i = e_{i,1x1} + e_{i,3x3}
ei=ei,1x1+ei,3x3
设定 expan layer 中
3
×
3
3×3
3×3 卷积的比例为
p
c
t
3
x
3
pct_{3x3}
pct3x3,所以
e
i
,
3
x
3
=
e
i
∗
p
c
t
3
x
3
,
e
i
,
1
x
1
=
e
i
∗
(
1
−
p
c
t
3
x
3
)
e_{i,3x3} = e_i*pct_{3x3},e_{i,1x1} = e_i*(1-pct_{3x3})
ei,3x3=ei∗pct3x3,ei,1x1=ei∗(1−pct3x3)
s i , 1 x 1 = e i ∗ S R s_{i,1x1} = e_i*SR si,1x1=ei∗SR, S R SR SR 为 squeeze ratio,0-1 之间。
所以,知道 e i e_i ei 就可以通过 p c t 3 x 3 pct_{3x3} pct3x3 推导出 e i , 1 x 1 e_{i,1x1} ei,1x1 和 e i , 3 x 3 e_{i,3x3} ei,3x3,知道 e i e_i ei 就可以通过 S R SR SR 可以推导出 s i , 1 x 1 s_{i,1x1} si,1x1
那么整个 fire module 的超参数可以由如下公式计算:
e
i
=
b
a
s
e
e
+
(
i
n
c
r
e
)
∗
⌊
i
f
r
e
q
⌋
e_i = base_e + (incr_e)*\left \lfloor \frac{i}{freq}\right \rfloor
ei=basee+(incre)∗⌊freqi⌋
We define b a s e e base_e basee as the number of expand filters in the first Fire module in a CNN. After every f r e q freq freq Fire modules, we increase the number of expand filters by i n c r e incr_e incre.
b
a
s
e
e
=
128
base_e = 128
basee=128
i
n
c
r
e
=
128
incr_e = 128
incre=128
f
r
e
q
=
2
freq = 2
freq=2
p
c
t
3
x
3
=
0.5
pct_3x3 = 0.5
pct3x3=0.5
S
R
=
0.125
SR = 0.125
SR=0.125
i
=
0
∼
7
i = 0\sim 7
i=0∼7
对比下 table 1
编程算算
from math import floor
base_e = 128
incr_e = 128
freq = 2
pct_3x3 = 0.5
SR = 0.125
print ("s1 e1 e3 ")
for i in range(0,8):
e_i = base_e + (incr_e * floor(i/freq))
e_i_3x3 = e_i * pct_3x3
e_i_1x1 = e_i * (1-pct_3x3)
s_i_1x1 = e_i * SR
print("%d %d %d"%(s_i_1x1,e_i_1x1,e_i_3x3))
output
s1 e1 e3
16 64 64
16 64 64
32 128 128
32 128 128
48 192 192
48 192 192
64 256 256
64 256 256
6.2.2 Squeeze Ratio
横坐标 model size,纵坐标 accuracy,Accuracy plateaus at 86.0% with
S
R
=
0.75
SR=0.75
SR=0.75(精度停滞期)
train from scratch
6.2.3 Trading off 1x1 and 3x3 filters
SR = 0.5,mostly 1x1 to mostly 3x3
横坐标 model size,纵坐标 accuracy,Accuracy plateaus at 85.3% with
p
c
t
3
x
3
=
50
%
pct_{3x3} = 50\%
pct3x3=50%(精度停滞期)
6.3 CNN macroarchitecture design space exploration
macroarchitecture:the system-level organization of multiple modules into an end-to-end CNN architecture(本文中涉及到 bypass 结构,也即 residual connections,参考 figure2)
- Vanilla SqueezeNet(figure 2 左)
- SqueezeNet with simple bypass connection(just a wire,figure 2 中)
- SqueezeNet with complex bypass connection(1x1 conv,figure 2 右)
bypass 结构的优势
- would help to alleviate the representational bottleneck introduced by squeeze layers.
- SR 压缩太多,有旁路,曲径通幽
Due to this severe dimensionality reduction, a limited amount of information can pass through squeeze layers. However, by adding bypass connections to SqueezeNet, we open up avenues for information to flow around the squeeze layers.
Interestingly, simple bypass connection 效果好
7 Conclusion / Future work
SqueeNet 在 FPGA 上可以应用
We think SqueezeNet will be a good candidate CNN architecture for a variety of applications, especially those in which small model size is of importance.(引用量还是超级恐怖的)
2019年1月16日 20:47:32
We hope that SqueezeNet will inspire the reader to consider and explore the broad range of possibilities in the design space of CNN architectures and to perform that exploration in a more systematic manner.(嘿嘿,Auto ML)