这篇论文把贝叶斯优化用到了模型压缩中,提出了一个优化框架,具体压缩的方法可以灵活使用。超参数 θ \theta θ表示最终的网络有多小,比如,当使用剪枝方法时它是一个threshold,当使用SVD是它是一个rank。
本文考虑的是如何选择压缩超参数 θ \theta θ。根据BO方法,我们需要确定objective function和acquisition function。
Objective function
object function 需要考虑的是1.the quality of the compressed network。用
Q
(
f
~
θ
)
\mathcal{Q}(\tilde{f}_\theta)
Q(f~θ) 代表模型的表现,用
L
(
f
~
t
h
e
t
a
,
f
∗
)
\mathcal{L}(\tilde{f}_theta,f^*)
L(f~theta,f∗)代表保真度。2.the size of the obtained network。用
R
(
f
~
θ
,
f
∗
)
R(\tilde{f}_\theta,f^*)
R(f~θ,f∗)表示压缩比例。那么优化问题可以表示为:
a
r
g
m
a
x
θ
(
γ
Q
(
f
~
θ
)
+
R
(
f
~
θ
,
f
∗
)
)
−
1
⏟
J
Q
(
θ
)
o
r
a
r
g
m
i
n
θ
(
κ
L
(
f
~
θ
,
f
∗
)
+
R
(
f
~
θ
,
f
∗
)
)
⏟
J
L
(
θ
)
arg max_\theta(\underbrace{ \gamma \mathcal{Q}(\tilde{f}_\theta)+R(\tilde{f}_\theta,f^*))^{-1}}_{J_Q(\theta)} \\or argmin_\theta(\underbrace{ \kappa \mathcal{L}(\tilde{f}_\theta, f^*)+R(\tilde{f}_\theta,f^*))}_{J_{L}(\theta)}
argmaxθ(JQ(θ)
γQ(f~θ)+R(f~θ,f∗))−1orargminθ(JL(θ)
κL(f~θ,f∗)+R(f~θ,f∗))
这里我们考虑用知识蒸馏的目标函数:
L
(
f
~
θ
,
f
∗
)
:
=
E
x
∼
P
(
∣
∣
f
~
θ
(
x
)
−
f
∗
(
x
)
∣
∣
2
2
)
=
∣
∣
f
∗
−
f
~
θ
∣
∣
2
2
\mathcal{L}(\tilde{f}_\theta,f^*):=\mathbb{E}_{x\thicksim P}(||\tilde{f}_\theta(x)-f^*(x)||_2^2)=||f^*-\tilde{f}_\theta||^2_2
L(f~θ,f∗):=Ex∼P(∣∣f~θ(x)−f∗(x)∣∣22)=∣∣f∗−f~θ∣∣22
acquisition function
Experiments
Comparison of different model selection methods on Resnet18
Knowledge distillation as a proxy for risk
一个自然的疑问是用知识蒸馏的目标函数(L2)是否在网络压缩中好用。实验结果表明用function norm 和用top-1 error rate 有相当的表现。
Compression of VGG-16
In this section, we demonstrate that our method finds compression parameters that compare favorably to state-ofthe-art compression results reported on VGG-16 [10]. We first apply our method to compress convolutional layers of VGG-16 using tensor decomposition, which has 13 parameters. After that, we fine-tune the compressed model for 5 epochs, using Stochastic Gradient Descent (SGD) with
momentum 0.9 and learning rate 1e-4, decreased by a factor of 10 every epoch. Second, we apply another pass of our algorithm to compress the fully-connected layers of the fine-tuned model using SVD, which has 3 parameters. A single optimization takes approximately 10 minutes. Again, after the compression, we fine-tune the compressed model…
Conclusion
In this work, we have developed a principled, fast, and flexible framework for optimizing neural network compression parameters…