To prune, or not to prune: exploring the efficacy of pruning for model compression
1 Introduction
given a bound on the model’s memory footprint, how can we arrive at the most accurate model?
作者对比了两种等价的模型:
- (1) large-sparse
- (2) small-dense
2 Related work
- 早期有LeCun的OBD(optimal brain damage)等
- 近期都是权重剪枝,我们基于此提出了AGP(automate gradual pruning)
- 还有结构化剪枝,这种可以加速推理,但它may not be directly extensible to other nn architectures
- 在其他还有量化、 low-rank matrix factorization 、group sparsity regularization等
3 Methods
首先预训练$t_0$个step,然后每$\Delta t$ 个step,更新一次binary weight masks,并prune,同时计算新的sparsity value:$S_f$,共迭代n*$\Delta t$个step,$\Delta t$通常设置为100-1000,达到目标$s_f$,masks不再更新
第t个step的sparsity value如下公式:
s t = s f + ( s i − s f ) ( 1 − t − t 0 n Δ t ) 3 } f o r t ∈ { t 0 , t 0 + Δ t , . . . , t 0 + n Δ t } ( 1 ) s_t = s_f + (s_i - s_f) (1 - \frac{t - t_0}{n\Delta t})^3 \} \quad for\quad t\in \{t_0,\ t_0+\Delta t,\ ...,\ t_0+n\Delta t \} \qquad\qquad (1) st=sf+(si−sf)(1−nΔtt−t0)3}fort∈{t0, t0+Δt, ..., t0+nΔt}(1)
-
s i s_i si是initial sparsity value(usually 0)
-
s f s_f sf是final sparsity value
-
s t s_t st是当前sparsity value
-
t t t是一共持续多少轮prune(单位: Δ t \Delta t Δt个step)
-
Δ t \Delta t Δt是pruning frequency(单位:step)
-
t 0 t_0 t0是start training step(单位:step)
-
t就是当前training step(单位:step)
换种表示方法:设$t = t_0 + a * \Delta t$,那么公式(1)可替换为:
s
t
=
s
f
+
(
s
i
−
s
f
)
(
1
−
a
n
)
3
}
f
o
r
a
∈
{
0
,
1
,
.
.
.
,
n
−
1
,
n
}
(
2
)
s_t = s_f + (s_i - s_f) (1 - \frac{a}{n})^3 \} \quad for\quad a\in \{0,\ 1,\ ...,\ n-1,\ n \} \qquad\qquad (2)
st=sf+(si−sf)(1−na)3}fora∈{0, 1, ..., n−1, n}(2)
而且通常情况下Si=0,那么公式(1)还可以继续简化为:
s
t
=
s
f
{
1
−
(
1
−
a
n
)
3
}
f
o
r
a
∈
{
0
,
1
,
.
.
.
,
n
−
1
,
n
}
(
3
)
s_t = s_f \{1 - (1 - \frac{a}{n})^3 \} \quad for\quad a\in \{0,\ 1, ...,\ n-1,\ n \}\qquad\qquad\qquad(3)
st=sf{1−(1−na)3}fora∈{0, 1,..., n−1, n}(3)
等式中稀疏函数的作用是在初始阶段当冗余连接充足时快速修剪网络,随着网络中剩余的权重越来越少并逐渐减少每次修剪的权重数量