Towards Accurate Binary Convolutional Neural Network
文章目录
文章链接 2017年11月30日
视频链接 (youtube)
Introduction
主要的工作:
1:使用多种binary weight base进行线性组合来接近全精度的权值
2:引入多种binary activations。这个将BNNs在Image上的精度提升了将近5%
Realted Work
We relied on the idea of finding the best approximation of full-precision convolution using multiple binary operations, and employing multiple binary activations to allow more information passing through.
Binarization methods
Weight approximation
用 ( w , h , c i n , c o u t ) (w,h,c_{in},c_{out}) (w,h,cin,cout)表示一个层的tensors。有两种不同的量化方法:1) approximate weights as a whole and 2) approximate weights channel-wise
Approximate weights as a whole
使用
M
M
M个二值化的滤波器
B
1
,
B
2
,
⋯
,
B
M
∈
{
−
1
,
+
1
}
w
×
h
×
c
i
n
×
c
o
u
t
B_1,B_2,\cdots,B_M \in \{-1, +1\}^{w\times h\times c_{in}\times c_{out}}
B1,B2,⋯,BM∈{−1,+1}w×h×cin×cout来逼近实值的权重
W
∈
R
w
×
h
×
c
i
n
×
c
o
u
t
W\in \mathbb{R}^{w\times h\times c_{in}\times c_{out}}
W∈Rw×h×cin×cout,如
W
≈
α
1
B
1
+
α
2
B
2
+
⋯
+
α
M
B
M
W \approx \alpha_1B_1+\alpha_2B_2+\dots+\alpha_MB_M
W≈α1B1+α2B2+⋯+αMBM。一个直接方法是解下面的这个问题:
min
α
,
B
J
(
α
,
B
)
=
∣
∣
w
−
B
α
∣
∣
2
s.t.
B
i
j
∈
{
−
1
,
+
1
}
(1)
\min _{\boldsymbol{\alpha}, \boldsymbol{B}} J(\boldsymbol{\alpha}, \boldsymbol{B})= {{||\boldsymbol{w}-\boldsymbol{B}\boldsymbol{\alpha}||}^2 \text{ s.t. } \boldsymbol{B}_{i j} \in\{-1,+1\} }\tag{1}
α,BminJ(α,B)=∣∣w−Bα∣∣2 s.t. Bij∈{−1,+1}(1)式中,
B
=
[
vec
(
B
1
)
,
vec
(
B
2
)
,
⋯
,
vec
(
B
M
)
]
,
w
=
vec
(
W
)
and
α
=
[
α
1
,
α
2
,
⋯
,
α
M
]
T
\boldsymbol{B}=\left[\operatorname{vec}\left(\boldsymbol{B}_{1}\right), \operatorname{vec}\left(\boldsymbol{B}_{2}\right), \cdots, \operatorname{vec}\left(\boldsymbol{B}_{M}\right)\right], \boldsymbol{w}=\operatorname{vec}(\boldsymbol{W}) \text { and } \boldsymbol{\alpha}=\left[\alpha_{1}, \alpha_{2}, \cdots, \alpha_{M}\right]^{\mathrm{T}}
B=[vec(B1),vec(B2),⋯,vec(BM)],w=vec(W) and α=[α1,α2,⋯,αM]T,
vec
(
⋅
)
\operatorname{vec}(\cdot)
vec(⋅)表示的是向量化。假设用
mean
(
W
)
\operatorname{mean}(\boldsymbol{W})
mean(W)和
std
(
W
)
\operatorname{std}(\boldsymbol{W})
std(W)分别表示
W
\boldsymbol{W}
W的均值和方差,那么将
B
i
B_i
Bi改为:
B
i
=
F
u
i
(
W
)
:
=
sign
(
W
‾
+
u
i
std
(
W
)
)
,
i
=
1
,
2
,
⋯
,
M
(2)
\boldsymbol{B}_{i}=F_{u_{i}}(\boldsymbol{W}):=\operatorname{sign}\left(\overline{\boldsymbol{W}}+u_{i} \operatorname{std}(\boldsymbol{W})\right), i=1,2, \cdots, M\tag{2}
Bi=Fui(W):=sign(W+uistd(W)),i=1,2,⋯,M(2)式中,
W
‾
=
W
−
mean
(
W
)
\overline{\boldsymbol{W}}=\boldsymbol{W}-\operatorname{mean}(\boldsymbol{W})
W=W−mean(W),
u
i
u_i
ui是一个滑动因子。例如,将
u
i
u_i
ui设定为
u
i
=
−
1
+
(
i
−
1
)
2
M
−
1
,
i
=
1
,
2
,
⋯
,
M
u_i=-1+(i-1){2 \over M-1},i=1,2,\cdots,M
ui=−1+(i−1)M−12,i=1,2,⋯,M来覆盖的整个
[
−
std
(
W
)
,
std
(
W
)
]
[-\operatorname{std}(\boldsymbol{W}),\operatorname{std}(\boldsymbol{W})]
[−std(W),std(W)]范围,或者通过网络去学习。
一旦
B
i
\boldsymbol{B}_i
Bi选定之后,上面的问题就变成了一个线性回归问题:
min
α
J
(
α
)
=
∥
w
−
B
α
∥
2
(3)
\min _{\boldsymbol{\alpha}} J(\boldsymbol{\alpha})=\|\boldsymbol{w}-\boldsymbol{B} \boldsymbol{\alpha}\|^{2}\tag{3}
αminJ(α)=∥w−Bα∥2(3)式中,
B
i
\boldsymbol{B}_i
Bi是the bases in the design/dictionary matrix。然后使用STE更新
B
i
\boldsymbol{B}_i
Bi。假定
c
c
c是代价函数,
A
\boldsymbol{A}
A和
O
\boldsymbol{O}
O分别是卷积的输入输出tensor,前向和反向就可以按照如下的形式计算:
Forward:
B
1
,
B
2
,
⋯
,
B
M
=
F
u
1
(
W
)
,
F
u
2
(
W
)
,
⋯
,
F
u
M
(
W
)
Solve
(
3
)
for
α
O
=
∑
m
=
1
M
α
m
Conv
(
B
m
,
A
)
Backward:
∂
c
∂
W
=
∂
c
∂
O
(
∑
m
=
1
M
α
m
∂
O
∂
B
m
∂
B
m
∂
W
)
=
sTE
∂
c
∂
O
(
∑
m
=
1
M
α
m
∂
O
∂
B
m
)
=
∑
m
=
1
M
α
m
∂
c
∂
B
m
\begin{array}{l}{\text { Forward: } B_{1}, B_{2}, \cdots, B_{M}=F_{u_{1}}(W), F_{u_{2}}(W), \cdots, F_{u_{M}}(W)} \\ {\text { Solve }(3) \text { for } \alpha} \\ {\qquad \begin{aligned} O=& \sum_{m=1}^{M} \alpha_{m} \operatorname{Conv}\left(B_{m}, A\right) \\ \text { Backward: } \frac{\partial c}{\partial W} &=\frac{\partial c}{\partial O}\left(\sum_{m=1}^{M} \alpha_{m} \frac{\partial O}{\partial B_{m}} \frac{\partial B_{m}}{\partial W}\right) \stackrel{\text { sTE }}{=} \frac{\partial c}{\partial O}\left(\sum_{m=1}^{M} \alpha_{m} \frac{\partial O}{\partial B_{m}}\right)=\sum_{m=1}^{M} \alpha_{m} \frac{\partial c}{\partial B_{m}} \end{aligned}}\end{array}
Forward: B1,B2,⋯,BM=Fu1(W),Fu2(W),⋯,FuM(W) Solve (3) for αO= Backward: ∂W∂cm=1∑MαmConv(Bm,A)=∂O∂c(m=1∑Mαm∂Bm∂O∂W∂Bm)= sTE ∂O∂c(m=1∑Mαm∂Bm∂O)=m=1∑Mαm∂Bm∂c
Multiple binary activations and bitwise convolution
为了实现bitwise操作,必须将激活值也量化掉,因为它们将作为卷积的输入。激活函数表示为
h
(
x
)
∈
[
0
,
1
]
h(x)\in [0,1]
h(x)∈[0,1]:
h
v
(
x
)
=
clip
(
x
+
v
,
0
,
1
)
(4)
h_v(x)=\operatorname{clip}(x+v,0,1)\tag{4}
hv(x)=clip(x+v,0,1)(4)式中,
v
v
v是滑动因子。量化的函数为:
H
v
(
R
)
:
=
2
I
h
v
(
R
)
≥
0.5
−
1
(5)
H_{v}(\boldsymbol{R}):=2 \mathbb{I}_{\boldsymbol{h}_{v}(\boldsymbol{R}) \geq 0.5}-1\tag{5}
Hv(R):=2Ihv(R)≥0.5−1(5)式中,
I
\mathbb{I}
I是标志函数,activation的前向和反向就可以这么计算:
Forward:
A
=
H
v
(
R
)
Backward:
∂
c
∂
R
=
∂
c
∂
A
∘
I
0
≤
R
−
v
≤
1
(using STE)
\begin{array}{l}{\text { Forward: } A=H_{v}(\boldsymbol{R})} \\ \\ \\ {\text { Backward: } \frac{\partial c}{\partial \boldsymbol{R}}=\frac{\partial c}{\partial \boldsymbol{A}} \circ \mathbb{I}_{0 \leq \boldsymbol{R}-v \leq 1} \text { (using STE) }}\end{array}
Forward: A=Hv(R) Backward: ∂R∂c=∂A∂c∘I0≤R−v≤1 (using STE)
其中
o
\operatorname{o}
o表示Hadamard product。
首先,让激活值的分布保持相对稳定,使用了batch normalization,把它放在激活函数之前。然后,使用
N
N
N个额二值激活值的线性组合逼近实值
R
≈
β
1
A
1
+
β
2
A
2
+
⋯
+
β
N
A
N
R\approx \beta_1\boldsymbol{A}_1+\beta_2\boldsymbol{A}_2+\dots+\beta_N\boldsymbol{A}_N
R≈β1A1+β2A2+⋯+βNAN,其中,
A
1
,
A
2
,
…
,
A
N
=
H
v
1
(
R
)
,
H
v
2
(
R
)
,
…
,
H
v
N
(
R
)
(6)
\boldsymbol{A}_1,\boldsymbol{A}_2,\dots,\boldsymbol{A}_N=H_{v1}(\boldsymbol{R}),H_{v2}(\boldsymbol{R}),\dots,H_{vN}(\boldsymbol{R}) \tag{6}
A1,A2,…,AN=Hv1(R),Hv2(R),…,HvN(R)(6)式中,
β
n
\beta_n
βn和
v
n
v_n
vn是可以训练的,在测试时固定,用来学习数据的分布。最后整个卷积操作变为:
Conv
(
W
,
R
)
≈
Conv
(
∑
m
=
1
M
α
m
B
m
,
∑
n
=
1
N
β
n
A
n
)
=
∑
m
=
1
M
∑
n
=
1
N
α
m
β
n
Conv
(
B
m
,
A
n
)
(7)
\operatorname{Conv}(\boldsymbol{W}, \boldsymbol{R}) \approx \operatorname{Conv}\left(\sum_{m=1}^{M} \alpha_{m} \boldsymbol{B}_{m}, \sum_{n=1}^{N} \beta_{n} \boldsymbol{A}_{n}\right)=\sum_{m=1}^{M} \sum_{n=1}^{N} \alpha_{m} \beta_{n} \operatorname{Conv}\left(\boldsymbol{B}_{m}, \boldsymbol{A}_{n}\right)\tag{7}
Conv(W,R)≈Conv(m=1∑MαmBm,n=1∑NβnAn)=m=1∑Mn=1∑NαmβnConv(Bm,An)(7)这也意味着它能够并行地计算
M
×
N
M\times N
M×N bitwise convolutions 。
Training algorithm
作者说一般的层的连接顺序为 Conv → BN → Activation → Pooling \text{Conv}\rightarrow \text{BN}\rightarrow \text{Activation}\rightarrow \text{Pooling} Conv→BN→Activation→Pooling,但是在实际过程中,经过最大值池化会将大量的值都变为+1,造成准确度下降。因此,将max-pooling放在BN层之前。具体的训练过程在补充材料当中。
Experiment results
Experiment results on ImageNet dataset
使用Resnet作为基网络,图片放缩成224*224大小。
Effect of weight approximation
使用Resnet-18作为基网络,BWN表示Binary-Weights-Network,FP表示全精度网络,结果对比如下:
Comparison with the state-of-the-art
Discussion
Why adding a shift parameter works?
作者说这个可以像BN层中的mean和std一样,学习数据的分布。
Advantage over the fixed-point quantization scheme
作者说一个K个二值化的量化方案比K-bit的量化方案好,原因在于1)可以用bitwise操作;2)K个1-bit的乘法器比一个K-bit的乘法器消耗的资源更少;3)保留了脉冲响应
看看视频还是能更好地理解文章的想法的。