AI 人工智能学习笔记
量化 quantification
Basic Knowledge
Concept
Convert the weights and activation values of the high-level width representation to low bit width. Such as convert: Float32 -> Float16/Int8(most popular,-128-127) / uint8(low precision,0-255)
Pros
- Reduce model size, the model of Int8 will have only 1/4 size of a model of Float32
- Imporve the speed of inference and handle more data during the same time.
- Fit in some hardware accelerators such as DSP/NPU
Symmetrical & asymmetrical quantification
Symmetrical Quantize
Float -> Int8
量化因子:
Δ
=
m
a
x
(
a
b
s
(
r
m
a
x
)
,
a
b
s
(
r
m
i
n
)
)
\Delta = max(abs(rmax), abs(rmin))
Δ=max(abs(rmax),abs(rmin))
量化过程:
根据量化因子做一个最近邻取整,之后做一个卡断,转成定点数
x
i
n
i
t
=
r
o
u
n
d
(
x
Δ
)
x_{init} = round(\frac{x}{\Delta})
xinit=round(Δx)
x
Q
=
c
l
i
p
(
−
N
l
e
v
e
l
s
2
,
N
l
e
v
e
l
s
2
−
1
)
,
x
i
n
i
t
,
i
f
s
i
g
n
e
d
x_Q = clip(\frac{-N_{levels}}{2}, \frac{N_{levels}}{2}-1), x_{init}, if \ signed
xQ=clip(2−Nlevels,2Nlevels−1),xinit,if signed
N
l
e
v
e
l
s
=
256
f
o
r
8
−
b
i
t
s
o
f
p
r
e
c
i
s
i
o
n
N_{levels} = 256\ for\ 8-bits\ of\ precision
Nlevels=256 for 8−bits of precision
Asymmetrical Quantize
量化过程与原理和对称量化类似,只是范围有所改变,加上了一个偏置Z
Δ
=
(
r
m
a
x
−
r
m
i
n
)
/
255
\Delta = (rmax - rmin)/255
Δ=(rmax−rmin)/255
z
=
−
r
m
i
n
Δ
z = -\frac{rmin}{\Delta}
z=−Δrmin
x
i
n
i
t
=
r
o
u
n
d
(
x
Δ
)
+
z
x_{init} = round(\frac{x}{\Delta}) + z
xinit=round(Δx)+z
x
Q
=
c
l
i
p
(
0
,
N
l
e
v
e
l
−
1
,
x
i
n
i
t
)
x_Q = clip(0,N_{level}-1, x_{init})
xQ=clip(0,Nlevel−1,xinit)
N
l
e
v
e
l
s
=
256
f
o
r
8
−
b
i
t
s
o
f
p
r
e
c
i
s
i
o
n
N_{levels} = 256\ for\ 8-bits\ of\ precision
Nlevels=256 for 8−bits of precision
但是人们认为这种不饱和线性量化,损失的精度比较大
Post Quantification & Training Quantification
(Post) Tensor RT Quantize
activation value(激活值) -> 饱和量化,选择合适的阈值
a
b
s
(
T
)
abs(T)
abs(T)
weights(权重) -> 直接非饱和量化
训练模拟量化
Forward过程中,将权值和激活值量化到8bit之后再反量化到有误差的32bit,训练还是浮点数
Backward求得梯度是模拟量化之后权值的梯度,用这个梯度去更新量化前的权值weights
以对称量化为例子:
x
i
n
i
t
=
r
o
u
n
d
(
x
Δ
)
x_{init} = round(\frac{x}{\Delta})
xinit=round(Δx)
x
Q
=
c
l
i
p
(
−
N
l
e
v
e
l
s
2
,
N
l
e
v
e
l
s
2
−
1
)
,
x
i
n
i
t
,
i
f
s
i
g
n
e
d
x_Q = clip(\frac{-N_{levels}}{2}, \frac{N_{levels}}{2}-1), x_{init}, if \ signed
xQ=clip(2−Nlevels,2Nlevels−1),xinit,if signed
x
o
u
t
=
x
Q
Δ
x_{out} = x_Q \Delta
xout=xQΔ
其中
x
o
u
t
x_{out}
xout即为反量化的输出,会引入一定的误差,之后用这个数值来做前向传播forward
而对于梯度:
ω
f
l
o
a
t
=
ω
f
l
o
a
t
−
η
∂
L
∂
ω
o
u
t
⋅
I
ω
o
u
t
∈
(
ω
m
i
n
,
ω
m
a
x
)
\omega_{float} = \omega_{float} - \eta \frac{\partial L}{\partial \omega_{out}} \cdot{I_{\omega_{out}\in (\omega_{min}, \omega_{max})}}
ωfloat=ωfloat−η∂ωout∂L⋅Iωout∈(ωmin,ωmax)
ω
o
u
t
=
S
i
m
Q
u
a
n
t
(
ω
f
l
o
a
t
)
\omega_{out} = SimQuant(\omega_{float})
ωout=SimQuant(ωfloat)
其中
S
i
m
Q
u
a
n
t
SimQuant
SimQuant就是上面计算
x
o
u
t
x_{out}
xout同样的步骤,
η
\eta
η是学习速率learning rate。
其目的是让网络学习量化带来的误差
权值weight的scale直接根据每次forward的最大值求得:
w
e
i
g
h
t
s
c
a
l
e
=
max
(
a
b
s
(
w
e
i
g
h
t
)
)
/
128
weight \ scale = \max(abs(weight))/128
weight scale=max(abs(weight))/128
激活值activation的scale类似,但是max值是通过训练中使用EMA(exponential moving averages)的方式求得。
m
a
x
=
max
∗
m
o
m
e
n
t
a
+
m
a
x
(
a
b
s
(
a
c
t
i
v
a
t
i
o
n
)
)
∗
(
1
−
m
o
m
e
n
t
a
)
,
m
o
m
e
n
t
a
=
0.95
max = \max * momenta + max(abs(activation))*(1- momenta), \ momenta=0.95
max=max∗momenta+max(abs(activation))∗(1−momenta), momenta=0.95
s
c
a
l
e
=
m
a
x
/
128
scale = max/128
scale=max/128
同时模拟量化训练时需要推理把batch norm融合进卷积参数。其中一个卷积层接受原始的浮点数值,算出激活值activation value之后会去计算
γ
\gamma
γ和
β
\beta
β,计算得到的均值和方差,再进入卷积层进行量化,量化完进行卷积回去计算
γ
\gamma
γ和
β
\beta
β,计算得到的均值和方差,再进入卷积层进行量化,量化完进行卷积
实现细节
- 量化之后的权值限制在(-127,127)之间。正常8bit的取值在[ − 2 7 -2^7 −27, 2 7 2^7 27-1],相乘之后取值区间是(- 2 14 2^{14} 214, 2 14 2^{14} 214],累加两次之后就到了(- 2 15 2^{15} 215, 2 15 2^{15} 215],就会有超过int16正数表示的最大值 2 15 − 1 2^{15}-1 215−1的范围。这样一次乘法的结果就会小于 2 14 2^{14} 214