0、写在前面
本来想学习gemm的相关代码,看一看tensorflow是怎么高效的进行int8推理,然后在这里看到most of gemmlowp is superseded by ruy
,于是研究了一会儿ruy。
1、工程编译
将源码clone下来进行编译
git clone https://github.com/google/ruy.git
cd ruy/
mkdir build
cd build/
cmake ..
期间遇到报错
CMake Error at CMakeLists.txt:76 (message):
This file does not exist:
xxx/ruy/third_party/cpuinfo/CMakeLists.txt
That typically means that the git submodules of the ruy repository haven't
been checked out.
于是把git的submodule也下载下来
git submodule update --init
cmake ..
make -j8
编译成功
2、跑个demo
跑个例子看一下
./example/ruy_example_example
Example Mul, float:
LHS:
1 2
3 4
RHS:
1 3
2 4
Result:
5 11
11 25
Example Mul, float with bias addition and clamp:
LHS:
1 2
3 4
RHS:
1 3
2 4
Result:
6 12
11 15
Example Mul, uint8 quantized with asymmetric zero points:
LHS:
124 125
126 127
RHS:
129 131
130 132
Result:
131 130
126 129
Example Mul, int8 quantized with per-channel multipliers
LHS:
1 2
3 4
RHS:
1 3
2 4
Result:
4 8
2 4
Example Mul, returning raw int32 accumulators:
LHS:
1 2
3 4
RHS:
1 3
2 4
Result:
5 11
11 25
Example Mul, int8 times int16 quantized with per-channel multipliers
LHS:
1 2
3 4
RHS:
1000 3000
2000 4000
Result:
3750 8250
1719 3906
3、Example分析
3.1 ExampleMulFloat
测试代码
const float lhs_data[] = {1, 2, 3, 4};
const float rhs_data[] = {1, 2, 3, 4};
float dst_data[4];
ruy::Matrix<float> lhs;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kRowMajor, lhs.mutable_layout());
lhs.set_data(lhs_data);
ruy::Matrix<float> rhs;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kColMajor, rhs.mutable_layout());
rhs.set_data(rhs_data);
ruy::Matrix<float> dst;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kColMajor, dst.mutable_layout());
dst.set_data(dst_data);
ruy::MulParams<float, float> mul_params;
ruy::Mul(lhs, rhs, mul_params, context, &dst);
std::cout << "Example Mul, float:\n";
std::cout << "LHS:\n" << lhs;
std::cout << "RHS:\n" << rhs;
std::cout << "Result:\n" << dst << "\n";
运行结果
Example Mul, float:
LHS:
1 2
3 4
RHS:
1 3
2 4
Result:
5 11
11 25
分析
[ 1 2 3 4 ] [ 1 3 2 4 ] = [ 5 11 11 25 ] \left[ \begin{matrix} 1 & 2 \\ 3 & 4 \end{matrix} \right] \left[ \begin{matrix} 1 & 3 \\ 2 & 4 \end{matrix} \right] = \left[ \begin{matrix} 5 & 11 \\ 11 & 25 \end{matrix} \right] [1324][1234]=[5111125]
这个是浮点的计算,在我看来只要结果一直就可以,没什么我需要特别关注的。
3.2 ExampleMulFloatWithBiasAddAndClamp
测试代码
const float lhs_data[] = {1, 2, 3, 4};
const float rhs_data[] = {1, 2, 3, 4};
const float bias_data[] = {1, 0};
float dst_data[4];
ruy::Matrix<float> lhs;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kRowMajor, lhs.mutable_layout());
lhs.set_data(lhs_data);
ruy::Matrix<float> rhs;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kColMajor, rhs.mutable_layout());
rhs.set_data(rhs_data);
ruy::Matrix<float> dst;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kColMajor, dst.mutable_layout());
dst.set_data(dst_data);
ruy::MulParams<float, float> mul_params;
mul_params.set_bias(bias_data);
mul_params.set_clamp_min(0);
mul_params.set_clamp_max(15);
ruy::Mul(lhs, rhs, mul_params, context, &dst);
std::cout << "Example Mul, float with bias addition and clamp:\n";
std::cout << "LHS:\n" << lhs;
std::cout << "RHS:\n" << rhs;
std::cout << "Result:\n" << dst << "\n";
运行结果
Example Mul, float with bias addition and clamp:
LHS:
1 2
3 4
RHS:
1 3
2 4
Result:
6 12
11 15
分析
[ 1 2 3 4 ] [ 1 3 2 4 ] + [ 1 0 ] = [ 6 12 11 25 ] \left[ \begin{matrix} 1 & 2 \\ 3 & 4 \end{matrix} \right] \left[ \begin{matrix} 1 & 3 \\ 2 & 4 \end{matrix} \right] + \left[ \begin{matrix} 1 \\ 0 \end{matrix} \right] = \left[ \begin{matrix} 6 & 12 \\ 11 & 25 \end{matrix} \right] [1324][1234]+[10]=[6111225]
结果不一致是因为set_clamp_max(15)
对25做了截断
3.3 ExampleMulUint8AsymmetricQuantized
测试代码
const std::uint8_t lhs_data[] = {124, 125, 126, 127};
const std::uint8_t rhs_data[] = {129, 130, 131, 132};
std::uint8_t dst_data[4];
ruy::Matrix<std::uint8_t> lhs;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kRowMajor, lhs.mutable_layout());
lhs.set_data(lhs_data);
lhs.set_zero_point(125);
ruy::Matrix<std::uint8_t> rhs;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kColMajor, rhs.mutable_layout());
rhs.set_data(rhs_data);
rhs.set_zero_point(132);
ruy::Matrix<std::uint8_t> dst;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kColMajor, dst.mutable_layout());
dst.set_data(dst_data);
dst.set_zero_point(129);
ruy::MulParams<std::int32_t, std::uint8_t> mul_params;
mul_params.set_multiplier_fixedpoint(1 << 30);
mul_params.set_multiplier_exponent(0);
ruy::Mul(lhs, rhs, mul_params, context, &dst);
std::cout << "Example Mul, uint8 quantized with asymmetric zero points:\n";
std::cout << "LHS:\n" << lhs;
std::cout << "RHS:\n" << rhs;
std::cout << "Result:\n" << dst << "\n";
运行结果
Example Mul, uint8 quantized with asymmetric zero points:
LHS:
124 125
126 127
RHS:
129 131
130 132
Result:
131 130
126 129
分析
重头戏来了~
我先从我的角度来看,这段代码到底计算了什么,input
减去zero_point
后,其实是进行了如下计算:
[
−
1
0
1
2
]
[
−
3
−
1
−
2
0
]
=
[
3
1
−
7
−
1
]
\left[ \begin{matrix} -1 & 0 \\ 1 & 2 \end{matrix} \right] \left[ \begin{matrix} -3 & -1 \\ -2 & 0 \end{matrix} \right] = \left[ \begin{matrix} 3 & 1 \\ -7 & -1 \end{matrix} \right]
[−1102][−3−2−10]=[3−71−1]
但是结果加上zero_point
,却是
[
132
130
122
128
]
!
=
[
131
130
126
129
]
\left[ \begin{matrix} 132 & 130 \\ 122 & 128 \end{matrix} \right] != \left[ \begin{matrix} 131 & 130 \\ 126 & 129 \end{matrix} \right]
[132122130128]!=[131126130129]
而且看不出来什么规律,所以关键点来了,要看一看这个int8推理到底做了什么。
先说结论:
在这次的计算中,shift=0
,quantized_multiplier = 1 << 30
,zero_point = 129
所以
r
e
s
u
l
t
=
⌊
(
x
∗
2
30
+
2
30
)
2
31
⌋
+
129
result = \lfloor{(x * 2^{30} + 2^{30}) \over 2^{31}}\rfloor + 129\\\hspace{50cm}
result=⌊231(x∗230+230)⌋+129
则
result(3) = 2 + 129 = 131
result(1) = 1 + 129 = 130
result(-7) = -3 + 129 = 126
result(-1) = 0 + 129 = 129
int8计算的放缩很重要,合理的放缩可以尽可能地保留精度。
我上述公式的计算来源于ruy的这一段代码
// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
// Warning: this code is not meant to be bit-exact-normative.
// Please refer to the class comment of ruy::MulParams, in mul_params.h.
// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
// Preconditions:
// - quantized_multiplier >= 0
// - shift is -31 to +7 (negative for right shift)
std::int32_t MultiplyByQuantizedMultiplier(std::int32_t x,
std::int32_t quantized_multiplier,
int shift) {
RUY_CHECK_GE(shift, -31);
int total_shift = 31 - shift;
std::int64_t x_64(x);
std::int64_t quantized_multiplier_64(quantized_multiplier);
std::int64_t round = (int64_t)1 << (total_shift - 1);
int64_t result = x_64 * quantized_multiplier_64 + round;
result = result >> total_shift;
RUY_DCHECK_GE(result, std::numeric_limits<std::int32_t>::lowest());
RUY_DCHECK_LE(result, std::numeric_limits<std::int32_t>::max());
return static_cast<std::int32_t>(result);
}
其中,关键的两句
result = x_64 * quantized_multiplier_64 + round;
result = result >> total_shift
写成公式便于理解:
r
e
s
u
l
t
=
(
x
∗
q
u
a
n
t
i
z
e
d
_
m
u
l
t
i
p
l
i
e
r
+
r
o
u
n
d
)
>
>
t
o
t
a
l
_
s
h
i
f
t
result = (x * quantized\_multiplier + round) >> total\_shift\\\hspace{50cm}
result=(x∗quantized_multiplier+round)>>total_shift
=
(
x
∗
q
u
a
n
t
i
z
e
d
_
m
u
l
t
i
p
l
i
e
r
+
2
t
o
t
a
l
_
s
h
i
f
t
−
1
)
>
>
t
o
t
a
l
_
s
h
i
f
t
\\\hspace{1.3cm} = (x * quantized\_multiplier + 2^{total\_shift - 1}) >> total\_shift\\\hspace{50cm}
=(x∗quantized_multiplier+2total_shift−1)>>total_shift
=
(
x
∗
q
u
a
n
t
i
z
e
d
_
m
u
l
t
i
p
l
i
e
r
+
2
30
−
s
h
i
f
t
)
>
>
(
31
−
s
h
i
f
t
)
\\\hspace{1.3cm} = (x * quantized\_multiplier + 2^{30-shift}) >> (31-shift)\\\hspace{50cm}
=(x∗quantized_multiplier+230−shift)>>(31−shift)
=
⌊
(
x
∗
q
u
a
n
t
i
z
e
d
_
m
u
l
t
i
p
l
i
e
r
+
2
30
−
s
h
i
f
t
)
2
31
−
s
h
i
f
t
⌋
\\\hspace{1.3cm} = \lfloor{(x * quantized\_multiplier + 2^{30-shift}) \over 2^{31-shift}}\rfloor\\\hspace{50cm}
=⌊231−shift(x∗quantized_multiplier+230−shift)⌋
这也就解释了这个example的计算结果。
以上是对结果修正的解释,除此之外,计算过程也同样重要。
我本来理解的是,input
减去zero_point
后,进行如下计算:
[
−
1
0
1
2
]
[
−
3
−
1
−
2
0
]
=
[
3
1
−
7
−
1
]
\left[ \begin{matrix} -1 & 0 \\ 1 & 2 \end{matrix} \right] \left[ \begin{matrix} -3 & -1 \\ -2 & 0 \end{matrix} \right] = \left[ \begin{matrix} 3 & 1 \\ -7 & -1 \end{matrix} \right]
[−1102][−3−2−10]=[3−71−1]
但事实上是,input
并没有减去zero_point
,而是做了展开操作:
Y
=
f
u
n
(
X
W
+
b
i
a
s
)
Y=fun(XW+bias) \\\hspace{50cm}
Y=fun(XW+bias)
其中,X为输入,W为权值,bias为偏置,fun()为缩放函数
Y
=
f
u
n
[
(
X
−
Z
x
)
(
W
−
Z
w
)
+
b
i
a
s
]
+
Z
y
Y=fun[(X-Z_x)(W-Z_w)+bias]+Z_y \\\hspace{50cm}
Y=fun[(X−Zx)(W−Zw)+bias]+Zy
=
f
u
n
[
X
W
−
X
Z
w
−
Z
x
W
+
Z
x
Z
w
+
b
i
a
s
]
+
Z
y
\\\hspace{0.45cm} =fun[XW-XZ_w-Z_xW+Z_xZ_w+bias]+Z_y \\\hspace{50cm}
=fun[XW−XZw−ZxW+ZxZw+bias]+Zy
在本例中,也就是
( [ 124 125 126 127 ] − [ 125 125 125 125 ] ) ( [ 129 131 130 132 ] − [ 132 132 132 132 ] ) (\left[ \begin{matrix} 124 & 125 \\ 126 & 127 \end{matrix} \right] - \left[ \begin{matrix} 125 & 125\\ 125 & 125 \end{matrix} \right] ) (\left[ \begin{matrix} 129 & 131 \\ 130 & 132 \end{matrix} \right]-\left[ \begin{matrix} 132 & 132\\ 132 & 132\end{matrix} \right])\\\hspace{50cm} ([124126125127]−[125125125125])([129130131132]−[132132132132])
=
[
124
125
126
127
]
[
129
131
130
132
]
−
[
125
125
125
125
]
[
129
131
130
132
]
−
[
124
125
126
127
]
[
132
132
132
132
]
+
[
125
125
125
125
]
[
132
132
132
132
]
= \left[ \begin{matrix} 124 & 125 \\ 126 & 127 \end{matrix} \right] \left[ \begin{matrix} 129 & 131 \\ 130 & 132 \end{matrix} \right] - \left[ \begin{matrix} 125 & 125 \\ 125 & 125 \end{matrix} \right] \left[ \begin{matrix} 129 & 131 \\ 130 & 132 \end{matrix} \right] - \left[ \begin{matrix} 124 & 125 \\ 126 & 127 \end{matrix} \right]\left[ \begin{matrix} 132 & 132 \\ 132 & 132 \end{matrix} \right] + \left[ \begin{matrix} 125 & 125\\ 125 & 125 \end{matrix} \right]\left[ \begin{matrix} 132 & 132\\ 132 & 132 \end{matrix} \right] \\\hspace{50cm}
=[124126125127][129130131132]−[125125125125][129130131132]−[124126125127][132132132132]+[125125125125][132132132132]
这是公式展开似乎显得更复杂,但在代码中,zero-point
只是125和132,不需要展开成这样的矩阵的。另外还有一些小的判断可以减少计算量,代码逻辑如下(建议看源码):
// 以计算第一个数为例
accum = 124 * 129 + 125 * 130 = 32246
// 本例中没有bias,所以可跳过
if(bias)
accum += bias
// 这是比较好的一点,如果zero_point是0可以直接免掉该计算
// 另外先求和再乘法可以加速计算
if(lhs.zero_point)
accum -= lhs.zero_point * rhs.sums[j]
= 32246 - 125 * (129 + 130)
= -12
// 同上
if(rhs.zero_point)
accum -= rhs.zero_point * lhs.sums[i]
= -129 - 132 * (124 + 125)
= -32997
// 如果任一zero_point是0可以直接免掉该计算,另外depth使用可以减少乘加
if(lhs.zero_point && rhs.zero_point)
accum += lhs.zero_point * rhs.zero_point * depth
= -32997 + 125 * 132 * 2
= 3
r e s u l t ( 3 ) = ⌊ ( 3 ∗ 2 30 + 2 30 ) 2 31 ⌋ = 2 result(3) = \lfloor{(3 * 2^{30} + 2^{30}) \over 2^{31}}\rfloor =2\\\hspace{50cm} result(3)=⌊231(3∗230+230)⌋=2
accun += dst->zero_point
= 2 + 129
= 131
3.4 ExampleMulInt8PerChannelQuantized
测试代码
const std::int8_t lhs_data[] = {1, 2, 3, 4};
const std::int8_t rhs_data[] = {1, 2, 3, 4};
const std::int32_t multiplier_data[] = {3 << 28, 5 << 28};
const int exponent_data[] = {1, -2};
std::int8_t dst_data[4];
ruy::Matrix<std::int8_t> lhs;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kRowMajor, lhs.mutable_layout());
lhs.set_data(lhs_data);
ruy::Matrix<std::int8_t> rhs;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kColMajor, rhs.mutable_layout());
rhs.set_data(rhs_data);
ruy::Matrix<std::int8_t> dst;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kColMajor, dst.mutable_layout());
dst.set_data(dst_data);
ruy::MulParams<std::int32_t, std::int8_t> mul_params;
mul_params.set_multiplier_fixedpoint_perchannel(multiplier_data);
mul_params.set_multiplier_exponent_perchannel(exponent_data);
ruy::Mul(lhs, rhs, mul_params, context, &dst);
std::cout << "Example Mul, int8 quantized with per-channel multipliers\n";
std::cout << "LHS:\n" << lhs;
std::cout << "RHS:\n" << rhs;
std::cout << "Result:\n" << dst << "\n";
运行结果
Example Mul, int8 quantized with per-channel multipliers
LHS:
1 2
3 4
RHS:
1 3
2 4
Result:
4 8
2 4
分析
[ 1 2 3 4 ] [ 1 3 2 4 ] = [ 5 11 11 25 ] \left[ \begin{matrix} 1 & 2 \\ 3 & 4 \end{matrix} \right] \left[ \begin{matrix} 1 & 3 \\ 2 & 4 \end{matrix} \right] = \left[ \begin{matrix} 5 & 11 \\ 11 & 25 \end{matrix} \right] [1324][1234]=[5111125]
有上一小节的理解,这里的分析就容易多了。
以shift=1
,quantized_multiplier = 3 << 28
为例
r
e
s
u
l
t
=
⌊
(
x
∗
3
∗
2
28
+
2
29
)
2
30
⌋
result = \lfloor{(x * 3 * 2^{28} + 2^{29}) \over 2^{30}}\rfloor\\\hspace{50cm}
result=⌊230(x∗3∗228+229)⌋
result(5) = 4
result(11) = 8
以shift=-2
,quantized_multiplier = 5 << 28
为例
r
e
s
u
l
t
=
⌊
(
x
∗
5
∗
2
28
+
2
32
)
2
33
⌋
result = \lfloor{(x * 5 * 2^{28} + 2^{32}) \over 2^{33}}\rfloor\\\hspace{50cm}
result=⌊233(x∗5∗228+232)⌋
result(11) = 2
result(25) = 4
3.5 raw int32 accumulators
暂不关注
3.6 int8 times int16 quantized with per-channel multipliers
暂不关注
4、对于ZeroPoint的pack操作
ruy/pack_common.h
PackedScalar Pack(Scalar x) {
return x - SymmetricZeroPoint<Scalar>() + SymmetricZeroPoint<PackedScalar>();
}
其中,
Scalar SymmetricZeroPoint() {
if (std::is_floating_point<Scalar>::value) {
return 0;
}
if (std::is_signed<Scalar>::value) {
return 0;
}
return std::numeric_limits<Scalar>::max() / 2 + 1;
}
Scalar是unsigned char
,std::numeric_limits::max() = 255,所以return 128,
PackedScalar是signed
,return 0,
假设x = 125,则返回结果是125 - 128 + 0 = -3
这一步的调用在ruy/create_trmul_params.h
的PopulateTrMulParamsAllCompiledPaths
函数内
5、源码分析
这部分记录比较乱,建议自行阅读源码
分析一下Mul
函数
ruy::Mul(lhs, rhs, mul_params, context, &dst);
该函数位于./ruy/ruy.h
,
void Mul(const Matrix<LhsScalar>& lhs, const Matrix<RhsScalar>& rhs,
const MulParams<AccumScalar, DstScalar>& mul_params, Context* context,
Matrix<DstScalar>* dst) {
RUY_TRACE_SCOPE;
RUY_TRACE_INFO(MUL);
Mat<LhsScalar> internal_lhs = ToInternal(lhs);
Mat<RhsScalar> internal_rhs = ToInternal(rhs);
Mat<DstScalar> internal_dst = ToInternal(*dst);
MulFrontEnd<CompiledPaths>(internal_lhs, internal_rhs, mul_params,
get_ctx(context), &internal_dst);
}
- Mul(lhs, rhs, mul_params, context, &dst);
- Mat internal_lhs = ToInternal(lhs); // Matrix–>Mat
- Mat internal_rhs = ToInternal(rhs);
- Mat internal_dst = ToInternal(*dst);
- MulFrontEnd(internal_lhs, internal_rhs, mul_params,get_ctx(context), &internal_dst);
- ctx->clear_performance_advisories();
- 对 performance_advisory_ 的配置
- TrMulParams params;
- MulFrontEndUpToCreateTrMulParams(lhs, rhs, *dst, mul_params,ctx, ¶ms);
- The first half of front-end work, up to the point where we have TrMulParams. In other words, this is the part of the front-end work that needs to be templatized like the entry point, and that performs the initial work that requires this templatization, and the de-templatization. The output of this function is the TrMulParams, which contain enough information to allow the un-templatized code to take over from there.
- 前端工作的前半部分,直到我们有TrMulParams为止。换句话说,这是前端工作的一部分,需要像入口点一样进行模板化,并执行需要模板化和去模板化的初始工作。该函数的输出是TrMulParams,其中包含足够的信息,允许未模板化的代码从那里接管。
- Validate(lhs, rhs, dst); //Perform validation of parameters early so that failures are easier to map to user errors. In particular, perform this validation before the transposition.
- CreateTrMulParams(Transpose(lhs), rhs, dst, mul_params, ctx,params); //The Transpose(lhs) here is where we switch from ‘Mul’ to ‘TrMul’.
- MulFrontEndFromTrMulParams(ctx, ¶ms);
- PreparePackedMatrices(ctx, params);
- TrMul(ctx, params);
- We’re done with the front-end work, now enter the middle-end.
- TrMul是ruy的中端。它包含执行ruy::Mul工作的高级逻辑,直至对后端Kernel和Pack函数的调用。这包括决定使用多少线程,计算BlockMap,在线程池上执行任务。TrMul函数本身运行在主线程上,可能运行在工作线程上的代码在TrMulTask::Run()中。
- ctx->GetMainAllocator()->FreeAll();
- ctx->clear_performance_advisories();