【RUY】tensorflow int8量化推理代码学习

Zhao Dongyu

已于 2023-03-23 17:32:12 修改

阅读量718

点赞数

分类专栏： tensorflow 文章标签： RUY int8 量化推理 HPC

于 2023-03-23 15:24:40 首次发布

本文链接：https://blog.csdn.net/ZhaoDongyu_AK47/article/details/129716191

版权

tensorflow 专栏收录该内容

3 篇文章

订阅专栏

0、写在前面

在这里插入图片描述
本来想学习gemm的相关代码，看一看tensorflow是怎么高效的进行int8推理，然后在这里看到most of gemmlowp is superseded by ruy，于是研究了一会儿ruy。

1、工程编译

将源码clone下来进行编译

git clone https://github.com/google/ruy.git
cd ruy/

mkdir build
cd build/
cmake ..

期间遇到报错

CMake Error at CMakeLists.txt:76 (message):
  This file does not exist:

  xxx/ruy/third_party/cpuinfo/CMakeLists.txt

  That typically means that the git submodules of the ruy repository haven't
  been checked out.

于是把git的submodule也下载下来

git submodule update --init
cmake ..
make -j8

编译成功

2、跑个demo

跑个例子看一下

./example/ruy_example_example 
Example Mul, float:
LHS:
1 2 
3 4 
RHS:
1 3 
2 4 
Result:
5 11 
11 25 

Example Mul, float with bias addition and clamp:
LHS:
1 2 
3 4 
RHS:
1 3 
2 4 
Result:
6 12 
11 15 

Example Mul, uint8 quantized with asymmetric zero points:
LHS:
124 125 
126 127 
RHS:
129 131 
130 132 
Result:
131 130 
126 129 

Example Mul, int8 quantized with per-channel multipliers
LHS:
1 2 
3 4 
RHS:
1 3 
2 4 
Result:
4 8 
2 4 

Example Mul, returning raw int32 accumulators:
LHS:
1 2 
3 4 
RHS:
1 3 
2 4 
Result:
5 11 
11 25 

Example Mul, int8 times int16 quantized with per-channel multipliers
LHS:
1 2 
3 4 
RHS:
1000 3000 
2000 4000 
Result:
3750 8250 
1719 3906

3、Example分析

3.1 ExampleMulFloat

测试代码

const float lhs_data[] = {1, 2, 3, 4};
const float rhs_data[] = {1, 2, 3, 4};
float dst_data[4];
  
ruy::Matrix<float> lhs;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kRowMajor, lhs.mutable_layout());
lhs.set_data(lhs_data);
ruy::Matrix<float> rhs;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kColMajor, rhs.mutable_layout());
rhs.set_data(rhs_data);
ruy::Matrix<float> dst;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kColMajor, dst.mutable_layout());
dst.set_data(dst_data);
  
ruy::MulParams<float, float> mul_params;
ruy::Mul(lhs, rhs, mul_params, context, &dst);
  
std::cout << "Example Mul, float:\n";
std::cout << "LHS:\n" << lhs;
std::cout << "RHS:\n" << rhs;
std::cout << "Result:\n" << dst << "\n";

运行结果

Example Mul, float:
LHS:
1 2 
3 4 
RHS:
1 3 
2 4 
Result:
5 11 
11 25

分析

$\left[ \begin{matrix} 1 & 2 \\ 3 & 4 \end{matrix} \right] \left[ \begin{matrix} 1 & 3 \\ 2 & 4 \end{matrix} \right] = \left[ \begin{matrix} 5 & 11 \\ 11 & 25 \end{matrix} \right]$

这个是浮点的计算，在我看来只要结果一直就可以，没什么我需要特别关注的。

3.2 ExampleMulFloatWithBiasAddAndClamp

测试代码

const float lhs_data[] = {1, 2, 3, 4};
const float rhs_data[] = {1, 2, 3, 4};
const float bias_data[] = {1, 0};
float dst_data[4];
  
ruy::Matrix<float> lhs;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kRowMajor, lhs.mutable_layout());
lhs.set_data(lhs_data);
ruy::Matrix<float> rhs;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kColMajor, rhs.mutable_layout());
rhs.set_data(rhs_data);
ruy::Matrix<float> dst;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kColMajor, dst.mutable_layout());
dst.set_data(dst_data);
  
ruy::MulParams<float, float> mul_params;
mul_params.set_bias(bias_data);
mul_params.set_clamp_min(0);
mul_params.set_clamp_max(15);
ruy::Mul(lhs, rhs, mul_params, context, &dst);
  
std::cout << "Example Mul, float with bias addition and clamp:\n";
std::cout << "LHS:\n" << lhs;
std::cout << "RHS:\n" << rhs;
std::cout << "Result:\n" << dst << "\n";

运行结果

Example Mul, float with bias addition and clamp:
LHS:
1 2 
3 4 
RHS:
1 3 
2 4 
Result:
6 12 
11 15

分析

$\left[ \begin{matrix} 1 & 2 \\ 3 & 4 \end{matrix} \right] \left[ \begin{matrix} 1 & 3 \\ 2 & 4 \end{matrix} \right] + \left[ \begin{matrix} 1 \\ 0 \end{matrix} \right] = \left[ \begin{matrix} 6 & 12 \\ 11 & 25 \end{matrix} \right]$

结果不一致是因为set_clamp_max(15)对25做了截断

3.3 ExampleMulUint8AsymmetricQuantized

测试代码

const std::uint8_t lhs_data[] = {124, 125, 126, 127};
const std::uint8_t rhs_data[] = {129, 130, 131, 132};
std::uint8_t dst_data[4];
  
ruy::Matrix<std::uint8_t> lhs;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kRowMajor, lhs.mutable_layout());
lhs.set_data(lhs_data);
lhs.set_zero_point(125);
ruy::Matrix<std::uint8_t> rhs;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kColMajor, rhs.mutable_layout());
rhs.set_data(rhs_data);
rhs.set_zero_point(132);
ruy::Matrix<std::uint8_t> dst;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kColMajor, dst.mutable_layout());
dst.set_data(dst_data);
dst.set_zero_point(129);
  
ruy::MulParams<std::int32_t, std::uint8_t> mul_params;
mul_params.set_multiplier_fixedpoint(1 << 30);
  
mul_params.set_multiplier_exponent(0);
ruy::Mul(lhs, rhs, mul_params, context, &dst);
  
std::cout << "Example Mul, uint8 quantized with asymmetric zero points:\n";
std::cout << "LHS:\n" << lhs;
std::cout << "RHS:\n" << rhs;
std::cout << "Result:\n" << dst << "\n";

运行结果

Example Mul, uint8 quantized with asymmetric zero points:
LHS:
124 125 
126 127 
RHS:
129 131 
130 132 
Result:
131 130 
126 129

分析

重头戏来了～

我先从我的角度来看，这段代码到底计算了什么，input减去zero_point后，其实是进行了如下计算：
$\left[ \begin{matrix} -1 & 0 \\ 1 & 2 \end{matrix} \right] \left[ \begin{matrix} -3 & -1 \\ -2 & 0 \end{matrix} \right] = \left[ \begin{matrix} 3 & 1 \\ -7 & -1 \end{matrix} \right]$
但是结果加上zero_point，却是
$\left[ \begin{matrix} 132 & 130 \\ 122 & 128 \end{matrix} \right] != \left[ \begin{matrix} 131 & 130 \\ 126 & 129 \end{matrix} \right]$
而且看不出来什么规律，所以关键点来了，要看一看这个int8推理到底做了什么。

先说结论：
在这次的计算中，shift=0，quantized_multiplier = 1 << 30，zero_point = 129所以
$\lfloor{(x * 2^{30} + 2^{30}) \over 2^{31}}\rfloor + 129\\\hspace{50cm}$
则

result(3)  = 2 + 129 = 131
result(1)  =  1 + 129 = 130
result(-7)  = -3 + 129 = 126
result(-1)  =  0 + 129 = 129

int8计算的放缩很重要，合理的放缩可以尽可能地保留精度。
我上述公式的计算来源于ruy的这一段代码

// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
// Warning: this code is not meant to be bit-exact-normative.
// Please refer to the class comment of ruy::MulParams, in mul_params.h.
// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
// Preconditions:
// - quantized_multiplier >= 0
// - shift is -31 to +7 (negative for right shift)
std::int32_t MultiplyByQuantizedMultiplier(std::int32_t x,
std::int32_t quantized_multiplier,
int shift) {
RUY_CHECK_GE(shift, -31);
  
int total_shift = 31 - shift;
  
std::int64_t x_64(x);
std::int64_t quantized_multiplier_64(quantized_multiplier);
std::int64_t round = (int64_t)1 << (total_shift - 1);
int64_t result = x_64 * quantized_multiplier_64 + round;
result = result >> total_shift;
  
RUY_DCHECK_GE(result, std::numeric_limits<std::int32_t>::lowest());
RUY_DCHECK_LE(result, std::numeric_limits<std::int32_t>::max());
  
return static_cast<std::int32_t>(result);
}

其中，关键的两句

result = x_64 * quantized_multiplier_64 + round;
result = result >> total_shift

写成公式便于理解：
$quantized\_multiplier + round) >> total\_shift\\\hspace{50cm}$
$\\\hspace{1.3cm} = (x * quantized\_multiplier + 2^{total\_shift - 1}) >> total\_shift\\\hspace{50cm}$
$\\\hspace{1.3cm} = (x * quantized\_multiplier + 2^{30-shift}) >> (31-shift)\\\hspace{50cm}$
$\\\hspace{1.3cm} = \lfloor{(x * quantized\_multiplier + 2^{30-shift}) \over 2^{31-shift}}\rfloor\\\hspace{50cm}$
这也就解释了这个example的计算结果。

以上是对结果修正的解释，除此之外，计算过程也同样重要。

我本来理解的是，input减去zero_point后，进行如下计算：
$\left[ \begin{matrix} -1 & 0 \\ 1 & 2 \end{matrix} \right] \left[ \begin{matrix} -3 & -1 \\ -2 & 0 \end{matrix} \right] = \left[ \begin{matrix} 3 & 1 \\ -7 & -1 \end{matrix} \right]$
但事实上是，input并没有减去zero_point，而是做了展开操作：
$\\\hspace{50cm}$
其中，X为输入，W为权值，bias为偏置，fun()为缩放函数
$Y=fun[(X-Z_x)(W-Z_w)+bias]+Z_y \\\hspace{50cm}$
$\\\hspace{0.45cm} =fun[XW-XZ_w-Z_xW+Z_xZ_w+bias]+Z_y \\\hspace{50cm}$

在本例中，也就是

$(\left[ \begin{matrix} 124 & 125 \\ 126 & 127 \end{matrix} \right] - \left[ \begin{matrix} 125 & 125\\ 125 & 125 \end{matrix} \right] ) (\left[ \begin{matrix} 129 & 131 \\ 130 & 132 \end{matrix} \right]-\left[ \begin{matrix} 132 & 132\\ 132 & 132\end{matrix} \right])\\\hspace{50cm}$

$\left[ \begin{matrix} 124 & 125 \\ 126 & 127 \end{matrix} \right] \left[ \begin{matrix} 129 & 131 \\ 130 & 132 \end{matrix} \right] - \left[ \begin{matrix} 125 & 125 \\ 125 & 125 \end{matrix} \right] \left[ \begin{matrix} 129 & 131 \\ 130 & 132 \end{matrix} \right] - \left[ \begin{matrix} 124 & 125 \\ 126 & 127 \end{matrix} \right]\left[ \begin{matrix} 132 & 132 \\ 132 & 132 \end{matrix} \right] + \left[ \begin{matrix} 125 & 125\\ 125 & 125 \end{matrix} \right]\left[ \begin{matrix} 132 & 132\\ 132 & 132 \end{matrix} \right] \\\hspace{50cm}$
这是公式展开似乎显得更复杂，但在代码中，zero-point只是125和132，不需要展开成这样的矩阵的。另外还有一些小的判断可以减少计算量，代码逻辑如下（建议看源码）：

// 以计算第一个数为例
accum = 124 * 129 + 125 * 130 = 32246

// 本例中没有bias，所以可跳过
if(bias)
accum += bias

// 这是比较好的一点，如果zero_point是0可以直接免掉该计算
// 另外先求和再乘法可以加速计算
if(lhs.zero_point)
accum -= lhs.zero_point * rhs.sums[j]
       = 32246 - 125 * (129 + 130)
       = -12
// 同上
if(rhs.zero_point)
accum -= rhs.zero_point * lhs.sums[i]
       = -129 - 132 * (124 + 125)
       = -32997
// 如果任一zero_point是0可以直接免掉该计算，另外depth使用可以减少乘加
if(lhs.zero_point && rhs.zero_point)
accum += lhs.zero_point * rhs.zero_point * depth
            = -32997 + 125 * 132 * 2
            = 3

$\lfloor{(3 * 2^{30} + 2^{30}) \over 2^{31}}\rfloor =2\\\hspace{50cm}$

accun += dst->zero_point
       = 2 + 129
       = 131

3.4 ExampleMulInt8PerChannelQuantized

测试代码

const std::int8_t lhs_data[] = {1, 2, 3, 4};
const std::int8_t rhs_data[] = {1, 2, 3, 4};
const std::int32_t multiplier_data[] = {3 << 28, 5 << 28};
const int exponent_data[] = {1, -2};
std::int8_t dst_data[4];
  
ruy::Matrix<std::int8_t> lhs;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kRowMajor, lhs.mutable_layout());
lhs.set_data(lhs_data);
ruy::Matrix<std::int8_t> rhs;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kColMajor, rhs.mutable_layout());
rhs.set_data(rhs_data);
ruy::Matrix<std::int8_t> dst;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kColMajor, dst.mutable_layout());
dst.set_data(dst_data);
  
ruy::MulParams<std::int32_t, std::int8_t> mul_params;
mul_params.set_multiplier_fixedpoint_perchannel(multiplier_data);
mul_params.set_multiplier_exponent_perchannel(exponent_data);
ruy::Mul(lhs, rhs, mul_params, context, &dst);
  
std::cout << "Example Mul, int8 quantized with per-channel multipliers\n";
std::cout << "LHS:\n" << lhs;
std::cout << "RHS:\n" << rhs;
std::cout << "Result:\n" << dst << "\n";

运行结果

Example Mul, int8 quantized with per-channel multipliers
LHS:
1 2 
3 4 
RHS:
1 3 
2 4 
Result:
4 8 
2 4

分析

$\left[ \begin{matrix} 1 & 2 \\ 3 & 4 \end{matrix} \right] \left[ \begin{matrix} 1 & 3 \\ 2 & 4 \end{matrix} \right] = \left[ \begin{matrix} 5 & 11 \\ 11 & 25 \end{matrix} \right]$

有上一小节的理解，这里的分析就容易多了。

以shift=1，quantized_multiplier = 3 << 28为例
$\lfloor{(x * 3 * 2^{28} + 2^{29}) \over 2^{30}}\rfloor\\\hspace{50cm}$
result(5) = 4
result(11) = 8

以shift=-2，quantized_multiplier = 5 << 28为例
$\lfloor{(x * 5 * 2^{28} + 2^{32}) \over 2^{33}}\rfloor\\\hspace{50cm}$
result(11) = 2
result(25) = 4

3.5 raw int32 accumulators

暂不关注

3.6 int8 times int16 quantized with per-channel multipliers

暂不关注

4、对于ZeroPoint的pack操作

ruy/pack_common.h

PackedScalar Pack(Scalar x) {
	return x - SymmetricZeroPoint<Scalar>() + SymmetricZeroPoint<PackedScalar>();
}

其中，

Scalar SymmetricZeroPoint() {
	if (std::is_floating_point<Scalar>::value) {
	return 0;
	}
	if (std::is_signed<Scalar>::value) {
	return 0;
	}
	return std::numeric_limits<Scalar>::max() / 2 + 1;
}

Scalar是unsigned char，std::numeric_limits::max() = 255，所以return 128，
PackedScalar是signed，return 0，

假设x = 125，则返回结果是125 - 128 + 0 = -3

这一步的调用在ruy/create_trmul_params.h的PopulateTrMulParamsAllCompiledPaths函数内

5、源码分析

这部分记录比较乱，建议自行阅读源码

分析一下Mul函数
ruy::Mul(lhs, rhs, mul_params, context, &dst);

该函数位于./ruy/ruy.h,

void Mul(const Matrix<LhsScalar>& lhs, const Matrix<RhsScalar>& rhs,
const MulParams<AccumScalar, DstScalar>& mul_params, Context* context,
Matrix<DstScalar>* dst) {
RUY_TRACE_SCOPE;
RUY_TRACE_INFO(MUL);
Mat<LhsScalar> internal_lhs = ToInternal(lhs);
Mat<RhsScalar> internal_rhs = ToInternal(rhs);
Mat<DstScalar> internal_dst = ToInternal(*dst);
MulFrontEnd<CompiledPaths>(internal_lhs, internal_rhs, mul_params,
get_ctx(context), &internal_dst);
}

Mul(lhs, rhs, mul_params, context, &dst);
- Mat internal_lhs = ToInternal(lhs); // Matrix–>Mat
- Mat internal_rhs = ToInternal(rhs);
- Mat internal_dst = ToInternal(*dst);
- MulFrontEnd(internal_lhs, internal_rhs, mul_params,get_ctx(context), &internal_dst);
  - ctx->clear_performance_advisories();
    - 对 performance_advisory_ 的配置
  - TrMulParams params;
  - MulFrontEndUpToCreateTrMulParams(lhs, rhs, *dst, mul_params,ctx, &params);
    - The first half of front-end work, up to the point where we have TrMulParams. In other words, this is the part of the front-end work that needs to be templatized like the entry point, and that performs the initial work that requires this templatization, and the de-templatization. The output of this function is the TrMulParams, which contain enough information to allow the un-templatized code to take over from there.
    - 前端工作的前半部分，直到我们有TrMulParams为止。换句话说，这是前端工作的一部分，需要像入口点一样进行模板化，并执行需要模板化和去模板化的初始工作。该函数的输出是TrMulParams，其中包含足够的信息，允许未模板化的代码从那里接管。
    - Validate(lhs, rhs, dst); //Perform validation of parameters early so that failures are easier to map to user errors. In particular, perform this validation before the transposition.
    - CreateTrMulParams(Transpose(lhs), rhs, dst, mul_params, ctx,params); //The Transpose(lhs) here is where we switch from ‘Mul’ to ‘TrMul’.
  - MulFrontEndFromTrMulParams(ctx, &params);
    - PreparePackedMatrices(ctx, params);
    - TrMul(ctx, params);
      - We’re done with the front-end work, now enter the middle-end.
      - TrMul是ruy的中端。它包含执行ruy::Mul工作的高级逻辑，直至对后端Kernel和Pack函数的调用。这包括决定使用多少线程，计算BlockMap，在线程池上执行任务。TrMul函数本身运行在主线程上，可能运行在工作线程上的代码在TrMulTask::Run()中。
    - ctx->GetMainAllocator()->FreeAll();