【RUY】tensorflow int8量化推理代码学习

0、写在前面

在这里插入图片描述
本来想学习gemm的相关代码,看一看tensorflow是怎么高效的进行int8推理,然后在这里看到most of gemmlowp is superseded by ruy,于是研究了一会儿ruy。

1、工程编译

将源码clone下来进行编译

git clone https://github.com/google/ruy.git
cd ruy/

mkdir build
cd build/
cmake ..

期间遇到报错

CMake Error at CMakeLists.txt:76 (message):
  This file does not exist:

  xxx/ruy/third_party/cpuinfo/CMakeLists.txt

  That typically means that the git submodules of the ruy repository haven't
  been checked out. 

于是把git的submodule也下载下来

git submodule update --init
cmake ..
make -j8

编译成功

2、跑个demo

跑个例子看一下

./example/ruy_example_example 
Example Mul, float:
LHS:
1 2 
3 4 
RHS:
1 3 
2 4 
Result:
5 11 
11 25 

Example Mul, float with bias addition and clamp:
LHS:
1 2 
3 4 
RHS:
1 3 
2 4 
Result:
6 12 
11 15 

Example Mul, uint8 quantized with asymmetric zero points:
LHS:
124 125 
126 127 
RHS:
129 131 
130 132 
Result:
131 130 
126 129 

Example Mul, int8 quantized with per-channel multipliers
LHS:
1 2 
3 4 
RHS:
1 3 
2 4 
Result:
4 8 
2 4 

Example Mul, returning raw int32 accumulators:
LHS:
1 2 
3 4 
RHS:
1 3 
2 4 
Result:
5 11 
11 25 

Example Mul, int8 times int16 quantized with per-channel multipliers
LHS:
1 2 
3 4 
RHS:
1000 3000 
2000 4000 
Result:
3750 8250 
1719 3906 

3、Example分析

3.1 ExampleMulFloat

测试代码

const float lhs_data[] = {1, 2, 3, 4};
const float rhs_data[] = {1, 2, 3, 4};
float dst_data[4];
  
ruy::Matrix<float> lhs;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kRowMajor, lhs.mutable_layout());
lhs.set_data(lhs_data);
ruy::Matrix<float> rhs;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kColMajor, rhs.mutable_layout());
rhs.set_data(rhs_data);
ruy::Matrix<float> dst;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kColMajor, dst.mutable_layout());
dst.set_data(dst_data);
  
ruy::MulParams<float, float> mul_params;
ruy::Mul(lhs, rhs, mul_params, context, &dst);
  
std::cout << "Example Mul, float:\n";
std::cout << "LHS:\n" << lhs;
std::cout << "RHS:\n" << rhs;
std::cout << "Result:\n" << dst << "\n";

运行结果

Example Mul, float:
LHS:
1 2 
3 4 
RHS:
1 3 
2 4 
Result:
5 11 
11 25 

分析

[ 1 2 3 4 ] [ 1 3 2 4 ] = [ 5 11 11 25 ] \left[ \begin{matrix} 1 & 2 \\ 3 & 4 \end{matrix} \right] \left[ \begin{matrix} 1 & 3 \\ 2 & 4 \end{matrix} \right] = \left[ \begin{matrix} 5 & 11 \\ 11 & 25 \end{matrix} \right] [1324][1234]=[5111125]

这个是浮点的计算,在我看来只要结果一直就可以,没什么我需要特别关注的。

3.2 ExampleMulFloatWithBiasAddAndClamp

测试代码

const float lhs_data[] = {1, 2, 3, 4};
const float rhs_data[] = {1, 2, 3, 4};
const float bias_data[] = {1, 0};
float dst_data[4];
  
ruy::Matrix<float> lhs;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kRowMajor, lhs.mutable_layout());
lhs.set_data(lhs_data);
ruy::Matrix<float> rhs;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kColMajor, rhs.mutable_layout());
rhs.set_data(rhs_data);
ruy::Matrix<float> dst;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kColMajor, dst.mutable_layout());
dst.set_data(dst_data);
  
ruy::MulParams<float, float> mul_params;
mul_params.set_bias(bias_data);
mul_params.set_clamp_min(0);
mul_params.set_clamp_max(15);
ruy::Mul(lhs, rhs, mul_params, context, &dst);
  
std::cout << "Example Mul, float with bias addition and clamp:\n";
std::cout << "LHS:\n" << lhs;
std::cout << "RHS:\n" << rhs;
std::cout << "Result:\n" << dst << "\n";

运行结果

Example Mul, float with bias addition and clamp:
LHS:
1 2 
3 4 
RHS:
1 3 
2 4 
Result:
6 12 
11 15 

分析

[ 1 2 3 4 ] [ 1 3 2 4 ] + [ 1 0 ] = [ 6 12 11 25 ] \left[ \begin{matrix} 1 & 2 \\ 3 & 4 \end{matrix} \right] \left[ \begin{matrix} 1 & 3 \\ 2 & 4 \end{matrix} \right] + \left[ \begin{matrix} 1 \\ 0 \end{matrix} \right] = \left[ \begin{matrix} 6 & 12 \\ 11 & 25 \end{matrix} \right] [1324][1234]+[10]=[6111225]

结果不一致是因为set_clamp_max(15)对25做了截断

3.3 ExampleMulUint8AsymmetricQuantized

测试代码

const std::uint8_t lhs_data[] = {124, 125, 126, 127};
const std::uint8_t rhs_data[] = {129, 130, 131, 132};
std::uint8_t dst_data[4];
  
ruy::Matrix<std::uint8_t> lhs;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kRowMajor, lhs.mutable_layout());
lhs.set_data(lhs_data);
lhs.set_zero_point(125);
ruy::Matrix<std::uint8_t> rhs;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kColMajor, rhs.mutable_layout());
rhs.set_data(rhs_data);
rhs.set_zero_point(132);
ruy::Matrix<std::uint8_t> dst;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kColMajor, dst.mutable_layout());
dst.set_data(dst_data);
dst.set_zero_point(129);
  
ruy::MulParams<std::int32_t, std::uint8_t> mul_params;
mul_params.set_multiplier_fixedpoint(1 << 30);
  
mul_params.set_multiplier_exponent(0);
ruy::Mul(lhs, rhs, mul_params, context, &dst);
  
std::cout << "Example Mul, uint8 quantized with asymmetric zero points:\n";
std::cout << "LHS:\n" << lhs;
std::cout << "RHS:\n" << rhs;
std::cout << "Result:\n" << dst << "\n";

运行结果

Example Mul, uint8 quantized with asymmetric zero points:
LHS:
124 125 
126 127 
RHS:
129 131 
130 132 
Result:
131 130 
126 129 

分析

重头戏来了~

我先从我的角度来看,这段代码到底计算了什么,input减去zero_point后,其实是进行了如下计算:
[ − 1 0 1 2 ] [ − 3 − 1 − 2 0 ] = [ 3 1 − 7 − 1 ] \left[ \begin{matrix} -1 & 0 \\ 1 & 2 \end{matrix} \right] \left[ \begin{matrix} -3 & -1 \\ -2 & 0 \end{matrix} \right] = \left[ \begin{matrix} 3 & 1 \\ -7 & -1 \end{matrix} \right] [1102][3210]=[3711]
但是结果加上zero_point,却是
[ 132 130 122 128 ] ! = [ 131 130 126 129 ] \left[ \begin{matrix} 132 & 130 \\ 122 & 128 \end{matrix} \right] != \left[ \begin{matrix} 131 & 130 \\ 126 & 129 \end{matrix} \right] [132122130128]!=[131126130129]
而且看不出来什么规律,所以关键点来了,要看一看这个int8推理到底做了什么。

先说结论:
在这次的计算中,shift=0quantized_multiplier = 1 << 30zero_point = 129所以
r e s u l t = ⌊ ( x ∗ 2 30 + 2 30 ) 2 31 ⌋ + 129 result = \lfloor{(x * 2^{30} + 2^{30}) \over 2^{31}}\rfloor + 129\\\hspace{50cm} result=231(x230+230)+129

result(3)  = 2 + 129 = 131
result(1)  =  1 + 129 = 130
result(-7)  = -3 + 129 = 126
result(-1)  =  0 + 129 = 129

int8计算的放缩很重要,合理的放缩可以尽可能地保留精度。
我上述公式的计算来源于ruy的这一段代码

// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
// Warning: this code is not meant to be bit-exact-normative.
// Please refer to the class comment of ruy::MulParams, in mul_params.h.
// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
// Preconditions:
// - quantized_multiplier >= 0
// - shift is -31 to +7 (negative for right shift)
std::int32_t MultiplyByQuantizedMultiplier(std::int32_t x,
std::int32_t quantized_multiplier,
int shift) {
RUY_CHECK_GE(shift, -31);
  
int total_shift = 31 - shift;
  
std::int64_t x_64(x);
std::int64_t quantized_multiplier_64(quantized_multiplier);
std::int64_t round = (int64_t)1 << (total_shift - 1);
int64_t result = x_64 * quantized_multiplier_64 + round;
result = result >> total_shift;
  
RUY_DCHECK_GE(result, std::numeric_limits<std::int32_t>::lowest());
RUY_DCHECK_LE(result, std::numeric_limits<std::int32_t>::max());
  
return static_cast<std::int32_t>(result);
}

其中,关键的两句

result = x_64 * quantized_multiplier_64 + round;
result = result >> total_shift

写成公式便于理解:
r e s u l t = ( x ∗ q u a n t i z e d _ m u l t i p l i e r + r o u n d ) > > t o t a l _ s h i f t result = (x * quantized\_multiplier + round) >> total\_shift\\\hspace{50cm} result=(xquantized_multiplier+round)>>total_shift
= ( x ∗ q u a n t i z e d _ m u l t i p l i e r + 2 t o t a l _ s h i f t − 1 ) > > t o t a l _ s h i f t \\\hspace{1.3cm} = (x * quantized\_multiplier + 2^{total\_shift - 1}) >> total\_shift\\\hspace{50cm} =(xquantized_multiplier+2total_shift1)>>total_shift
= ( x ∗ q u a n t i z e d _ m u l t i p l i e r + 2 30 − s h i f t ) > > ( 31 − s h i f t ) \\\hspace{1.3cm} = (x * quantized\_multiplier + 2^{30-shift}) >> (31-shift)\\\hspace{50cm} =(xquantized_multiplier+230shift)>>(31shift)
= ⌊ ( x ∗ q u a n t i z e d _ m u l t i p l i e r + 2 30 − s h i f t ) 2 31 − s h i f t ⌋ \\\hspace{1.3cm} = \lfloor{(x * quantized\_multiplier + 2^{30-shift}) \over 2^{31-shift}}\rfloor\\\hspace{50cm} =231shift(xquantized_multiplier+230shift)
这也就解释了这个example的计算结果。

以上是对结果修正的解释,除此之外,计算过程也同样重要。


我本来理解的是,input减去zero_point后,进行如下计算:
[ − 1 0 1 2 ] [ − 3 − 1 − 2 0 ] = [ 3 1 − 7 − 1 ] \left[ \begin{matrix} -1 & 0 \\ 1 & 2 \end{matrix} \right] \left[ \begin{matrix} -3 & -1 \\ -2 & 0 \end{matrix} \right] = \left[ \begin{matrix} 3 & 1 \\ -7 & -1 \end{matrix} \right] [1102][3210]=[3711]
但事实上是,input并没有减去zero_point,而是做了展开操作:
Y = f u n ( X W + b i a s ) Y=fun(XW+bias) \\\hspace{50cm} Y=fun(XW+bias)
其中,X为输入,W为权值,bias为偏置,fun()为缩放函数
Y = f u n [ ( X − Z x ) ( W − Z w ) + b i a s ] + Z y Y=fun[(X-Z_x)(W-Z_w)+bias]+Z_y \\\hspace{50cm} Y=fun[(XZx)(WZw)+bias]+Zy
= f u n [ X W − X Z w − Z x W + Z x Z w + b i a s ] + Z y \\\hspace{0.45cm} =fun[XW-XZ_w-Z_xW+Z_xZ_w+bias]+Z_y \\\hspace{50cm} =fun[XWXZwZxW+ZxZw+bias]+Zy


在本例中,也就是

( [ 124 125 126 127 ] − [ 125 125 125 125 ] ) ( [ 129 131 130 132 ] − [ 132 132 132 132 ] ) (\left[ \begin{matrix} 124 & 125 \\ 126 & 127 \end{matrix} \right] - \left[ \begin{matrix} 125 & 125\\ 125 & 125 \end{matrix} \right] ) (\left[ \begin{matrix} 129 & 131 \\ 130 & 132 \end{matrix} \right]-\left[ \begin{matrix} 132 & 132\\ 132 & 132\end{matrix} \right])\\\hspace{50cm} ([124126125127][125125125125])([129130131132][132132132132])

= [ 124 125 126 127 ] [ 129 131 130 132 ] − [ 125 125 125 125 ] [ 129 131 130 132 ] − [ 124 125 126 127 ] [ 132 132 132 132 ] + [ 125 125 125 125 ] [ 132 132 132 132 ] = \left[ \begin{matrix} 124 & 125 \\ 126 & 127 \end{matrix} \right] \left[ \begin{matrix} 129 & 131 \\ 130 & 132 \end{matrix} \right] - \left[ \begin{matrix} 125 & 125 \\ 125 & 125 \end{matrix} \right] \left[ \begin{matrix} 129 & 131 \\ 130 & 132 \end{matrix} \right] - \left[ \begin{matrix} 124 & 125 \\ 126 & 127 \end{matrix} \right]\left[ \begin{matrix} 132 & 132 \\ 132 & 132 \end{matrix} \right] + \left[ \begin{matrix} 125 & 125\\ 125 & 125 \end{matrix} \right]\left[ \begin{matrix} 132 & 132\\ 132 & 132 \end{matrix} \right] \\\hspace{50cm} =[124126125127][129130131132][125125125125][129130131132][124126125127][132132132132]+[125125125125][132132132132]
这是公式展开似乎显得更复杂,但在代码中,zero-point只是125和132,不需要展开成这样的矩阵的。另外还有一些小的判断可以减少计算量,代码逻辑如下(建议看源码):

// 以计算第一个数为例
accum = 124 * 129 + 125 * 130 = 32246

// 本例中没有bias,所以可跳过
if(bias)
accum += bias

// 这是比较好的一点,如果zero_point是0可以直接免掉该计算
// 另外先求和再乘法可以加速计算
if(lhs.zero_point)
accum -= lhs.zero_point * rhs.sums[j]
       = 32246 - 125 * (129 + 130)
       = -12
// 同上
if(rhs.zero_point)
accum -= rhs.zero_point * lhs.sums[i]
       = -129 - 132 * (124 + 125)
       = -32997
// 如果任一zero_point是0可以直接免掉该计算,另外depth使用可以减少乘加
if(lhs.zero_point && rhs.zero_point)
accum += lhs.zero_point * rhs.zero_point * depth
            = -32997 + 125 * 132 * 2
            = 3

r e s u l t ( 3 ) = ⌊ ( 3 ∗ 2 30 + 2 30 ) 2 31 ⌋ = 2 result(3) = \lfloor{(3 * 2^{30} + 2^{30}) \over 2^{31}}\rfloor =2\\\hspace{50cm} result(3)=231(3230+230)=2

accun += dst->zero_point
       = 2 + 129
       = 131

3.4 ExampleMulInt8PerChannelQuantized

测试代码

const std::int8_t lhs_data[] = {1, 2, 3, 4};
const std::int8_t rhs_data[] = {1, 2, 3, 4};
const std::int32_t multiplier_data[] = {3 << 28, 5 << 28};
const int exponent_data[] = {1, -2};
std::int8_t dst_data[4];
  
ruy::Matrix<std::int8_t> lhs;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kRowMajor, lhs.mutable_layout());
lhs.set_data(lhs_data);
ruy::Matrix<std::int8_t> rhs;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kColMajor, rhs.mutable_layout());
rhs.set_data(rhs_data);
ruy::Matrix<std::int8_t> dst;
ruy::MakeSimpleLayout(2, 2, ruy::Order::kColMajor, dst.mutable_layout());
dst.set_data(dst_data);
  
ruy::MulParams<std::int32_t, std::int8_t> mul_params;
mul_params.set_multiplier_fixedpoint_perchannel(multiplier_data);
mul_params.set_multiplier_exponent_perchannel(exponent_data);
ruy::Mul(lhs, rhs, mul_params, context, &dst);
  
std::cout << "Example Mul, int8 quantized with per-channel multipliers\n";
std::cout << "LHS:\n" << lhs;
std::cout << "RHS:\n" << rhs;
std::cout << "Result:\n" << dst << "\n";

运行结果

Example Mul, int8 quantized with per-channel multipliers
LHS:
1 2 
3 4 
RHS:
1 3 
2 4 
Result:
4 8 
2 4 

分析

[ 1 2 3 4 ] [ 1 3 2 4 ] = [ 5 11 11 25 ] \left[ \begin{matrix} 1 & 2 \\ 3 & 4 \end{matrix} \right] \left[ \begin{matrix} 1 & 3 \\ 2 & 4 \end{matrix} \right] = \left[ \begin{matrix} 5 & 11 \\ 11 & 25 \end{matrix} \right] [1324][1234]=[5111125]

有上一小节的理解,这里的分析就容易多了。

shift=1quantized_multiplier = 3 << 28为例
r e s u l t = ⌊ ( x ∗ 3 ∗ 2 28 + 2 29 ) 2 30 ⌋ result = \lfloor{(x * 3 * 2^{28} + 2^{29}) \over 2^{30}}\rfloor\\\hspace{50cm} result=230(x3228+229)
result(5) = 4
result(11) = 8

shift=-2quantized_multiplier = 5 << 28为例
r e s u l t = ⌊ ( x ∗ 5 ∗ 2 28 + 2 32 ) 2 33 ⌋ result = \lfloor{(x * 5 * 2^{28} + 2^{32}) \over 2^{33}}\rfloor\\\hspace{50cm} result=233(x5228+232)
result(11) = 2
result(25) = 4

3.5 raw int32 accumulators

暂不关注

3.6 int8 times int16 quantized with per-channel multipliers

暂不关注

4、对于ZeroPoint的pack操作

ruy/pack_common.h

PackedScalar Pack(Scalar x) {
	return x - SymmetricZeroPoint<Scalar>() + SymmetricZeroPoint<PackedScalar>();
}

其中,

Scalar SymmetricZeroPoint() {
	if (std::is_floating_point<Scalar>::value) {
	return 0;
	}
	if (std::is_signed<Scalar>::value) {
	return 0;
	}
	return std::numeric_limits<Scalar>::max() / 2 + 1;
}

Scalar是unsigned char,std::numeric_limits::max() = 255,所以return 128,
PackedScalar是signed,return 0,

假设x = 125,则返回结果是125 - 128 + 0 = -3

这一步的调用在ruy/create_trmul_params.hPopulateTrMulParamsAllCompiledPaths函数内

5、源码分析

这部分记录比较乱,建议自行阅读源码

分析一下Mul函数
ruy::Mul(lhs, rhs, mul_params, context, &dst);

该函数位于./ruy/ruy.h,

void Mul(const Matrix<LhsScalar>& lhs, const Matrix<RhsScalar>& rhs,
const MulParams<AccumScalar, DstScalar>& mul_params, Context* context,
Matrix<DstScalar>* dst) {
RUY_TRACE_SCOPE;
RUY_TRACE_INFO(MUL);
Mat<LhsScalar> internal_lhs = ToInternal(lhs);
Mat<RhsScalar> internal_rhs = ToInternal(rhs);
Mat<DstScalar> internal_dst = ToInternal(*dst);
MulFrontEnd<CompiledPaths>(internal_lhs, internal_rhs, mul_params,
get_ctx(context), &internal_dst);
}
  • Mul(lhs, rhs, mul_params, context, &dst);
    • Mat internal_lhs = ToInternal(lhs); // Matrix–>Mat
    • Mat internal_rhs = ToInternal(rhs);
    • Mat internal_dst = ToInternal(*dst);
    • MulFrontEnd(internal_lhs, internal_rhs, mul_params,get_ctx(context), &internal_dst);
      • ctx->clear_performance_advisories();
        • 对 performance_advisory_ 的配置
      • TrMulParams params;
      • MulFrontEndUpToCreateTrMulParams(lhs, rhs, *dst, mul_params,ctx, &params);
        • The first half of front-end work, up to the point where we have TrMulParams. In other words, this is the part of the front-end work that needs to be templatized like the entry point, and that performs the initial work that requires this templatization, and the de-templatization. The output of this function is the TrMulParams, which contain enough information to allow the un-templatized code to take over from there.
        • 前端工作的前半部分,直到我们有TrMulParams为止。换句话说,这是前端工作的一部分,需要像入口点一样进行模板化,并执行需要模板化和去模板化的初始工作。该函数的输出是TrMulParams,其中包含足够的信息,允许未模板化的代码从那里接管。
        • Validate(lhs, rhs, dst); //Perform validation of parameters early so that failures are easier to map to user errors. In particular, perform this validation before the transposition.
        • CreateTrMulParams(Transpose(lhs), rhs, dst, mul_params, ctx,params); //The Transpose(lhs) here is where we switch from ‘Mul’ to ‘TrMul’.
      • MulFrontEndFromTrMulParams(ctx, &params);
        • PreparePackedMatrices(ctx, params);
        • TrMul(ctx, params);
          • We’re done with the front-end work, now enter the middle-end.
          • TrMul是ruy的中端。它包含执行ruy::Mul工作的高级逻辑,直至对后端Kernel和Pack函数的调用。这包括决定使用多少线程,计算BlockMap,在线程池上执行任务。TrMul函数本身运行在主线程上,可能运行在工作线程上的代码在TrMulTask::Run()中。
        • ctx->GetMainAllocator()->FreeAll();
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值