[TVM PR 开发随笔] quantized::leakyrelu

最新推荐文章于 2022-12-13 15:51:10 发布

yuanfz1998

最新推荐文章于 2022-12-13 15:51:10 发布

阅读量648

点赞数 1

分类专栏： tvm 文章标签： ai 编辑器 compiler

本文链接：https://blog.csdn.net/Eurypterid/article/details/125859678

版权

tvm 专栏收录该内容

10 篇文章 2 订阅

订阅专栏

quantized::leakyrelu

量化基础知识

https://zhuanlan.zhihu.com/p/149659607

量化并不是什么新知识，我们在对图像做预处理时就用到了量化。回想一下，我们通常会将一张 uint8 类型、数值范围在 0~255 的图片归一成 float32 类型、数值范围在 0.0~1.0 的张量，这个过程就是反量化。类似地，我们经常将网络输出的范围在 0.0~1.0 之间的张量调整成数值为 0~255、uint8 类型的图片数据，这个过程就是量化。所以量化本质上只是对数值范围的重新调整，可以「粗略」理解为是一种线性映射。(之所以加「粗略」二字，是因为有些论文会用非线性量化，但目前在工业界落地的还都是线性量化，所以本文只讨论线性量化的方案)。

不过，可以明显看出，反量化一般没有信息损失，而量化一般都会有精度损失。这也非常好理解，float32 能保存的数值范围本身就比 uint8 多，因此必定有大量数值无法用 uint8 表示，只能四舍五入成 uint8 型的数值。量化模型和全精度模型的误差也来自四舍五入的 clip 操作。

这篇文章中会用到一些公式，这里我们用r表示浮点实数，q表示量化后的定点整数。浮点和整型之间的换算公式为：

$r = S (q - Z)$

$q=round(\frac{r}{S}+Z)$

其中，S是 scale，表示实数和整数之间的比例关系，Z是 zero point，表示实数中的 0 经过量化后对应的整数，它们的计算方法为：

$S=\frac{r_{max}-r_{min}}{q_{max}-q_{min}}$

$Z=round(q_{max}-\frac{r_{max}}{S})$

https://github.com/apache/tvm/issues/11451

先看https://github.com/apache/tvm/pull/7606/files作为一个非常好的参考。

实现hard_sigmoid

非量化版本

$\frac{relu6(x+3)}{6}$

$re l u 6 (x) = min (ma x (0, x), 6)$

代码

TVM: python/tvm/relay/frontend/pytorch.py

def hard_sigmoid(self, inputs, input_types):
        def _relu6(x):
            return _op.tensor.clip(x, 0.0, 6.0)

        def func(x):
            return _relu6(x + _expr.const(3.0)) / _expr.const(6.0)

        if self.is_quantized_tensor(inputs[0]):
            input_scale = _expr.const(inputs[1])
            input_zero_point = _expr.const(inputs[2])
            # PyTorch seems to use the following output qparams, but accuracy
            # is broken if we use this.
            # TODO(masahi): Revisit this parameter choice
            #
            # Taken from src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp
            # output_scale = _expr.const(0.00390625)  # 1.0 / 2^8
            # output_zero_point = _expr.const(-128)
            output_scale = input_scale
            output_zero_point = input_zero_point

            data = qnn.op.dequantize(inputs[0], input_scale, input_zero_point, axis=1)
            out = func(data)
            return qnn.op.quantize(out, output_scale, output_zero_point, out_dtype="uint8")

        return func(inputs[0])

def clip(a, a_min, a_max):
    """Clip the elements in `a` between `a_min` and `a_max`.
    `a_min` and `a_max` are cast to `a`'s dtype.

    Parameters
    ----------
    a : relay.Expr
        The input tensor.
    a_min : float
        The clip minimum.
    a_max : float
        The clip maximum.

    Returns
    -------
    result : relay.Expr
        `a` with elements clipped between `a_min` and `a_max`.

    Examples
    --------
    .. code:: python

      x = relay.Constant(tvm.nd.array([0, 1, 5, 3, 4, 2]))
      relay.clip(x, 1., 4.)
      # [1, 1, 4, 3, 4, 2]
    """
    return _make.clip(a, a_min, a_max)

非量化的情况：

clip(a, a_min, a_max)返回min(max(a_min, a), a_max), 或a_min<a<a_max.

根据relu6的公式，relu6=clip(x, 0, 6)

hardsigmoid = relu6(x+3)/6 = clip(x+3, 0, 6)/6

量化的情况：

先反量化input
执行非量化版本的hardsigmoid，得到out
将out量化。zp和scale用的是input中的参数

per channel (axis) 量化和反量化

https://www.tensorflow.org/lite/performance/quantization_spec

普通的张量量化(per tensor)：对于整个tensor，只有一个scale和zero_point。
Per-axis quantization，指每个量化维度都有一个scale和zp。 The quantized dimension specifies the dimension of the Tensor’s shape that the scales and zero-points correspond to. For example, a tensor t, with dims=[4, 3, 2, 1] with quantization params: scale=[1.0, 2.0, 3.0], zero_point=[1, 2, 3], quantization_dimension=1 will be quantized across the second dimension of t:

t[:, 0, :, :] will have scale[0]=1.0, zero_point[0]=1
t[:, 1, :, :] will have scale[1]=2.0, zero_point[1]=2
t[:, 2, :, :] will have scale[2]=3.0, zero_point[2]=3

leakyrelu

非量化的版本

参数：

torch.nn.LeakyReLU(negative_slope: float=0.01, inplace: bool=False)

$* * L e ak y R e LU (x) = ma x (0, x) + n e g a t i v es l o p e * min (0, x) * *$

TVM实现

def leaky_relu(self, inputs, input_types):
        data = inputs[0]
        alpha = float(inputs[1])
        return _op.nn.leaky_relu(data, alpha)

def leaky_relu(data, alpha=0.01):
    """This operator takes data as input and does Leaky version
    of a Rectified Linear Unit.

    .. math::

        `y = x > 0 ? x : alpha * x`

    Parameters
    ----------
    data : tvm.relay.Expr
        The input data to the operator.

    alpha : float
        Slope coefficient for the negative half axis.

    Returns
    -------
    result : tvm.relay.Expr
        The computed result.
    """
    return _make.leaky_relu(data, alpha)

y = x > 0 ? x : alpha * x

Test script:

@tvm.testing.uses_gpu
def test_forward_leakyrelu():
    torch.set_grad_enabled(False)
    input_shape = [1, 3, 10, 10]
    input_data = torch.rand(input_shape).float()
    verify_model(torch.nn.LeakyReLU().eval(), input_data=input_data)
    verify_model(torch.nn.LeakyReLU(negative_slope=0.05).eval(), input_data=input_data)
    verify_model(torch.nn.LeakyReLU(negative_slope=1.0, inplace=True).eval(), input_data=input_data)
    verify_model(
        torch.nn.LeakyReLU(negative_slope=1.25, inplace=True).eval(), input_data=input_data
    )

量化的leakyrelu

先反量化input
执行非量化版本的leakyrelu，得到out
将out量化。zp和scale用的是input中的参数

实现

1. 在qnn_torch.py中注册函数信息

注册scale和zp的参数位置索引。

python/tvm/relay/frontend/qnn_torch.py

def _get_quant_param_for_input(input_value):
    """
    We want to know the input scale and zp of this input_value, since
    input quant params are not explicitly passed around in torch (they
    are embedded in a QTensor data structure, not visible statically).
    We know that it is quantized using output scale and zp
    of some previous quantized op. The purpose of this function
    is to find that pair of parameters.
    """
    # Indices for output scale and zp
    # For example, in quantized::conv2d(%input, %1, %2, %3, %4, %5, %6, %7),
    # 6th and 7th arg are output scale and zp respectively.

    # PyTorch 1.6 changed qconv API
    if is_version_greater_than("1.5.1"):
        qconv_indices = (2, 3)
    else:
        qconv_indices = (6, 7)

    output_quant_param_indices = {
        "aten::quantize_per_tensor": (1, 2),
        "quantized::conv2d": qconv_indices,
        "quantized::conv2d_relu": qconv_indices,
        "quantized::linear": (2, 3),
        "quantized::linear_relu": (2, 3),
        "quantized::add_relu": (2, 3),
        "quantized::add": (2, 3),
        "quantized::mul_relu": (2, 3),
        "quantized::mul": (2, 3),
        "quantized::cat": (2, 3),
        "quantized::mul_scalar": (2, 3),
        "quantized::add_scalar": (2, 3),
        "quantized::hardswish": (1, 2),
		    "quantized::conv_transpose2d": qconv_indices,
    }

_get_quant_param_for_input注册output_scale和zp。

信息来源：pytorch源码。搜索quantized::leaky_relu，在library.cpp下有函数的注册信息：

m.def(TORCH_SELECTIVE_SCHEMA("quantized::leaky_relu(Tensor qx, Scalar negative_slope, bool inplace, float output_scale, int output_zero_point) -> Tensor"));

可知，scale和zp的位置分别是3和4。

看下别的例子验证一下：

quantized::linear →(2,3)

m.def(TORCH_SELECTIVE_SCHEMA("quantized::linear(Tensor X, __torch__.torch.classes.quantized.LinearPackedParamsBase W_prepack, float Y_scale_i, int Y_zero_point_i) -> Tensor Y"));

quantized::hardswish → (1,2)

m.def(TORCH_SELECTIVE_SCHEMA("quantized::hardswish(Tensor input, float output_scale, int output_zero_point) -> Tensor"));

注册input量化的张量的数量 - How many quantized tensors each op takes as inputs?

def add_input_quant_params_to_op_inputs(graph):
    """
    In Torch, input quant params are not explicitly passed around
    Instead, they are stored in QTensor data structure, and retrieved
    at runtime by each quantized ops.
    However, they need to be known statically for QNN translation.
    To workaround and simplify the translation of inputs, we manually add
    input quant params to inputs of Torch quantized operators listed below.
    See _quantized_conv2d() below for example of why this is helpful.

    For example,
      %input : QUInt8(1, 512, 7, 7) = quantized::add(%x.8, %x.9, %434, %435)
    becomes
      %395 : float = prim::Constant[value=0.036212071776390076]()
      %396 : int = prim::Constant[value=0]()
      %430 : float = prim::Constant[value=0.16080744564533234]()
      %431 : int = prim::Constant[value=42]()
      %input : QUInt8(1, 512, 7, 7) = quantized::add(%x.8, %x.9, %434, %435,
                                                     %430, %431, %395, %396)

    %434, %435 are output scale and zp of quantized::add op
    %430, %431, %395, %396 are two pairs of input (scale, zp) for two tensors
    added by this function
    """
    # How many quantized tensors each op takes as inputs?
    # A pair of (scale, zp) for each input quantized tensor will be added
    # to the input nodes
    num_quantized_inputs = {
        "quantized::conv2d": 1,
        "quantized::conv2d_relu": 1,
        "quantized::linear": 1,
        "quantized::linear_relu": 1,
        "quantized::add_relu": 2,
        "quantized::add": 2,
        "quantized::mul_relu": 2,
        "quantized::mul": 2,
        "aten::dequantize": 1,
        "aten::mean": 1,
        "aten::sigmoid": 1,
        "aten::upsample_nearest2d": 1,
        "aten::upsample_bilinear2d": 1,
        "aten::relu_": 1,
        "aten::relu": 1,
        "quantized::add_scalar": 1,
        "quantized::mul_scalar": 1,
        "quantized::relu6": 1,
        "quantized::hardswish": 1,
        "aten::hardsigmoid": 1,
        "quantized::conv_transpose2d": 1,
    }

quantized::leaky_relu 接受1个张量作为输入。

注册函数实现

convert_map = {
    "aten::quantize_per_tensor": _quantize_per_tensor(),
    "quantized::conv2d_relu": _quantized_conv2d(with_relu=True),
    "aten::dequantize": _dequantize(),
    "quantized::conv2d": _quantized_conv2d(),
    "quantized::add_relu": _binop(relay.qnn.op.add, with_relu=True),
    "quantized::add": _binop(relay.qnn.op.add),
    "quantized::mul_relu": _binop(relay.qnn.op.mul, with_relu=True),
    "quantized::mul": _binop(relay.qnn.op.mul),
    "quantized::linear": _linear(),
    "quantized::linear_relu": _linear(with_relu=True),
    "quantized::cat": _cat(),
    "quantized::add_scalar": _add_scalar(),
    "quantized::mul_scalar": _mul_scalar(),
    "quantized::relu6": _relu6(),
    "quantized::leaky_relu": _leaky_relu(),
    "quantized::linear_dynamic": _linear_dynamic(),
    "quantized::hardswish": _hswish(),
}

字典映射到函数的执行逻辑。

具体函数实现

pytorch源码：

class QLeakyRelu final {
 public:
  static Tensor run(Tensor self, const Scalar& negative_slope, bool inplace, double output_scale, int64_t output_zero_point) {
    // inplace argument is ignored now, TODO:support inplace
    if (inplace) {
      TORCH_WARN("inplace=True is not supported for quantized::leaky_relu yet");
    }
    const auto qx = self.contiguous(self.suggest_memory_format());
    auto qy = at::_empty_affine_quantized(qx.sizes(),
      at::device(kCPU).dtype(self.scalar_type()),
      output_scale,
      output_zero_point,
      self.suggest_memory_format());
    qrelu_leaky_stub(self.device().type(), qy, qx, negative_slope);
    return qy;
  }
};

TORCH_LIBRARY_IMPL(quantized, QuantizedCPU, m) {
  m.impl(TORCH_SELECTIVE_NAME("quantized::relu6"), TORCH_FN(QRelu6::run));
  m.impl(TORCH_SELECTIVE_NAME("quantized::leaky_relu"), TORCH_FN(QLeakyRelu::run));
}

CPU Only (2022/06/14)

qrelu_leaky_stub(self.device().type(), qy, qx, negative_slope); 将函数分派到：

aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp

static void leaky_qrelu_out_kernel(Tensor& out, const Tensor& qx,
                                   const Scalar& negval_) {
  int64_t i_zp = qx.q_zero_point();
  // NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)
  float i_scale = qx.q_scale();

  int64_t o_zp = out.q_zero_point();
  // NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)
  float o_scale = out.q_scale();
  float o_inv_scale = 1.0f / o_scale;

  float negval = negval_.to<float>();

  AT_DISPATCH_QINT_TYPES(out.scalar_type(), "leaky_qrelu", [&] {
    using Vec = Vectorized<float>;  // Naive implementation uses dequant/quant loop.
    using qVec = Vectorized<scalar_t>;
    Vec zero_vec = Vec(0.0f);
    Vec one_vec = Vec(1.0f);

    Vec i_scale_vec = Vec((float)i_scale);
    Vec i_zp_vec = Vec((float)i_zp);
    Vec i_scale_zp_neg_premul_vec = i_scale_vec * i_zp_vec.neg();

    Vec negval_vec = Vec(negval);

    auto iter = TensorIterator::unary_op(out, qx);

    cpu_kernel_vec(
        iter,
        [&](scalar_t value_qx) -> scalar_t {
          auto value_dx = at::native::dequantize_val(i_scale, i_zp, value_qx);
          auto value_dy = value_dx > 0 ? value_dx : value_dx * negval;
          return at::native::quantize_val<scalar_t>(o_scale, o_zp, value_dy);
        },
        [&](qVec qx_vec) -> qVec {
          /* Vectorized implementation creates a multiplicand vector, which has
           * "alpha" for all negative dx values and ones-vector for all
           * positive values of dx. The multiplicand then is multiplied by the
           * input.
           */
          auto dx_vec_vec = qx_vec.dequantize(i_scale_vec, i_zp_vec,
                                              i_scale_zp_neg_premul_vec);
          for (auto & dx_vec : dx_vec_vec) {
            const auto multiplicand = Vec::blendv(negval_vec, one_vec,
                                                  dx_vec > zero_vec);
            dx_vec *= multiplicand;
          }
          return qVec::quantize(dx_vec_vec, o_scale, o_zp, o_inv_scale);
        });
  });
}

可以观察到核心代码的逻辑：

auto value_dx = at::native::dequantize_val(i_scale, i_zp, value_qx);
          auto value_dy = value_dx > 0 ? value_dx : value_dx * negval;
          return at::native::quantize_val<scalar_t>(o_scale, o_zp, value_dy);

其实现遵循很传统的dequant→calculate→quant

TVM函数实现:

def _leaky_relu():
    # refer to src/ATen/native/quantized/cpu/qrelu.cpp
    def _impl(inputs, _):
        for i, input in enumerate(inputs):
            print("_leaky_relu inputs", i, input)
        print("_leaky_relu len(inputs)=", len(inputs))
'''
_leaky_relu inputs 0 free_var %input: Tensor[(1, 3, 224, 224), float32];
qnn.quantize(%input, 0.00392155f, 0, out_dtype="uint8", axis=1)
_leaky_relu inputs 1 0.01
_leaky_relu inputs 2 False
_leaky_relu inputs 3 0.003921554423868656
_leaky_relu inputs 4 0
_leaky_relu inputs 5 0.003921554423868656
_leaky_relu inputs 6 0
_leaky_relu len(inputs)= 7
'''

input3, input4 分别是 output_scale和output_zp。

input5, input6 分别是 input_scale和input_zp。

完整：

def _leaky_relu():
    # refer to src/ATen/native/quantized/cpu/qrelu.cpp
    def _impl(inputs, _):
        assert len(inputs) == 7, "Input quant params not found in op inputs"
        assert inputs[2] == False, "inplace=True is not supported for quantized::leaky_relu yet"
        alpha = inputs[1]
        output_scale = _expr.const(inputs[3])
        output_zero_point = _expr.const(inputs[4])
        input_scale = _expr.const(inputs[5])
        input_zero_point = _expr.const(inputs[6])
        dequant = relay.qnn.op.dequantize(inputs[0], input_scale, input_zero_point)
        dequantized = dequant * _op.nn.leaky_relu(dequant, alpha)
        return relay.qnn.op.quantize(
            dequantized, output_scale, output_zero_point, out_dtype="uint8"
        )
        
    return _impl

在tests/python/frontend/pytorch/qnn_test.py中添加测试用例。

例子：

class Hsigmoid(nn.Module):
    def __init__(self, add_stub=False):
        super().__init__()
        self.quant = QuantStub()
        self.dequant = DeQuantStub()
        self.add_stub = add_stub
        self.hsigmoid = nn.Hardsigmoid()

    def forward(self, x):
        if self.add_stub:
            x = self.quant(x)
        x = self.hsigmoid(x)
        if self.add_stub:
            x = self.dequant(x)
        return x

    def fuse_model(self):
        pass

注意，调用的pytorch函数仍然是nn.Hardsigmoid，forward时遵循 float x → x = quant(x) → x = nn.Hardsigmoid(x) → x = dequant(x)。不是**torch.nn.quantized.LeakyReLU 或torch.nn.quantized.functional.leaky_relu。**

class LeakyReLU(nn.Module):
    def __init__(self):
        super().__init__()
        self.leaky_relu = QuantWrapper(nn.LeakyReLU())

    def forward(self, x):
        return self.leaky_relu(x)

    def fuse_model(self):
        pass