pytorch量化

weixin_45919003

已于 2023-04-26 16:11:30 修改

阅读量2.6k

点赞数 6

分类专栏： pytorch量化学习文章标签： pytorch 深度学习 python

于 2023-04-13 16:37:19 首次发布

本文链接：https://blog.csdn.net/weixin_45919003/article/details/130115760

版权

pytorch量化学习专栏收录该内容

13 篇文章

订阅专栏

一、参考文档

pytorch官方文档
quantization：https://pytorch.org/docs/stable/quantization.html?highlight=quantization
pytorch量化介绍：https://pytorch.org/blog/introduction-to-quantization-on-pytorch/
参考文章：
Gemfield：PyTorch的量化

pytorch提供了两种量化模式：

Eager Mode Quantization：手动进行融合，并指定量化和反量化的位置
FX Graph Mode Quantization：自动

二、Eager模式支持的量化类型

PTQ支持：static、dynamic
QAT支持：static
动态量化一般运用在NLP领域的模型
静态量化一般运用在计算机视觉，主要针对CNN网络

1、Post Training Dynamic Quantization

这是最简单的量化形式，其中权重静态量化，输入在推理过程中动态量化。
激活是以浮点格式读取和写入存储器的
PTDQ API：

import torch

# define a floating point model
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = torch.nn.Linear(4, 4)

    def forward(self, x):
        x = self.fc(x)
        return x

# create a model instance
model_fp32 = M()
# create a quantized model instance
model_int8 = torch.ao.quantization.quantize_dynamic(
    model_fp32,  # the original model
    {torch.nn.Linear},  # a set of layers to dynamically quantize
    dtype=torch.qint8)  # the target dtype for quantized weights

# run the model
input_fp32 = torch.randn(4, 4, 4, 4)
res = model_int8(input_fp32)

Post Training Dynamic Quantization，简称为Dynamic Quantization，也就是动态量化，或者叫作Weight-only的量化

可以有更高的精度（因为裁剪范围被精确校准）
目前只支持线性层（linear）和递归(LSTM, GRU, RNN)层的动态量化。并且在运行时对每一层的激活进行校准和量化会增加计算开销。

1.1动态量化的计算

默认只对部分op进行转换：Linear、LSTM、LSTMCell、RNNCell、GRUCell。

用于activation的PlaceholderObserver 就是个占位符，什么也不做；
用于weight的MinMaxObserver就是记录输入tensor中的最大值和最小值，用来计算scale和zp。min_val，max_val代表op权重数据/input tensor数据分布的最小值和最大值；qmin, qmax代表量化后的取值范围的最小、最大值（-128和127）。使用对称量化公式计算。
由此可知权重部分的量化其实是“静态”的，之所以叫“动态量化”是因为在于前向推理的时候动态的把input的float tensor转换为量化tensor。
在动态量化的前向推理的时候，nnqd.Linear会调用torch.ops.quantized.linear_dynamic函数，输入就是上面pack好的量化后的权重和浮点型的bias，linear_dynamic函数最终会被PyTorch分发到C++中的apply_dynamic_impl函数。为了将输入转为量化形式，apply_dynamic_impl函数使用下面逻辑对输入进行量化

Tensor q_input = at::quantize_per_tensor(input_contig, q_params.scale, q_params.zero_point, c10::kQUInt8);

动态量化的本质就是基于运行时对数据范围的观察，来动态确定对输入进行量化时的scale值，确保输入tensor的scale能基于输入数据进行优化。而模型参数则是提前转换成了INT8的格式。这样，当输出也被量化后，网络中的运算就使用向量化的INT8指令来完成。当前layer在输出时还需要把结果反量化为float32。

2、Post Training Static Quantization

权重和激活都是静态量化，将激活融合到前面的层中，量化后需要数据集进行校准，以确定激活的最佳量化参数。

与动态量化的共同点：都把网络的权重参数从float32转换为int8；不同点：需要把训练集或者和训练分布类似的的数据喂给模型（没有反向传播），然后通过每个op输入的分布特点来计算激活(activation)的量化参数–也就是Calibrate。静态量化包含激活量化，也就是op 前向推理之后的处理，

PTSQ API:

import torch

# define a floating point model where some layers could be statically quantized
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        # QuantStub converts tensors from floating point to quantized
        self.quant = torch.ao.quantization.QuantStub()
        self.conv = torch.nn.Conv2d(1, 1, 1)
        self.relu = torch.nn.ReLU()
        # DeQuantStub converts tensors from quantized to floating point
        self.dequant = torch.ao.quantization.DeQuantStub()

    def forward(self, x):
        # manually specify where tensors will be converted from floating
        # point to quantized in the quantized model
        x = self.quant(x)
        x = self.conv(x)
        x = self.relu(x)
        # manually specify where tensors will be converted from quantized
        # to floating point in the quantized model
        x = self.dequant(x)
        return x

# create a model instance
model_fp32 = M()

# model must be set to eval mode for static quantization logic to work
model_fp32.eval()

# attach a global qconfig, which contains information about what kind
# of observers to attach. Use 'x86' for server inference and 'qnnpack'
# for mobile inference. Other quantization configurations such as selecting
# symmetric or asymmetric quantization and MinMax or L2Norm calibration techniques
# can be specified here.
# Note: the old 'fbgemm' is still available but 'x86' is the recommended default
# for server inference.
# model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('fbgemm')
model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('x86')

# Fuse the activations to preceding layers, where applicable.
# This needs to be done manually depending on the model architecture.
# Common fusions include `conv + relu` and `conv + batchnorm + relu`
model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32, [['conv', 'relu']])

# Prepare the model for static quantization. This inserts observers in
# the model that will observe activation tensors during calibration.
model_fp32_prepared = torch.ao.quantization.prepare(model_fp32_fused)

# calibrate the prepared model to determine quantization parameters for activations
# in a real world setting, the calibration would be done with a representative dataset
input_fp32 = torch.randn(4, 1, 4, 4)
model_fp32_prepared(input_fp32)

# Convert the observed model to a quantized model. This does several things:
# quantizes the weights, computes and stores the scale and bias value to be
# used with each activation tensor, and replaces key operators with quantized
# implementations.
model_int8 = torch.ao.quantization.convert(model_fp32_prepared)

# run the model, relevant calculations will happen in int8
res = model_int8(input_fp32)

从上面的API可以看出静态量化主要五个步骤：

1、fuse_model：合并一些layer，以提高速度和准确度
2、设置qconfig：Qconfig的一个实例，维护量化observer
– default_qconfig维护的两个observer如下表：

量化的backend	activation	weight
fbgemm	HistogramObserver (reduce_range=True)	PerChannelMinMaxObserver (default_per_channel_weight_observer)
qnnpack	HistogramObserver (reduce_range=False)	MinMaxObserver (default_weight_observer)
默认（非fbgemm和qnnpack）	MinMaxObserver (default_observer)	MinMaxObserver (default_weight_observer)

3、 prepare：给每个子module插入Observer，用来收集和定标数据。
4、喂数据：不是训练。是为了获取数据的分布特点，来更好的计算activation的scale和zp。至少要喂上几百个迭代的数据。
5、转换模型：这个过程和dynamic量化类似，本质就是检索模型中op的type，如果某个op的type属于字典DEFAULT_STATIC_QUANT_MODULE_MAPPINGS的key（注意字典和动态量化的不一样了），那么，这个op将被替换为key对应的value
不是实时校准激活，而是使用验证数据预校准和固定裁剪范围（静态的）
静态量化比动态量化具有更快的推理速度，因为消除了层之间float和int的转换开销

2.1 静态量化过程中scale和zero point的计算

pytorch的scale和zero point的计算逻辑

#qscheme 是 torch.per_tensor_symmetric 或者torch.per_channel_symmetric时
max_val = torch.max(-min_val, max_val)
scale = max_val / (float(qmax - qmin) / 2)
scale = torch.max(scale, torch.tensor(self.eps, device=device, dtype=scale.dtype))
if self.dtype == torch.quint8:
    zero_point = zero_point.new_full(zero_point.size(), 128)

#qscheme 是 torch.per_tensor_affine时
scale = (max_val - min_val) / float(qmax - qmin)
scale = torch.max(scale, torch.tensor(self.eps, device=device, dtype=scale.dtype))
zero_point = qmin - torch.round(min_val / scale)
zero_point = torch.max(zero_point, torch.tensor(qmin, device=device, dtype=zero_point.dtype))
zero_point = torch.min(zero_point, torch.tensor(qmax, device=device, dtype=zero_point.dtype))

QuantStub的scale和zp：非对称量化计算。QuantStub使用的是HistogramObserver，根据输入从[-3,3]的分布，HistogramObserver计算得到min_val、max_val分别是-3、2.9971，而qmin和qmax又分别是0、127
conv activation的scale和zp：卷积后的tensor使用非对称量化公式计算。observer(quint8)是HistogramObserver，又是reduce_range的，因此其qmin,qmax = 0 ，127；min_val，max_val为输入数据 + 权重值根据L2Norm确定
conv weight的scale和zp：对卷积权重tensor使用对称量化公式计算。weight(qint8)是PerChannelMinMaxObserver，不是reduce_range的，因此其qmin, qmax = -128, 127；min_val，max_val为输入数据的最小值和最大值确定。
fc activation的scale和zp：计算方法同conv
fc weight的scale和zp：
relu activation的scale和zp：非对称量化计算
在conv过程中假设权重为-0.7898，输入tensor的第一个值为-0.9912，那卷积后得到的应该是-0.7898 x -0.9912=0.7828，但实际得到的是0.7801，这说明已经在引入误差了（%0.34），因此fuse_modules可以提高精度（每一层都会引入类似的误差）。
静态量化和动态量化最大的区别就是：静态量化的float输入必经QuantStub变为int，此后到输出之前都是int；动态量化的float输入是经动态计算的scale和zp量化为int，op输出时转换回float。

3、static quantization aware training

所有权重和偏差都以FP32存储，在前向传播中，量化通过FakeQuantize模块进行内部模拟（在数据量化后立刻反量化）
QAT API：

import torch

# define a floating point model where some layers could benefit from QAT
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        # QuantStub converts tensors from floating point to quantized
        self.quant = torch.ao.quantization.QuantStub()
        self.conv = torch.nn.Conv2d(1, 1, 1)
        self.bn = torch.nn.BatchNorm2d(1)
        self.relu = torch.nn.ReLU()
        # DeQuantStub converts tensors from quantized to floating point
        self.dequant = torch.ao.quantization.DeQuantStub()

    def forward(self, x):
        x = self.quant(x)
        x = self.conv(x)
        x = self.bn(x)
        x = self.relu(x)
        x = self.dequant(x)
        return x

# create a model instance
model_fp32 = M()

# model must be set to eval for fusion to work
model_fp32.eval()

# attach a global qconfig, which contains information about what kind
# of observers to attach. Use 'x86' for server inference and 'qnnpack'
# for mobile inference. Other quantization configurations such as selecting
# symmetric or asymmetric quantization and MinMax or L2Norm calibration techniques
# can be specified here.
# Note: the old 'fbgemm' is still available but 'x86' is the recommended default
# for server inference.
# model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('fbgemm')
model_fp32.qconfig = torch.ao.quantization.get_default_qat_qconfig('x86')

# fuse the activations to preceding layers, where applicable
# this needs to be done manually depending on the model architecture
model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32,
    [['conv', 'bn', 'relu']])

# Prepare the model for QAT. This inserts observers and fake_quants in
# the model needs to be set to train for QAT logic to work
# the model that will observe weight and activation tensors during calibration.
model_fp32_prepared = torch.ao.quantization.prepare_qat(model_fp32_fused.train())

# run the training loop (not shown)
training_loop(model_fp32_prepared)

# Convert the observed model to a quantized model. This does several things:
# quantizes the weights, computes and stores the scale and bias value to be
# used with each activation tensor, fuses modules where appropriate,
# and replaces key operators with quantized implementations.
model_fp32_prepared.eval()
model_int8 = torch.ao.quantization.convert(model_fp32_prepared)

# run the model, relevant calculations will happen in int8
res = model_int8(input_fp32)

1、设置qconfig：在设置之前，模型首先设置为训练模式。
– 在QAT的qconfig中，activation和权重的observer都变成了FakeQuantize（和observer是has a的关系，也即包含一个observer），并且参数不一样（qmin、qmax、schema,dtype,qschema,reduce_range这些参数）
– FakeQuantize包含的observer是MovingAverageMinMaxObserver，继承自前面提到过的MinMaxObserver，但是求最小值和最大值的方法有点区别。
2、fuse_modules：与静态量化一样
3、 prepare_qat：使用的是prepare_qat API。主要有两点区别：prepare_qat要把qconfig安插到每个op上，qconfig的内容本身就不同，参考五部曲中的第一步；prepare_qat 中需要多做一步转换子module的工作，需要inplace的把模型中的一些子module替换了，替换的逻辑就是从DEFAULT_QAT_MODULE_MAPPINGS的key替换为value，这个字典的定义也不同。
4、喂数据：和静态量化完全不同，在QAT中这一步是用来训练的。每个op的输入都需要经过self.weight_fake_quant来处理下，输出又都需要经过self.activation_post_process来处理下，这两个都是FakeQuantize的实例，只是里面包含的observer不一样。
– FakeQuantize前向函数中的fake_quantize_per_channel_or_tensor_affine实现了quantize和dequantize，用公式表示的话为：out = (clamp(round(x/scale + zero_point), quant_min, quant_max) - zero_point) * scale。也就是说，这是把量化的误差引入到了训练loss之中了。
– 这样，在QAT中，所有的weights和activations就像上面那样被fake quantized了，且参与模型训练中的前向和反向计算。float值被round成了（用来模拟的）int8值，但是所有的计算仍然是通过float来完成的。这样以来，所有的权重在优化过程中都能感知到量化带来的影响，称之为量化感知训练（支持cpu和cuda），精度也因此更高。
5、转换convert：和静态量化一样，需要注意的是，QAT中，有一些module在prepare中已经转换成新的module了

四、Quantization Stack

量化流程中使用到的

Observer and FakeQuantize
– Observer ：收集张量信息，如统计张量的最大最小值，并计算量化参数
– FakeQuantize：伪量化模块
QConfig：是Observer 和 FakeQuantize模块类的命名元组，可以进行配置（namedtuple ）
– 不同类型的Observer/FakeQuantize
– 支持权重和激活配置

五、量化 API

官方文档：https://pytorch.org/docs/stable/quantization-support.html
参考文档：https://zhuanlan.zhihu.com/p/299108528

1、顶层API

1.1 quantize & quantize_dynamic & quantize_qat：使用训练后静态量化 / 动态（仅weights-only）/ 量化感知；

细粒度通过qconfig设置

qunatize：需要先准备模型进行校准，API如下

torch.ao.quantization.quantize_dynamic(model, qconfig_spec=None, dtype=torch.qint8, mapping=None, inplace=False)

quantize_dynamic ：本质就是检索模型中op的type，如果某个op的type属于字典DEFAULT_DYNAMIC_QUANT_MODULE_MAPPINGS的key，那么，这个op将被替换为key对应的value，API如下
– 其中qconfig_spec参数指定了一组qconfig，具体就是哪个op（operation，CNN中各种操作，比图conv、linear。batchnorm等）对应哪个qconfig
– 每个qconfig是Qconfig类的实例（instance），封装了两个observer；
– 两个observer分别是权重和激活的observer
– qconfig_spec=None时时默认行为
– qconfig_spec赋值为set，比如：{nn.LSTM, nn.Linear}，意思是指定当前模型中的哪些layer要被dynamic quantization；
– qconfig_spec赋值为一个dict，key为submodule的name或type，value为QConfigDynamic实例

torch.ao.quantization.quantize_dynamic(model, qconfig_spec=None, dtype=torch.qint8, mapping=None, inplace=False)

字典DEFAULT_DYNAMIC_QUANT_MODULE_MAPPINGS

# Default map for swapping dynamic modules
DEFAULT_DYNAMIC_QUANT_MODULE_MAPPINGS = {
    nn.GRUCell: nnqd.GRUCell,
    nn.Linear: nnqd.Linear,
    nn.LSTM: nnqd.LSTM,
    nn.LSTMCell: nnqd.LSTMCell,
    nn.RNNCell: nnqd.RNNCell,
}

当type从key换为value，新的type需要实例化，并且要使用之前的权重参数，这个一般是通过from_float()来进行实例化。

1.2 prepare & prepare_qat：为量化校准或量化感知训练准备模型副本，需优先配置.qconfig。

训练后静态量化（PTQ）中使用prepare ：插入observer模块，以便在校准期间，观测激活张量；

torch.ao.quantization.prepare(model, inplace=False, allow_list=None, observer_non_leaf_module_list=None, prepare_custom_config_dict=None)

model_fp32_prepared = torch.ao.quantization.prepare(model_fp32_fused)

量化感知训练（QAT）中使用prepare_qat：插入observer 和fake_quants 模块，需要设置为train()模式才能运行，在校准期间观测权重和激活张量。

torch.ao.quantization.prepare_qat(model, mapping=None, inplace=False)

model_fp32_prepared = torch.ao.quantization.prepare_qat(model_fp32_fused.train())

1.3 convert：通过对目标模块类调用from_float方法来根据映射将输入模块中的子模块转换乘不同的模块。如果remove_qconfig设置的是True，则在末尾删除qconfig

在QAT量化中，整个计算是以浮点的形式进行的，在训练结束时，通过convert转换函数将浮点转为量化后的数据

torch.ao.quantization.convert(module, mapping=None, inplace=False, remove_qconfig=True, is_reference=False, convert_custom_config_dict=None)

model_int8 = torch.ao.quantization.convert(model_fp32_prepared)

from_float(）
nnqat.Linear模块的from_float方法如下

@classmethod
    def from_float(cls, mod):
        r"""Create a qat module from a float module or qparams_dict
            Args: `mod` a float module, either produced by torch.ao.quantization utilities
            or directly from user
        """
        assert type_before_parametrizations(mod) == cls._FLOAT_MODULE, (
            " qat."
            + cls.__name__
            + ".from_float only works for "
            + cls._FLOAT_MODULE.__name__
        )
        assert hasattr(mod, "qconfig"), "Input float module must have qconfig defined"
        assert mod.qconfig, "Input float module must have a valid qconfig"
        if type_before_parametrizations(mod) == LinearReLU:
            mod = mod[0]

        qconfig = mod.qconfig
        qat_linear = cls(mod.in_features, mod.out_features, bias=mod.bias is not None, qconfig=qconfig)

        if is_parametrized(mod, "weight"):
            transfer_parametrizations_and_params(mod, qat_linear, "weight")
        else:
            qat_linear.weight = mod.weight

        if is_parametrized(mod, "bias"):
            transfer_parametrizations_and_params(mod, qat_linear, "bias")
        else:
            qat_linear.bias = mod.bias

        return qat_linear

此方法会构造qat_linear类实例。
from_float()主要做的事情就是：

使用MinMaxObserver计算模型中op权重参数中tensor的最大值最小值（这个例子中只有Linear op），缩小量化时原始值的取值范围，提高量化的精度；
通过上述步骤中得到四元组中的min_val和max_val，再结合算法确定的qmin, qmax计算出scale和zp，然后计算得到量化后的weight
实例化nnqd.Linear，然后使用qlinear.set_weight_bias将量化后的weight和原始的bias设置到新的layer上。其中最后一步还涉及到weight和bias的打包，在源代码中是这样的：

#ifdef USE_FBGEMM
    if (ctx.qEngine() == at::QEngine::FBGEMM) {
      return PackedLinearWeight::prepack(std::move(weight), std::move(bias));
    }
#endif

#ifdef USE_PYTORCH_QNNPACK
    if (ctx.qEngine() == at::QEngine::QNNPACK) {
      return PackedLinearWeightsQnnp::prepack(std::move(weight), std::move(bias));
    }
#endif
    TORCH_CHECK(false,"Didn't find engine for operation quantized::linear_prepack ",toString(ctx.qEngine()));

其实就是依赖FBGEMM、QNNPACK这些backend

2、量化前准备

2.1 fuse_modules：融合模块，常见的融合模块包括“conv+ReLU” & “conv+BN+ReLU” ，需要根据模型结构手动完成.

model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32, [['conv', 'bn', 'relu']])

model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32, [[‘conv’, ‘relu’]])
model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32, [[‘conv’, ‘bn’, ‘relu’]])

2.2 QuantStub & DeQuantStub：量化和反量化

需要手动插入CNN结构中。

QuantStub： quantize stub模块，在校准前和observer相同，在convert中变换成nnq.Quantize；
–
DeQuantStub：DeQuantStub模块，在prepare阶段相当于Identity，在convert中变换成nnq.DeQuantize。

3、torch.ao.quantization.observer

ObserverBase
MinMaxObserver
等

4、torch.ao.quantization.qconfig

定义了用于配置单个操作的量化设置的QConfig对象

4.1 QConfig：描述如何分别设置激活和权重的observer类来量化网络的层或部分

需要包含observer类（如MinMaxObserver）或在调用时返回实例的可调用类，而不是具体的observer实例本身。

4.2 default_qconfig 默认qconfig配置

六、量化.qconfig

获取config的函数定义如下，常用的有两种方式，fbgemm是逐通道的，qnnpack是逐层的，目前“fbgemm”可以用“x86”代替，“x86”建议的默认值

def get_default_qconfig(backend='fbgemm', version=0):
    """
    Returns the default PTQ qconfig for the specified backend.
    Args:
      * `backend`: a string representing the target backend. Currently supports `fbgemm`,
        `qnnpack` and `onednn`.
    Return:
        qconfig
    """
    if version == 0:
        if backend == 'fbgemm':
            qconfig = QConfig(activation=HistogramObserver.with_args(reduce_range=True),
                              weight=default_per_channel_weight_observer)
        elif backend == 'qnnpack':
            qconfig = QConfig(activation=HistogramObserver.with_args(reduce_range=False),
                              weight=default_weight_observer)
        elif backend == 'onednn':
            qconfig = QConfig(activation=HistogramObserver.with_args(reduce_range=False),
                              weight=default_per_channel_weight_observer)
        else:
            qconfig = default_qconfig
    else:
        raise AssertionError("Version number: " + str(version) +
                             " in get_default_qconfig is not supported. Version number must be 0")

    return qconfig

myModel.qconfig = torch.quantization.default_qconfig
per_channel_quantized_model.qconfig = torch.quantization.get_default_qconfig(‘fbgemm’)
qat_model.qconfig = torch.quantization.get_default_qat_qconfig(‘fbgemm’)

其中调用了with_args，定义如下