Pytorch 1.3.0 量化介绍

最新推荐文章于 2024-06-13 18:32:06 发布

张博208

最新推荐文章于 2024-06-13 18:32:06 发布

阅读量693

点赞数

分类专栏： Model Compression pytorch

原文链接：https://blog.csdn.net/zym19941119/article/details/102523719/

版权

pytorch 同时被 2 个专栏收录

15 篇文章 0 订阅

订阅专栏

Model Compression

4 篇文章 0 订阅

订阅专栏

https://blog.csdn.net/zym19941119/article/details/102523719/

量化介绍
量化指的是使用比浮点精度更少的比特数来进行yi计算和存储的技术。一个量化后的模型使用整数tensor而不是浮点数tensor来执行一部分或全部的操作。这是一种更紧凑的模型表现方式，并且可以利用许多硬件平台上的高性能向量操作。PyTorch支持INT8的量化，相比于FP32，模型大小减少了4x，对内存带宽的需求也减少了4x。INT8操作的硬件支持使得其计算通常要比FP32快2-4倍。量化主要是一种加速inference的技术，量化后的操作也仅支持前向的计算。

PyTorch支持多种深度学习模型的量化方式。在大部分的情况下，模型使用FP32训练，然后被转换为INT8的模型。此外，PyTorch还支持量化感知的训练，可以将量化过程中出现的误差建模，通过fake-quantization模块进行前向和反向的计算。需要注意的是全部的计算是在浮点数上执行的。在量化感知训练最后，PyTorch提供了转换工具来吧训练好的模型转换为更低的精度。

从更底层的层面上，PyTorch提供了一种表示量化后的tensor的方式，同时使用它们进行计算。这些tensor可以被用来直接构建模型，并在低精度下完成所有的计算。同时也提供高级的API包含了典型的用最小的精度损失从FP32转换到低精度模型的工作流程。

量化tensor
PyTorch同时提供逐tensor与逐channel的非对称线性量化。逐tensor的量化意味着tensor内所有的值都被同样的尺度缩放。逐channel的量化意味着对于给定维度，通常是tensor的channel维度，tensor在该维度的每个切片会使用不同的缩放和偏移（这样缩放与偏移可以用向量的方式来表示），这样保证了在量化过程拥有更少的误差。

浮点数到定点数的转换过程使用了下面的映射方程：

Q(x,scale,zero_point)=round(xscale+zero_point)Q(x, scale,zero\_point)=round(\frac{x}{scale}+zero\_point)
Q(x,scale,zero_point)=round(
scale
x

+zero_point)

值得注意的是，我们保证浮点数的零在量化前后的表示不存在误差，也即量化前的零点正好对应于量化后的一个定点值，这就保证了在类似zero padding的操作中不会引入额外的量化误差。

为了在PyTorch中进行量化，我们需要能够用tensor表示量化后的数据。一个量化后的tensor可以存储量化后的数据（用int8/uint8/int32表示）与量化参数，如缩放与量化零点。量化后的tensor可以使用多种操作，使得量化后tensor能轻易进行运算，同时也可以进行序列化与并行化（？）

操作
量化后的tensor支持的数据操作集合是全精度tensor的子集。对于PyTorch中的NN操作子，我们仅支持：

8 bit权重（数据类型为torch.qint8）

8 bit激活值（数据类型为torch.quint8）

对于卷积层与线性层，目前操作子的实现仅支持权重的逐channel量化，此外，输入数据的最大值和最小值被线性映射到对应量化数据类型的最大值和最小值，使得零点没有量化误差。

对于量化后的tensor许多操作都与全精度tensor有着相同的API。能提供反量化的NN模块的量化版本位于torch.nn.quantized。这些操作在operation signature内显式地接收量化参数（缩放与量化零点）。

此外我们在torch.nn.intrinsic.quantized中提供了普通的融合方式对应的影响量化的融合版本。

对于量化感知训练，我们在torch.nn.qat与torch.nn.intrinsic.qat中提供了用于量化感知训练的模块。

量化流程
PyTorch提供了三种量化模型的方式：

Post Training 动态量化：这是最简单的一种量化方式，在这种方式里，权重首先会被量化，但是激活值会在inference的过程中动态地被量化。这被用来处理模型的处理时间主要被内存读取权重的时间所占据的模型。这中现象对于小batchsize下的LSTM与Transformer类型的模型非常常见。使用动态量化只需要一个简单的函数调用torch.quantization.quantize_dynamic()即可。

Post Training 静态量化：这是最常用的一种量化方式，在这种方式里，权重被提前量化，而激活值的缩放与量化零点根据观察模型在calibration过程中的统计量提前计算好。Post Training量化通常用于存储带宽与计算量都很重要的场景，CNN就是一个典型例子。使用Post Training静态量化的过程是：

准备模型：1）通过显式地加入QuantStub与DeQuantStub模块指定哪里的激活值要被量化与反量化。 2）保证模型没有被重复使用。3）将任何需要再次量化的操作转换为模块

融合操作子，例如conv+bn、conv+bn+relu、conv+relu等，提升模型的准确性与性能

指定量化方法的配置例如选择对称量化或非对称量化，MinMax或L2Norm标定方法

使用torch.quantization.prepare()来加入标定过程中观察tensor激活值的模块

通过在标定数据集上inference来标定模型

最终，使用torch.quantization.convert()方法转换模型。这一步做了以下几个事情：量化权重、计算和存储每个激活值tensor使用的缩放与量化零点、移除关键操作子中的量化实现

量化感知训练：在特殊的Post Training量化不能提供足够的精度的情况下，可以使用量化感知训练通过torch.quantization.FakeQuantize来模拟量化的过程。计算过程将会使用FP32但是数据会通过被固定在一定动态范围与四舍五入来模拟INT8量化的影响。使用步骤也非常类似：

执行上面步骤a与步骤b

指定模拟量化方法的配置例如选择对称量化或非对称量化，MinMax或Moving Average或L2Norm标定方法

使用torch.quantization.prepare_qat()来添加训练过程中模拟量化的模块

训练或fine-tune模型

执行上面步骤f

NOTE: 为了防止翻译过程中出现理解的偏差，故将原文放在下方：

PyTorch provides three approaches to quantize models.

1.Post Training Dynamic Quantization: This is the simplest to apply form of quantization where the weights are quantized ahead of time but the activations are dynamically quantized during inference. This is used for situations where the model execution time is dominated by loading weights from memory rather than computing the matrix multiplications. This is true for for LSTM and Transformer type models with small batch size. Applying dynamic quantization to a whole model can be done with a single call to torch.quantization.quantize_dynamic().

2.Post Training Static Quantization: This is the most commonly used form of quantization where the weights are quantized ahead of time and the scale factor and bias for the activation tensors is pre-computed based on observing the behavior of the model during a calibration process. Post Training Quantization is typically when both memory bandwidth and compute savings are important with CNNs being a typical use case. The general process for doing post training quantization is:
1.Prepare the model: a. Specify where the activations are quantized and dequantized explicitly by adding QuantStub and DeQuantStub modules. b. Ensure that modules are not reused. c. Convert any operations that require requantization into modules
2.Fuse operations like conv + relu or conv+batchnorm + relu together to improve both model accuracy and performance.
3.Specify the configuration of the quantization methods ‘97 such as selecting symmetric or asymmetric quantization and MinMax or L2Norm calibration techniques.
4.Use the torch.quantization.prepare() to insert modules that will observe activation tensors during calibration
5.Calibrate the model by running inference against a calibration dataset
6.Finally, convert the model itself with the torch.quantization.convert() method. This does several things: it quantizes the weights, computes and stores the scale and bias value to be used each activation tensor, and replaces key operators quantized implementations.

3.Quantization Aware Training: In the rare cases where post training quantization does not provide adequate accuracy training can be done with simulated quantization using the torch.quantization.FakeQuantize. Computations will take place in FP32 but with values clamped and rounded to simulate the effects of INT8 quantization. The sequence of steps is very similar.
1.Steps (1) and (2) are identical.
3.Specify the configuration of the fake quantization methods ‘97 such as selecting symmetric or asymmetric quantization and MinMax or Moving Average or L2Norm calibration techniques.
4.Use the torch.quantization.prepare_qat() to insert modules that will simulate quantization during training.
5.Train or fine tune the model.
6.Identical to step (6) for post training quantization
While default implementations of observers to select the scale factor and bias based on observed tensor data are provided, developers can provide their own quantization functions. Quantization can be applied selectively to different parts of the model or configured differently for different parts of the model.

当默认的observer实现根据观察到的数据选择了缩放与量化零点后，开发者可以使用自己的量化函数。模型的不同部分可以使用不同的量化方法或量化配置。

量化的流程通过分层地增加（如增加observer作为.observer子模块）或替换（如替换nn.Conv2d为nn.quantized.Conv2d）原有模型的子模块来实现，因此模型在整个流程中始终是一个标准的nn.Module-based模型，因此可以与其余的PyTorch API交互。

量化的模型准备
在量化之前，需要先对模型的定义进行一些改变，这是因为量化目前工作在逐模块的模型上，具体地，使用者需要做以下几件事：

将任何需要对输出进行再量化的操作从函数转换为模块

指定模型哪些地方需要被量化，可以通过给子模块分配.qconfig属性或指定qconfig_dict

对于需要量化激活值的静态量化，使用者需要额外做以下几件事：

指定哪些激活值需要被量化与反量化，通过使用QuantStub与DeQuantStub模块

使用torch.nn.quantized.FloatFunctional来将那些在量化中需要特殊对待的tensor操作封装为模块。例如add或者cat操作，需要重新决定输出的量化参数。

融合模块：将操作/模块融合进一个模块来获得高性能与速度。通过torch.quantization.fuse_modules() API，可以将成列表的子模块进行融合，目前仅支持[Conv, ReLU],[Conv, BatchNorm], [Conv, BatchNorm, ReLU], [Linear, ReLU]

张博208

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Pytorch 1.3.0 量化介绍

https://blog.csdn.net/zym19941119/article/details/102523719/量化介绍量化指的是使用比浮点精度更少的比特数来进行yi计算和存储的技术。一个量化后的模型使用整数tensor而不是浮点数tensor来执行一部分或全部的操作。这是一种更紧凑的模型表现方式，并且可以利用许多硬件平台上的高性能向量操作。PyTorch支持INT8的量化，相比于FP32，模型大小减少了4x，对内存带宽的需求也减少了4x。INT8操作的硬件支持使得其计算通常要比FP32.
复制链接

扫一扫

专栏目录