PyTorch Python API：Quantization || Intro

最新推荐文章于 2023-12-27 09:45:22 发布

Bitterest

最新推荐文章于 2023-12-27 09:45:22 发布

阅读量2.8k

点赞数 4

分类专栏： PyTorch API 文章标签： pytorch python 深度学习边缘计算

本文链接：https://blog.csdn.net/Mr_Menace/article/details/121254712

版权

PyTorch API 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

参考：https://pytorch.org/docs/stable/quantization.html
(本篇比较适合已经有一定模型量化概念的人阅读)

Intro

　量化是一种以低于浮点精度的位宽，来执行张量的计算和存储的技术。量化过的模型对部分或全部 Tensor 使用整数，而不是浮点值来执行操作。这允许更紧凑的模型表示，并能在硬件平台上使用高性能 Tensor 运算。需要注意的是，目前 PyTorch 不提供 CUDA 上的量化算子的实现——也即不支持 GPU——量化后的模型将移至 CPU 上运行、测试。但是进行 QAT 时可以在 GPU 上运行。此外，PyTorch 还支持 QAT，该训练使用伪量化模块对前向和后向传递中的量化误差进行建模。

　对于使用 PyTorch 的 Quantization，你需要知道几个概念：

Quantization Config (Qconfig)：指定如何量化激活和权重。创建一个量化模型，你需要先创建 Qconfig。
Backend：支持量化的内核。通常有很多属性。
Quantization engine (Qengine)：当执行量化模型时，Qengine 会指定具体的 Backend。需要保证 Qengine 与 Qconfig 一致。
Observer：能够在 Calibration 期间记录 activation 的分布有关参数。
Operator Fusion：可以将多个算子融合为一个算子以节省内存访问、加速计算。
Per-channel Quantization：独立量化卷积/线性层中每个输出通道所对应的权重。

　目前 PyTorch 支持的硬件框架为 FBGEMM (用于服务器端推理)、QNNPACK (用于移动端推理) 。

Quantization API Summary

　PyTorch 目前提供两种量化模式：Eager Mode Quantization 和 FX Graph Mode Quantization。

　Eager Mode Quantization 需要做 Fusion、指出量化和反量化在何处发生，其目前只支持 Module，不支持 Function。

　FX Graph Mode Quantization 是一个新的自动量化框架，目前只是个雏形。其需要一个 symbolically traceable 模型，会用到 FX 框架。

　目前 PyTorch 官方有四种量化方式：

Weight Only Quantization。只有权重被真量化。其他参数均是全精度。计算也是全精度。类似于 PTQ 但没有 Calibration。
Dynamic Quantization。权重被真量化；而激活以全精度存储、读取，在需要计算时才会进行真量化(因此是动态)。这适用于从内存加载权重耗时过久而计算矩阵乘法耗时较少的情况。类似于 PTQ 但没有 Calibration，也就是说在推理过程中才计算出 Activation 的量化尺度和量化零点并量化。
Static Quantization。即纯粹的 PTQ，在校正完毕后，激活和权重均进行真量化。之后的计算过程均是 INT 格式下进行的。
Static Quantization Aware Training。即 QAT，权重和激活进行伪量化，可以进行训练。所有的计算均是全精度格式下进行的。训练完成后还可以进一步转换为真量化。

　而 Eager Mode Quantization支持2、3、4方式。FX Graph Mode Quantization 则支持上述所有方式。

Eager Mode Quantization

Dynamic Quantization

API示例：

import torch
# define a floating point model
class M(torch.nn.Module):
    def __init__(self):
        super(M, self).__init__()
        self.fc = torch.nn.Linear(4, 4)
    def forward(self, x):
        x = self.fc(x)
        return x

# create a model instance
model_fp32 = M()
# create a quantized model instance
model_int8 = torch.quantization.quantize_dynamic(
    model_fp32,  # the original model
    {torch.nn.Linear},  # a set of layers to dynamically quantize
    dtype=torch.qint8)  # the target dtype for quantized weights

# run the model
input_fp32 = torch.randn(4, 4, 4, 4)
res = model_int8(input_fp32)

Static Quantization

API示例：

import torch

# define a floating point model where some layers could be statically quantized
class M(torch.nn.Module):
    def __init__(self):
        super(M, self).__init__()
        # QuantStub converts tensors from floating point to quantized
        self.quant = torch.quantization.QuantStub()
        self.conv = torch.nn.Conv2d(1, 1, 1)
        self.relu = torch.nn.ReLU()
        # DeQuantStub converts tensors from quantized to floating point
        self.dequant = torch.quantization.DeQuantStub()

    def forward(self, x):
        # manually specify where tensors will be converted from floating
        # point to quantized in the quantized model
        x = self.quant(x)
        x = self.conv(x)
        x = self.relu(x)
        # manually specify where tensors will be converted from quantized
        # to floating point in the quantized model
        x = self.dequant(x)
        return x

# create a model instance
model_fp32 = M().to('cpu')

# model must be set to eval mode for static quantization logic to work
model_fp32.eval()

# Get global qconfig 阶段。这里直接获得官方的默认参数, 内定义了 observers、对称/非对称量化、校正方法等。
model_fp32.qconfig = torch.quantization.get_default_qconfig('fbgemm')

# Fusion 阶段。需要指定什么层参与 Fuse。
# Fusion 的种类可有 `conv + relu` 和 `conv + bn + relu`
model_fp32_fused = torch.quantization.fuse_modules(model_fp32, [['conv', 'relu']])

# Prepare 阶段。 插入 observer。
model_fp32_prepared = torch.quantization.prepare(model_fp32_fused)

# Calibrate 阶段。 只是输入数据，产生中间过程的 Activation。
# in a real world setting, the calibration would be done with a representative dataset
input_fp32 = torch.randn(4, 1, 4, 4)
model_fp32_prepared(input_fp32)

# Convert 阶段。转化为真量化模型。 进行权重量化、计算并储存量化尺度和量化零点、实现算子的替换。
model_int8 = torch.quantization.convert(model_fp32_prepared)

# 即可进行 int8 的推理。
res = model_int8(input_fp32)

总结来说，整个过程会经历 Get Qconfig、Fusion、Prepare、Calibration、Convert 几个阶段。其实 Fusion 阶段可以不用做，但是如果目标是为了部署的话，Fusion 还是必须要做的。

Static Quantization Aware Training (QAT)

API示例：

import torch

# define a floating point model where some layers could benefit from QAT
class M(torch.nn.Module):
    def __init__(self):
        super(M, self).__init__()
        # QuantStub converts tensors from floating point to quantized
        self.quant = torch.quantization.QuantStub()
        self.conv = torch.nn.Conv2d(1, 1, 1)
        self.bn = torch.nn.BatchNorm2d(1)
        self.relu = torch.nn.ReLU()
        # DeQuantStub converts tensors from quantized to floating point
        self.dequant = torch.quantization.DeQuantStub()

    def forward(self, x):
        x = self.quant(x)
        x = self.conv(x)
        x = self.bn(x)
        x = self.relu(x)
        x = self.dequant(x)
        return x

# create a model instance
model_fp32 = M()

# model must be set to train mode for QAT logic to work
model_fp32.train()

# Get global qconfig 阶段。这里直接获得官方的默认参数, 内定义了 伪量化器类型、对称/非对称量化、校正方法等。
model_fp32.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')

# Fusion 阶段。需要指定什么层参与 Fuse。
# Fusion 的种类可有 `conv + relu` 和 `conv + bn + relu`
model_fp32_fused = torch.quantization.fuse_modules(model_fp32,
    [['conv', 'bn', 'relu']])

# Prepare 阶段。 插入伪量化器。
model_fp32_prepared = torch.quantization.prepare_qat(model_fp32_fused)

# Train 阶段，利用梯度来更新参数。
training_loop(model_fp32_prepared)

# Convert 阶段。转化为真量化模型。 进行权重量化、计算并储存量化尺度和量化零点、实现算子的替换。
model_fp32_prepared.eval()
model_int8 = torch.quantization.convert(model_fp32_prepared)

# 即可进行 int8 的推理。
res = model_int8(input_fp32)

　总结来说，整个过程会经历 Get Qconfig、Fusion、Prepare、Training、Convert 几个阶段。需要注意的是，在 Training loop 只能在 CPU 上进行，且 Quantizer 的量化尺度和量化零点是可根据统计信息来更新的。若不想更新量化尺度和量化零点，则可以：

model_fp32_prepared.apply(torch.quantization.disable_observer)

若想让 BN 使用 Running Mean 和 Running Variance，可以：

model_fp32_prepared.apply(torch.nn.intrinsic.qat.freeze_bn_stats)

FX Graph Mode Quantization (Prototype)

Weight Only Quantization

API 示例：

import torch.quantization.quantize_fx as quantize_fx
import copy

model_to_quantize = UserModel(...)

model_to_quantize.eval()
# 下面的qconfig指定了某类型的层为weight_only 量化
qconfig_dict = {
    "object_type": [
        (nn.Embedding, float_qparams_weight_only_qconfig),
        #(nn.LSTM, default_dynamic_qconfig),
        #(nn.Linear, default_dynamic_qconfig)
    ]
}
# prepare. fuse modules
model_prepared = quantize_fx.prepare_fx(model_to_quantize, qconfig_dict)
# no calibration needed when we only have dynamici/weight_only quantization
# quantize
model_quantized = quantize_fx.convert_fx(model_prepared)

Dynamic Quantization

API 示例：

import torch.quantization.quantize_fx as quantize_fx
import copy


model_to_quantize = UserModel(...)
model_to_quantize.eval()
# 只需要 qconfig_dict 如下，即可称为 Dynamic Quantization。表示所有的层都是这个属性。
qconfig_dict = {"": torch.quantization.default_dynamic_qconfig}
# prepare fuse modules and insert observers
model_prepared = quantize_fx.prepare_fx(model_to_quantize, qconfig_dict)
# no calibration needed when we only have dynamici/weight_only quantization
# quantize
model_quantized = quantize_fx.convert_fx(model_prepared)

Static Quantization

API示例：

import torch.quantization.quantize_fx as quantize_fx
import copy

model_to_quantize = UserModel(...)
model_to_quantize.eval()

qconfig_dict = {"": torch.quantization.get_default_qconfig('qnnpack')}
# prepare fuse modules and insert observers
model_prepared = quantize_fx.prepare_fx(model_to_quantize, qconfig_dict)
# calibrate (细节就不展示了~)

# quantize
model_quantized = quantize_fx.convert_fx(model_prepared)

Static Quantization Aware Training (QAT)

API示例：


import torch.quantization.quantize_fx as quantize_fx
import copy

model_to_quantize = UserModel(...)
model_to_quantize.eval()
# 关键其实在于qconfig的内容
qconfig_dict = {"": torch.quantization.get_default_qat_qconfig('qnnpack')}
model_to_quantize.train()

# 可以选择性地做 fuse
# model_fused = quantize_fx.fuse_fx(model_to_quantize)

# prepare
model_prepared = quantize_fx.prepare_qat_fx(model_to_quantize, qconfig_dict)
# training loop (细节就不展示了~)

# quantize
model_quantized = quantize_fx.convert_fx(model_prepared)

Other Detail

PyTorch 支持非对称均匀量化，同时支持 Per-tensor 或 Per-channel 量化。
Quantized Tensor 允许存储格式为 int8/uint8/int32，包括量化数据、量化尺度和量化零点。
量化后的推理算子只能是：8 Bit Weights (data_type = qint8) 和 8 Bit Activations (data_type = quint8)。
有关于该模块的可行函数/操作/算子展示于此，似乎只支持典型的 CNN、RNN 模型。

个人思考

PyTorch 层面做的 Fusion，应该是 Conv、BN 等层都没有了，那么保存下来的模型就可以直接拿来部署了，但是按照官方的说法只能用于CPU推理，用于服务器端、移动端推理，似乎目前还不支持其他硬件。
要是想部署在其他硬件上，应该还是得还得经过其他底层 AI 编译器。比如 GPU端推理的话，正常训练完 FP32 模型后再通过 ONNX 导出模型，然后用 TensorRT 来做 Quantization 而不是用 PyTorch。
QuantStub / DeQuantStub 算是 Placeholder，仔细看示例会发现需要在模型定义阶段指出，涉及 PTQ 相关时需要这玩意。但其实可以不用在模型定义时手动给出，似乎 QuantWrapper 函数可以直接给模型打包上。
个人觉得官方文档写得有点混乱，所以就做了以上总结，如有谬误，还请指正、包涵。

Bitterest

关注

4
点赞
踩
9

收藏

觉得还不错? 一键收藏
1
评论
PyTorch Python API：Quantization || Intro

量化是一种以低于浮点精度的位宽，来执行张量的计算和存储的技术。量化过的模型对部分或全部 Tensor 使用整数，而不是浮点值来执行操作。这允许更紧凑的模型表示，并能在硬件平台上使用高性能 Tensor 运算。需要注意的是，目前 PyTorch 不提供 CUDA 上的量化算子的实现——也即不支持 GPU——量化后的模型将移至 CPU 上运行、测试。但是进行 QAT 时可以在 GPU 上运行。此外，PyTorch 还支持 QAT，该训练使用伪量化模块对前向和后向传递中的量化误差进行建模。
复制链接

扫一扫

专栏目录