torch量化接口深度解读-eager模式-fx模式

一、定义

  1. 接口总结
  2. 量化模式解读

二、实现

  1. 接口总结
    1. PyTorch提供了三种不同的量化模式:Eager模式量化、FX图模式量化(维护)和PyTorch 2导出量化。
    2. Eager Mode Quantization是一个测试版功能。用户需要进行融合,并手动指定量化和去量化发生的位置,而且它只支持模块,不支持函数。
    3. FX图形模式量化是PyTorch中的一个自动量化工作流程,目前它是一个原型功能,由于我们有. PyTorch 2导出量化,它处于维护模式。它通过添加对泛函的支持和自动化量化过程来改进Eager Mode Quantization,尽管人们可能需要重构模型以使模型与FX Graph Mode Quantizations兼容(可通过torch.FX进行象征性跟踪)。请注意,FX Graph Mode Quantization预计不会在任意模型上工作,因为该模型可能无法进行符号追踪,我们将把它集成到torchvision等领域库中,用户将能够使用FX Graph Mode Quantization量化与支持的领域库中类似的模型。对于任意模型,我们将提供一般指导方针,但要真正使其工作,用户可能需要熟悉torch.fx,特别是如何使模型具有符号可追溯性。
    4. PyTorch 2导出量化是新的全图模式量化工作流程,作为PyTorch 2.1中的原型功能发布。使用PyTorch 2,我们正在转向更好的完整程序捕获解决方案(torch.export),因为它可以捕获更高比例的模型(在14K型号上为88.8%),而fx Graph Mode Quantization使用的程序捕获方案torch.fx.symbolic_trace(在14K型号上为72.7%)。torch.export在某些python构造方面仍然存在局限性,需要用户参与以支持导出模型中的动态性,但总体而言,它比之前的程序捕获解决方案有所改进。PyTorch 2导出量化是为torch.Export捕获的模型构建的,考虑了建模用户和后端开发人员的灵活性和生产力。主要特点是(1)。可编程API,用于配置如何量化模型,可以扩展到更多的用例(2)。简化了建模用户和后端开发人员的用户体验,因为他们只需要与单个对象(量化器)交互,就可以表达用户对如何量化模型以及后端支持什么的意图。3.可选的参考量化模型表示,可以用整数运算表示量化计算,该运算更接近硬件中发生的实际量化计算。
    5. 鼓励量化的新用户首先尝试PyTorch 2导出量化,如果效果不佳,用户可以尝试渴望模式量化。
    下表比较了Eager模式量化、FX图形模式量化和PyTorch 2导出量化之间的差异:
    在这里插入图片描述
    支持三种类型的量化:

    1. 动态量化(当网络训练完成后,其权重值已经确定,故权重的量化因子已经确定,但是对于不同的输入值来说,其缩放因子是动态计算的)------------训练后量化
    2. 静态量化(静态量化的模型在使用前有fine-tuning的过程(校准缩放因子):准备部分输入(对于图像分类模型就是准备一些图片,其他任务类似),使用静态量化后的模型进行预测,在此过程中量化模型的缩放因子会根据输入数据的分布进行调整。) -------------训练后量化
    3. 静态量化感知训练(它将静态量化直接插入到网络的训练过程中,消除了网络训练后的校准过程。)---------训练时量化
  2. 量化模式解读
    量化模式分为: eager 模式, fx 模式 ,pytorch 2 模式
    eager 模式下量化
    1. PTQ–动态量化

import torch
# define a floating point model
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = torch.nn.Linear(4, 4)

    def forward(self, x):
        x = self.fc(x)
        return x

# create a model instance
model_fp32 = M()
# create a quantized model instance
model_int8 = torch.ao.quantization.quantize_dynamic(
    model_fp32,  # the original model
    {torch.nn.Linear},  # a set of layers to dynamically quantize   #量化的层
    dtype=torch.qint8)  # the target dtype for quantized weights

# run the model
input_fp32 = torch.randn(4, 4, 4, 4)
res = model_int8(input_fp32)

PTQ—静态量化

import torch

# define a floating point model where some layers could be statically quantized
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        # QuantStub converts tensors from floating point to quantized
        self.quant = torch.ao.quantization.QuantStub()
        self.conv = torch.nn.Conv2d(1, 1, 1)
        self.relu = torch.nn.ReLU()
        # DeQuantStub converts tensors from quantized to floating point
        self.dequant = torch.ao.quantization.DeQuantStub()

    def forward(self, x):
        # manually specify where tensors will be converted from floating
        # point to quantized in the quantized model
        x = self.quant(x)
        x = self.conv(x)
        x = self.relu(x)
        # manually specify where tensors will be converted from quantized
        # to floating point in the quantized model
        x = self.dequant(x)
        return x

# create a model instance
model_fp32 = M()

# model must be set to eval mode for static quantization logic to work
model_fp32.eval()


model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('x86')


model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32, [['conv', 'relu']])    #手动融合


model_fp32_prepared = torch.ao.quantization.prepare(model_fp32_fused)

# calibrate the prepared model to determine quantization parameters for activations
# in a real world setting, the calibration would be done with a representative dataset
input_fp32 = torch.randn(4, 1, 4, 4)
model_fp32_prepared(input_fp32)

# Convert the observed model to a quantized model. This does several things:
# quantizes the weights, computes and stores the scale and bias value to be
# used with each activation tensor, and replaces key operators with quantized
# implementations.
model_int8 = torch.ao.quantization.convert(model_fp32_prepared)    #量化

# run the model, relevant calculations will happen in int8
res = model_int8(input_fp32)

ATQ:

import torch

# define a floating point model where some layers could benefit from QAT
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        # QuantStub converts tensors from floating point to quantized
        self.quant = torch.ao.quantization.QuantStub()
        self.conv = torch.nn.Conv2d(1, 1, 1)
        self.bn = torch.nn.BatchNorm2d(1)
        self.relu = torch.nn.ReLU()
        # DeQuantStub converts tensors from quantized to floating point
        self.dequant = torch.ao.quantization.DeQuantStub()

    def forward(self, x):
        x = self.quant(x)
        x = self.conv(x)
        x = self.bn(x)
        x = self.relu(x)
        x = self.dequant(x)
        return x

# create a model instance
model_fp32 = M()

# model must be set to eval for fusion to work
model_fp32.eval()

# attach a global qconfig, which contains information about what kind
# of observers to attach. Use 'x86' for server inference and 'qnnpack'
# for mobile inference. Other quantization configurations such as selecting
# symmetric or asymmetric quantization and MinMax or L2Norm calibration techniques
# can be specified here.
# Note: the old 'fbgemm' is still available but 'x86' is the recommended default
# for server inference.
# model_fp32.qconfig = torch.ao.quantization.get_default_qconfig('fbgemm')
model_fp32.qconfig = torch.ao.quantization.get_default_qat_qconfig('x86')

# fuse the activations to preceding layers, where applicable
# this needs to be done manually depending on the model architecture
model_fp32_fused = torch.ao.quantization.fuse_modules(model_fp32,
    [['conv', 'bn', 'relu']])

# Prepare the model for QAT. This inserts observers and fake_quants in
# the model needs to be set to train for QAT logic to work
# the model that will observe weight and activation tensors during calibration.
model_fp32_prepared = torch.ao.quantization.prepare_qat(model_fp32_fused.train())

# run the training loop (not shown)
num_train_batches = 20

# QAT takes time and one needs to train over a few epochs.
# Train and check accuracy after each epoch
for nepoch in range(8):
    train_one_epoch(qat_model, criterion, optimizer, data_loader, torch.device('cpu'), num_train_batches)
    if nepoch > 3:
        # Freeze quantizer parameters
        qat_model.apply(torch.ao.quantization.disable_observer)
    if nepoch > 2:
        # Freeze batch norm mean and variance estimates
        qat_model.apply(torch.nn.intrinsic.qat.freeze_bn_stats)

    # Check the accuracy after each epoch
    quantized_model = torch.ao.quantization.convert(qat_model.eval(), inplace=False)
    quantized_model.eval()
    top1, top5 = evaluate(quantized_model,criterion, data_loader_test, neval_batches=num_eval_batches)
    print('Epoch %d :Evaluation accuracy on %d images, %2.2f'%(nepoch, num_eval_batches * eval_batch_size, top1.avg))

fx 模式下量化:优点:自动融合算子-量化
PTQ-静态量化

import torch
from torch.ao.quantization import get_default_qconfig
from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx
from torch.ao.quantization import QConfigMapping
float_model.eval()
# The old 'fbgemm' is still available but 'x86' is the recommended default.
qconfig = get_default_qconfig("x86")
qconfig_mapping = QConfigMapping().set_global(qconfig)
def calibrate(model, data_loader):
    model.eval()
    with torch.no_grad():
        for image, target in data_loader:
            model(image)
example_inputs = (next(iter(data_loader))[0]) # get an example input   
prepared_model = prepare_fx(float_model, qconfig_mapping, example_inputs)  # fuse modules and insert observers  融合算子,插入观测
calibrate(prepared_model, data_loader_test)  # run calibration on sample data
     记录校对
quantized_model = convert_fx(prepared_model)  # convert the calibrated model to a quantized model         #量化

具体见:https://pytorch.org/tutorials/prototype/fx_graph_mode_ptq_static.html
PTQ–动态量化

import torch
from torch.ao.quantization import default_dynamic_qconfig, QConfigMapping
# Note that this is temporary, we'll expose these functions to torch.ao.quantization after official releasee
from torch.quantization.quantize_fx import prepare_fx, convert_fx

float_model.eval()
# The old 'fbgemm' is still available but 'x86' is the recommended default.
qconfig = get_default_qconfig("x86")
qconfig_mapping = QConfigMapping().set_global(qconfig)
prepared_model = prepare_fx(float_model, qconfig_mapping, example_inputs)  # fuse modules and insert observers
# no calibration is required for dynamic quantization
quantized_model = convert_fx(prepared_model)  # convert the model to a dynamically quantized model

pytorch2 模式下量化
1. 训练后量化ptq模式

import torch

from torch._export import capture_pre_autograd_graph
class M(torch.nn.Module):
   def __init__(self):
      super().__init__()
      self.linear = torch.nn.Linear(5, 10)

   def forward(self, x):
      return self.linear(x)


example_inputs = (torch.randn(1, 5),)
m = M().eval()

# Step 1. program capture
# NOTE: this API will be updated to torch.export API in the future, but the captured
# result shoud mostly stay the same
m = capture_pre_autograd_graph(m, *example_inputs)         #获取动态图
# we get a model with aten ops


# Step 2. quantization
from torch.ao.quantization.quantize_pt2e import (
  prepare_pt2e,
  convert_pt2e,
)

from torch.ao.quantization.quantizer import (
  XNNPACKQuantizer,
  get_symmetric_quantization_config,
)
# backend developer will write their own Quantizer and expose methods to allow
# users to express how they
# want the model to be quantized       #获取量化器, int8
quantizer = XNNPACKQuantizer().set_global(get_symmetric_quantization_config())
m = prepare_pt2e(m, quantizer)

# calibration omitted
def calibrate(model, data_loader):
    model.eval()
    with torch.no_grad():
        for image, target in data_loader:
            model(image)
calibrate(m, example_inputs)

m = convert_pt2e(m)
# we have a model with aten ops doing integer computations when possible


########################扩展至c++
import torch._inductor.config as config
config.cpp_wrapper = True

with torch.no_grad():
    optimized_model = torch.compile(m)

    # Running some benchmark
    optimized_model(*example_inputs)


    res=optimized_model(example_inputs[0])
    print(res)

    # tensor([[0.0312, 0.0998, -0.7920, 0.0748, 0.7982, 0.1808, 0.4365, 0.0998,
    #          0.5800, 0.4428]])

QAT:量化感知训练

#简化基本步骤
import torch
from torch._export import capture_pre_autograd_graph
from torch.ao.quantization.quantize_pt2e import (
  prepare_qat_pt2e,
  convert_pt2e,
)
from torch.ao.quantization.quantizer import (
  XNNPACKQuantizer,
  get_symmetric_quantization_config,
)

class M(torch.nn.Module):
   def __init__(self):
      super().__init__()
      self.linear = torch.nn.Linear(5, 10)

   def forward(self, x):
      return self.linear(x)


example_inputs = (torch.randn(1, 5),)
m = M()

# Step 1. program capture
# NOTE: this API will be updated to torch.export API in the future, but the captured
# result shoud mostly stay the same
m = capture_pre_autograd_graph(m, *example_inputs)
# we get a model with aten ops

# Step 2. quantization-aware training
# backend developer will write their own Quantizer and expose methods to allow
# users to express how they want the model to be quantized
quantizer = XNNPACKQuantizer().set_global(get_symmetric_quantization_config())
m = prepare_qat_pt2e(m, quantizer)

# train omitted

m = convert_pt2e(m)
# we have a model with aten ops doing integer computations when possible

# move the quantized model to eval mode, equivalent to `m.eval()`
torch.ao.quantization.move_exported_model_to_eval(m)

https://pytorch.org/tutorials/prototype/pt2e_quant_qat.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值