模型量化——NVIDIA——QAT

概述

    QAT 截止目前(20230418)的CUDA 实现并不在pytorch 原生包中(不等同于pytorch 的QAT,它主要支持CPU),需要引入NVIDIA 的第三方包“pytorch-quantization”。需要TRT8+ 、 pytorch 1.8 +。主要流程如下:

    工具流转方向如下:

    所以目前我的理解+咨询了NVIDIA官方人员,当前仅仅下面支持的算子是INT8的:

QuantConv1d, QuantConv2d, QuantConv3d,

QuantConvTranspose1d, QuantConvTranspose2d, QuantConvTranspose3d

QuantLinear

QuantAvgPool1d, QuantAvgPool2d, QuantAvgPool3d,

QuantMaxPool1d, QuantMaxPool2d, QuantMaxPool3d

QuantAdaptiveAvgPool1d, QuantAdaptiveAvgPool2d, QuantAdaptiveAvgPool3d

Clip

QuantLSTM, QuantLSTMCell

    如果要实现其他算子仅仅需要仿照:TensorRT/tools/pytorch-quantization/pytorch_quantization/nn/modules/quant_conv.py at release/8.6 · NVIDIA/TensorRT · GitHub来实现,其实就是首先如果pytorch 支持你的fp32算子,只需要你仿照这个来在fp32算子的输入和权重套一层量化的层即可。然后就可以使用校正获得量化参数再训练导出到TRT,TRT自动的去拆解组合Q/DQ 层,可以参考这个Developer Guide :: NVIDIA Deep Learning TensorRT Documentation

    建议在训练期间不要改变量化的scale参数,或者至少不要太频繁的改变,否则会影响模型的拟合。也就是说建议可以像PTQ一样先做一遍校正获得对应的scale和bias,然后再finetune,finetune 的轮数一般是原始训练的10%左右,YOLOV6是10个epoch;学习率为初始的1%,选择cosine LR策略,下降一半的学习率。原文:Do not change quantization representation (scale) during training, at least not too frequently. Changing scale every step, it is effectively like changing data format (e8m7, e5m10, e3m4, et.al) every step, which will easily affect convergence.After calibration is done, Quantization Aware Training is simply select a training schedule and continue training the calibrated model. Usually, it doesn’t need to fine tune very long. We usually use around 10% of the original training schedule, starting at 1% of the initial training learning rate, and a cosine annealing learning rate schedule that follows the decreasing half of a cosine period, down to 1% of the initial fine tuning learning rate (0.01% of the initial training learning rate).

    官方示例

    1、主要的流程

    里面实现了量化的网络的构建resnet、校准+finetune、敏感层分析,参考这个文件:

TensorRT/tools/pytorch-quantization/examples/torchvision/classification_flow.py at 96e23978cd6e4a8fe869696d3d8ec2b47120629b · NVIDIA/TensorRT · GitHub

    2、对应的网络实现

    对支持的算子,重构了对应的量化版本:

TensorRT/tools/pytorch-quantization/examples/torchvision/models/classification/resnet.py at 96e23978cd6e4a8fe869696d3d8ec2b47120629b · NVIDIA/TensorRT · GitHub

    自定义网络   

    如果是已经有了网络,可以参考YOLOV6 的实现

    1、主要流程

如何量化加速 YOLOv6 — YOLOv6_docs 文档

    2、主要的函数

    1)循环替换对应的量化算子。 

YOLOv6/tools/partial_quantization/ptq.py at 6b9f5f4ea3185496b5f62a934c3c8f2d095c0318 · meituan/YOLOv6 · GitHub

def quant_model_init(model, device):

    model_ptq = copy.deepcopy(model)
    model_ptq.eval()
    model_ptq.to(device)
    conv2d_weight_default_desc = tensor_quant.QUANT_DESC_8BIT_CONV2D_WEIGHT_PER_CHANNEL
    conv2d_input_default_desc = QuantDescriptor(num_bits=8, calib_method='histogram')

    convtrans2d_weight_default_desc = tensor_quant.QUANT_DESC_8BIT_CONVTRANSPOSE2D_WEIGHT_PER_CHANNEL
    convtrans2d_input_default_desc = QuantDescriptor(num_bits=8, calib_method='histogram')

    for k, m in model_ptq.named_modules():
        if 'proj_conv' in k:
            print("Skip Layer {}".format(k))
            continue

        if isinstance(m, nn.Conv2d):
            in_channels = m.in_channels
            out_channels = m.out_channels
            kernel_size = m.kernel_size
            stride = m.stride
            padding = m.padding
            quant_conv = quant_nn.QuantConv2d(in_channels,
                                              out_channels,
                                              kernel_size,
                                              stride,
                                              padding,
                                              quant_desc_input = conv2d_input_default_desc,
                                              quant_desc_weight = conv2d_weight_default_desc)
            quant_conv.weight.data.copy_(m.weight.detach())
            if m.bias is not None:
                quant_conv.bias.data.copy_(m.bias.detach())
            else:
                quant_conv.bias = None
            set_module(model_ptq, k, quant_conv)
        elif isinstance(m, nn.ConvTranspose2d):
            in_channels = m.in_channels
            out_channels = m.out_channels
            kernel_size = m.kernel_size
            stride = m.stride
            padding = m.padding
            quant_convtrans = quant_nn.QuantConvTranspose2d(in_channels,
                                                       out_channels,
                                                       kernel_size,
                                                       stride,
                                                       padding,
                                                       quant_desc_input = convtrans2d_input_default_desc,
                                                       quant_desc_weight = convtrans2d_weight_default_desc)
            quant_convtrans.weight.data.copy_(m.weight.detach())
            if m.bias is not None:
                quant_convtrans.bias.data.copy_(m.bias.detach())
            else:
                quant_convtrans.bias = None
            set_module(model_ptq, k, quant_convtrans)
        elif isinstance(m, nn.MaxPool2d):
            kernel_size = m.kernel_size
            stride = m.stride
            padding = m.padding
            dilation = m.dilation
            ceil_mode = m.ceil_mode
            quant_maxpool2d = quant_nn.QuantMaxPool2d(kernel_size,
                                                      stride,
                                                      padding,
                                                      dilation,
                                                      ceil_mode,
                                                      quant_desc_input = conv2d_input_default_desc)
            set_module(model_ptq, k, quant_maxpool2d)
        else:
            # module can not be quantized, continue
            continue

    return model_ptq.to(device) 

    2)校正量化

YOLOv6/tools/partial_quantization/ptq.py at 6b9f5f4ea3185496b5f62a934c3c8f2d095c0318 · meituan/YOLOv6 · GitHub

这里disable_quant 其实就是关闭了量化层,直接fp32. 

def collect_stats(model, data_loader, batch_number, device='cuda'):
    """Feed data to the network and collect statistic"""

    # Enable calibrators
    for name, module in model.named_modules():
        if isinstance(module, quant_nn.TensorQuantizer):
            if module._calibrator is not None:
                module.disable_quant()
                module.enable_calib()
            else:
                module.disable()

    for i, data_tuple in enumerate(data_loader):
        image = data_tuple[0]
        image = image.float()/255.0
        model(image.to(device))
        if i + 1 >= batch_number:
            break

    # Disable calibrators
    for name, module in model.named_modules():
        if isinstance(module, quant_nn.TensorQuantizer):
            if module._calibrator is not None:
                module.enable_quant()
                module.disable_calib()
            else:
                module.enable()


def compute_amax(model, **kwargs):
    # Load calib result
    for name, module in model.named_modules():
        if isinstance(module, quant_nn.TensorQuantizer):
            print(F"{name:40}: {module}")
            if module._calibrator is not None:
                if isinstance(module._calibrator, calib.MaxCalibrator):
                    module.load_calib_amax()
                else:
                    module.load_calib_amax(**kwargs)

参考文章

Pytorch 自带量化: Practical Quantization in PyTorch | PyTorch

NVIDIA TRT8 量化:Quantization Aware Training in PyTorch with TensorRT 8.0 | GTC Digital April 2021 | NVIDIA On-Demand 

https://www.cnblogs.com/wujianming-110117/p/16015708.html

pytorch-quantization 文档:https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/tutorials/quant_resnet50.html

pytorch-quantization  git: TensorRT/tools/pytorch-quantization at release/8.6 · NVIDIA/TensorRT · GitHub 

NVIDIA 量化官方论文:https://arxiv.org/pdf/2004.09602.pdf

https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/userguide.html#some-recommendations

YOLOV6 量化: 如何量化加速 YOLOv6 — YOLOv6_docs 文档 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

TigerZ*

你点滴支持,我持续创作,羞羞

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值