概述
QAT 截止目前(20230418)的CUDA 实现并不在pytorch 原生包中(不等同于pytorch 的QAT,它主要支持CPU),需要引入NVIDIA 的第三方包“pytorch-quantization”。需要TRT8+ 、 pytorch 1.8 +。主要流程如下:

工具流转方向如下:

所以目前我的理解+咨询了NVIDIA官方人员,当前仅仅下面支持的算子是INT8的:
QuantConv1d, QuantConv2d, QuantConv3d,
QuantConvTranspose1d, QuantConvTranspose2d, QuantConvTranspose3d
QuantLinear
QuantAvgPool1d, QuantAvgPool2d, QuantAvgPool3d,
QuantMaxPool1d, QuantMaxPool2d, QuantMaxPool3d
QuantAdaptiveAvgPool1d, QuantAdaptiveAvgPool2d, QuantAdaptiveAvgPool3d
Clip
QuantLSTM, QuantLSTMCell
如果要实现其他算子仅仅需要仿照:TensorRT/tools/pytorch-quantization/pytorch_quantization/nn/modules/quant_conv.py at release/8.6 · NVIDIA/TensorRT · GitHub来实现,其实就是首先如果pytorch 支持你的fp32算子,只需要你仿照这个来在fp32算子的输入和权重套一层量化的层即可。然后就可以使用校正获得量化参数再训练导出到TRT,TRT自动的去拆解组合Q/DQ 层,可以参考这个Developer Guide :: NVIDIA Deep Learning TensorRT Documentation。
建议在训练期间不要改变量化的scale参数,或者至少不要太频繁的改变,否则会影响模型的拟合。也就是说建议可以像PTQ一样先做一遍校正获得对应的scale和bias,然后再finetune,finetune 的轮数一般是原始训练的10%左右,YOLOV6是10个epoch;学习率为初始的1%,选择cosine LR策略,下降一半的学习率。原文:Do not change quantization representation (scale) during training, at least not too frequently. Changing scale every step, it is effectively like changing data format (e8m7, e5m10, e3m4, et.al) every step, which will easily affect convergence.After calibration is done, Quantization Aware Training is simply select a training schedule and continue training the calibrated model. Usually, it doesn’t need to fine tune very long. We usually use around 10% of the original training schedule, starting at 1% of the initial training learning rate, and a cosine annealing learning rate schedule that follows the decreasing half of a cosine period, down to 1% of the initial fine tuning learning rate (0.01% of the initial training learning rate).
官方示例
1、主要的流程
里面实现了量化的网络的构建resnet、校准+finetune、敏感层分析,参考这个文件:
2、对应的网络实现
对支持的算子,重构了对应的量化版本:
自定义网络
如果是已经有了网络,可以参考YOLOV6 的实现
1、主要流程
如何量化加速 YOLOv6 — YOLOv6_docs 文档
2、主要的函数
1)循环替换对应的量化算子。
def quant_model_init(model, device):
model_ptq = copy.deepcopy(model)
model_ptq.eval()
model_ptq.to(device)
conv2d_weight_default_desc = tensor_quant.QUANT_DESC_8BIT_CONV2D_WEIGHT_PER_CHANNEL
conv2d_input_default_desc = QuantDescriptor(num_bits=8, calib_method='histogram')
convtrans2d_weight_default_desc = tensor_quant.QUANT_DESC_8BIT_CONVTRANSPOSE2D_WEIGHT_PER_CHANNEL
convtrans2d_input_default_desc = QuantDescriptor(num_bits=8, calib_method='histogram')
for k, m in model_ptq.named_modules():
if 'proj_conv' in k:
print("Skip Layer {}".format(k))
continue
if isinstance(m, nn.Conv2d):
in_channels = m.in_channels
out_channels = m.out_channels
kernel_size = m.kernel_size
stride = m.stride
padding = m.padding
quant_conv = quant_nn.QuantConv2d(in_channels,
out_channels,
kernel_size,
stride,
padding,
quant_desc_input = conv2d_input_default_desc,
quant_desc_weight = conv2d_weight_default_desc)
quant_conv.weight.data.copy_(m.weight.detach())
if m.bias is not None:
quant_conv.bias.data.copy_(m.bias.detach())
else:
quant_conv.bias = None
set_module(model_ptq, k, quant_conv)
elif isinstance(m, nn.ConvTranspose2d):
in_channels = m.in_channels
out_channels = m.out_channels
kernel_size = m.kernel_size
stride = m.stride
padding = m.padding
quant_convtrans = quant_nn.QuantConvTranspose2d(in_channels,
out_channels,
kernel_size,
stride,
padding,
quant_desc_input = convtrans2d_input_default_desc,
quant_desc_weight = convtrans2d_weight_default_desc)
quant_convtrans.weight.data.copy_(m.weight.detach())
if m.bias is not None:
quant_convtrans.bias.data.copy_(m.bias.detach())
else:
quant_convtrans.bias = None
set_module(model_ptq, k, quant_convtrans)
elif isinstance(m, nn.MaxPool2d):
kernel_size = m.kernel_size
stride = m.stride
padding = m.padding
dilation = m.dilation
ceil_mode = m.ceil_mode
quant_maxpool2d = quant_nn.QuantMaxPool2d(kernel_size,
stride,
padding,
dilation,
ceil_mode,
quant_desc_input = conv2d_input_default_desc)
set_module(model_ptq, k, quant_maxpool2d)
else:
# module can not be quantized, continue
continue
return model_ptq.to(device)
2)校正量化
这里disable_quant 其实就是关闭了量化层,直接fp32.
def collect_stats(model, data_loader, batch_number, device='cuda'):
"""Feed data to the network and collect statistic"""
# Enable calibrators
for name, module in model.named_modules():
if isinstance(module, quant_nn.TensorQuantizer):
if module._calibrator is not None:
module.disable_quant()
module.enable_calib()
else:
module.disable()
for i, data_tuple in enumerate(data_loader):
image = data_tuple[0]
image = image.float()/255.0
model(image.to(device))
if i + 1 >= batch_number:
break
# Disable calibrators
for name, module in model.named_modules():
if isinstance(module, quant_nn.TensorQuantizer):
if module._calibrator is not None:
module.enable_quant()
module.disable_calib()
else:
module.enable()
def compute_amax(model, **kwargs):
# Load calib result
for name, module in model.named_modules():
if isinstance(module, quant_nn.TensorQuantizer):
print(F"{name:40}: {module}")
if module._calibrator is not None:
if isinstance(module._calibrator, calib.MaxCalibrator):
module.load_calib_amax()
else:
module.load_calib_amax(**kwargs)
参考文章
Pytorch 自带量化: Practical Quantization in PyTorch | PyTorch
NVIDIA TRT8 量化:Quantization Aware Training in PyTorch with TensorRT 8.0 | GTC Digital April 2021 | NVIDIA On-Demand
https://www.cnblogs.com/wujianming-110117/p/16015708.html
pytorch-quantization 文档:https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/tutorials/quant_resnet50.html
pytorch-quantization git: TensorRT/tools/pytorch-quantization at release/8.6 · NVIDIA/TensorRT · GitHub
NVIDIA 量化官方论文:https://arxiv.org/pdf/2004.09602.pdf
YOLOV6 量化: 如何量化加速 YOLOv6 — YOLOv6_docs 文档
962

被折叠的 条评论
为什么被折叠?



