TensorRT Model Optimizer量化和模型导出

Luchang-Li

已于 2024-08-14 16:31:18 修改

阅读量401

点赞数 3

分类专栏：推理引擎大模型模型轻量化文章标签： TensorRT 量化

于 2024-08-07 17:38:45 首次发布

本文链接：https://blog.csdn.net/u013701860/article/details/140997386

版权

推理引擎同时被 3 个专栏收录

17 篇文章 6 订阅

订阅专栏

大模型

8 篇文章 2 订阅

订阅专栏

模型轻量化

3 篇文章 1 订阅

订阅专栏

ref

Accelerate Generative AI Inference Performance with NVIDIA TensorRT Model Optimizer, Now Publicly Available | NVIDIA Technical Blog

https://github.com/NVIDIA/TensorRT-Model-Optimizer

https://nvidia.github.io/TensorRT-Model-Optimizer/index.html

安装

pip install "nvidia-modelopt[all]" --extra-index-url https://pypi.nvidia.com

resnet50实验

先用resnet简单实验跑通int8量化导出onnx模型和转TensorRT

from transformers import AutoImageProcessor, ResNetForImageClassification
import torch
from diffusers.utils import load_image
import modelopt.torch.quantization as mtq

# https://huggingface.co/microsoft/resnet-50
processor = AutoImageProcessor.from_pretrained("resnet_50")
model = ResNetForImageClassification.from_pretrained("resnet_50")

img_url = "cat1.jpg"

image = load_image(img_url).resize((512, 512))
inputs = processor(image, return_tensors="pt")

pixel_values = inputs["pixel_values"]
data_loader=[pixel_values]

def forward_loop(model):
    for batch in data_loader:
        model(batch)

# mtq.INT8_SMOOTHQUANT_CFG
# Quantize the model and perform calibration (PTQ)
model = mtq.quantize(model, mtq.INT8_DEFAULT_CFG, forward_loop)

torch.onnx.export(model,  # model being run
                  (pixel_values),  # model input (or a tuple for multiple inputs)
                  "resnet50_quant.onnx",   # where to save the model (can be a file or file-like object)
                  export_params=True,        # store the trained parameter weights inside the model file
                  opset_version=15,          # the ONNX version to export the model to
                  do_constant_folding=True,  # whether to execute constant folding for optimization
                  input_names=['pixel_values'],   # the model's input names
                  output_names=['output'],  # the model's output names
                  dynamic_axes={'pixel_values': {0: 'batch'}},
                  )

可见核心流程为定义一个根据校准数据推理的校准函数，设置一个量化配置，然后用这两个作为quantize的输入进行量化插入量化节点。最后导出插入量化节点的onnx模型，并且使用tensorrt trtexec转换该onnx模型。

使用trtexec转模型，查看里面的性能评估。还可以用trt-engine-explorer查看trt模型结构和算子性能。

v100测试上述int8模型

Latency: min = 1.51245 ms, max = 1.69751 ms, mean = 1.54279 ms

fp16模型

Latency: min = 0.972412 ms, max = 1.14563 ms, mean = 0.993796 ms

为啥这里int8反而比fp16更慢呢？实际上我看通过上述导出的onnx模型中，batch norm并没有fuse到conv里面，插入了量化反量化算子进一步阻值了该融合。而fp16的模型batch norm进行了融合从而性能更好。如下图是量化的onnx模型中可以看到bn没有融合到conv。而实际上trt-engine-explorer中显示的算子实际上int8 conv更好，但是bn耗时也挺大。