TensorRT 入门(7) INT8 量化

最新推荐文章于 2024-08-17 00:42:59 发布

清欢守护者

最新推荐文章于 2024-08-17 00:42:59 发布

阅读量4.7k

点赞数 5

分类专栏： TensorRT

本文链接：https://blog.csdn.net/irving512/article/details/116460466

版权

TensorRT 专栏收录该内容

11 篇文章 35 订阅

订阅专栏

文章目录

0. 前言

TensorRT 提供了 FP16 量化与 INT8 量化。
- 前者通过 FP32 engine 或 ONNX 模型就可以直接得到。
- 后者多了一步操作，需要进行校准（calibration），生成校准文件。
官方提供了两个sample
参考资料：
- TensorRT(5)-INT8校准原理

1. sampleINT8

sampleINT8 - Performing Inference In INT8 Using Custom Calibration

1.1 实例简介

Github，参考资料-TensorRT(6)-INT8 inference
功能概述：以Caffe模型作为输入，通过 MNIST 数据集构建标定所需参数，实现INT8量化
与普通 Caffe 模型转换为 TensorRT Engine 相比，INT8 量化多了以下操作：
- 在使用 Parser 解析 Caffe 模型的时候，需要指定weights输入类型为FP32。
- IBuilderConfig 需要设置 FLAG 为 BuilderFlag::kINT8
- 构建标定对象，即 calibrator，作为 IBuilderConfig 的一部分。
如何构建标定对象
- 最终目标就是构建一个 IInt8Calibrator 对象作为 IBuilderConfig 的输入。
- IInt8Calibrator 只是一个抽象类，关键函数就是
  - getBatchSize：调用一次，获取标定过程中的batch size
  - bool getBatch(void* bindings[], const char* names[], int nbBindings)：调用多次，获取标定过程中的输入，直到返回false
  - 前两个参数的长度相同，且参数一一对应。
  - 第三个参数的数值就是前两个参数的长度。
  - 需要将读取到数据对应的GPU地址作保存到 bindings 中。
  - writeCalibrationCache：将标定结果写入本地，在buidler运行过程中会自动调用。
  - readCalibrationCache：读取保存在本地的标定文件，在buidler运行过程中会自动调用。
- IInt8Calibrator 有具体实现，最常用的是 IInt8EntropyCalibrator2
保存好的标定文件有固定的格式，不详细说了，参考这里
其他推理相关代码，与是否是INT8量化无关。
运行结果解析
- 程序总体运行流程就是：分别运行FP32/FP16/INT8三个模型，对比三个模型的结果是否超过一定阈值。
关键代码

// 在 construct network 过程中构建量化相关参数
if (dataType == DataType::kINT8)
{
    // INT8 量化中标定相关配置

    // 用于读取 MNIST 中的数据，作为后续 calibrator 的输入参数
    // IInt8Calibrator 的两个关键函数，就是有该类实现
    MNISTBatchStream calibrationStream(mParams.calBatchSize, mParams.nbCalBatches, "train-images-idx3-ubyte",
        "train-labels-idx1-ubyte", mParams.dataDirs);
    
    // 初始化 IInt8Calibrator 对象
    // 使用 Int8EntropyCalibrator2 进行初始化
    calibrator.reset(new Int8EntropyCalibrator2<MNISTBatchStream>(
        calibrationStream, 0, mParams.networkName.c_str(), mParams.inputTensorNames[0].c_str()));
    
    // 设置参数，将 calibrator 与 IBuilderConfig 连接
    config->setInt8Calibrator(calibrator.get());
}

1.2 扩展阅读

官方文档连接
builder如何使用calibrator
- 先调用 getBatchSize 函数，获取标定过程中的 batch size
- 多次调用 getBatch 函数，每次获取的参数量就是 getBatchSize 的数值，直到返回 false 为止
构建 INT8 engine 的过程
- 构建FP32 engine，在标定数据集上前向推理，获得每一层结果的直方图
- 构建标定表，即 calibration table
- 通过 network 以及标定表构建 INT8 engine
不同硬件平台可以使用相同的标定表（只要网络以及标定数据集没有变化）

2. sampleINT8API

sampleINT8API - Performing Inference In INT8 Precision

2.1 实例简介

与前一个例子的区别
- 前一个例子是通过输入大量图片，计算得到校准表。
- 这个例子不使用 calibrator，而是用户自定义每一层的INT8转换数值范围。
功能详解
- 使用 nvinfer1::ITensor::setDynamicRange 设置参数的动态范围
- 使用 nvinfer1::ILayer::setPrecision 设置计算精度，可能是中间过程的参数类型
- 使用 nvinfer1::ILayer::setOutputType 设置输出数据类型
- 不使用 INT8 calibration 进行INT8量化推理
实现流程
- 确定硬件平台支持INT8量化（只有compute capability 6.1 or 7.x的才支持）
- builder中设置INT8模式，且calibrator设置为nullptr
- 在
- 通过 tensor->setDynamicRange(min, max) 来设置动态数值范围
- 通过 layer->setPrecision(nvinfer1::DataType::kINT8) 来设置精度，中间过程的计算类型
- 通过 layer->setOutputType(j, nvinfer1::DataType::kFLOAT) 设置输出数据类型，这个应该是模型结果

2.2 扩展阅读

Setting Per-Tensor Dynamic Range Using C++
dynamic range 如何获得？
- 训练最后一轮的时候，记录每个中间过程tensor的最大值与最小值
- 或使用 quantization aware training 获取。
遍历所有tensor分别设置动态范围，设置方法如下

ITensor* tensor = network->getLayer(layer_index)->getOutput(output_index);
tensor->setDynamicRange(min_float, max_float);

ITensor* input_tensor = network->getInput(input_index);
input_tensor->setDynamicRange(min_float, max_float);

3. Python Caffe MNIST INT8

python/int8_caffe_mnist - INT8 Calibration In Python
感觉就是 sampleINT8 的 Python 版
核心代码就是 caliborator 的构建，即 MNISTEntropyCalibrator
- 该类是 trt.IInt8EntropyCalibrator2 的子类
- 也是重写四个函数，get_batch_size/get_batch/read_calibration_cache/write_calibration_cache
- 上述四个函数的名称与参数可能与 C++ 版本有少量区别
其他也没啥要说的，就是构建 IBuilder 的时候需要设置 INT8 量化flag以及导入 caliborator
值得一提的是，校准表应该是在构建 engine 的这一步中生成的。
构建engine相关代码

# 构建 engine 的过程如下
# 其中，calib 就是MNISTEntropyCalibrator对象
# This function builds an engine from a Caffe model.
def build_int8_engine(deploy_file, model_file, calib, batch_size=32):
    with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network, trt.CaffeParser() as parser:
        # We set the builder batch size to be the same as the calibrator's, as we use the same batches
        # during inference. Note that this is not required in general, and inference batch size is
        # independent of calibration batch size.
        builder.max_batch_size = batch_size
        builder.max_workspace_size = common.GiB(1)
        builder.int8_mode = True
        builder.int8_calibrator = calib
        # Parse Caffe model
        model_tensors = parser.parse(deploy=deploy_file, model=model_file, network=network, dtype=ModelData.DTYPE)
        network.mark_output(model_tensors.find(ModelData.OUTPUT_NAME))
        # Build engine and do int8 calibration.
        return builder.build_cuda_engine(network)

MNISTEntropyCalibrator 源码

class MNISTEntropyCalibrator(trt.IInt8EntropyCalibrator2):
    def __init__(self, training_data, cache_file, batch_size=64):
        # Whenever you specify a custom constructor for a TensorRT class,
        # you MUST call the constructor of the parent explicitly.
        trt.IInt8EntropyCalibrator2.__init__(self)

        self.cache_file = cache_file

        # Every time get_batch is called, the next batch of size batch_size will be copied to the device and returned.
        self.data = load_mnist_data(training_data)
        self.batch_size = batch_size
        self.current_index = 0

        # Allocate enough memory for a whole batch.
        self.device_input = cuda.mem_alloc(self.data[0].nbytes * self.batch_size)

    def get_batch_size(self):
        return self.batch_size

    # TensorRT passes along the names of the engine bindings to the get_batch function.
    # You don't necessarily have to use them, but they can be useful to understand the order of
    # the inputs. The bindings list is expected to have the same ordering as 'names'.
    def get_batch(self, names):
        if self.current_index + self.batch_size > self.data.shape[0]:
            return None

        current_batch = int(self.current_index / self.batch_size)
        if current_batch % 10 == 0:
            print("Calibrating batch {:}, containing {:} images".format(current_batch, self.batch_size))

        batch = self.data[self.current_index:self.current_index + self.batch_size].ravel()
        cuda.memcpy_htod(self.device_input, batch)
        self.current_index += self.batch_size
        return [self.device_input]


    def read_calibration_cache(self):
        # If there is a cache, use it instead of calibrating again. Otherwise, implicitly return None.
        if os.path.exists(self.cache_file):
            with open(self.cache_file, "rb") as f:
                return f.read()

    def write_calibration_cache(self, cache):
        with open(self.cache_file, "wb") as f:
            f.write(cache)