TensorRT：INT8量化加速原理与问题解析

最新推荐文章于 2025-03-12 14:05:19 发布

郑小路

最新推荐文章于 2025-03-12 14:05:19 发布

阅读量4.1k

点赞数 44

分类专栏：模型部署文章标签：神经网络人工智能边缘计算

本文链接：https://blog.csdn.net/yitiaoxiaolu/article/details/136339471

版权

模型部署专栏收录该内容

9 篇文章

订阅专栏

文章目录

前言
一、INT8量化
二、为什么INT8量化快？
三、为什么INT8量化不会大幅度损失精度？
四、INT8量化算法介绍
五、使用TensorRT进行INT8量化
补充：
INT8量化一定快吗？

前言

本文将首先介绍INT8量化的基础知识，解答一些关于INT8的基础问题，然后介绍三种常见的INT8量化算法，包括动态对称量化、动态非对称量化、静态对称量化，最后介绍一个在TensorRT中实现INT8量化的例子：MNIST手写数字识别的INT8_PTQ量化。

一、INT8量化

在学习计算机基础时，我们学习过各种数据类型，我们知道单精度浮点数float类型的数据在计算机中占用4个字节，32bit，int整型类型数据也占用4个字节，32bit，可表示的范围为-2^31-1 ~ 2 ^31-1 ，这里的INT8就是用8bit数据来表示一个整型数据，可表示范围为-128~127。

INT8量化就是将基于浮点数FP32的模型转换成低精度的INT8数值进行运算，以加快推理速度。
在这里插入图片描述
上图为正常神经网络流程图和量化神经网络流程图，在实际量化过程中，我们主要针对的是矩阵乘（卷积本质也是矩阵乘）的量化，因此为了加快模型推理速度，我们可以将部分网络结构进行量化，实现INT8类型的矩阵乘。

二、为什么INT8量化快？

1、首先是INT8的数据位数低，数据复杂度降低，从而使得模型的推理速度加快，8bit的INT8和32bit的FP32，其数据量降低了4倍。
2、计算平台针对INT8数据计算有高效的指令支持。以NVIDIA的计算平台为例，对于计算能力大于等于SM_61的显卡，如Tesla P4/P40 GPU，NVIDIA提供了新的INT8点乘运算的 指令支持-DP4A。该计算过程可以获得理论上最大4倍的性能提升。另外针对ARM架构，NEON指令也可实现对INT8数据的量化加速。

INT8能够量化加速的本质：
通过指令或者硬件技术，在单位时钟周期内 INT8 类型的运算次数大于 FP32 类型的运算次数。

三、为什么INT8量化不会大幅度损失精度？

1、神经网络具有一定的鲁棒性。通过神经网络来实现检测、分割、预测等各种任务时，模型的参数是允许小范围更改的，神经网络可以理解为一个复杂的函数，函数输入相应数据，最后输出相应的概率，最后根据概率做判断。因此，模型参数的改变只要对最终的输出概率没有严重影响，那么网络仍然是有效的。
2、在训练神经网络时，训练数据一般都是有噪声的，神经网络的训练过程就是从噪声中识别出有效的信息，因此，可以将低精度计算造成的损失理解为另一种噪声。
3、神经网络的权值大部分都是正态分布的，值域小且对称。实践证明：一个性能良好的网络，它的权值大部分都是正态分布的，因此对于模型来说FP32的高精度和大范围表示，是存在严重的资源浪费和性能过剩的。模型参数大部分都集中在零点附近，也就导致了FP32的数据高位一直都是全为零的状态，这些无效的位数占用了大量的计算资源。

四、INT8量化算法介绍

1、动态对称量化

动态对称量化首先选出绝对值最大值|max|，然后计算出量化比例scale，量化后的值quantized_value就是实际值real_value 除以 scale。
在这里插入图片描述
该算法的优点是算法简单，量化步骤耗时短；缺点是会造成位宽浪费，影响精度（当数据不对称时，量化后数据仍然不对称，存在部分区间没有数据），目前该算法使用于：
（1）PyTorch dynamic quantization （2） ONNX quantization

2、动态非对称量化

针对数据不对称的情况，动态非对称量化算法可以很好的处理该问题，它将不对称的数据量化为关于量化零点对称分布的数据，避免的位宽浪费。
在这里插入图片描述
优点:不会造成bit位宽浪费，精度有保证
缺点:算法较复杂，量化步骤耗时时间长

使用于（1）Google Gemmlowp

3、静态对称量化

在这里插入图片描述
动态量化算法在推理时需要实时统计数值的绝对值最大值|max|，因此量化耗时较长，而静态量化算法，推理时使用预先统计的缩放阈值T，截断部分阈值外的数据。

我们这里不详细解释静态对称量化算法的细节，仅简单介绍一下在TensorRT中INT8量化的过程：
1、在校准数据集上运行FP32推理。
2、对于每个图层:
收集激活的直方图。
生成许多具有不同饱和度阈值的量化分布。
选择最小化的阈值
在这里插入图片描述
简而言之：TensorRT中的静态INT8量化，就是让模型通过校准数据集来预先计算出最小的截断阈值T，然后在后续的INT8推理过程中，直接使用该阈值进行推理。

使用于：
（1）PyTorch static quantization
（2）TensorRT
（3）NCNN

五、使用TensorRT进行INT8量化

这里以MNIST例子为例：


import os
from datetime import datetime as dt
from glob import glob

import calibrator
import cv2
import numpy as np
import tensorrt as trt
import torch as t
import torch.nn.functional as F
from cuda import cudart
from torch.autograd import Variable

np.random.seed(31193)
t.manual_seed(97) #设置PyTorch的随机种子
t.cuda.manual_seed_all(97)#设置PyTorch在CUDA上的随机种子
t.backends.cudnn.deterministic = True#强制CuDNN库使用确定性模式，即相同的操作将产生相同的结果
nTrainBatchSize = 128
nHeight = 28
nWidth = 28
paraFile = "./para.npz" #用来保存模型的参数（权重和偏置）
trtFile = "./model.plan"#用来保存序列化文件
dataPath = os.path.dirname(os.path.realpath(__file__)) + "/../../00-MNISTData/"
trainFileList = sorted(glob(dataPath + "train/*.jpg"))
testFileList = sorted(glob(dataPath + "test/*.jpg"))
inferenceImage = dataPath + "8.png"

# for FP16 mode
bUseFP16Mode = False
# for INT8 model
bUseINT8Mode = True
nCalibration = 1 #设置INT8量化过程中的校准（calibration）迭代次数
cacheFile = "./int8.cache"#用于缓存校准结果的文件路径
calibrationDataPath = dataPath + "test/"#用于校准的数据路径

os.system("rm -rf ./*.npz ./*.plan ./*.cache")
np.set_printoptions(precision=3, linewidth=200, suppress=True)
cudart.cudaDeviceSynchronize()

# Create network and train model in pyTorch ------------------------------------
class Net(t.nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = t.nn.Conv2d(1, 32, (5, 5), padding=(2, 2), bias=True)
        self.conv2 = t.nn.Conv2d(32, 64, (5, 5), padding=(2, 2), bias=True)
        self.fc1 = t.nn.Linear(64 * 7 * 7, 1024, bias=True)
        self.fc2 = t.nn.Linear(1024, 10, bias=True)

    def forward(self, x):
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        x = F.max_pool2d(F.relu(self.conv2(x)), (2, 2))
        x = x.reshape(-1, 64 * 7 * 7)
        x = F.relu(self.fc1(x))
        y = self.fc2(x)
        z = F.softmax(y, dim=1)
        z = t.argmax(z, dim=1)
        return y, z

class MyData(t.utils.data.Dataset):

    def __init__(self, isTrain=True):
        if isTrain:
            self.data = trainFileList
        else:
            self.data = testFileList

    def __getitem__(self, index):
        imageName = self.data[index]
        data = cv2.imread(imageName, cv2.IMREAD_GRAYSCALE)
        label = np.zeros(10, dtype=np.float32)
        index = int(imageName[-7])
        label[index] = 1
        return t.from_numpy(data.reshape(1, nHeight, nWidth).astype(np.float32)), t.from_numpy(label)

    def __len__(self):
        return len(self.data)

model = Net().cuda()
ceLoss = t.nn.CrossEntropyLoss()
opt = t.optim.Adam(model.parameters(), lr=0.001)
trainDataset = MyData(True)
testDataset = MyData(False)
trainLoader = t.utils.data.DataLoader(dataset=trainDataset, batch_size=nTrainBatchSize, shuffle=True)
testLoader = t.utils.data.DataLoader(dataset=testDataset, batch_size=nTrainBatchSize, shuffle=True)

for epoch in range(10):
    for xTrain, yTrain in trainLoader:
        xTrain = Variable(xTrain).cuda()
        yTrain = Variable(yTrain).cuda()
        opt.zero_grad()
        y_, z = model(xTrain)
        loss = ceLoss(y_, yTrain)
        loss.backward()
        opt.step()

    with t.no_grad():
        acc = 0
        n = 0
        for xTest, yTest in testLoader:
            xTest = Variable(xTest).cuda()
            yTest = Variable(yTest).cuda()
            y_, z = model(xTest)
            acc += t.sum(z == t.matmul(yTest, t.Tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]).to("cuda:0"))).cpu().numpy()
            n += xTest.shape[0]
        print("%s, epoch %2d, loss = %f, test acc = %f" % (dt.now(), epoch + 1, loss.data, acc / n))

para = {}  # save weight as file
for name, parameter in model.named_parameters():
    #print(name, parameter.detach().cpu().numpy().shape)
    para[name] = parameter.detach().cpu().numpy()
np.savez(paraFile, **para)

del para
print("Succeeded building model in pyTorch!")

# Rebuild network, load weights and do inference in TensorRT -------------------
logger = trt.Logger(trt.Logger.ERROR)
builder = trt.Builder(logger)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
profile = builder.create_optimization_profile()
config = builder.create_builder_config()
if bUseFP16Mode:
    config.set_flag(trt.BuilderFlag.FP16)
if bUseINT8Mode:
    config.set_flag(trt.BuilderFlag.INT8)
    config.int8_calibrator = calibrator.MyCalibrator(calibrationDataPath, nCalibration, (1, 1, nHeight, nWidth), cacheFile)

inputTensor = network.add_input("inputT0", trt.float32, [-1, 1, nHeight, nWidth])
profile.set_shape(inputTensor.name, [1, 1, nHeight, nWidth], [4, 1, nHeight, nWidth], [8, 1, nHeight, nWidth])
config.add_optimization_profile(profile)

para = np.load(paraFile)
print(para.files)
print(para["conv1.weight"].shape)#(32, 1, 5, 5)
print(para["conv1.bias"].shape)#(32,)

w = np.ascontiguousarray(para["conv1.weight"])
b = np.ascontiguousarray(para["conv1.bias"])
_0 = network.add_convolution_nd(inputTensor, 32, [5, 5], trt.Weights(w), trt.Weights(b))
_0.padding_nd = [2, 2]
_1 = network.add_activation(_0.get_output(0), trt.ActivationType.RELU)
_2 = network.add_pooling_nd(_1.get_output(0), trt.PoolingType.MAX, [2, 2])
_2.stride_nd = [2, 2]

w = np.ascontiguousarray(para["conv2.weight"])
b = np.ascontiguousarray(para["conv2.bias"])
_3 = network.add_convolution_nd(_2.get_output(0), 64, [5, 5], trt.Weights(w), trt.Weights(b))
_3.padding_nd = [2, 2]
_4 = network.add_activation(_3.get_output(0), trt.ActivationType.RELU)
_5 = network.add_pooling_nd(_4.get_output(0), trt.PoolingType.MAX, [2, 2])
_5.stride_nd = [2, 2]

_6 = network.add_shuffle(_5.get_output(0))
_6.reshape_dims = (-1, 64 * 7 * 7)

w = np.ascontiguousarray(para["fc1.weight"].transpose())
b = np.ascontiguousarray(para["fc1.bias"].reshape(1, -1))
_7 = network.add_constant(w.shape, trt.Weights(w))
_8 = network.add_matrix_multiply(_6.get_output(0), trt.MatrixOperation.NONE, _7.get_output(0), trt.MatrixOperation.NONE)
_9 = network.add_constant(b.shape, trt.Weights(b))
_10 = network.add_elementwise(_8.get_output(0), _9.get_output(0), trt.ElementWiseOperation.SUM)
_11 = network.add_activation(_10.get_output(0), trt.ActivationType.RELU)

w = np.ascontiguousarray(para["fc2.weight"].transpose())
b = np.ascontiguousarray(para["fc2.bias"].reshape(1, -1))
_12 = network.add_constant(w.shape, trt.Weights(w))
_13 = network.add_matrix_multiply(_11.get_output(0), trt.MatrixOperation.NONE, _12.get_output(0), trt.MatrixOperation.NONE)
_14 = network.add_constant(b.shape, trt.Weights(b))
_15 = network.add_elementwise(_13.get_output(0), _14.get_output(0), trt.ElementWiseOperation.SUM)

_16 = network.add_softmax(_15.get_output(0))
_16.axes = 1 << 1

_17 = network.add_topk(_16.get_output(0), trt.TopKOperation.MAX, 1, 1 << 1)

network.mark_output(_17.get_output(1))

engineString = builder.build_serialized_network(network, config)
if engineString == None:
    print("Failed building engine!")
    exit()
print("Succeeded building engine!")
with open(trtFile, "wb") as f:
    f.write(engineString)
engine = trt.Runtime(logger).deserialize_cuda_engine(engineString)
nIO = engine.num_io_tensors
print("nIO==",nIO)# 2 只标注了输入和输出
lTensorName = [engine.get_tensor_name(i) for i in range(nIO)]
nInput = [engine.get_tensor_mode(lTensorName[i]) for i in range(nIO)].count(trt.TensorIOMode.INPUT)

context = engine.create_execution_context()
context.set_input_shape(lTensorName[0], [1, 1, nHeight, nWidth])
for i in range(nIO):
    print("[%2d]%s->" % (i, "Input " if i < nInput else "Output"), engine.get_tensor_dtype(lTensorName[i]), engine.get_tensor_shape(lTensorName[i]), context.get_tensor_shape(lTensorName[i]), lTensorName[i])

bufferH = []
data = cv2.imread(inferenceImage, cv2.IMREAD_GRAYSCALE).astype(np.float32).reshape(1, 1, nHeight, nWidth)
bufferH.append(np.ascontiguousarray(data))
for i in range(nInput, nIO):
    bufferH.append(np.empty(context.get_tensor_shape(lTensorName[i]), dtype=trt.nptype(engine.get_tensor_dtype(lTensorName[i]))))
bufferD = []
for i in range(nIO):
    bufferD.append(cudart.cudaMalloc(bufferH[i].nbytes)[1])

for i in range(nInput):
    cudart.cudaMemcpy(bufferD[i], bufferH[i].ctypes.data, bufferH[i].nbytes, cudart.cudaMemcpyKind.cudaMemcpyHostToDevice)

for i in range(nIO):
    context.set_tensor_address(lTensorName[i], int(bufferD[i]))

context.execute_async_v3(0)

for i in range(nInput, nIO):
    cudart.cudaMemcpy(bufferH[i].ctypes.data, bufferD[i], bufferH[i].nbytes, cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost)

for i in range(nIO):
    print(lTensorName[i])
    print(bufferH[i])

for b in bufferD:
    cudart.cudaFree(b)

print("Succeeded running model in TensorRT!")

calibrator.py文件如下：


import os
from glob import glob

import cv2
import numpy as np
import tensorrt as trt
from cuda import cudart


class MyCalibrator(trt.IInt8EntropyCalibrator2):

    def __init__(self, calibrationDataPath, nCalibration, inputShape, cacheFile):
        trt.IInt8EntropyCalibrator2.__init__(self)
        self.imageList = glob(calibrationDataPath + "*.jpg")[:100]
        self.nCalibration = nCalibration
        self.shape = inputShape  # (N,C,H,W)
        self.buffeSize = trt.volume(inputShape) * trt.float32.itemsize
        self.cacheFile = cacheFile
        _, self.dIn = cudart.cudaMalloc(self.buffeSize)
        self.oneBatch = self.batchGenerator()

        print(int(self.dIn))

    def __del__(self):
        cudart.cudaFree(self.dIn)

    def batchGenerator(self):
        for i in range(self.nCalibration):
            print("> calibration %d" % i)
            subImageList = np.random.choice(self.imageList, self.shape[0], replace=False)
            yield np.ascontiguousarray(self.loadImageList(subImageList))

    def loadImageList(self, imageList):
        res = np.empty(self.shape, dtype=np.float32)
        for i in range(self.shape[0]):
            res[i, 0] = cv2.imread(imageList[i], cv2.IMREAD_GRAYSCALE).astype(np.float32)
        return res

    def get_batch_size(self):  # necessary API
        return self.shape[0]

    def get_batch(self, nameList=None, inputNodeName=None):  # necessary API
        try:
            data = next(self.oneBatch)
            cudart.cudaMemcpy(self.dIn, data.ctypes.data, self.buffeSize, cudart.cudaMemcpyKind.cudaMemcpyHostToDevice)
            return [int(self.dIn)]
        except StopIteration:
            return None

    def read_calibration_cache(self):  # necessary API
        if os.path.exists(self.cacheFile):
            print("Succeed finding cahce file: %s" % (self.cacheFile))
            with open(self.cacheFile, "rb") as f:
                cache = f.read()
                return cache
        else:
            print("Failed finding int8 cache!")
            return

    def write_calibration_cache(self, cache):  # necessary API
        with open(self.cacheFile, "wb") as f:
            f.write(cache)
        print("Succeed saving int8 cache!")
        return

if __name__ == "__main__":
    cudart.cudaDeviceSynchronize()
    m = MyCalibrator("../../00-MNISTData/test/", 5, (1, 1, 28, 28), "./int8.cache")
    m.get_batch("FakeNameList")
    m.get_batch("FakeNameList")
    m.get_batch("FakeNameList")
    m.get_batch("FakeNameList")
    m.get_batch("FakeNameList")