模型量化 onnx2tensorrt

wholetus

已于 2023-06-27 09:43:13 修改

阅读量512

点赞数

分类专栏：量化文章标签：深度学习

于 2023-01-10 15:31:22 首次发布

本文链接：https://blog.csdn.net/wholetus/article/details/128371559

版权

量化专栏收录该内容

4 篇文章 0 订阅

订阅专栏

文章目录

onnx模型转tensorrt引擎

onnx模型转tensorrt引擎

克隆tensorrt的仓库

git clone -b release/8.4 --recursive https://github.com/NVIDIA/TensorRT.git TensorRT-8.4

安装相对应版本的tensorRT

下载链接：https://developer.nvidia.com/nvidia-tensorrt-8x-download

编译Tensorrt

在TensorRT源文件根目录下执行下列命令：

cd TensorRT

打开CMakeLists.txt
增加tensorrt lib的路径，做如下替换以下
#set_ifndef(TRT_LIB_DIR ${CMAKE_BINARY_DIR})
set_ifndef(TRT_LIB_DIR /home/peekaboo/TensorRT-8.4.2.4/lib)
set_ifndef(TRT_OUT_DIR /media/peekaboo/T7/project/TensorRT/out)

官方提供的cmakelists默认编译parser，plugin，还有sample。自定义算子不需要编译sample，可以将它关闭。

option(BUILD_PLUGINS “Build TensorRT plugin” ON)
option(BUILD_PARSERS “Build TensorRT parsers” ON)
option(BUILD_SAMPLES “Build TensorRT samples” OFF)

cmake -B build
cd build
make

编译完成，生成量化工具
[100%] Linking CXX executable …/…/trtexec
[100%] Built target trtexec

tensorrt注册新算子

使用命令行量化模型
./trtexec --onnx=custom.onnx --saveEngine=custom.trt
无法识别自定义算子报错

[E] [TRT] ModelImporter.cpp:773: While parsing node number 2 [Custom -> "41"]:
[E] [TRT] ModelImporter.cpp:774: --- Begin node ---
[E] [TRT] ModelImporter.cpp:775: input: "40"
output: "41"
name: "LeakyRelu_2"
op_type: "Custom"
attribute {
  name: "alpha"
  f: 0.2
  type: FLOAT
}

打开plugin/api/InferPlugin.cpp 文件，加入头文件
#include “customPlugin.h”
加入初始化插件的接口
initializePluginnvinfer1::plugin::CustomPluginCreator(logger, libNamespace);
在plugin/CmakeLists.txt文件中添加编译Custom插件的选项
customPlugin
在common/kernel.h文件里按照IReluInference实现CustomInference
pluginStatus_t CustomInference(cudaStream_t stream, int32_t n, float negativeSlope, const void* input, void* output);
最后需要实现onnx结点和TRT插件的映射关系
在 parsers/onnx/builtin_op_importers.cpp

DEFINE_BUILTIN_OP_IMPORTER(Custom)
{
    OnnxAttrs attrs(node, ctx);
    float alpha = attrs.get<float>("alpha", 0.01f);
    nvinfer1::ITensor* input = &convertToTensor(inputs.at(0), ctx);

    const std::string pluginName = "Custom_TRT";
    const std::string pluginVersion = "1";
    std::vector<nvinfer1::PluginField> f;
    f.emplace_back("alpha", &alpha, nvinfer1::PluginFieldType::kFLOAT32, 1);
    const auto plugin = createPlugin(getNodeName(node), importPluginCreator(pluginName, pluginVersion), f);

    auto* layer = ctx->network()->addPluginV2(&input, 1, *plugin);
    ctx->registerLayer(layer, getNodeName(node));
    nvinfer1::ITensor* tensorPtr = layer->getOutput(0);

    return{{tensorPtr}};
}

完成custom算子的实现customPlugin

CMakeLists.txt
customLayer.cu
customPlugin.cpp
customPlugin.h

cuda编程

核函数

核函数是GPU每个thread上运行的程序。必须通过__gloabl__函数类型限定符定义。形式如下：

            __global__ void kernel(param list){  }

核函数只能在主机端调用，调用时必须申明执行参数。调用形式如下：

            Kernel<<<Dg,Db, Ns, S>>>(param list);

<<<>>>运算符内是核函数的执行参数，告诉编译器运行时如何启动核函数，用于说明内核函数中的线程数量，以及线程是如何组织的。

<<<>>>运算符对kernel函数完整的执行配置参数形式是<<<Dg, Db, Ns, S>>>

参数Dg用于定义整个grid的维度和尺寸，即一个grid有多少个block。为dim3类型。Dim3 Dg(Dg.x, Dg.y, 1)表示grid中每行有Dg.x个block，每列有Dg.y个block，第三维恒为1(目前一个核函数只有一个grid)。整个grid中共有Dg.x*Dg.y个block，其中Dg.x和Dg.y最大值为65535。

参数Db用于定义一个block的维度和尺寸，即一个block有多少个thread。为dim3类型。Dim3 Db(Db.x, Db.y, Db.z)表示整个block中每行有Db.x个thread，每列有Db.y个thread，高度为Db.z。Db.x和Db.y最大值为512，Db.z最大值为62。 一个block中共有Db.x*Db.y*Db.z个thread。计算能力为1.0,1.1的硬件该乘积的最大值为768，计算能力为1.2,1.3的硬件支持的最大值为1024。

参数Ns是一个可选参数，用于设置每个block除了静态分配的shared Memory以外，最多能动态分配的shared memory大小，单位为byte。不需要动态分配时该值为0或省略不写。

参数S是一个cudaStream_t类型的可选参数，初始值为零，表示该核函数处在哪个流之中。

#include<stdio.h>

__global__ void hello_from_gpu()
{
   printf("hello word from the gpu!\n");
}

int main()
{

   hello_from_gpu<<<1,1>>>();
   cudaDeviceSynchronize();
   printf("helloword\n");
   return 0;
}
~

在核函数的调用格式上与普通C++的调用不同，调用核函数的函数名和（）之间有一对三括号，里面有逗号隔开的两个数字。因为一个GPU中有很多计算核心，可以支持很多个线程。主机在调用一个核函数时，必须指明需要在设备中指派多少个线程，否则设备不知道怎么工作。三括号里面的数就是用来指明核函数中的线程数以及排列情况的。核函数中的线程常组织为若干线程块（thread block）。
三括号中的第一个数时线程块的个数，第二个数可以看作每个线程中的线程数。一个核函数的全部线程块构成一个网格，而线程块的个数记为网格大小，每个线程块中含有同样数目的线程，该数目称为线程块大小。所以核函数中的总的线程就等与网格大小乘以线程块大小，即<<<网格大小，线程块大小 >>>
核函数中的printf函数的使用方法和C++库中的printf函数的使用方法基本上是一样的，而在核函数中使用printf函数时也需要包含头文件<stdio.h>,核函数中不支持C++的iostream。
cudaDeviceSynchronize();这条语句调用了CUDA运行时的API函数，去掉这个函数就打印不出字符了。因为cuda调用输出函数时，输出流是先放在缓存区的，而这个缓存区不会核会自动刷新，只有程序遇到某种同步操作时缓存区才会刷新。这个函数的作用就是同步主机与设备，所以能够促进缓存区刷新。

使用多个线程的核函数

核函数中允许指派很多线程，一个GPU往往有几千个计算核心，而总的线程数必须至少等与计算核心数时才有可能充分利用GPU的全部计算资源。实际上，总的线程数大于计算核心数时才能更充分地利用GPU中的计算资源，因为这会让计算和内存访问之间及不同的计算之间合理地重叠，从而减小计算核心空闲的时间。
使用网格数为2，线程块大小为4的计算核心，所以总的线程数就是2x4=8，所以核函数的调用将指派8个线程完成。
核函数中的代码的执行方式是“单指令-多线程”，即每一个线程都执行同一指令的内容。

#include<stdio.h>

__global__ void hello_from_gpu()
{
   printf("hello word from the gpu!\n");
}

int main()
{

   hello_from_gpu<<<2,4>>>();
   cudaDeviceSynchronize();
   printf("helloword\n");
   return 0;
}

线程索引的使用

一个核函数可以指派多个线程，而这些线程的组织结构是由执行配置（<<<网格大小，线程块大小 >>>）来决定的，这是的网格大小和线程块大小一般来说是一个结构体类型的变量，也可以是一个普通的整形变量。

一个核函数允许指派的线程数是巨大的，能够满足几乎所有应用程序的要求。但是一个核函数中虽然可以指派如此巨大数目的线程数，但在执行时能够同时活跃（不活跃的线程处于等待状态）的线程数是由硬件（主要是CUDA核心数）和软件（核函数的函数体）决定的。
每个线程在核函数中都有一个唯一的身份标识。由于我们在三括号中使用了两个参数制定了线程的数目，所以线程的身份可以由两个参数确定。在程序内部，程序是知道执行配置参数grid_size和block_size的值的，这两个值分别保存在内建变量（built-in vari-
able）中。
gridDim.x ：该变量的数值等与执行配置中变量grid_size的数值。
blockDim.x: 该变量的数值等与执行配置中变量block_size的数值。
在核函数中预定义了如下标识线程的内建变量：
blockIdx.x :该变量指定一个线程在一个网格中的线程块指标。其取值范围是从0到gridDim.x-1
threadIdx.x：该变量指定一个线程在一个线程块中的线程指标，其取值范围是从0到blockDim.x-1
代码如下。

#include<stdio.h>
__global__ void hello_from_gpu()
{
   const int bid = blockIdx.x;
   const int tid = threadIdx.x;
   printf("hello word from block %d and thread %d\n",bid,tid);
}
int main()
{
   hello_from_gpu<<<2,4>>>();
   cudaDeviceSynchronize(); 
   printf("helloword\n");
   return 0;
}

网格与线程块大小的限制

cuda中对能够定义的网格大小和线程块大小做了限制，一个线程块最多只能有1024个线程。

提供了一个示例代码为内存拷贝两种方法从设备到设备的CUDA比较：

使用cudaMemcpyDeviceToDevice;
使用复制内核。

#include <stdio.h> 

#include "Utilities.cuh" 
#include "TimingGPU.cuh" 

#define BLOCKSIZE 512 

/***************/ 
/* COPY KERNEL */ 
/***************/ 
__global__ void copyKernel(const double * __restrict__ d_in, double * __restrict__ d_out, const int N) { 

    const int tid = threadIdx.x + blockIdx.x * blockDim.x; 

    if (tid >= N) return; 

    d_out[tid] = d_in[tid]; 

} 

/********/ 
/* MAIN */ 
/********/ 
int main() { 

    const int N = 1000000; 

    TimingGPU timerGPU; 

    double *h_test = (double *)malloc(N * sizeof(double)); 

    for (int k = 0; k < N; k++) h_test[k] = 1.; 

    double *d_in; gpuErrchk(cudaMalloc(&d_in, N * sizeof(double))); 
    gpuErrchk(cudaMemcpy(d_in, h_test, N * sizeof(double), cudaMemcpyHostToDevice)); 

    double *d_out; gpuErrchk(cudaMalloc(&d_out, N * sizeof(double))); 

    timerGPU.StartCounter(); 
    gpuErrchk(cudaMemcpy(d_out, d_in, N * sizeof(double), cudaMemcpyDeviceToDevice)); 
    printf("cudaMemcpy timing = %f [ms]\n", timerGPU.GetCounter()); 

    timerGPU.StartCounter(); 
    copyKernel << <iDivUp(N, BLOCKSIZE), BLOCKSIZE >> >(d_in, d_out, N); 
    gpuErrchk(cudaPeekAtLastError()); 
    gpuErrchk(cudaDeviceSynchronize()); 
    printf("Copy kernel timing = %f [ms]\n", timerGPU.GetCounter()); 

    return 0; 
}

tensorrt编译

回到TRT源代码主目录，使用如下命令编译
cmake -B build
cd build
make

https://www.cnblogs.com/liangliangdetianxia/p/3979438.html
https://blog.csdn.net/qq_31112205/article/details/105329959
http://cn.voidcc.com/question/p-fwxfplgw-bgd.html