Pytorch量化+部署

u013250861

于 2024-06-22 12:36:41 发布

阅读量1.1k

点赞数 16

分类专栏： AI/模型量化文章标签： pytorch 人工智能 python

本文链接：https://blog.csdn.net/u013250861/article/details/139880728

版权

AI/模型量化专栏收录该内容

18 篇文章 2 订阅

订阅专栏

一. 量化

Pytorch对量化的支持有以下三种方式：

1) 模型训练完毕后动态量化：post training dynamic quantization

2) 模型训练完毕后静态量化：post training static quantization

3) 模型训练中开启量化：quantization aware training (QAT)

关于post training dynamic/static quantization的方法，可以参考下面的博客

Pytorch模型量化_凌逆战的博客-CSDN博客blog.csdn.net/qq_34218078/article/details/127521819编辑

关于post training static quantization和quantization aware training的方法，可以参考下面的博客

https://pytorch.org/tutorials/advanced/static_quantization_tutorial.htmlpytorch.org/tutorials/advanced/static_quantization_tutorial.html

二. 部署

1.路线1：PyTorch --> ONNX --> TensorRT(NVIDIA)，适用于Nvidia GPU上的部署

ONNX简介：Open Neural Network Exchange (ONNX, 开放神经网络交换)格式，是一个用于表示深度学习模型的标准，可使模型在不同框架之间进行转移。ONNX是一种针对机器学习所设计的开放式的文件格式，用于存储训练好的模型。它使得不同的人工智能框架（如PyTorch, MXNet）可以采用相同格式存储模型数据并交互。ONNX的规范及代码主要由微软、亚马逊、Facebook和IBM等公司共同开发。目前官方支持加载ONNX模型并进行推理的深度学习框架有：Caffe2, PyTorch, MXNet, ML.NET, TensorRT和Microsoft CNTK, 并且Tensorflow也非官方的支持ONNX.[1][2][3]

TensorRT简介：TensorRT是一个有助于在NVIDIA图形处理单元（GPU）上高性能推理C++库。它旨在与TensorFlow, Caffe, PyTorch以及MXNet等训练框架以互补的方式进行工作，专门致力于在GPU上快速有效地进行网络推理。[4][5]

step 1: PyTorch --> ONNX

将PyTorch转为ONNX模型很简单，使用torch.onnx.export()函数即可。

官网文档：

https://onnxruntime.ai/docs/tutorials/accelerate-pytorch/pytorch.htmlonnxruntime.ai/docs/tutorials/accelerate-pytorch/pytorch.html

https://pytorch.org/docs/stable/onnx.htmlpytorch.org/docs/stable/onnx.html

函数说明：

torch.onnx.export()[6]

torch.onnx.export(model, args, f, export_params=True, verbose=False, training=False, input_names=None, output_names=None, aten=False, export_raw_ir=False, operator_export_type=None, opset_version=None, _retain_param_name=True, do_constant_folding=False, example_outputs=None, strip_doc_string=True, dynamic_axes=None, keep_initializers_as_inputs=None)

功能：将.pth模型转为onnx文件导出

参数：

model(torch.nn.Module): pth模型文件
args (tuple of arguments): 模型的输入，模型的尺寸
export_params (bool, default True): 如果指定为True或者默认，参数也会被导出，如果要导出一个没训练过的就设置为False
verbose (bool, default False): 导出轨迹的调试描述
training (bool, default False) ：在训练模式下导出模型。目前，ONNX导出的模型只是为了做推断，通常不需要将其设置为True；
input_names (list of strings, default empty list) ：onnx文件的输入名称, 可以随便取
output_names (list of strings, default empty list) ：onnx文件的输出名称，可以随便取
opset_version：默认为9
dynamic_axes – {‘input’ : {0 : ‘batch_size’}, ‘output’ : {0 : ‘batch_size’}})

举例：

1)使用torch.save(model,path)方式保存(这种方法会将model的参数，框架都保存到路径path中，但是在加载model的时候可能会因为包版本的不同而报错，所以在保存所有模型参数时，需要将模型构造相关代码放在相同路径，否则在load时无法索引到model的框架)。

输入测试数据，格式为[batch, channel, height, width]

x = torch.randn(1, 3, 224, 224, device=device)

加入这一句，不然可能会影响onnx的输出结果

import torch
import torch.nn
import onnx
 
# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device = torch.device('cpu')
 
model = torch.load('***.pth', map_location=device)
model.eval()
 
input_names = ['input']
output_names = ['output']
 
x = torch.randn(1, 3, 224, 224, device=device) #与实际输入数据的shape一致即可，取值没有影响，所以用了随机数
 
torch.onnx.export(model, x, 'name.onnx', input_names=input_names, output_names=output_names, verbose='True')

2) 使用torch.save(model.state_dict(),model_path)方式保存（只保存参数，建议使用）

这种保存方式需要提供网络结构文件

import torch.onnx
import onnxruntime as ort
from model import Net
 
# 创建跟训练时候一致的网络结构
model = Net

# 加载权重
model_path = '***.pth'
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model_statedict = torch.load(model_path, map_location=device)
model.load_state_dict(model_statedict)
 
model.to(device)
model.eval()
 
input_data = torch.randn(1, 3, 224, 224, device=device)
 
# 转化为onnx模型
input_names = ['input']
output_names = ['output']
 
torch.onnx.export(model, input_data, 'name.onnx', opset_version=9, verbose=True, input_names=input_names, output_names = output_names)

注意, input_data的维度即为模型运行时输入数据的维度。

验证ONNX模型是否有效：[7][8]

import onnx

# Preprocessing: load the ONNX model
model_path = 'path/to/the/model.onnx'
onnx_model = onnx.load(model_path)

print('The model is:\n{}'.format(onnx_model))

# Check the model
try:
    onnx.checker.check_model(onnx_model)
except onnx.checker.ValidationError as e:
    print('The model is invalid: %s' % e)
else:
    print('The model is valid!')

onnxruntime上运行onnx模型：[9]

官方教程：

ONNX Runtime 1.13.0 documentationonnxruntime.ai/docs/api/python/api_summary.html

用法举例(以RL为例)：[10]

import onnxruntime as ort
import onnx
import numpy as np
import env_class

env = env_class(render=True)
#onnx_model = onnx.load_model('xxx.onnx')
model_path = 'xxx.onnx'
onnx_model = onnx.load(model_path)
onnx.checker.check_model(onnx_model)

#sess = ort.InferenceSession(onnx_model.SerializeToString())
#sess.set_providers(['CPUExecutionProvider'])
sess = ort.InferenceSession(model_path,providers=['CPUExecutionProvider'])

input_name = sess.get_inputs()[0].name
output_name = sess.get_outputs()[0].name

for j in range(n_eval):
    o,info = env.reset()
    r = 0
    d = False
    ep_ret = 0
    ep_len = 0
    while not (d or (ep_len == 1000)):
        o = np.expand_dims(o,axis=0).astype(np.float32)
        a = sess.run([output_name],{input_name:o})[0][0]
        o,r,d,_ = env.step(a)
        ep_ret += r
        ep_len += 1

当需要两个或以上网络模型时，建立多个sess即可。

查看onnx模型结构：[11]

Netronnetron.app/

打开后直接把转化好的模型加载进去，即可看到模型结构。

step2: ONNX --> TensorRT[12]

1) 使用tensorRT的python API进行推理（以RL推理为例）：

Method1：使用trtexe工具将ONNX模型转为.trt然后进行推理

TensorRT的安装方法本文下面有介绍。将.onnx模型转化为.trt模型时，需要借助TensorRT提供的工具trtexec, 这个工具保存在TensorRT-8.2.5.1/bin目录下，将你的onnx模型拷贝到这个文件夹，并在这个文件夹下打开命令行，输入如下指令即可得到转化后的模型output_model_name.trt

./trtexec --onnx=onnx_model_name.onnx --saveEngine=output_model_name.trt

这里的--onnx和--saveEngine分别代表onnx模型的路径和保存trt模型的路径。此外，还有两个比较常用的命令行工具参数：

--explicitBatch: 告诉trtexec在优化时固定输入的batch_size（将从onnx文件中推断batch size的具体数值，即与导出onnx文件时传入的batch size一致）当确定模型的输入batch size时，推荐采用此参数，因为固定batch size大小可以使得trtexec进行额外的优化，且省去了指定“优化配置文件”这一额外步骤。
--fp16: 采用FP16精度，通过牺牲部分模型准确率来简化模型（减少显存占用和加速模型推理）。TensorRT支持TF32/FP32/FP16/INT8多种精度（具体还要看GPU是否支持）。FP32是多数框架训练模型的默认精度，FP16对模型推理速度和显存占用有较大优化，且准确率损失往往可以忽略不计。INT8进一步牺牲了准确率，同时也进一步降低模型的延迟和显存要求，但需要额外的步骤来仔细校准，来使其精度损耗较小。

加入这两个指令之后的指令为：

./trtexec --onnx=onnx_model_name.onnx --saveEngine=output_model_name.trt --explicitBatch --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --fp16

关于trtexec工具的具体命令，可以见下文：

TensorRT - 自带工具trtexec的参数使用说明stubbornhuang.blog.csdn.net/article/details/120360642?spm=1001.2101.3001.6661.1&utm_medium=distribute.pc_relevant_t0.none-task-blog-2%7Edefault%7EBlogCommendFromBaidu%7ERate-1-120360642-blog-125976347.pc_relevant_aa&depth_1-utm_source=distribute.pc_relevant_t0.none-task-blog-2%7Edefault%7EBlogCommendFromBaidu%7ERate-1-120360642-blog-125976347.pc_relevant_aa&utm_relevant_index=1编辑

【TensorRT】trtexec工具转engine_there2belief的博客-CSDN博客_使用 trtexec工具转engineblog.csdn.net/dou3516/article/details/125976347编辑

得到output_model_name.trt之后，使用python API进行推理：

import numpy as np
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import time
import copy
import env_class


class TRTPolicy():
    def __init__(self):
        f = open("output_model_name.trt","rb")                                     # 读取trt模型
        self.runtime = trt.Runtime(trt.Logger(trt.Logger.WARNING))   # 创建一个Runtime(传入记录器Logger)
        self.engine = self.runtime.deserialize_cuda_engine(f.read()) # 从文件中加载trt引擎
        self.context = self.engine.create_execution_context()        # 创建context

        # 分配input和output内存
        self.batch_size = 1
        self.state_dim = 37
        self.act_dim = 12

        dummy_input = np.random.rand(self.batch_size,self.state_dim).astype(np.float16)
        input_batch = np.array(dummy_input)
        self.output = np.empty([self.batch_size,self.act_dim],dtype=np.float16)

        self.d_input = cuda.mem_alloc(1 * input_batch.nbytes)
        self.d_output = cuda.mem_alloc(1 * self.output.nbytes)

        self.bindings = [int(self.d_input),int(self.d_output)]
        self.stream = cuda.Stream()

    def predict(self,batch):
        batch = batch.astype(np.float16)
        # 将输入的数据放到指定的设备上
        cuda.memcpy_htod_async(self.d_input,batch,self.stream)
        # 执行推理过程，此处采用异步推理。如果想要同步推理，需将execute_async_v2替换成execute_v2
        self.context.execute_async_v2(self.bindings, self.stream.handle, None)
        # 将得到的数据从设备中取出来
        cuda.memcpy_dtoh_async(self.output,self.d_output,self.stream)
        # 同步一下线程
        self.stream.synchronize()
        output = copy.deepcopy(self.output)
        return output

policy = TRTPolicy()
env = env_class(render=True)
for j in range(n_eval):
    o,info = env.reset()
    r = 0
    d = False
    ep_ret = 0
    ep_len = 0
    while not (d or (ep_len == 100000)):
        o = np.expand_dims(o,axis=0).astype(np.float32)
        a = policy.predict(o)[0]
        o,r,d,_ = env.step(a)
        ep_ret += r
        ep_len += 1

Method2: 使用tensorrt的parser接口解析onnx模型，构建engine引擎，这种方式比较简单，不依赖其他库，并且支持动态推理

官网示例：

NVIDIA Deep Learning TensorRT Documentationdocs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#python_topics编辑

非官方示例：

pytorch模型(.pth)转tensorrt模型(.engine)几种方式_51CTO博客_pytorch导入pth模型测试blog.51cto.com/u_15435490/5293165编辑

https://murphypei.github.io/blog/2019/09/trt-useagemurphypei.github.io/blog/2019/09/trt-useage

2）使用TensorRT的C++API进行推理

官方教程：

NVIDIA Deep Learning TensorRT Documentationdocs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#c_topics编辑

Method1: 使用trtexe工具将ONNX模型转为.trt然后进行推理[13]

首先使用trtexec工具进行转换

cd TensorRT-8.2.5.1/bin
./trtexec --onnx=onnx_model_name.onnx --saveEngine=output_model_name.trt

然后进行推理, infer_trt_model.cpp

// TensorRT include
#include <NvInfer.h>
#include <NvInferRuntime.h>

//cuda include
#include <cuda_runtime.h>

//system include
#include <stdio.h>
#include <math.h>

#include <iostream>
#include <fstream>
#include <vector>

using namespace std;

static const int batch_size = 1;
static const int input_dim = 37;
static const int output_dim = 12;


class TRTLogger : public nvinfer1::ILogger{
public:
 virtual void log(Severity severity, nvinfer1::AsciiChar const* msg) noexcept override{
 if(severity <= Severity::kINFO){
 printf("%d: %s\n", severity, msg);
        }
    }
} logger;


vector<unsigned char> load_file(const string& file){
 ifstream in(file, ios::in | ios::binary);//二进制打开
 if (!in.is_open())
 return {};
 
 in.seekg(0, ios::end);//定位到文件结束处
 size_t length = in.tellg();//获取文件长度
 
 std::vector<uint8_t> data;
 if (length > 0){
 in.seekg(0, ios::beg);//定位到开始
 data.resize(length);
 
 in.read((char*)&data[0], length);//强制转换读取，
    }
 in.close();
 return data;
}


void inference(){
    //--------------------------1.准备模型并加载--------------------------
 TRTLogger logger;
 auto engine_data = load_file("output_model_name.trt");                   //二进制转换完成的trt文件并以字符形式读取到vector 
 nvinfer1::IRuntime* runtime = nvinfer1::createInferRuntime(logger); //创建runtime推理实例
 nvinfer1::ICudaEngine* engine = runtime->deserializeCudaEngine(engine_data.data(),engine_data.size()); //对engine_data反序列化
 if(engine == nullptr){
 printf("Deserialize cuda engine failed.\n");
 runtime->destroy();
 return;
    }

 nvinfer1::IExecutionContext* execution_context = engine->createExecutionContext();  //创建执行上下文
 cudaStream_t stream = nullptr;                                                      //创建流，batch异步
 cudaStreamCreate(&stream);

    //------------------------2.准备好要推理的数据并搬运到GPU-------------
float input_data_host[batch_size * state_dim] = {1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
                                                        1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
                                                        1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
                                                        1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0};
 float* input_data_device = nullptr;

 float output_data_host[batch_size * act_dim];
 float* output_data_device = nullptr;

 cudaMalloc(&input_data_device,sizeof(input_data_host));
 cudaMalloc(&output_data_device,sizeof(output_data_host));
 cudaMemcpyAsync(input_data_device,input_data_host,sizeof(input_data_host),cudaMemcpyHostToDevice,stream); //将input从host到device
 float* bindings[] = {input_data_device, output_data_device}; //用一个指针数组指定input和output在gpu中的指针

    //--------------------3.推理并将结果搬运回CPU------------------------
 bool success = execution_context->enqueueV2((void**)bindings,stream,nullptr);
 cudaMemcpyAsync(output_data_host,output_data_device,sizeof(output_data_host),cudaMemcpyDeviceToHost, stream);
 cudaStreamSynchronize(stream);

    //-------------------4.释放内存------------------------------------
 printf("Clean memory\n");
 cudaStreamDestroy(stream);
 execution_context->destroy();
 engine->destroy();
 runtime->destroy();

}

int main(){
 inference();
}

CMakeLists.txt[14]

cmake_minimum_required(VERSION 3.11)
project(tensorRT_test)
set(CMAKE_CXX_STANDARD 14)
file(COPY output_model_name.trt DESTINATION .)

# CUDA
find_package(CUDA REQUIRED)
include_directories(${CUDA_INCLUDE_DIRS})
message("CUDA_TOOLKIT_ROOT_DIR = ${CUDA_TOOLKIT_ROOT_DIR}")
message("CUDA_INCLUDE_DIRS = ${CUDA_INCLUDE_DIRS}")
message("CUDA_LIBRARIES = ${CUDA_LIBRARIES}")


link_libraries(nvinfer nvonnxparser ${CUDA_LIBRARIES})

add_executable(infer infer_trt_model.cpp)

编译并执行

rm -r build
mkdir build
cd build
cmake ..
make
./infer

Method2: 使用NvOnnxParser解析后进行推理[15][16]

首先进行模型转换，使用NvOnnxParser将onnx文件进行解析

onnx_to_engine.cpp

#include <NvInfer.h>
#include <NvInferRuntime.h>
#include <NvOnnxParser.h>
#include <cuda_runtime.h>

#include <stdio.h>
#include <math.h>
#include <string>
#include <iostream>
#include <fstream>
#include <memory>
#include <functional>
#include <unistd.h>
#include <chrono>
#include <sys/stat.h>

#include <vector>



inline const char* severity_string(nvinfer1::ILogger::Severity t) {
 switch (t) {
 case nvinfer1::ILogger::Severity::kINTERNAL_ERROR: return "internal_error";
 case nvinfer1::ILogger::Severity::kERROR: return "error";
 case nvinfer1::ILogger::Severity::kWARNING: return "warning";
 case nvinfer1::ILogger::Severity::kINFO: return "info";
 case nvinfer1::ILogger::Severity::kVERBOSE: return "verbose";
 default: return "unknown";
    }
}

class TRTLogger : public nvinfer1::ILogger {
public:
 virtual void log(Severity severity, nvinfer1::AsciiChar const* msg) noexcept override {
 if (severity <= Severity::kWARNING) {
 if (severity == Severity::kWARNING) printf("\033[33m%s: %s\033[0m\n", severity_string(severity), msg);
 else if (severity == Severity::kERROR) printf("\031[33m%s: %s\033[0m\n", severity_string(severity), msg);
 else printf("%s: %s\n", severity_string(severity), msg);
        }
    }
};



bool isFileExist(const std::string name) 
{
 struct stat buffer;
 return (stat (name.c_str(), &buffer) == 0);
}

bool build_model(){
 if (isFileExist ("your_model.trtmodel")){
 printf("your_model.trtmodel already exists.\n");
 return true;
    }

 TRTLogger logger;

    //创建builder, config, network
 nvinfer1::IBuilder* builder = nvinfer1::createInferBuilder(logger);  //create an instance of the builder
 nvinfer1::IBuilderConfig* config = builder->createBuilderConfig();
 nvinfer1::INetworkDefinition* network = builder->createNetworkV2(1); //create a network definition

    //使用onnx parser解析器来解析onnx模型
 auto parser = nvonnxparser::createParser(*network,logger);           //create an ONNX parser to populate the networks
 if (!parser->parseFromFile("your_onnx_model.onnx",1)){
 printf("Failed to parse your_onnx_model.onnx.\n");
 return false;
    }

    //设置工作区大小
 printf("Workspace Size = %.2f MB\n", (1 << 28) / 1024.0f / 1024.0f);
 config->setMaxWorkspaceSize(1 << 28);

    //开始构建tensorrt模型engine
 nvinfer1::ICudaEngine* engine = builder->buildEngineWithConfig(*network,*config);
 if (engine == nullptr){
 printf("Build engine failed.\n");
 return false;
    }

    //将构建好的tensorrt模型engine反序列化(保存成文件)
 nvinfer1::IHostMemory* model_data = engine->serialize();
 FILE* f = fopen("your_model.trtmodel","wb");
 fwrite(model_data->data(),1,model_data->size(),f);
 fclose(f);

    //destroy指针
 model_data->destroy();
 engine->destroy();
 network->destroy();
 config->destroy();
 builder->destroy();

 printf("Build Done.\n");

 return true;

}

int main()
{
 build_model();
}

CMakeLists.txt[17]

cmake_minimum_required(VERSION 3.11)
project(tensorRT_test)
set(CMAKE_CXX_STANDARD 14)
file(COPY your_onnx_model.onnx DESTINATION .)

# CUDA
find_package(CUDA REQUIRED)
include_directories(${CUDA_INCLUDE_DIRS})
message("CUDA_TOOLKIT_ROOT_DIR = ${CUDA_TOOLKIT_ROOT_DIR}")
message("CUDA_INCLUDE_DIRS = ${CUDA_INCLUDE_DIRS}")
message("CUDA_LIBRARIES = ${CUDA_LIBRARIES}")


link_libraries(nvinfer nvonnxparser ${CUDA_LIBRARIES})

add_executable(engine onnx_to_engine.cpp)

编译并执行得到your_model.trtmodel（保存在build文件夹下）

mkdir build
cd build
cmake ..
make
./engine

接下来，基于上一步得到的your_model.trtmodel进行推理

infer_engine.cpp

#include <fstream> 
#include <iostream> 
 
// TensorRT include
#include <NvInfer.h>
#include <NvInferRuntime.h>

//cuda include
#include <cuda_runtime.h>
#include <cassert>


class TRTLogger : public nvinfer1::ILogger{
public:
 virtual void log(Severity severity, nvinfer1::AsciiChar const* msg) noexcept override{
 if(severity <= Severity::kINFO){
 printf("%d: %s\n", severity, msg);
        }
    }
} logger;


#define CHECK(status)                                          \
    do                                                         \
    {                                                          \
        auto ret = (status);                                   \
        if (ret != 0)                                          \
        {                                                      \
            std::cerr << "Cuda failure: " << ret << std::endl; \
            abort();                                           \
        }                                                      \
    } while (0)
 
using namespace nvinfer1; 

 
const char* IN_NAME = "input"; 
const char* OUT_NAME = "output"; 
static const int input_dim = 37; 
static const int output_dim = 12; 
static const int batch_size = 1; 
static const int EXPLICIT_BATCH = 1 << (int)(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH); 
 
 
void doInference(IExecutionContext& context, float* input, float* output, int batchSize) 
{ 
 const ICudaEngine& engine = context.getEngine(); 
 
        // Pointers to input and output device buffers to pass to engine. 
        // Engine requires exactly IEngine::getNbBindings() number of buffers. 
 assert(engine.getNbBindings() == 2); 
 void* buffers[2]; 
 
        // In order to bind the buffers, we need to know the names of the input and output tensors. 
        // Note that indices are guaranteed to be less than IEngine::getNbBindings() 
 const int inputIndex = engine.getBindingIndex(IN_NAME); 
 const int outputIndex = engine.getBindingIndex(OUT_NAME); 
 
        // Create GPU buffers on device 
 CHECK(cudaMalloc(&buffers[inputIndex], batchSize * input_dim * sizeof(float))); 
 CHECK(cudaMalloc(&buffers[outputIndex], batchSize * output_dim * sizeof(float))); 
 
        // Create stream 
 cudaStream_t stream; 
 CHECK(cudaStreamCreate(&stream)); 
 
        // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host 
 CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * input_dim * sizeof(float), cudaMemcpyHostToDevice, stream)); 
 context.enqueue(batchSize, buffers, stream, nullptr); 
 CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * output_dim * sizeof(float), cudaMemcpyDeviceToHost, stream)); 
 cudaStreamSynchronize(stream); 
 
        // Release stream and buffers 
 cudaStreamDestroy(stream); 
 CHECK(cudaFree(buffers[inputIndex])); 
 CHECK(cudaFree(buffers[outputIndex])); 
} 
 
int main(int argc, char** argv) 
{ 
        // create a model using the API directly and serialize it to a stream 
 char *trtModelStream{ nullptr }; 
 size_t size{ 0 }; 
 
 std::ifstream file("your_model.trtmodel", std::ios::binary); 
 if (file.good()) { 
 file.seekg(0, file.end); 
 size = file.tellg(); 
 file.seekg(0, file.beg); 
 trtModelStream = new char[size]; 
 assert(trtModelStream); 
 file.read(trtModelStream, size); 
 file.close(); 
        } 
 
 TRTLogger m_logger; 
 IRuntime* runtime = createInferRuntime(m_logger); 
 assert(runtime != nullptr); 
 ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr); 
 assert(engine != nullptr); 
 IExecutionContext* context = engine->createExecutionContext(); 
 assert(context != nullptr); 
 
        // generate input data 
 float data[batch_size * input_dim]; 
 for (int i = 0; i < batch_size * input_dim; i++) 
 data[i] = 1.0; 
 
        // Run inference 
 float prob[batch_size * act_dim]; 
 doInference(*context, data, prob, batch_size); 
 
        // Destroy the engine 
 context->destroy(); 
 engine->destroy(); 
 runtime->destroy(); 
 return 0; 
}

CMakeLists.txt[18]

cmake_minimum_required(VERSION 3.11)
project(tensorRT_test)
set(CMAKE_CXX_STANDARD 14)
file(COPY your_model.trtmodel DESTINATION .)

# CUDA
find_package(CUDA REQUIRED)
include_directories(${CUDA_INCLUDE_DIRS})
message("CUDA_TOOLKIT_ROOT_DIR = ${CUDA_TOOLKIT_ROOT_DIR}")
message("CUDA_INCLUDE_DIRS = ${CUDA_INCLUDE_DIRS}")
message("CUDA_LIBRARIES = ${CUDA_LIBRARIES}")


link_libraries(nvinfer nvonnxparser ${CUDA_LIBRARIES})

add_executable(infer infer_engine.cpp)

编译并执行

rm -r build
mkdir build
cd build
cmake ..
make
./infer

Tips:

1）如果推理时需要两个及以上模型，首先建立一个runtime, 然后对应各自的trtmodel建立不同的engine，engine再建立不同的context:

参考：[19]

The TensorRT runtime can be used by multiple threads simultaneously, so long as each object uses a different execution context.

2）explicit batch和implicit batch的区别:

参考下文：

TensorRT 初探（3）-- explicit_batch vs implicit_batchblog.csdn.net/weixin_45252450/article/details/124691175编辑

3）获取输入输出维度:

#include <sstream>
std::string DimsToStr(Dims dims)
{
 std::stringstream ss;
 for (size_t i = 0; i < dims.nbDims; i++)
    {
 ss << dims.d[i] << " ";
    }
 return ss.str();
}
 const int inputIndex = engine.getBindingIndex(IN_NAME);
 const int outputIndex = engine.getBindingIndex(OUT_NAME);

 auto input_dimensions = engine.getBindingDimensions(inputIndex);
 auto output_dimensions = engine.getBindingDimensions(outputIndex);
 cout << "The input_dimensions is:" << DimsToStr(input_dimensions) << endl;   
 cout << "The output_dimensions is:" << DimsToStr(output_dimensions) << endl;

2. 路线2：PyTorch--> ONNX --> OpenVINO(Intel)，适用于Intel CPU上部署

OpenVINO简介：OpenVINO是英特尔推出的一款全面的工具套件，用于快速部署应用和解决方案，支持计算机视觉的CNN网络结构超过200余种。OpenVINO是一个Pipeline工具集，同时可以兼容各种开源框架训练好的模型，拥有算法模型上线部署的各种能力，只要掌握了该工具，可以轻松地将预训练模型在Intel的CPU上快速部署起来。[20]

支持的硬件：MKLDNN plugin --> CPUs: Xeon至强/Core酷睿/Atom凌动；clDNN plugin --> GPU: HD Graphics；Myriad plugin --> VPU: Myriad；FPGA plugin --> FPGA Arria 10
支持的框架：Caffe，Keras，Tensorflow，ONNX，PyTorch，XNet
支持的系统：Windows, Ubuntu, MacOS

step1: Pytorch --> ONNX (同上)

step2：ONNX --> OpenVINO

首先按照本文下面的介绍安装OpenVINO

进入下面目录

cd /opt/intel/openvino_2021/deployment_tools/model_optimizer

执行如下语句（注意不要用虚拟环境的python）

sudo python3 mo.py --input_model yolox_s.onnx --input_shape [1,3,416,416] --
data_type FP16

其中yolox_5.onnx是原始的.onnx文件，执行上面的语句后会生成.xml和.bin两个文件，.xml 描述了网络的拓扑结构, .bin 包含了权重和偏置。

使用IR模型进行推理（python API）[21]

python调用步骤：

1) 加载插件

ie = IECore()
ie.add_extension(os.getenv(‘INTEL_OPENVINO_DIR’)+‘/deployment_tools/inference_engine/lib/intel64/libcpu_extension.dylib’,device)

2) 读入模型

net = IENetwork(model= modelpath+‘.xml’, weights=modelpath+‘.bin’)

3) 指定输入输出格式

input_blob = next(iter(net.inputs))
out_blob = next(iter(net.outputs))

4）开始工作

exec_net = ie.load_network(network=net, device_name=device)
image = cv2.imread(imgpath)
res = exec_net.infer(inputs={input_blob: image})

综合如下：

from openvino.inference_engine import IENetwork, IECore
model_xml = "*.xml"
model_bin = "*.bin"
device = 'CPU'
ie = IECore()
net = IENetwork(model=model_xml, weights=model_bin)
input_blob,out_blob,net.batch_size = next(iter(net.inputs)),next(iter(net.outputs)),1
n, c, h, w = net.inputs[input_blob].shape
exec_net = ie.load_network(network=net, device_name=device)
output = exec_net.infer(inputs={input_blob: cpux.numpy()})[out_blob]
output

yolox官方示例(~/YOLOX/demo/OpenVINO/python)：

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# Copyright (C) 2018-2021 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
# Copyright (c) Megvii, Inc. and its affiliates.

import argparse
import logging as log
import os
import sys

import cv2
import numpy as np

from openvino.inference_engine import IECore

from yolox.data.data_augment import preproc as preprocess
from yolox.data.datasets import COCO_CLASSES
from yolox.utils import mkdir, multiclass_nms, demo_postprocess, vis


def parse_args() -> argparse.Namespace:
    """Parse and return command line arguments"""
    parser = argparse.ArgumentParser(add_help=False)
    args = parser.add_argument_group('Options')
    args.add_argument(
        '-h',
        '--help',
        action='help',
        help='Show this help message and exit.')
    args.add_argument(
        '-m',
        '--model',
        required=True,
        type=str,
        help='Required. Path to an .xml or .onnx file with a trained model.')
    args.add_argument(
        '-i',
        '--input',
        required=True,
        type=str,
        help='Required. Path to an image file.')
    args.add_argument(
        '-o',
        '--output_dir',
        type=str,
        default='demo_output',
        help='Path to your output dir.')
    args.add_argument(
        '-s',
        '--score_thr',
        type=float,
        default=0.3,
        help="Score threshould to visualize the result.")
    args.add_argument(
        '-d',
        '--device',
        default='CPU',
        type=str,
        help='Optional. Specify the target device to infer on; CPU, GPU, \
              MYRIAD, HDDL or HETERO: is acceptable. The sample will look \
              for a suitable plugin for device specified. Default value \
              is CPU.')
    args.add_argument(
        '--labels',
        default=None,
        type=str,
        help='Option:al. Path to a labels mapping file.')
    args.add_argument(
        '-nt',
        '--number_top',
        default=10,
        type=int,
        help='Optional. Number of top results.')
    return parser.parse_args()


def main():
    log.basicConfig(format='[ %(levelname)s ] %(message)s', level=log.INFO, stream=sys.stdout)
    args = parse_args()

    # ---------------------------Step 1. Initialize inference engine core--------------------------------------------------
    log.info('Creating Inference Engine')
    ie = IECore()

    # ---------------------------Step 2. Read a model in OpenVINO Intermediate Representation or ONNX format---------------
    log.info(f'Reading the network: {args.model}')
    # (.xml and .bin files) or (.onnx file)
    net = ie.read_network(model=args.model)
    if len(net.input_info) != 1:
        log.error('Sample supports only single input topologies')
        return -1
    if len(net.outputs) != 1:
        log.error('Sample supports only single output topologies')
        return -1

    # ---------------------------Step 3. Configure input & output----------------------------------------------------------
    log.info('Configuring input and output blobs')
    # Get names of input and output blobs
    input_blob = next(iter(net.input_info))
    out_blob = next(iter(net.outputs))
    # Set input and output precision manually
    net.input_info[input_blob].precision = 'FP32'
    net.outputs[out_blob].precision = 'FP16'

    # Get a number of classes recognized by a model
    num_of_classes = max(net.outputs[out_blob].shape)

    # ---------------------------Step 4. Loading model to the device-------------------------------------------------------
    log.info('Loading the model to the plugin')
    exec_net = ie.load_network(network=net, device_name=args.device)

    # ---------------------------Step 5. Create infer request--------------------------------------------------------------
    # load_network() method of the IECore class with a specified number of requests (default 1) returns an ExecutableNetwork
    # instance which stores infer requests. So you already created Infer requests in the previous step.

    # ---------------------------Step 6. Prepare input---------------------------------------------------------------------
    origin_img = cv2.imread(args.input)
    _, _, h, w = net.input_info[input_blob].input_data.shape
    image, ratio = preprocess(origin_img, (h, w))

    # ---------------------------Step 7. Do inference----------------------------------------------------------------------
    log.info('Starting inference in synchronous mode')
    res = exec_net.infer(inputs={input_blob: image})
    # ---------------------------Step 8. Process output--------------------------------------------------------------------
    res = res[out_blob]

    predictions = demo_postprocess(res, (h, w), p6=False)[0]

    boxes = predictions[:, :4]
    scores = predictions[:, 4, None] * predictions[:, 5:]

    boxes_xyxy = np.ones_like(boxes)
    boxes_xyxy[:, 0] = boxes[:, 0] - boxes[:, 2]/2.
    boxes_xyxy[:, 1] = boxes[:, 1] - boxes[:, 3]/2.
    boxes_xyxy[:, 2] = boxes[:, 0] + boxes[:, 2]/2.
    boxes_xyxy[:, 3] = boxes[:, 1] + boxes[:, 3]/2.
    boxes_xyxy /= ratio
    dets = multiclass_nms(boxes_xyxy, scores, nms_thr=0.45, score_thr=0.1)

    if dets is not None:
        final_boxes = dets[:, :4]
        final_scores, final_cls_inds = dets[:, 4], dets[:, 5]
        origin_img = vis(origin_img, final_boxes, final_scores, final_cls_inds,
                         conf=args.score_thr, class_names=COCO_CLASSES)

    mkdir(args.output_dir)
    output_path = os.path.join(args.output_dir, os.path.basename(args.input))
    cv2.imwrite(output_path, origin_img)


if __name__ == '__main__':
    sys.exit(main())

三.安装OpenVINO

1) 下载和安装英特尔OpenVINO工具包

https://www.intel.cn/content/www/cn/zh/developer/tools/openvino-toolkit/download.htmlwww.intel.cn/content/www/cn/zh/developer/tools/openvino-toolkit/download.html

按照下面参数进行选择，并点击下载，下载l_openvino_toolkit_p_xxx.tgz文件, 例如l_openvino_toolkit_p_2021.4.752.tgz文件

解压并安装

tar -xvzf l_openvino_toolkit_p_2021.4.752.tgz
cd l_openvino_toolkit_p_2021.4.752
sudo ./install_GUI.sh

执行下面的命令从而使用英特尔优化版本的OpenCV

source /opt/intel/openvino_2021/opencv/setupvars.sh

2) 安装软件依赖

cd /opt/intel/openvino_2021/install_dependencies
sudo -E ./install_openvino_dependencies.sh

3) 设置环境变量

在终端执行

source /opt/intel/openvino_2021/bin/setupvars.sh
source /opt/intel/openvino_2021/opencv/setupvars.sh

并且

sudo gedit ~/.bashrc

添加第一行到.bashrc文件中，并source一下

source /opt/intel/openvino_2021/bin/setupvars.sh
source ~/.bashrc

4）配置Model Optimizer

cd /opt/intel/openvino_2021/deployment_tools/model_optimizer/install_prerequisites
sudo ./install_prerequisites_onnx.sh

5) 测试安装是否成功

cd /opt/intel/openvino_2021/deployment_tools/demo/
bash demo_security_barrier_camera.sh

运行成功后会显示以下结果

四. 安装TensorRT[22]

安装方式1（直接下载源码）：

1）下载压缩包：

我的电脑是3080Ti显卡 + 510.60.02驱动 + CUDA11.2 + cuDNN811，对应可以选择TensorRT8.2版本（我看新出的8.5版本显示也支持CUDA11.0-8, 但是我没用过，不知道可不可行，感兴趣的可以试一下）

下面是TensorRT8.x版本的下载地址：

https://developer.nvidia.com/nvidia-tensorrt-8x-downloaddeveloper.nvidia.com/nvidia-tensorrt-8x-download

我选择的是TensorRT 8.2 GA Update 4：

选择第一个进行下载，大概1.2G左右，下载完后的名字为TensorRT-8.2.5.1.Linux.x86_64-gnu.cuda-11.4.cudnn8.2.tar.gz

2）解压该文件：

tar -zxvf TensorRT-8.2.5.1.Linux.x86_64-gnu.cuda-11.4.cudnn8.2.tar.gz

3）安装pycuda, 命令为：

conda create -n deployment python=3.8
conda activate deployment
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple 'pycuda<2021.1'

4）安装onnxruntime：

pip install onnxruntime-gpu

5) 安装一些库（这些文件都在解压后的TensorRT-8.2.5.1文件夹下）：

# step1：安装后可以使用python调用TensorRT，注意到python文件夹下有适用于py36,py37,py38,py39等各种python版本的轮子，
# 因为我第三步用conda建立的环境是py38,所以我选择了py38的轮子
cd TensorRT-8.2.5.1/python
pip install tensorrt-8.2.5.1-cp38-none-linux_x86_64.whl 

# step2: 安装UFF,支持tensorflow模型转化，若使用pytorch可以不装
cd TensorRT-8.2.5.1/uff
pip install uff-0.6.9-py2.py3-none-any.whl

# step3:安装graphsurgeon，支持自定义结构
cd TensorRT-8.2.5.1/graphsurgeon
pip install graphsurgeon-0.4.5-py2.py3-none-any.whl

（如果后续需要更新版本，卸载上述安装包的指令为）

pip uninstall tensorrt
pip uninstall uff
pip uninstall graphsurgeon

6）配置环境变量

sudo gedit ~/.bashrc

在文件末尾添加

export LD_LIBRARY_PATH=$PATH:~/TensorRT-8.2.5.1/lib:$LD_LIBRARY_PATH
export LIBRARY_PATH=$PATH:~/TensorRT-8.2.5.1/lib::$LIBRARY_PATH

由于我将解压之后的TensorRT-8.2.5.1文件夹放在了根目录下，所以我的路径是~/xxx, 需要根据自己的路径进行修改

source ~/.bashrc

7）验证tensorRT是否安装成功

import onnxruntime as ort
import tensorrt
print(ort.get_device())
print(ort.get_available_providers())
print(tensorrt.__version__ )

输出上面的信息则为安装成功。

安装方式2（使用.deb文件安装）

具体方法参考下面的文章：

本初ben：使用TensorRT加速Pytorch模型推理70 赞同 · 6 评论文章

常见报错解决：

1. 报错：Unexpected input data type. Actual: (N11onnxruntime17PrimitiveDataTypeIdEE)

解决方法：

报错:Unexpected input data type. Actual: (N11onnxruntime17PrimitiveDataTypeIdEE)blog.csdn.net/sinat_24899403/article/details/114268313编辑

2. Onnx export fails on torch.distributions.Normal #30517

解决方法：将pytorch升级到11.0以上版本即可，亲测在Pytorch 1.13.0+cu116版本可用，不会报错。

https://github.com/pytorch/pytorch/issues/30517github.com/pytorch/pytorch/issues/30517

3. runtimeError: Exporting the operator normal to ONNX opset version 9 is not supported. Support for this operator was added in version 11, try exporting with this version

解决方法：将torch.onnx.export()中的opset_version改为11.

4. 相同输入的情况下，onnx模型的输出与原始pytorch模型不一致

解决方法：检查模型的forward()函数中是否存在随机量，有的话需要去掉。

5. ImportError: libnvinfer.so.7: cannot open shared object file: No such file or directory

解决方法: TensorRT路径配置有问题，改正即可。

https://blog.csdn.net/qq_42178122/article/details/120035310blog.csdn.net/qq_42178122/article/details/120035310

6. 在终端运行没问题，在pycharm中run时报错：ImportError: libcusolver.so.8.0: cannot open shared object file: No such file or directory

解决方法：

https://blog.csdn.net/qq_31347869/article/details/104788423blog.csdn.net/qq_31347869/article/details/104788423

7. 使用pycuda时报错ModuleNotFoundError: No module named "six"

解决方法: pip install six

8. pycuda._driver.LogicError: cuMemcpyHtoDAsync failed: invalid argument

解决方法：这个错误原因一般有两个，一个是输入数据的维度不对，一个是输入数据的类型不对。

pycuda._driver.LogicError: cuMemcpyHtoDAsync failed: invalid argumentblog.csdn.net/racesu/article/details/125811605编辑

9. pycuda._driver.LogicError: explicit_context_dependent failed: invalid device context - no currently

解决方法：pycuda.driver没有初始化，导致无法得到context, 需要在导入pycuda.driver后再导入pycuda.autoinit, 即如下所示：[23]

import pycuda.driver as cuda
import pycuda.autoinit

10. fatal error: NvInfer.h: No such file or directory

解决方法：在CUDA10.2+TensorRT-8.0.0.3版本中编译时遇到上述错误，解决方法是在CMakeLists.txt中加入如下语句[24]

include_directories(/home/chenci/TensorRT-8.0.0.3/include/)
link_directories(/home/chenci/TensorRT-8.0.0.3/lib/)

参考

https://zhuanlan.zhihu.com/p/589411751

u013250861

关注

16
点赞
踩
29

收藏

觉得还不错? 一键收藏
0
评论
Pytorch量化+部署

Open Neural Network Exchange (ONNX, 开放神经网络交换)格式，是一个用于表示深度学习模型的标准，可使模型在不同框架之间进行转移。ONNX是一种针对机器学习所设计的开放式的文件格式，用于存储训练好的模型。它使得不同的人工智能框架（如PyTorch, MXNet）可以采用相同格式存储模型数据并交互。ONNX的规范及代码主要由微软、亚马逊、Facebook和IBM等公司共同开发。
复制链接

扫一扫