一. 量化
Pytorch对量化的支持有以下三种方式:
1) 模型训练完毕后动态量化:post training dynamic quantization
2) 模型训练完毕后静态量化:post training static quantization
3) 模型训练中开启量化:quantization aware training (QAT)
关于post training dynamic/static quantization的方法,可以参考下面的博客
Pytorch模型量化_凌逆战的博客-CSDN博客blog.csdn.net/qq_34218078/article/details/127521819编辑
关于post training static quantization和quantization aware training的方法,可以参考下面的博客
二. 部署
1.路线1:PyTorch --> ONNX --> TensorRT(NVIDIA),适用于Nvidia GPU上的部署
ONNX简介:Open Neural Network Exchange (ONNX, 开放神经网络交换)格式,是一个用于表示深度学习模型的标准,可使模型在不同框架之间进行转移。ONNX是一种针对机器学习所设计的开放式的文件格式,用于存储训练好的模型。它使得不同的人工智能框架(如PyTorch, MXNet)可以采用相同格式存储模型数据并交互。ONNX的规范及代码主要由微软、亚马逊、Facebook和IBM等公司共同开发。目前官方支持加载ONNX模型并进行推理的深度学习框架有:Caffe2, PyTorch, MXNet, ML.NET, TensorRT和Microsoft CNTK, 并且Tensorflow也非官方的支持ONNX.[1][2][3]
TensorRT简介:TensorRT是一个有助于在NVIDIA图形处理单元(GPU)上高性能推理C++库。它旨在与TensorFlow, Caffe, PyTorch以及MXNet等训练框架以互补的方式进行工作,专门致力于在GPU上快速有效地进行网络推理。[4][5]
step 1: PyTorch --> ONNX
将PyTorch转为ONNX模型很简单,使用torch.onnx.export()函数即可。
官网文档:
https://pytorch.org/docs/stable/onnx.htmlpytorch.org/docs/stable/onnx.html
函数说明:
torch.onnx.export()[6]
torch.onnx.export(model, args, f, export_params=True, verbose=False, training=False, input_names=None, output_names=None, aten=False, export_raw_ir=False, operator_export_type=None, opset_version=None, _retain_param_name=True, do_constant_folding=False, example_outputs=None, strip_doc_string=True, dynamic_axes=None, keep_initializers_as_inputs=None)
功能:将.pth模型转为onnx文件导出
参数:
- model(torch.nn.Module): pth模型文件
- args (tuple of arguments): 模型的输入,模型的尺寸
- export_params (bool, default True): 如果指定为True或者默认,参数也会被导出,如果要导出一个没训练过的就设置为False
- verbose (bool, default False): 导出轨迹的调试描述
- training (bool, default False) :在训练模式下导出模型。目前,ONNX导出的模型只是为了做推断,通常不需要将其设置为True;
- input_names (list of strings, default empty list) :onnx文件的输入名称, 可以随便取
- output_names (list of strings, default empty list) :onnx文件的输出名称,可以随便取
- opset_version:默认为9
- dynamic_axes – {‘input’ : {0 : ‘batch_size’}, ‘output’ : {0 : ‘batch_size’}})
举例:
1)使用torch.save(model,path)方式保存(这种方法会将model的参数,框架都保存到路径path中,但是在加载model的时候可能会因为包版本的不同而报错,所以在保存所有模型参数时,需要将模型构造相关代码放在相同路径,否则在load时无法索引到model的框架)。
输入测试数据,格式为[batch, channel, height, width]
x = torch.randn(1, 3, 224, 224, device=device)
加入这一句,不然可能会影响onnx的输出结果
import torch
import torch.nn
import onnx
# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device = torch.device('cpu')
model = torch.load('***.pth', map_location=device)
model.eval()
input_names = ['input']
output_names = ['output']
x = torch.randn(1, 3, 224, 224, device=device) #与实际输入数据的shape一致即可,取值没有影响,所以用了随机数
torch.onnx.export(model, x, 'name.onnx', input_names=input_names, output_names=output_names, verbose='True')
2) 使用torch.save(model.state_dict(),model_path)方式保存(只保存参数,建议使用)
这种保存方式需要提供网络结构文件
import torch.onnx
import onnxruntime as ort
from model import Net
# 创建跟训练时候一致的网络结构
model = Net
# 加载权重
model_path = '***.pth'
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model_statedict = torch.load(model_path, map_location=device)
model.load_state_dict(model_statedict)
model.to(device)
model.eval()
input_data = torch.randn(1, 3, 224, 224, device=device)
# 转化为onnx模型
input_names = ['input']
output_names = ['output']
torch.onnx.export(model, input_data, 'name.onnx', opset_version=9, verbose=True, input_names=input_names, output_names = output_names)
注意, input_data的维度即为模型运行时输入数据的维度。
import onnx
# Preprocessing: load the ONNX model
model_path = 'path/to/the/model.onnx'
onnx_model = onnx.load(model_path)
print('The model is:\n{}'.format(onnx_model))
# Check the model
try:
onnx.checker.check_model(onnx_model)
except onnx.checker.ValidationError as e:
print('The model is invalid: %s' % e)
else:
print('The model is valid!')
onnxruntime上运行onnx模型:[9]
官方教程:
ONNX Runtime 1.13.0 documentationonnxruntime.ai/docs/api/python/api_summary.html
用法举例(以RL为例):[10]
import onnxruntime as ort
import onnx
import numpy as np
import env_class
env = env_class(render=True)
#onnx_model = onnx.load_model('xxx.onnx')
model_path = 'xxx.onnx'
onnx_model = onnx.load(model_path)
onnx.checker.check_model(onnx_model)
#sess = ort.InferenceSession(onnx_model.SerializeToString())
#sess.set_providers(['CPUExecutionProvider'])
sess = ort.InferenceSession(model_path,providers=['CPUExecutionProvider'])
input_name = sess.get_inputs()[0].name
output_name = sess.get_outputs()[0].name
for j in range(n_eval):
o,info = env.reset()
r = 0
d = False
ep_ret = 0
ep_len = 0
while not (d or (ep_len == 1000)):
o = np.expand_dims(o,axis=0).astype(np.float32)
a = sess.run([output_name],{input_name:o})[0][0]
o,r,d,_ = env.step(a)
ep_ret += r
ep_len += 1
当需要两个或以上网络模型时,建立多个sess即可。
查看onnx模型结构:[11]
打开后直接把转化好的模型加载进去,即可看到模型结构。
step2: ONNX --> TensorRT[12]
1) 使用tensorRT的python API进行推理(以RL推理为例):
Method1:使用trtexe工具将ONNX模型转为.trt然后进行推理
TensorRT的安装方法本文下面有介绍。将.onnx模型转化为.trt模型时,需要借助TensorRT提供的工具trtexec, 这个工具保存在TensorRT-8.2.5.1/bin目录下,将你的onnx模型拷贝到这个文件夹,并在这个文件夹下打开命令行,输入如下指令即可得到转化后的模型output_model_name.trt
./trtexec --onnx=onnx_model_name.onnx --saveEngine=output_model_name.trt
这里的--onnx和--saveEngine分别代表onnx模型的路径和保存trt模型的路径。此外,还有两个比较常用的命令行工具参数:
- --explicitBatch: 告诉trtexec在优化时固定输入的batch_size(将从onnx文件中推断batch size的具体数值,即与导出onnx文件时传入的batch size一致)当确定模型的输入batch size时,推荐采用此参数,因为固定batch size大小可以使得trtexec进行额外的优化,且省去了指定“优化配置文件”这一额外步骤。
- --fp16: 采用FP16精度,通过牺牲部分模型准确率来简化模型(减少显存占用和加速模型推理)。TensorRT支持TF32/FP32/FP16/INT8多种精度(具体还要看GPU是否支持)。FP32是多数框架训练模型的默认精度,FP16对模型推理速度和显存占用有较大优化,且准确率损失往往可以忽略不计。INT8进一步牺牲了准确率,同时也进一步降低模型的延迟和显存要求,但需要额外的步骤来仔细校准,来使其精度损耗较小。
加入这两个指令之后的指令为:
./trtexec --onnx=onnx_model_name.onnx --saveEngine=output_model_name.trt --explicitBatch --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --fp16
关于trtexec工具的具体命令,可以见下文:
得到output_model_name.trt之后,使用python API进行推理:
import numpy as np
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import time
import copy
import env_class
class TRTPolicy():
def __init__(self):
f = open("output_model_name.trt","rb") # 读取trt模型
self.runtime = trt.Runtime(trt.Logger(trt.Logger.WARNING)) # 创建一个Runtime(传入记录器Logger)
self.engine = self.runtime.deserialize_cuda_engine(f.read()) # 从文件中加载trt引擎
self.context = self.engine.create_execution_context() # 创建context
# 分配input和output内存
self.batch_size = 1
self.state_dim = 37
self.act_dim = 12
dummy_input = np.random.rand(self.batch_size,self.state_dim).astype(np.float16)
input_batch = np.array(dummy_input)
self.output = np.empty([self.batch_size,self.act_dim],dtype=np.float16)
self.d_input = cuda.mem_alloc(1 * input_batch.nbytes)
self.d_output = cuda.mem_alloc(1 * self.output.nbytes)
self.bindings = [int(self.d_input),int(self.d_output)]
self.stream = cuda.Stream()
def predict(self,batch):
batch = batch.astype(np.float16)
# 将输入的数据放到指定的设备上
cuda.memcpy_htod_async(self.d_input,batch,self.stream)
# 执行推理过程,此处采用异步推理。如果想要同步推理,需将execute_async_v2替换成execute_v2
self.context.execute_async_v2(self.bindings, self.stream.handle, None)
# 将得到的数据从设备中取出来
cuda.memcpy_dtoh_async(self.output,self.d_output,self.stream)
# 同步一下线程
self.stream.synchronize()
output = copy.deepcopy(self.output)
return output
policy = TRTPolicy()
env = env_class(render=True)
for j in range(n_eval):
o,info = env.reset()
r = 0
d = False
ep_ret = 0
ep_len = 0
while not (d or (ep_len == 100000)):
o = np.expand_dims(o,axis=0).astype(np.float32)
a = policy.predict(o)[0]
o,r,d,_ = env.step(a)
ep_ret += r
ep_len += 1
Method2: 使用tensorrt的parser接口解析onnx模型,构建engine引擎,这种方式比较简单,不依赖其他库,并且支持动态推理
官网示例:
非官方示例:
https://murphypei.github.io/blog/2019/09/trt-useagemurphypei.github.io/blog/2019/09/trt-useage
2)使用TensorRT的C++API进行推理
官方教程:
Method1: 使用trtexe工具将ONNX模型转为.trt然后进行推理[13]
首先使用trtexec工具进行转换
cd TensorRT-8.2.5.1/bin
./trtexec --onnx=onnx_model_name.onnx --saveEngine=output_model_name.trt
然后进行推理, infer_trt_model.cpp
// TensorRT include
#include <NvInfer.h>
#include <NvInferRuntime.h>
//cuda include
#include <cuda_runtime.h>
//system include
#include <stdio.h>
#include <math.h>
#include <iostream>
#include <fstream>
#include <vector>
using namespace std;
static const int batch_size = 1;
static const int input_dim = 37;
static const int output_dim = 12;
class TRTLogger : public nvinfer1::ILogger{
public:
virtual void log(Severity severity, nvinfer1::AsciiChar const* msg) noexcept override{
if(severity <= Severity::kINFO){
printf("%d: %s\n", severity, msg);
}
}
} logger;
vector<unsigned char> load_file(const string& file){
ifstream in(file, ios::in | ios::binary);//二进制打开
if (!in.is_open())
return {};
in.seekg(0, ios::end);//定位到文件结束处
size_t length = in.tellg();//获取文件长度
std::vector<uint8_t> data;
if (length > 0){
in.seekg(0, ios::beg);//定位到开始
data.resize(length);
in.read((char*)&data[0], length);//强制转换读取,
}
in.close();
return data;
}
void inference(){
//--------------------------1.准备模型并加载--------------------------
TRTLogger logger;
auto engine_data = load_file("output_model_name.trt"); //二进制转换完成的trt文件并以字符形式读取到vector
nvinfer1::IRuntime* runtime = nvinfer1::createInferRuntime(logger); //创建runtime推理实例
nvinfer1::ICudaEngine* engine = runtime->deserializeCudaEngine(engine_data.data(),engine_data.size()); //对engine_data反序列化
if(engine == nullptr){
printf("Deserialize cuda engine failed.\n");
runtime->destroy();
return;
}
nvinfer1::IExecutionContext* execution_context = engine->createExecutionContext(); //创建执行上下文
cudaStream_t stream = nullptr; //创建流,batch异步
cudaStreamCreate(&stream);
//------------------------2.准备好要推理的数据并搬运到GPU-------------
float input_data_host[batch_size * state_dim] = {1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0};
float* input_data_device = nullptr;
float output_data_host[batch_size * act_dim];
float* output_data_device = nullptr;
cudaMalloc(&input_data_device,sizeof(input_data_host));
cudaMalloc(&output_data_device,sizeof(output_data_host));
cudaMemcpyAsync(input_data_device,input_data_host,sizeof(input_data_host),cudaMemcpyHostToDevice,stream); //将input从host到device
float* bindings[] = {input_data_device, output_data_device}; //用一个指针数组指定input和output在gpu中的指针
//--------------------3.推理并将结果搬运回CPU------------------------
bool success = execution_context->enqueueV2((void**)bindings,stream,nullptr);
cudaMemcpyAsync(output_data_host,output_data_device,sizeof(output_data_host),cudaMemcpyDeviceToHost, stream);
cudaStreamSynchronize(stream);
//-------------------4.释放内存------------------------------------
printf("Clean memory\n");
cudaStreamDestroy(stream);
execution_context->destroy();
engine->destroy();
runtime->destroy();
}
int main(){
inference();
}
CMakeLists.txt[14]
cmake_minimum_required(VERSION 3.11)
project(tensorRT_test)
set(CMAKE_CXX_STANDARD 14)
file(COPY output_model_name.trt DESTINATION .)
# CUDA
find_package(CUDA REQUIRED)
include_directories(${CUDA_INCLUDE_DIRS})
message("CUDA_TOOLKIT_ROOT_DIR = ${CUDA_TOOLKIT_ROOT_DIR}")
message("CUDA_INCLUDE_DIRS = ${CUDA_INCLUDE_DIRS}")
message("CUDA_LIBRARIES = ${CUDA_LIBRARIES}")
link_libraries(nvinfer nvonnxparser ${CUDA_LIBRARIES})
add_executable(infer infer_trt_model.cpp)
编译并执行
rm -r build
mkdir build
cd build
cmake ..
make
./infer
Method2: 使用NvOnnxParser解析后进行推理[15][16]
首先进行模型转换,使用NvOnnxParser将onnx文件进行解析
onnx_to_engine.cpp
#include <NvInfer.h>
#include <NvInferRuntime.h>
#include <NvOnnxParser.h>
#include <cuda_runtime.h>
#include <stdio.h>
#include <math.h>
#include <string>
#include <iostream>
#include <fstream>
#include <memory>
#include <functional>
#include <unistd.h>
#include <chrono>
#include <sys/stat.h>
#include <vector>
inline const char* severity_string(nvinfer1::ILogger::Severity t) {
switch (t) {
case nvinfer1::ILogger::Severity::kINTERNAL_ERROR: return "internal_error";
case nvinfer1::ILogger::Severity::kERROR: return "error";
case nvinfer1::ILogger::Severity::kWARNING: return "warning";
case nvinfer1::ILogger::Severity::kINFO: return "info";
case nvinfer1::ILogger::Severity::kVERBOSE: return "verbose";
default: return "unknown";
}
}
class TRTLogger : public nvinfer1::ILogger {
public:
virtual void log(Severity severity, nvinfer1::AsciiChar const* msg) noexcept override {
if (severity <= Severity::kWARNING) {
if (severity == Severity::kWARNING) printf("\033[33m%s: %s\033[0m\n", severity_string(severity), msg);
else if (severity == Severity::kERROR) printf("\031[33m%s: %s\033[0m\n", severity_string(severity), msg);
else printf("%s: %s\n", severity_string(severity), msg);
}
}
};
bool isFileExist(const std::string name)
{
struct stat buffer;
return (stat (name.c_str(), &buffer) == 0);
}
bool build_model(){
if (isFileExist ("your_model.trtmodel")){
printf("your_model.trtmodel already exists.\n");
return true;
}
TRTLogger logger;
//创建builder, config, network
nvinfer1::IBuilder* builder = nvinfer1::createInferBuilder(logger); //create an instance of the builder
nvinfer1::IBuilderConfig* config = builder->createBuilderConfig();
nvinfer1::INetworkDefinition* network = builder->createNetworkV2(1); //create a network definition
//使用onnx parser解析器来解析onnx模型
auto parser = nvonnxparser::createParser(*network,logger); //create an ONNX parser to populate the networks
if (!parser->parseFromFile("your_onnx_model.onnx",1)){
printf("Failed to parse your_onnx_model.onnx.\n");
return false;
}
//设置工作区大小
printf("Workspace Size = %.2f MB\n", (1 << 28) / 1024.0f / 1024.0f);
config->setMaxWorkspaceSize(1 << 28);
//开始构建tensorrt模型engine
nvinfer1::ICudaEngine* engine = builder->buildEngineWithConfig(*network,*config);
if (engine == nullptr){
printf("Build engine failed.\n");
return false;
}
//将构建好的tensorrt模型engine反序列化(保存成文件)
nvinfer1::IHostMemory* model_data = engine->serialize();
FILE* f = fopen("your_model.trtmodel","wb");
fwrite(model_data->data(),1,model_data->size(),f);
fclose(f);
//destroy指针
model_data->destroy();
engine->destroy();
network->destroy();
config->destroy();
builder->destroy();
printf("Build Done.\n");
return true;
}
int main()
{
build_model();
}
CMakeLists.txt[17]
cmake_minimum_required(VERSION 3.11)
project(tensorRT_test)
set(CMAKE_CXX_STANDARD 14)
file(COPY your_onnx_model.onnx DESTINATION .)
# CUDA
find_package(CUDA REQUIRED)
include_directories(${CUDA_INCLUDE_DIRS})
message("CUDA_TOOLKIT_ROOT_DIR = ${CUDA_TOOLKIT_ROOT_DIR}")
message("CUDA_INCLUDE_DIRS = ${CUDA_INCLUDE_DIRS}")
message("CUDA_LIBRARIES = ${CUDA_LIBRARIES}")
link_libraries(nvinfer nvonnxparser ${CUDA_LIBRARIES})
add_executable(engine onnx_to_engine.cpp)
编译并执行得到your_model.trtmodel(保存在build文件夹下)
mkdir build
cd build
cmake ..
make
./engine
接下来,基于上一步得到的your_model.trtmodel进行推理
infer_engine.cpp
#include <fstream>
#include <iostream>
// TensorRT include
#include <NvInfer.h>
#include <NvInferRuntime.h>
//cuda include
#include <cuda_runtime.h>
#include <cassert>
class TRTLogger : public nvinfer1::ILogger{
public:
virtual void log(Severity severity, nvinfer1::AsciiChar const* msg) noexcept override{
if(severity <= Severity::kINFO){
printf("%d: %s\n", severity, msg);
}
}
} logger;
#define CHECK(status) \
do \
{ \
auto ret = (status); \
if (ret != 0) \
{ \
std::cerr << "Cuda failure: " << ret << std::endl; \
abort(); \
} \
} while (0)
using namespace nvinfer1;
const char* IN_NAME = "input";
const char* OUT_NAME = "output";
static const int input_dim = 37;
static const int output_dim = 12;
static const int batch_size = 1;
static const int EXPLICIT_BATCH = 1 << (int)(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
void doInference(IExecutionContext& context, float* input, float* output, int batchSize)
{
const ICudaEngine& engine = context.getEngine();
// Pointers to input and output device buffers to pass to engine.
// Engine requires exactly IEngine::getNbBindings() number of buffers.
assert(engine.getNbBindings() == 2);
void* buffers[2];
// In order to bind the buffers, we need to know the names of the input and output tensors.
// Note that indices are guaranteed to be less than IEngine::getNbBindings()
const int inputIndex = engine.getBindingIndex(IN_NAME);
const int outputIndex = engine.getBindingIndex(OUT_NAME);
// Create GPU buffers on device
CHECK(cudaMalloc(&buffers[inputIndex], batchSize * input_dim * sizeof(float)));
CHECK(cudaMalloc(&buffers[outputIndex], batchSize * output_dim * sizeof(float)));
// Create stream
cudaStream_t stream;
CHECK(cudaStreamCreate(&stream));
// DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host
CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * input_dim * sizeof(float), cudaMemcpyHostToDevice, stream));
context.enqueue(batchSize, buffers, stream, nullptr);
CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * output_dim * sizeof(float), cudaMemcpyDeviceToHost, stream));
cudaStreamSynchronize(stream);
// Release stream and buffers
cudaStreamDestroy(stream);
CHECK(cudaFree(buffers[inputIndex]));
CHECK(cudaFree(buffers[outputIndex]));
}
int main(int argc, char** argv)
{
// create a model using the API directly and serialize it to a stream
char *trtModelStream{ nullptr };
size_t size{ 0 };
std::ifstream file("your_model.trtmodel", std::ios::binary);
if (file.good()) {
file.seekg(0, file.end);
size = file.tellg();
file.seekg(0, file.beg);
trtModelStream = new char[size];
assert(trtModelStream);
file.read(trtModelStream, size);
file.close();
}
TRTLogger m_logger;
IRuntime* runtime = createInferRuntime(m_logger);
assert(runtime != nullptr);
ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);
assert(engine != nullptr);
IExecutionContext* context = engine->createExecutionContext();
assert(context != nullptr);
// generate input data
float data[batch_size * input_dim];
for (int i = 0; i < batch_size * input_dim; i++)
data[i] = 1.0;
// Run inference
float prob[batch_size * act_dim];
doInference(*context, data, prob, batch_size);
// Destroy the engine
context->destroy();
engine->destroy();
runtime->destroy();
return 0;
}
CMakeLists.txt[18]
cmake_minimum_required(VERSION 3.11)
project(tensorRT_test)
set(CMAKE_CXX_STANDARD 14)
file(COPY your_model.trtmodel DESTINATION .)
# CUDA
find_package(CUDA REQUIRED)
include_directories(${CUDA_INCLUDE_DIRS})
message("CUDA_TOOLKIT_ROOT_DIR = ${CUDA_TOOLKIT_ROOT_DIR}")
message("CUDA_INCLUDE_DIRS = ${CUDA_INCLUDE_DIRS}")
message("CUDA_LIBRARIES = ${CUDA_LIBRARIES}")
link_libraries(nvinfer nvonnxparser ${CUDA_LIBRARIES})
add_executable(infer infer_engine.cpp)
编译并执行
rm -r build
mkdir build
cd build
cmake ..
make
./infer
Tips:
1)如果推理时需要两个及以上模型,首先建立一个runtime, 然后对应各自的trtmodel建立不同的engine,engine再建立不同的context:
参考:[19]
The TensorRT runtime can be used by multiple threads simultaneously, so long as each object uses a different execution context.
2)explicit batch和implicit batch的区别:
参考下文:
3)获取输入输出维度:
#include <sstream>
std::string DimsToStr(Dims dims)
{
std::stringstream ss;
for (size_t i = 0; i < dims.nbDims; i++)
{
ss << dims.d[i] << " ";
}
return ss.str();
}
const int inputIndex = engine.getBindingIndex(IN_NAME);
const int outputIndex = engine.getBindingIndex(OUT_NAME);
auto input_dimensions = engine.getBindingDimensions(inputIndex);
auto output_dimensions = engine.getBindingDimensions(outputIndex);
cout << "The input_dimensions is:" << DimsToStr(input_dimensions) << endl;
cout << "The output_dimensions is:" << DimsToStr(output_dimensions) << endl;
2. 路线2:PyTorch--> ONNX --> OpenVINO(Intel),适用于Intel CPU上部署
OpenVINO简介:OpenVINO是英特尔推出的一款全面的工具套件,用于快速部署应用和解决方案,支持计算机视觉的CNN网络结构超过200余种。OpenVINO是一个Pipeline工具集,同时可以兼容各种开源框架训练好的模型,拥有算法模型上线部署的各种能力,只要掌握了该工具,可以轻松地将预训练模型在Intel的CPU上快速部署起来。[20]
- 支持的硬件:MKLDNN plugin --> CPUs: Xeon至强/Core酷睿/Atom凌动;clDNN plugin --> GPU: HD Graphics;Myriad plugin --> VPU: Myriad;FPGA plugin --> FPGA Arria 10
- 支持的框架:Caffe,Keras,Tensorflow,ONNX,PyTorch,XNet
- 支持的系统:Windows, Ubuntu, MacOS
step1: Pytorch --> ONNX (同上)
step2:ONNX --> OpenVINO
首先按照本文下面的介绍安装OpenVINO
进入下面目录
cd /opt/intel/openvino_2021/deployment_tools/model_optimizer
执行如下语句(注意不要用虚拟环境的python)
sudo python3 mo.py --input_model yolox_s.onnx --input_shape [1,3,416,416] --
data_type FP16
其中yolox_5.onnx是原始的.onnx文件,执行上面的语句后会生成.xml和.bin两个文件,.xml 描述了网络的拓扑结构, .bin 包含了权重和偏置。
使用IR模型进行推理(python API)[21]
python调用步骤:
1) 加载插件
ie = IECore()
ie.add_extension(os.getenv(‘INTEL_OPENVINO_DIR’)+‘/deployment_tools/inference_engine/lib/intel64/libcpu_extension.dylib’,device)
2) 读入模型
net = IENetwork(model= modelpath+‘.xml’, weights=modelpath+‘.bin’)
3) 指定输入输出格式
input_blob = next(iter(net.inputs))
out_blob = next(iter(net.outputs))
4)开始工作
exec_net = ie.load_network(network=net, device_name=device)
image = cv2.imread(imgpath)
res = exec_net.infer(inputs={input_blob: image})
综合如下:
from openvino.inference_engine import IENetwork, IECore
model_xml = "*.xml"
model_bin = "*.bin"
device = 'CPU'
ie = IECore()
net = IENetwork(model=model_xml, weights=model_bin)
input_blob,out_blob,net.batch_size = next(iter(net.inputs)),next(iter(net.outputs)),1
n, c, h, w = net.inputs[input_blob].shape
exec_net = ie.load_network(network=net, device_name=device)
output = exec_net.infer(inputs={input_blob: cpux.numpy()})[out_blob]
output
yolox官方示例(~/YOLOX/demo/OpenVINO/python):
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# Copyright (C) 2018-2021 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
# Copyright (c) Megvii, Inc. and its affiliates.
import argparse
import logging as log
import os
import sys
import cv2
import numpy as np
from openvino.inference_engine import IECore
from yolox.data.data_augment import preproc as preprocess
from yolox.data.datasets import COCO_CLASSES
from yolox.utils import mkdir, multiclass_nms, demo_postprocess, vis
def parse_args() -> argparse.Namespace:
"""Parse and return command line arguments"""
parser = argparse.ArgumentParser(add_help=False)
args = parser.add_argument_group('Options')
args.add_argument(
'-h',
'--help',
action='help',
help='Show this help message and exit.')
args.add_argument(
'-m',
'--model',
required=True,
type=str,
help='Required. Path to an .xml or .onnx file with a trained model.')
args.add_argument(
'-i',
'--input',
required=True,
type=str,
help='Required. Path to an image file.')
args.add_argument(
'-o',
'--output_dir',
type=str,
default='demo_output',
help='Path to your output dir.')
args.add_argument(
'-s',
'--score_thr',
type=float,
default=0.3,
help="Score threshould to visualize the result.")
args.add_argument(
'-d',
'--device',
default='CPU',
type=str,
help='Optional. Specify the target device to infer on; CPU, GPU, \
MYRIAD, HDDL or HETERO: is acceptable. The sample will look \
for a suitable plugin for device specified. Default value \
is CPU.')
args.add_argument(
'--labels',
default=None,
type=str,
help='Option:al. Path to a labels mapping file.')
args.add_argument(
'-nt',
'--number_top',
default=10,
type=int,
help='Optional. Number of top results.')
return parser.parse_args()
def main():
log.basicConfig(format='[ %(levelname)s ] %(message)s', level=log.INFO, stream=sys.stdout)
args = parse_args()
# ---------------------------Step 1. Initialize inference engine core--------------------------------------------------
log.info('Creating Inference Engine')
ie = IECore()
# ---------------------------Step 2. Read a model in OpenVINO Intermediate Representation or ONNX format---------------
log.info(f'Reading the network: {args.model}')
# (.xml and .bin files) or (.onnx file)
net = ie.read_network(model=args.model)
if len(net.input_info) != 1:
log.error('Sample supports only single input topologies')
return -1
if len(net.outputs) != 1:
log.error('Sample supports only single output topologies')
return -1
# ---------------------------Step 3. Configure input & output----------------------------------------------------------
log.info('Configuring input and output blobs')
# Get names of input and output blobs
input_blob = next(iter(net.input_info))
out_blob = next(iter(net.outputs))
# Set input and output precision manually
net.input_info[input_blob].precision = 'FP32'
net.outputs[out_blob].precision = 'FP16'
# Get a number of classes recognized by a model
num_of_classes = max(net.outputs[out_blob].shape)
# ---------------------------Step 4. Loading model to the device-------------------------------------------------------
log.info('Loading the model to the plugin')
exec_net = ie.load_network(network=net, device_name=args.device)
# ---------------------------Step 5. Create infer request--------------------------------------------------------------
# load_network() method of the IECore class with a specified number of requests (default 1) returns an ExecutableNetwork
# instance which stores infer requests. So you already created Infer requests in the previous step.
# ---------------------------Step 6. Prepare input---------------------------------------------------------------------
origin_img = cv2.imread(args.input)
_, _, h, w = net.input_info[input_blob].input_data.shape
image, ratio = preprocess(origin_img, (h, w))
# ---------------------------Step 7. Do inference----------------------------------------------------------------------
log.info('Starting inference in synchronous mode')
res = exec_net.infer(inputs={input_blob: image})
# ---------------------------Step 8. Process output--------------------------------------------------------------------
res = res[out_blob]
predictions = demo_postprocess(res, (h, w), p6=False)[0]
boxes = predictions[:, :4]
scores = predictions[:, 4, None] * predictions[:, 5:]
boxes_xyxy = np.ones_like(boxes)
boxes_xyxy[:, 0] = boxes[:, 0] - boxes[:, 2]/2.
boxes_xyxy[:, 1] = boxes[:, 1] - boxes[:, 3]/2.
boxes_xyxy[:, 2] = boxes[:, 0] + boxes[:, 2]/2.
boxes_xyxy[:, 3] = boxes[:, 1] + boxes[:, 3]/2.
boxes_xyxy /= ratio
dets = multiclass_nms(boxes_xyxy, scores, nms_thr=0.45, score_thr=0.1)
if dets is not None:
final_boxes = dets[:, :4]
final_scores, final_cls_inds = dets[:, 4], dets[:, 5]
origin_img = vis(origin_img, final_boxes, final_scores, final_cls_inds,
conf=args.score_thr, class_names=COCO_CLASSES)
mkdir(args.output_dir)
output_path = os.path.join(args.output_dir, os.path.basename(args.input))
cv2.imwrite(output_path, origin_img)
if __name__ == '__main__':
sys.exit(main())
三.安装OpenVINO
1) 下载和安装英特尔OpenVINO工具包
按照下面参数进行选择,并点击下载,下载l_openvino_toolkit_p_xxx.tgz文件, 例如l_openvino_toolkit_p_2021.4.752.tgz文件
解压并安装
tar -xvzf l_openvino_toolkit_p_2021.4.752.tgz
cd l_openvino_toolkit_p_2021.4.752
sudo ./install_GUI.sh
执行下面的命令从而使用英特尔优化版本的OpenCV
source /opt/intel/openvino_2021/opencv/setupvars.sh
2) 安装软件依赖
cd /opt/intel/openvino_2021/install_dependencies
sudo -E ./install_openvino_dependencies.sh
3) 设置环境变量
在终端执行
source /opt/intel/openvino_2021/bin/setupvars.sh
source /opt/intel/openvino_2021/opencv/setupvars.sh
并且
sudo gedit ~/.bashrc
添加第一行到.bashrc文件中,并source一下
source /opt/intel/openvino_2021/bin/setupvars.sh
source ~/.bashrc
4)配置Model Optimizer
cd /opt/intel/openvino_2021/deployment_tools/model_optimizer/install_prerequisites
sudo ./install_prerequisites_onnx.sh
5) 测试安装是否成功
cd /opt/intel/openvino_2021/deployment_tools/demo/
bash demo_security_barrier_camera.sh
运行成功后会显示以下结果
四. 安装TensorRT[22]
安装方式1(直接下载源码):
1)下载压缩包:
我的电脑是3080Ti显卡 + 510.60.02驱动 + CUDA11.2 + cuDNN811,对应可以选择TensorRT8.2版本(我看新出的8.5版本显示也支持CUDA11.0-8, 但是我没用过,不知道可不可行,感兴趣的可以试一下)
下面是TensorRT8.x版本的下载地址:
我选择的是TensorRT 8.2 GA Update 4:
选择第一个进行下载,大概1.2G左右,下载完后的名字为TensorRT-8.2.5.1.Linux.x86_64-gnu.cuda-11.4.cudnn8.2.tar.gz
2)解压该文件:
tar -zxvf TensorRT-8.2.5.1.Linux.x86_64-gnu.cuda-11.4.cudnn8.2.tar.gz
3)安装pycuda, 命令为:
conda create -n deployment python=3.8
conda activate deployment
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple 'pycuda<2021.1'
4)安装onnxruntime:
pip install onnxruntime-gpu
5) 安装一些库(这些文件都在解压后的TensorRT-8.2.5.1文件夹下):
# step1:安装后可以使用python调用TensorRT,注意到python文件夹下有适用于py36,py37,py38,py39等各种python版本的轮子,
# 因为我第三步用conda建立的环境是py38,所以我选择了py38的轮子
cd TensorRT-8.2.5.1/python
pip install tensorrt-8.2.5.1-cp38-none-linux_x86_64.whl
# step2: 安装UFF,支持tensorflow模型转化,若使用pytorch可以不装
cd TensorRT-8.2.5.1/uff
pip install uff-0.6.9-py2.py3-none-any.whl
# step3:安装graphsurgeon,支持自定义结构
cd TensorRT-8.2.5.1/graphsurgeon
pip install graphsurgeon-0.4.5-py2.py3-none-any.whl
(如果后续需要更新版本,卸载上述安装包的指令为)
pip uninstall tensorrt
pip uninstall uff
pip uninstall graphsurgeon
6)配置环境变量
sudo gedit ~/.bashrc
在文件末尾添加
export LD_LIBRARY_PATH=$PATH:~/TensorRT-8.2.5.1/lib:$LD_LIBRARY_PATH
export LIBRARY_PATH=$PATH:~/TensorRT-8.2.5.1/lib::$LIBRARY_PATH
由于我将解压之后的TensorRT-8.2.5.1文件夹放在了根目录下,所以我的路径是~/xxx, 需要根据自己的路径进行修改
source ~/.bashrc
7)验证tensorRT是否安装成功
import onnxruntime as ort
import tensorrt
print(ort.get_device())
print(ort.get_available_providers())
print(tensorrt.__version__ )
输出上面的信息则为安装成功。
安装方式2(使用.deb文件安装)
具体方法参考下面的文章:
本初ben:使用TensorRT加速Pytorch模型推理70 赞同 · 6 评论文章
常见报错解决:
1. 报错:Unexpected input data type. Actual: (N11onnxruntime17PrimitiveDataTypeIdEE)
解决方法:
2. Onnx export fails on torch.distributions.Normal #30517
解决方法:将pytorch升级到11.0以上版本即可,亲测在Pytorch 1.13.0+cu116版本可用,不会报错。
https://github.com/pytorch/pytorch/issues/30517github.com/pytorch/pytorch/issues/30517
3. runtimeError: Exporting the operator normal to ONNX opset version 9 is not supported. Support for this operator was added in version 11, try exporting with this version
解决方法:将torch.onnx.export()中的opset_version改为11.
4. 相同输入的情况下,onnx模型的输出与原始pytorch模型不一致
解决方法:检查模型的forward()函数中是否存在随机量,有的话需要去掉。
5. ImportError: libnvinfer.so.7: cannot open shared object file: No such file or directory
解决方法: TensorRT路径配置有问题,改正即可。
6. 在终端运行没问题,在pycharm中run时报错:ImportError: libcusolver.so.8.0: cannot open shared object file: No such file or directory
解决方法:
7. 使用pycuda时报错ModuleNotFoundError: No module named "six"
解决方法: pip install six
8. pycuda._driver.LogicError: cuMemcpyHtoDAsync failed: invalid argument
解决方法:这个错误原因一般有两个,一个是输入数据的维度不对,一个是输入数据的类型不对。
9. pycuda._driver.LogicError: explicit_context_dependent failed: invalid device context - no currently
解决方法:pycuda.driver没有初始化,导致无法得到context, 需要在导入pycuda.driver后再导入pycuda.autoinit, 即如下所示:[23]
import pycuda.driver as cuda
import pycuda.autoinit
10. fatal error: NvInfer.h: No such file or directory
解决方法:在CUDA10.2+TensorRT-8.0.0.3版本中编译时遇到上述错误,解决方法是在CMakeLists.txt中加入如下语句[24]
include_directories(/home/chenci/TensorRT-8.0.0.3/include/)
link_directories(/home/chenci/TensorRT-8.0.0.3/lib/)
参考
- ^ONNX简介_onnx是什么-CSDN博客
- ^ONNX(Open Neural Network Exchange)介绍-CSDN博客
- ^https://zhuanlan.zhihu.com/p/346511883
- ^https://zhuanlan.zhihu.com/p/356072366
- ^TensorRT介绍-CSDN博客
- ^https://blog.csdn.net/weixin_50980847/article/details/126409707
- ^https://zhuanlan.zhihu.com/p/474135529
- ^https://zhuanlan.zhihu.com/p/371177698
- ^https://zhuanlan.zhihu.com/p/474135529
- ^https://zhuanlan.zhihu.com/p/346544539
- ^https://zhuanlan.zhihu.com/p/448651888
- ^https://zhuanlan.zhihu.com/p/482473219
- ^TensorRT模型推理_createinferruntime-CSDN博客
- ^GitHub - agrechnev/trt-cpp-min: TensorRT 7 C++ (almost) minimal examples
- ^模型部署入门教程(七):TensorRT 模型构建与推理_tensorrt模型格式-CSDN博客
- ^Pytorch导出onnx模型,C++转化为TensorRT并实现推理过程_pytorch_Adenialzz-华为云开发者联盟
- ^GitHub - agrechnev/trt-cpp-min: TensorRT 7 C++ (almost) minimal examples
- ^GitHub - agrechnev/trt-cpp-min: TensorRT 7 C++ (almost) minimal examples
- ^Documentation Archives :: NVIDIA Deep Learning TensorRT Documentation
- ^https://zhuanlan.zhihu.com/p/91882515
- ^深度学习系列8:openvino-CSDN博客
- ^https://zhuanlan.zhihu.com/p/467401558
- ^pycuda._driver.LogicError: explicit_context_dependent failed: invalid device context - no currently_cuda error: invalid device context-CSDN博客
- ^fatal error: NvInfer.h: No such file or directory | TensorRT 报错处理 | 【成功解决】-CSDN博客