Ubuntu20.04.02环境配置
Anaconda3/Python 3.8.8
Pytorch 1.9.0+cu102
CUDA 11.3.r11.3
Onnx 1.10.2
Onnxruntime 1.9.0
pytorch常用的部署方案
CPU: pytorch->onnx->onnxruntime
GPU: pytorch->onnx->onnx2trt->tensorRT
ARM: pytorch->onnx->ncnn/mace/mnn等
Pytorch GPU 部署流程
-
Step1:pytorch2onnx
-
Step2:onnx2TensorRT Engine
-
Step3:TensorRT Inference
第一步:Pytorch2Onnx
-
pytorch2onnx
-
安装
onnx
,onnxsim
和onnxruntime
模块;pip install onnx # 安装onnx pip install onnx-simiplifier # 安装onnxsim pip install onnxruntime # 安装onnxruntime
-
可以降低精度 FP32 --> FP16,FP16仅支持CUDA;也可以先正常转换,后面在TensorRT上进行精度量化;
-
没有特殊操作的模型直接使用
torch.onnx.export()
函数导出; -
如果有特殊操作,需要改写特殊操作,
torch.onnx
中有关于 特殊操作的转换方式; -
使用
onnxsim.simplify
对转换后的ONNX进行简化; -
转换成onnx后可以使用onnx模块/ONNX Runtime模块验证转换的结果是否正确
-
方式一:
import onnx from onnxsim.onnx_simplifier import simplify model = onnx.load('xxx.onnx') onnx.checker.check_model(model) print(onnx.helper.printable_graph(model.graph)) # simplify onnx model model_sim, check = simplify(model) onnx.save(model_sim, 'xxx-sim.onnx')
-
方式二:
import onnxruntime as ort import numpy as np ort_session = ort.InferenceSession('xxx.onnx') outputs = ort_session.run(None, { 'input_name': np.random.randn(10, 3, 224, 224).astype(np.float32) }) print(outputs[0])
-
-
验证onnx模型与原pytorch模型结果是否一致,一致就表示模型转换无误;
import torch import original_model import onnxruntime as ort import numpy as np # load pytorch model torch_model = original_model() torch_model.load_state_dict(torch.load('xxx.pth'), map_location = 'cpu') img = torch.ones((1, 3, 640, 640))/255 # pytorch model inference with torch.no_grad(): res = torch_model(img) print(res[0]) # load onnx model sess = ort.InferenceSession('xxx.onnx') # get input name input_names = sess.get_inputs()[0].name outputs = sess.run([], {input_names: np.array(img)}) print(outputs[0]) # verify whether the result is corresponing with each other
-
也可使用可视化工具
netron
可视化ONNX模型
netron的安装与使用:
# For Linux or Ubuntu, with pip installation pip install netron # For visualize the onnx model, use the command `netron` and press `Enetr` key netron # After pressing `enter` key, netron will show the interface
-
第二步:Onnx2TensorRT Engines
-
TensorRT安装
Python安装TensorRT(采用.tar文件的安装方式,此方式安装不会出现未安装的依赖包 问题)
-
确认已经安装
cuda
与pytorch
相匹配的版本 -
安装
pycuda
模块,pip install pycuda
-
确认安装
onnx
模块 -
下载TensorRT8.x相对应
cuda
版本的.tar
文件,网址:NVIDIA TensorRT | NVIDIA Developer -
解压文件,切换至
python
分支,安装TensorRT
,graphsurgeon
和onnx_graphsurgeon
cd TensorRT-8.x.x/python pip install tensorrt-8.x.x-xxx.whl cd TensorRT-8.x.x/graphsurgeon pip install graphsurgeon-x.x.x.whl cd TensorRT-8.x.x/onnx_graphsurgeon pip install onnx_graphsurgeon-x.x.x.whl
-
添加
TensorRT/lib
到环境变量
export LD_LIBRARY_PATH=/home/YOUR_USERNAME/TensorRT-8.0.3.4/lib:$LD_LIBRARY_PATH export PATH=/home/YOUR_USERNAME/TensorRT-8.0.3.4/bin:$PATH export LD_LIBRARY_PATH=/home/YOUR_USERNAME/TensorRT-8.0.3.4/targets/x86_64-linux-gnu/lib:$LD_LIBRARY_PATH source ~/.bashrc
-
验证TensorRT是否安装好
import tensorrt # 能正常导入即成功安装 print(tensorrt.__version__)
-
-
TensorRT的运行的基本流程图
-
首先以trt的Logger为参数,使用builder创建计算图类型INetworkDefinition;
-
然后使用Parsers将onnx等网络框架下的结构填充计算图,当然也可以使用tensorrt的API进行构建;
-
由计算图创建cuda环境下的引擎;
-
最终进行推理的则是cuda引擎生成的ExecutionContext;
engine.create_execution_context()
-
-
Onnx2TensorRT Engine
-
方式一
-
采用TensorRT中自带的工具
trtexec
可以直接将onnx模型转为TensorRT Engine,通过如下指令即可生成TensorRT Engine
trtexec --onnx=your_model.onnx --saveEngine=resnet_engine.trt --explicitBatch
trtexec有如下可选参数:
-
--explicitBatch 网络输入时batch维度明确的size大小,该参数不能隐藏,ONNX解析器只支持明确的batch模式。更多信息可参考TensorRT Developer Guide中的Working With Dynamic Shapes 部分;
-
--fp16 enables FP16 precision for layers that support it, in addition to FP32. For more information, refer to the Working With Mixed Precision section in the TensorRT Developer Guide.
-
--int8 enables INT8 precision for layers that support it, in addition to FP32. For more information, refer to the Working With Mixed Precision section in the TensorRT Developer Guide.
-
--best enables all supported precisions to achieve the best performance for every layer.
-
--workspace controls the maximum amount of persistent scratch memory available (in MB) for algorithms considered by the builder. This should be set as high as possible for a given platform based on availability; at runtime TensorRT will allocate only what is required, not exceeding the max.
-
--minShapes and --maxShapes specify the range of dimensions for each network input and --optShapes specifies the dimensions that the auto-tuner should use for optimization. For more information, refer to the Optimization Profiles section in the TensorRT Developer Guide.
-
--buildOnly requests that inference performance measurements be skipped.
-
--saveEngine specifies the file into which the serialized engine must be saved.
-
--safe enables building safety certified engines. This switch is used for prototyping automotive safety restricted flows in the TensorRT safe runtime.
-
--tacticSources can be used to add or remove tactics from the default tactic sources (cuDNN, cuBLAS and cuBLASLt).
-
--minTiming and --avgTiming respectively set the minimum and average number of iterations used in tactic selection.
-
--noBuilderCache disables the layer timing cache in the TensorRT builder. The timing cache helps to reduce the time taken in the builder phase by caching the layer profiling information and should work for most cases. Use this switch for the problematic cases. For more information, refer to the Builder Layer Timing Cache section in the TensorRT Developer Guide.
-
--timingCacheFile can be used to save or load the serialized global timing cache.
-
-
方式二
-
采用TensorRT API(Python版,TensorRT中自带的ONNX解析器[正如TensorRT运行流程图中的Parser]构建TensorRT Engine)
import os import tensorrt as trt # 打印日志 TRT_LOGGER = trt.Logger() model_path = 'resnet18.onnx' engine_file_path = "resnet18.trt" EXPLICIT_BATCH = 1 << (int) \ (trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH) # 明确batchsize=1 # 定义创建builder, network和ONNX Parser with trt.Builder(TRT_LOGGER) as builder, builder.create_network(EXPLICIT_BATCH) \ as network, trt.OnnxParser(network, TRT_LOGGER) as parser: builder.max_workspace_size = 1 << 28 builder.max_batch_size = 1 # 检查onnx文件是否存在 if not os.path.exists(model_path): print('ONNX file {} not found.'.format(model_path)) exit(0) print('Loading ONNX file from path {}...'.format(model_path)) with open(model_path, 'rb') as model: print('Beginning ONNX file parsing') if not parser.parse(model.read()): print('ERROR: Failed to parse the ONNX file.') for error in range(parser.num_errors): print(parser.get_error(error)) network.get_input(0).shape = [1, 3, 32, 32] print('Completed parsing of ONNX file') engine = builder.build_cuda_engine(network) with open(engine_file_path, "wb") as f: f.write(engine.serialize())
-
-
第三步 TensorRT Engine进行模型Inference
-
Inference的四个基本步骤
-
1、输入前处理:和训练的前处理过程,保证输入的图片格式和训练一致;
-
2、分配内存:allocate_buffers函数,不需要改动;
-
3、推理函数:do_inference_v2函数,不需要改动,如果需要多张推理,则可用do_inference函数;
-
4、结果:tensorrt推理结果为列表;
-
-
TensorRT+Pycuda实现
import pycuda.driver as cuda
import pycuda.autoinit
import cv2
import numpy as np
import os
import tensorrt as trt
import time
from PIL import Image
TRT_LOGGER = trt.Logger()
engine_file_path = "resnet18.trt"
# Simple helper data class that's a little nicer to use than a 2-tuple.
class HostDeviceMem(object):
def __init__(self, host_mem, device_mem):
self.host = host_mem
self.device = device_mem
def __str__(self):
return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)
def __repr__(self):
return self.__str__()
# Allocates all buffers required for an engine, i.e. host/device inputs/outputs.
def allocate_buffers(engine):
inputs = []
outputs = []
bindings = []
stream = cuda.Stream()
for binding in engine:
size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
dtype = trt.nptype(engine.get_binding_dtype(binding))
# Allocate host and device buffers
host_mem = cuda.pagelocked_empty(size, dtype)
device_mem = cuda.mem_alloc(host_mem.nbytes)
# Append the device buffer to device bindings.
bindings.append(int(device_mem))
# Append to the appropriate list.
if engine.binding_is_input(binding):
inputs.append(HostDeviceMem(host_mem, device_mem))
else:
outputs.append(HostDeviceMem(host_mem, device_mem))
return inputs, outputs, bindings, stream
# 推理函数,固定函数
def do_inference_v2(context, bindings, inputs, outputs, stream):
# Transfer input data to the GPU.
[cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
# Run inference.
context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
# Transfer predictions back from the GPU.
[cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
# Synchronize the stream
stream.synchronize()
# Return only the host outputs.
return [out.host for out in outputs]
i = 0
j = 0
with open(engine_file_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime, \
runtime.deserialize_cuda_engine(f.read()) as engine, engine.create_execution_context() as context:
inputs, outputs, bindings, stream = allocate_buffers(engine)
print(inputs,outputs,bindings,stream)
dir = "val/white/"
for name in os.listdir(dir):
# 前处理部分
t1 = time.clock()
image_path = os.path.join(dir,name)
img = Image.open(image_path)
img = np.array(img)
img = img.transpose((1, 0, 2))
img = img.transpose((2, 1, 0))
img = img.astype(np.float32) / 255.0
img = img[np.newaxis, :, :].astype(np.float32)
print(img.shape)
img = np.ascontiguousarray(img)
# 前处理结束
# 开始推理
inputs[0].host = img
trt_outputs = do_inference_v2(context, bindings=bindings, \
inputs=inputs, outputs=outputs, stream=stream)
print(trt_outputs)
# 结果判断
if trt_outputs[0][0] > trt_outputs[0][1]:
print("0")
i = i +1
else:
print("1")
j = j +1
print("Time:",time.clock()-t1)
m = 0
n = 0
with open(engine_file_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime, \
runtime.deserialize_cuda_engine(f.read()) as engine, engine.create_execution_context() as context:
inputs, outputs, bindings, stream = allocate_buffers(engine)
print(inputs,outputs,bindings,stream)
dir = "val/yellow/"
for name in os.listdir(dir):
t1 = time.clock()
image_path = os.path.join(dir,name)
image = cv2.imread(image_path)
img = Image.fromarray(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
img = np.array(img)
img = img.transpose((1, 0, 2))
img = img.transpose((2, 1, 0))
img = img.astype(np.float32) / 255.0
img = img[np.newaxis, :, :].astype(np.float32)
print(img.shape)
img = np.ascontiguousarray(img)
inputs[0].host = img
# 开始推理
trt_outputs = do_inference_v2(context, bindings=bindings, \
inputs=inputs, outputs=outputs, stream=stream)
print(trt_outputs)
if trt_outputs[0][0] > trt_outputs[0][1]:
print("0")
m = m+1
else:
print("1")
n = n +1
print("Time:",time.clock()-t1)
#
print("i = ",i)
print("j = ",j)
print("m = ",m)
print("n = ",n)