Pytorch模型TensorRT部署

最新推荐文章于 2024-08-10 22:13:16 发布

凤舞九天cw

最新推荐文章于 2024-08-10 22:13:16 发布

阅读量3.1k

点赞数 3

分类专栏： Pytorch模型部署文章标签： pytorch 模型部署 TensoRT

本文链接：https://blog.csdn.net/qq_17464457/article/details/121305840

版权

Pytorch模型部署专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Ubuntu20.04.02环境配置

Anaconda3/Python        3.8.8
Pytorch                          1.9.0+cu102
CUDA 11.3.r11.3
Onnx                              1.10.2
Onnxruntime 1.9.0

pytorch常用的部署方案

CPU: pytorch->onnx->onnxruntime
GPU: pytorch->onnx->onnx2trt->tensorRT
ARM: pytorch->onnx->ncnn/mace/mnn等

Pytorch GPU 部署流程

Step1：pytorch2onnx
Step2：onnx2TensorRT Engine
Step3：TensorRT Inference

第一步：Pytorch2Onnx

pytorch2onnx

安装onnx , onnxsim 和onnxruntime模块；

pip install onnx # 安装onnx
pip install onnx-simiplifier # 安装onnxsim
pip install onnxruntime # 安装onnxruntime

可以降低精度 FP32 --> FP16，FP16仅支持CUDA；也可以先正常转换，后面在TensorRT上进行精度量化；
没有特殊操作的模型直接使用torch.onnx.export()函数导出；
如果有特殊操作，需要改写特殊操作，torch.onnx中有关于特殊操作的转换方式；
使用onnxsim.simplify对转换后的ONNX进行简化；

转换成onnx后可以使用onnx模块/ONNX Runtime模块验证转换的结果是否正确

方式一：

import onnx
from onnxsim.onnx_simplifier import simplify
model = onnx.load('xxx.onnx')
onnx.checker.check_model(model)
print(onnx.helper.printable_graph(model.graph))
# simplify onnx model
model_sim, check = simplify(model)
onnx.save(model_sim, 'xxx-sim.onnx')

方式二：

import onnxruntime as ort
import numpy as np
ort_session = ort.InferenceSession('xxx.onnx')
outputs = ort_session.run(None, {
  'input_name': np.random.randn(10, 3,  224, 224).astype(np.float32)
})
print(outputs[0])

验证onnx模型与原pytorch模型结果是否一致，一致就表示模型转换无误；

import torch
import original_model
import onnxruntime as ort
import numpy as np

# load pytorch model
torch_model = original_model()
torch_model.load_state_dict(torch.load('xxx.pth'), map_location = 'cpu')
img = torch.ones((1, 3, 640, 640))/255
# pytorch model inference
with torch.no_grad():
    res = torch_model(img)
    print(res[0])
  
# load onnx model
sess = ort.InferenceSession('xxx.onnx')
# get input name 
input_names = sess.get_inputs()[0].name 
outputs = sess.run([], {input_names: np.array(img)})
print(outputs[0])
# verify whether the result is corresponing with each other

也可使用可视化工具netron可视化ONNX模型

netron的安装与使用：

# For Linux or Ubuntu, with pip installation
pip install netron
# For visualize the onnx model, use the command `netron` and press `Enetr` key
netron
# After pressing `enter` key, netron will show the interface

第二步：Onnx2TensorRT Engines

TensorRT安装

Python安装TensorRT(采用.tar文件的安装方式，此方式安装不会出现未安装的依赖包问题)

确认已经安装cuda与pytorch相匹配的版本
安装pycuda模块，pip install pycuda
确认安装onnx模块
下载TensorRT8.x相对应 cuda 版本的.tar文件，网址：NVIDIA TensorRT | NVIDIA Developer
解压文件，切换至python分支，安装TensorRT,graphsurgeon和onnx_graphsurgeon

cd TensorRT-8.x.x/python
pip install tensorrt-8.x.x-xxx.whl
cd TensorRT-8.x.x/graphsurgeon
pip install graphsurgeon-x.x.x.whl
cd TensorRT-8.x.x/onnx_graphsurgeon
pip install onnx_graphsurgeon-x.x.x.whl

添加TensorRT/lib到环境变量

export LD_LIBRARY_PATH=/home/YOUR_USERNAME/TensorRT-8.0.3.4/lib:$LD_LIBRARY_PATH  
export PATH=/home/YOUR_USERNAME/TensorRT-8.0.3.4/bin:$PATH  
export LD_LIBRARY_PATH=/home/YOUR_USERNAME/TensorRT-8.0.3.4/targets/x86_64-linux-gnu/lib:$LD_LIBRARY_PATH
source ~/.bashrc

验证TensorRT是否安装好

import tensorrt # 能正常导入即成功安装
print(tensorrt.__version__)

TensorRT的运行的基本流程图
- 首先以trt的Logger为参数，使用builder创建计算图类型INetworkDefinition；
- 然后使用Parsers将onnx等网络框架下的结构填充计算图，当然也可以使用tensorrt的API进行构建；
- 由计算图创建cuda环境下的引擎；
- 最终进行推理的则是cuda引擎生成的ExecutionContext；
  
  engine.create_execution_context()
Onnx2TensorRT Engine
- 方式一
  - 采用TensorRT中自带的工具trtexec可以直接将onnx模型转为TensorRT Engine，通过如下指令即可生成TensorRT Engine
```
trtexec --onnx=your_model.onnx --saveEngine=resnet_engine.trt  --explicitBatch
```
  trtexec有如下可选参数：
  - --explicitBatch 网络输入时batch维度明确的size大小，该参数不能隐藏，ONNX解析器只支持明确的batch模式。更多信息可参考TensorRT Developer Guide中的Working With Dynamic Shapes 部分；
  - --fp16 enables FP16 precision for layers that support it, in addition to FP32. For more information, refer to the Working With Mixed Precision section in the TensorRT Developer Guide.
  - --int8 enables INT8 precision for layers that support it, in addition to FP32. For more information, refer to the Working With Mixed Precision section in the TensorRT Developer Guide.
  - --best enables all supported precisions to achieve the best performance for every layer.
  - --workspace controls the maximum amount of persistent scratch memory available (in MB) for algorithms considered by the builder. This should be set as high as possible for a given platform based on availability; at runtime TensorRT will allocate only what is required, not exceeding the max.
  - --minShapes and --maxShapes specify the range of dimensions for each network input and --optShapes specifies the dimensions that the auto-tuner should use for optimization. For more information, refer to the Optimization Profiles section in the TensorRT Developer Guide.
  - --buildOnly requests that inference performance measurements be skipped.
  - --saveEngine specifies the file into which the serialized engine must be saved.
  - --safe enables building safety certified engines. This switch is used for prototyping automotive safety restricted flows in the TensorRT safe runtime.
  - --tacticSources can be used to add or remove tactics from the default tactic sources (cuDNN, cuBLAS and cuBLASLt).
  - --minTiming and --avgTiming respectively set the minimum and average number of iterations used in tactic selection.
  - --noBuilderCache disables the layer timing cache in the TensorRT builder. The timing cache helps to reduce the time taken in the builder phase by caching the layer profiling information and should work for most cases. Use this switch for the problematic cases. For more information, refer to the Builder Layer Timing Cache section in the TensorRT Developer Guide.
  - --timingCacheFile can be used to save or load the serialized global timing cache.
- 方式二
  - 采用TensorRT API(Python版，TensorRT中自带的ONNX解析器[正如TensorRT运行流程图中的Parser]构建TensorRT Engine)
```
import os
import tensorrt as trt

# 打印日志
TRT_LOGGER = trt.Logger()
model_path = 'resnet18.onnx'
engine_file_path = "resnet18.trt"
EXPLICIT_BATCH = 1 << (int) \ (trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)  # 明确batchsize=1

# 定义创建builder, network和ONNX Parser
with trt.Builder(TRT_LOGGER) as builder, builder.create_network(EXPLICIT_BATCH) \
        as network, trt.OnnxParser(network, TRT_LOGGER) as parser:
    builder.max_workspace_size = 1 << 28
    builder.max_batch_size = 1
    # 检查onnx文件是否存在
    if not os.path.exists(model_path):
        print('ONNX file {} not found.'.format(model_path))
        exit(0)
    print('Loading ONNX file from path {}...'.format(model_path))
    with open(model_path, 'rb') as model:
        print('Beginning ONNX file parsing')
        if not parser.parse(model.read()):
            print('ERROR: Failed to parse the ONNX file.')
            for error in range(parser.num_errors):
                print(parser.get_error(error))
 
    network.get_input(0).shape = [1, 3, 32, 32]
    print('Completed parsing of ONNX file')
    engine = builder.build_cuda_engine(network)
    with open(engine_file_path, "wb") as f:
        f.write(engine.serialize())
```

第三步 TensorRT Engine进行模型Inference

Inference的四个基本步骤
- 1、输入前处理：和训练的前处理过程，保证输入的图片格式和训练一致；
- 2、分配内存：allocate_buffers函数，不需要改动；
- 3、推理函数：do_inference_v2函数，不需要改动，如果需要多张推理，则可用do_inference函数；
- 4、结果：tensorrt推理结果为列表；
TensorRT+Pycuda实现

import pycuda.driver as cuda
import pycuda.autoinit
import cv2
import numpy as np
import os
import tensorrt as trt
import time
from  PIL import Image
 
TRT_LOGGER = trt.Logger()
engine_file_path = "resnet18.trt"
 
# Simple helper data class that's a little nicer to use than a 2-tuple.
class HostDeviceMem(object):
    def __init__(self, host_mem, device_mem):
        self.host = host_mem
        self.device = device_mem
 
    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)
 
    def __repr__(self):
        return self.__str__()
 
# Allocates all buffers required for an engine, i.e. host/device inputs/outputs.
def allocate_buffers(engine):
    inputs = []
    outputs = []
    bindings = []
    stream = cuda.Stream()
    for binding in engine:
        size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        # Allocate host and device buffers
        host_mem = cuda.pagelocked_empty(size, dtype)
        device_mem = cuda.mem_alloc(host_mem.nbytes)
        # Append the device buffer to device bindings.
        bindings.append(int(device_mem))
        # Append to the appropriate list.
        if engine.binding_is_input(binding):
            inputs.append(HostDeviceMem(host_mem, device_mem))
        else:
            outputs.append(HostDeviceMem(host_mem, device_mem))
    return inputs, outputs, bindings, stream
 
# 推理函数，固定函数
def do_inference_v2(context, bindings, inputs, outputs, stream):
    # Transfer input data to the GPU.
    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
    # Run inference.
    context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
    # Transfer predictions back from the GPU.
    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
    # Synchronize the stream
    stream.synchronize()
    # Return only the host outputs.
    return [out.host for out in outputs]
 
i = 0
j = 0
with open(engine_file_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime, \
        runtime.deserialize_cuda_engine(f.read()) as engine, engine.create_execution_context() as context:
    inputs, outputs, bindings, stream = allocate_buffers(engine)
    print(inputs,outputs,bindings,stream)
    dir = "val/white/"
 
    for name in os.listdir(dir):
        # 前处理部分
        t1 = time.clock()
        image_path = os.path.join(dir,name)
 
        img = Image.open(image_path)
        img = np.array(img)
        img = img.transpose((1, 0, 2))
        img = img.transpose((2, 1, 0))
        img = img.astype(np.float32) / 255.0
 
        img = img[np.newaxis, :, :].astype(np.float32)
        print(img.shape)
        img = np.ascontiguousarray(img)
        # 前处理结束
 
        # 开始推理
        inputs[0].host = img
        trt_outputs = do_inference_v2(context, bindings=bindings, \
                                      inputs=inputs, outputs=outputs, stream=stream)
        print(trt_outputs)
 
        # 结果判断
        if trt_outputs[0][0] > trt_outputs[0][1]:
            print("0")
            i = i +1
        else:
            print("1")
            j = j +1
        print("Time:",time.clock()-t1)
 
m = 0
n = 0
with open(engine_file_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime, \
        runtime.deserialize_cuda_engine(f.read()) as engine, engine.create_execution_context() as context:
    inputs, outputs, bindings, stream = allocate_buffers(engine)
    print(inputs,outputs,bindings,stream)
    dir = "val/yellow/"
 
    for name in os.listdir(dir):
        t1 = time.clock()
        image_path = os.path.join(dir,name)
 
        image = cv2.imread(image_path)
        img = Image.fromarray(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
        img = np.array(img)
        img = img.transpose((1, 0, 2))
        img = img.transpose((2, 1, 0))
        img = img.astype(np.float32) / 255.0
 
        img = img[np.newaxis, :, :].astype(np.float32)
        print(img.shape)
        img = np.ascontiguousarray(img)
        inputs[0].host = img
        # 开始推理
        trt_outputs = do_inference_v2(context, bindings=bindings, \
                                      inputs=inputs, outputs=outputs, stream=stream)
        print(trt_outputs)
        if trt_outputs[0][0] > trt_outputs[0][1]:
            print("0")
            m = m+1
        else:
            print("1")
            n = n +1
        print("Time:",time.clock()-t1)
#
print("i = ",i)
print("j = ",j)
print("m = ",m)
print("n = ",n)