【pytorch】将模型部署至生产环境：借助TensorRT 8完成代码优化及部署（一）：python接口实现

颢师傅

已于 2022-05-14 19:12:57 修改

阅读量1.9k

点赞数 1

分类专栏： pytorch python 计算机视觉文章标签： pytorch 深度学习 tensorflow

于 2022-05-07 23:55:42 首次发布

本文链接：https://blog.csdn.net/hh1357102/article/details/124640978

版权

计算机视觉同时被 3 个专栏收录

132 篇文章 10 订阅

订阅专栏

python

89 篇文章 11 订阅

订阅专栏

pytorch

65 篇文章 3 订阅

订阅专栏

（一）TensorRT介绍：
Tensor是一个有助于在NVIDIA图形处理单元（GPU）上高性能推理c++库，专门致力于在GPU上快速有效地进行网络推理。

TensorRT可以对网络进行压缩、优化以及运行时部署，并且没有框架的开销。改善网络的延迟、吞吐量以及效率。

TensorRT通常是异步使用的，因此，当输入数据到达时，程序调用带有输入缓冲区和TensorRT放置结果的缓冲区的enqueue函数。

下面是TensorRT结构图：
在这里插入图片描述

Network Definition：网络定义接口为应用程序提供了指定网络定义方法。可以指定网络的输入输出tensor，也可以添加layer，一般不会用tensorRT构建网络。

Builder Configuration：构建器配置接口指定用于创建engine的详细信息，它允许应用程序指定优化Profile，最大工作空间大小，最小可接受的精度水平，用于自动调整的定时迭代技术以及用于量化网络以8位精度运行的接口。

Builder：构建器接口允许根据网络定义和builder configuration创建一个优化的engine。

Engine：engine接口允许应用程序执行inference。它支持同步和异步执行、概要分析以及枚举和查询engine的输入和输出的绑定。

TensorRT会根据网络的定义执行优化【包括特定平台的优化】并生成inference engine。此过程被称为构建阶段，因此，一个典型的应用程序只会被构建一次engine，然后将其序列化为plane file以供后续使用。【注意：生成的plane file 不能跨平台或TensorRT 版本移植
因为plane file是明确指定GPU 的model，所以我们要想使用不同的GPU来运行plane file必须得重新指定GPU】

（二）转换思路及模型准备：
转换思路为：pytorch -> onnx -> onnx2trt -> TensorRT
对于pytorch -> onnx ，采用记录法并导出onxx的方法，代码如下：

import os.path
from typing import Iterator
import numpy as np
import torch
import cv2
from PIL import Image
from torch.utils.data import Dataset,DataLoader,Subset,random_split
import re
from functools import reduce
from torch.utils.tensorboard import SummaryWriter as Writer
from torchvision import transforms,datasets
import torchvision as tv
from torch import nn
import torch.nn.functional as F
import time
import onnx
import onnxruntime
#查看命令：tensorboard --logdir=./myBorderText
#可用pycharm中code中的generater功能实现：
#向模型添加显式注释：
class myCustomerNetWork(torch.jit.ScriptModule):
    def __init__(self):
        super().__init__()
        #输入3通道输出6通道：
        self.features=nn.Sequential(nn.Conv2d(3, 64, (3, 3)),nn.ReLU(),nn.Conv2d(64,128,(3,3)),
                                    nn.ReLU(),nn.Conv2d(128,256,(3,3)),nn.ReLU(),nn.AdaptiveAvgPool2d(1))

        self.classfired=nn.Sequential(nn.Flatten(),nn.Linear(256,80),nn.Dropout(),nn.Linear(80,10))

    @torch.jit.script_method
    def forward(self,x):
        return self.classfired(self.features(x))
#网络输入要求为torch.Size([32, 3, 32, 32])格式
myNet= torch.jit.script(myCustomerNetWork())
myNet=myCustomerNetWork()
pthfile = r'D:\flask_pytorch\saveTextOnlyParams.pth'
#当strict=false时，参数文件匹配得上就加载，没有就默认初始化。
myNet.load_state_dict(torch.load(pthfile),strict=False)
if torch.cuda.is_available():
    myNet=myNet.cuda()
myNet.eval()

if __name__ == '__main__':
    imagePath = r"C:\Users\25360\Desktop\monodepth.jpeg"
    img = cv2.imdecode(np.fromfile(imagePath, np.uint8), -1)
    img = cv2.resize(img, (32, 32))
    # bgr转rgb
    img = img[:, :, ::-1].copy()
    inputX = torch.FloatTensor(img).cuda()
    inputX = inputX.permute(2, 0, 1).contiguous()
    inputX = inputX.unsqueeze(0)
    #torch_out=myNet(inputX)
    # 将模型序列化
    #myNet.save('jit_model2.pth')
    #torch.onnx.export在运行时，先判断是否是SriptModule，如果不是，则进行torch.jit.trace，因此export需要一个随机生成的输入参数
    # 若传入 scriptModule,需要外加配置 example_outputs，用来获取输出的shape和dtype，无需运行模型
    #之前模型使用记录法得到，这里无需运行模型，但要给出输入及输出参数形状；一般无特殊情况，跟踪法使用更多。
    dynamic_axes = {'input': {0: 'batch'}, 'output': {0: 'batch'}}  # 配置动态分辨率
    #在最新版pytorch中的记录法，inputX仍然需要，但只是用于生成output形状的，但是不会再最终，所以example_outputs参数被删去了
    torch.onnx.export(myNet, inputX, r'./modelForTensorRT.onnx', input_names=['input'], output_names=['output'], dynamic_axes=dynamic_axes)

python版tensorrt安装：
Linux下的安装命令：

pip install tensorrt
pip install nvidia-pyindex
pip install nvidia-tensorrt

在win10下安装python及c++版本会麻烦一些，详见：【pytorch】Win10安装C++版及python版本tensorRT

#TensorRT仅支持GPU
注意common文件是tensortrt的官方安装包提供的，为适应动态batch，改进allocate_buffers，见博文最后。
（三）下面是python版接口用法：
（1）onnx转换至trt完成序列化：
此处的对象角色是builder，建设者；建设者创造network，profile及config模型；config通过profile构建出动态的输入；parser翻译者；翻译者根据network及onnx填充network;builder通过network及config构建出engin.

import tensorrt as trt
import common
def ONNX_build_engine(onnx_file_path, write_engine=True):
    # 通过加载onnx文件，构建engine
    # :param onnx_file_path: onnx文件路径
    # :return: engine
    #这里是创建日志记录器
    G_LOGGER = trt.Logger(trt.Logger.WARNING)
    #动态输入是使用onnx模型时必须要写的：
    explicit_batch = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
    batch_size = 8  # trt推理时最大支持的batchsize
    with trt.Builder(G_LOGGER) as builder, builder.create_network(explicit_batch) as network, \
            trt.OnnxParser(network, G_LOGGER) as parser:
        builder.max_batch_size = batch_size
        config = builder.create_builder_config()
        config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE,common.GiB(2))
        config.set_flag(trt.BuilderFlag.FP16)
        print('Loading ONNX file from path {}...'.format(onnx_file_path))
        with open(onnx_file_path, 'rb') as model:
            print('Beginning ONNX file parsing')
            parser.parse(model.read())
        print('Completed parsing of ONNX file')
        print('Building an engine from file {}; this may take a while...'.format(onnx_file_path))
        # 重点
        profile = builder.create_optimization_profile()  # 动态输入时候需要 分别为最小输入、常规输入、最大输入
        # 有几个输入就要写几个profile.set_shape 名字和转onnx的时候要对应
        # tensorrt6以后的版本是支持动态输入的，需要给每个动态输入绑定一个profile，用于指定最小值，常规值和最大值，如果超出这个范围会报异常。
        profile.set_shape("input", (1, 3, 32, 32), (1, 3, 32, 32), (8, 3, 32, 32))
        config.add_optimization_profile(profile)
        engine = builder.build_serialized_network(network, config)
        print("Completed creating Engine")
        # 保存engine文件
        if write_engine:
            engine_file_path = 'class10.trt'
            with open(engine_file_path, "wb") as f:
                f.write(engine)
        return engine

TRT_LOGGER = trt.Logger()
onnx_model_path = 'modelForTensorRT.onnx'
# Build an engine
engine = ONNX_build_engine(onnx_model_path, True)

运行结果为：

Loading ONNX file from path modelForTensorRT.onnx...
Beginning ONNX file parsing
[05/11/2022-23:08:06] [TRT] [W] onnx2trt_utils.cpp:365: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
Completed parsing of ONNX file
Building an engine from file modelForTensorRT.onnx; this may take a while...
[05/11/2022-23:08:07] [TRT] [W] TensorRT was linked against cuBLAS/cuBLAS LT 11.8.0 but loaded cuBLAS/cuBLAS LT 11.5.1
[05/11/2022-23:08:08] [TRT] [W] TensorRT was linked against cuDNN 8.3.2 but loaded cuDNN 8.2.0
Completed creating Engine

（2）TensorRT反序列化trt文件进行推理：

import numpy as np
import torch
import cv2
import time
import onnx
import onnxruntime
import tensorrt as trt
import common
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
runtime = trt.Runtime(TRT_LOGGER)
#反序列化模型
with open('class10.trt', 'rb') as f:
    engine = runtime.deserialize_cuda_engine(f.read())
    print("Completed creating Engine")
# Create the context for this engine
context = engine.create_execution_context()
#下面两行的1要根据输入图片，也即batch数灵活变动，最大为8
context.set_binding_shape(0, (1, 3, 32, 32))
# Allocate buffers for input and output
inputs, outputs, bindings, stream = common.allocate_buffers(engine,1)  # input, output: host # bindings
#
# 推断，产生待推断模型输入：
#网络输入为(n,c,w,h)格式
imagePath = r"C:\Users\25360\Desktop\monodepth.jpeg"
img = cv2.imdecode(np.fromfile(imagePath, np.uint8), -1)
img = cv2.resize(img, (32, 32))
# bgr转rgb
img = img[:, :, ::-1].copy()
inputX = torch.FloatTensor(img)
inputX = inputX.permute(2, 0, 1).contiguous()
inputX = inputX.unsqueeze(0)
inputs[0].host = inputX.numpy()
# inputs[1].host = ... 对于多输入
t1 = time.time()
# 输入数据要求为numpy
trt_outputs = common.do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
t2 = time.time()
print(t2-t1)
#由于最大batch为8，故输出可reshape成[8,10]，但实际输入只有1，故其[0]维有意义，其它维度全为0.
print(np.reshape(trt_outputs[0],[-1,10])[0])

输出结果为：

0.0010251998901367188
[  71.5625      10.8828125  164.875      313.5       -148.125
  329.5        109.875     -266.        -171.25      -272.5      ]

为方便使用，最后附上common文件源码：

#
# Copyright (c) 1993-2022, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

import argparse
import os

import numpy as np
import pycuda.autoinit
import pycuda.driver as cuda
import tensorrt as trt

try:
    # Sometimes python does not understand FileNotFoundError
    FileNotFoundError
except NameError:
    FileNotFoundError = IOError

EXPLICIT_BATCH = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)


def GiB(val):
    return val * 1 << 30


def add_help(description):
    parser = argparse.ArgumentParser(description=description, formatter_class=argparse.ArgumentDefaultsHelpFormatter)
    args, _ = parser.parse_known_args()


def find_sample_data(description="Runs a TensorRT Python sample", subfolder="", find_files=[], err_msg=""):
    """
    Parses sample arguments.

    Args:
        description (str): Description of the sample.
        subfolder (str): The subfolder containing data relevant to this sample
        find_files (str): A list of filenames to find. Each filename will be replaced with an absolute path.

    Returns:
        str: Path of data directory.
    """

    # Standard command-line arguments for all samples.
    kDEFAULT_DATA_ROOT = os.path.join(os.sep, "usr", "src", "tensorrt", "data")
    parser = argparse.ArgumentParser(description=description, formatter_class=argparse.ArgumentDefaultsHelpFormatter)
    parser.add_argument(
        "-d",
        "--datadir",
        help="Location of the TensorRT sample data directory, and any additional data directories.",
        action="append",
        default=[kDEFAULT_DATA_ROOT],
    )
    args, _ = parser.parse_known_args()

    def get_data_path(data_dir):
        # If the subfolder exists, append it to the path, otherwise use the provided path as-is.
        data_path = os.path.join(data_dir, subfolder)
        if not os.path.exists(data_path):
            if data_dir != kDEFAULT_DATA_ROOT:
                print("WARNING: " + data_path + " does not exist. Trying " + data_dir + " instead.")
            data_path = data_dir
        # Make sure data directory exists.
        if not (os.path.exists(data_path)) and data_dir != kDEFAULT_DATA_ROOT:
            print(
                "WARNING: {:} does not exist. Please provide the correct data path with the -d option.".format(
                    data_path
                )
            )
        return data_path

    data_paths = [get_data_path(data_dir) for data_dir in args.datadir]
    return data_paths, locate_files(data_paths, find_files, err_msg)


def locate_files(data_paths, filenames, err_msg=""):
    """
    Locates the specified files in the specified data directories.
    If a file exists in multiple data directories, the first directory is used.

    Args:
        data_paths (List[str]): The data directories.
        filename (List[str]): The names of the files to find.

    Returns:
        List[str]: The absolute paths of the files.

    Raises:
        FileNotFoundError if a file could not be located.
    """
    found_files = [None] * len(filenames)
    for data_path in data_paths:
        # Find all requested files.
        for index, (found, filename) in enumerate(zip(found_files, filenames)):
            if not found:
                file_path = os.path.abspath(os.path.join(data_path, filename))
                if os.path.exists(file_path):
                    found_files[index] = file_path

    # Check that all files were found
    for f, filename in zip(found_files, filenames):
        if not f or not os.path.exists(f):
            raise FileNotFoundError(
                "Could not find {:}. Searched in data paths: {:}\n{:}".format(filename, data_paths, err_msg)
            )
    return found_files


# Simple helper data class that's a little nicer to use than a 2-tuple.
class HostDeviceMem(object):
    def __init__(self, host_mem, device_mem):
        self.host = host_mem
        self.device = device_mem

    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

    def __repr__(self):
        return self.__str__()


# Allocates all buffers required for an engine, i.e. host/device inputs/outputs.

def allocate_buffers(engine,max_batch_size=16):
    inputs = []
    outputs = []
    bindings = []
    stream = cuda.Stream()
    for binding in engine:
        dims = engine.get_binding_shape(binding) 
        #print(dims) 
        if dims[0] == -1:
            assert(max_batch_size is not None)
            dims[0] = max_batch_size #动态batch_size适应
        
        #size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
        size = trt.volume(dims) * engine.max_batch_size
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        #print(dtype,size)
        # Allocate host and device buffers
        host_mem = cuda.pagelocked_empty(size, dtype) #开辟出一片显存
        device_mem = cuda.mem_alloc(host_mem.nbytes)
        # Append the device buffer to device bindings.
        bindings.append(int(device_mem))
        # Append to the appropriate list.
        if engine.binding_is_input(binding):
            inputs.append(HostDeviceMem(host_mem, device_mem))
        else:
            outputs.append(HostDeviceMem(host_mem, device_mem))
    return inputs, outputs, bindings, stream


# This function is generalized for multiple inputs/outputs.
# inputs and outputs are expected to be lists of HostDeviceMem objects.
def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
    # Transfer input data to the GPU.
    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
    # Run inference.
    context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
    # Transfer predictions back from the GPU.
    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
    # Synchronize the stream
    stream.synchronize()
    # Return only the host outputs.
    return [out.host for out in outputs]


# This function is generalized for multiple inputs/outputs for full dimension networks.
# inputs and outputs are expected to be lists of HostDeviceMem objects.
def do_inference_v2(context, bindings, inputs, outputs, stream):
    # Transfer input data to the GPU.
    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
    # Run inference.
    context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
    # Transfer predictions back from the GPU.
    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
    # Synchronize the stream
    stream.synchronize()
    # Return only the host outputs.
    return [out.host for out in outputs]

颢师傅

关注

1
点赞
踩
11

收藏

觉得还不错? 一键收藏
打赏
0
评论
【pytorch】将模型部署至生产环境：借助TensorRT 8完成代码优化及部署（一）：python接口实现

Tensor是一个有助于在NVIDIA图形处理单元（GPU）上高性能推理c++库。它旨在与TesnsorFlow、Caffe、Pytorch以及MXNet等训练框架以互补的方式进行工作，专门致力于在GPU上快速有效地进行网络推理。TensorRT可以作为用户应用程序中的库，它包括用于从Caffe，ONNX或TensorFlow导入现有模型的解析器，以及用于以编程方式（C++或Python API）构建模型。TensorRT可以对网络进行压缩、优化以及运行时部署，并且没有框架的开销。TensorRT通过c
复制链接

扫一扫