【pytorch】将模型部署至生产环境:借助TensorRT 8完成代码优化及部署(一):python接口实现

89 篇文章 11 订阅
65 篇文章 3 订阅

(一)TensorRT介绍:
Tensor是一个有助于在NVIDIA图形处理单元(GPU)上高性能推理c++库,专门致力于在GPU上快速有效地进行网络推理。

TensorRT可以对网络进行压缩、优化以及运行时部署,并且没有框架的开销。改善网络的延迟、吞吐量以及效率。

TensorRT通常是异步使用的,因此,当输入数据到达时,程序调用带有输入缓冲区和TensorRT放置结果的缓冲区的enqueue函数。

下面是TensorRT结构图:
在这里插入图片描述

Network Definition:网络定义接口为应用程序提供了指定网络定义方法。可以指定网络的输入输出tensor,也可以添加layer,一般不会用tensorRT构建网络。

Builder Configuration:构建器配置接口指定用于创建engine的详细信息,它允许应用程序指定优化Profile,最大工作空间大小,最小可接受的精度水平,用于自动调整的定时迭代技术以及用于量化网络以8位精度运行的接口。

Builder:构建器接口允许根据网络定义和builder configuration创建一个优化的engine。

Engine:engine接口允许应用程序执行inference。它支持同步和异步执行、概要分析以及枚举和查询engine的输入和输出的绑定。

TensorRT会根据网络的定义执行优化【包括特定平台的优化】并生成inference engine。此过程被称为构建阶段,因此,一个典型的应用程序只会被构建一次engine,然后将其序列化为plane file以供后续使用。【注意:生成的plane file 不能跨平台或TensorRT 版本移植
因为plane file是明确指定GPU 的model,所以我们要想使用不同的GPU来运行plane file必须得重新指定GPU】

(二)转换思路及模型准备:
转换思路为:pytorch -> onnx -> onnx2trt -> TensorRT
对于pytorch -> onnx ,采用记录法并导出onxx的方法,代码如下:

import os.path
from typing import Iterator
import numpy as np
import torch
import cv2
from PIL import Image
from torch.utils.data import Dataset,DataLoader,Subset,random_split
import re
from functools import reduce
from torch.utils.tensorboard import SummaryWriter as Writer
from torchvision import transforms,datasets
import torchvision as tv
from torch import nn
import torch.nn.functional as F
import time
import onnx
import onnxruntime
#查看命令:tensorboard --logdir=./myBorderText
#可用pycharm中code中的generater功能实现:
#向模型添加显式注释:
class myCustomerNetWork(torch.jit.ScriptModule):
    def __init__(self):
        super().__init__()
        #输入3通道输出6通道:
        self.features=nn.Sequential(nn.Conv2d(3, 64, (3, 3)),nn.ReLU(),nn.Conv2d(64,128,(3,3)),
                                    nn.ReLU(),nn.Conv2d(128,256,(3,3)),nn.ReLU(),nn.AdaptiveAvgPool2d(1))

        self.classfired=nn.Sequential(nn.Flatten(),nn.Linear(256,80),nn.Dropout(),nn.Linear(80,10))

    @torch.jit.script_method
    def forward(self,x):
        return self.classfired(self.features(x))
#网络输入要求为torch.Size([32, 3, 32, 32])格式
myNet= torch.jit.script(myCustomerNetWork())
myNet=myCustomerNetWork()
pthfile = r'D:\flask_pytorch\saveTextOnlyParams.pth'
#当strict=false时,参数文件匹配得上就加载,没有就默认初始化。
myNet.load_state_dict(torch.load(pthfile),strict=False)
if torch.cuda.is_available():
    myNet=myNet.cuda()
myNet.eval()

if __name__ == '__main__':
    imagePath = r"C:\Users\25360\Desktop\monodepth.jpeg"
    img = cv2.imdecode(np.fromfile(imagePath, np.uint8), -1)
    img = cv2.resize(img, (32, 32))
    # bgr转rgb
    img = img[:, :, ::-1].copy()
    inputX = torch.FloatTensor(img).cuda()
    inputX = inputX.permute(2, 0, 1).contiguous()
    inputX = inputX.unsqueeze(0)
    #torch_out=myNet(inputX)
    # 将模型序列化
    #myNet.save('jit_model2.pth')
    #torch.onnx.export在运行时,先判断是否是SriptModule,如果不是,则进行torch.jit.trace,因此export需要一个随机生成的输入参数
    # 若传入 scriptModule,需要外加配置 example_outputs,用来获取输出的shape和dtype,无需运行模型
    #之前模型使用记录法得到,这里无需运行模型,但要给出输入及输出参数形状;一般无特殊情况,跟踪法使用更多。
    dynamic_axes = {'input': {0: 'batch'}, 'output': {0: 'batch'}}  # 配置动态分辨率
    #在最新版pytorch中的记录法,inputX仍然需要,但只是用于生成output形状的,但是不会再最终,所以example_outputs参数被删去了
    torch.onnx.export(myNet, inputX, r'./modelForTensorRT.onnx', input_names=['input'], output_names=['output'], dynamic_axes=dynamic_axes)

python版tensorrt安装:
Linux下的安装命令:

pip install tensorrt
pip install nvidia-pyindex
pip install nvidia-tensorrt

在win10下安装python及c++版本会麻烦一些,详见:【pytorch】Win10安装C++版及python版本tensorRT

#TensorRT仅支持GPU
注意common文件是tensortrt的官方安装包提供的,为适应动态batch,改进allocate_buffers,见博文最后。
(三)下面是python版接口用法:
(1)onnx转换至trt完成序列化:
此处的对象角色是builder,建设者;建设者创造network,profile及config模型;config通过profile构建出动态的输入;parser翻译者;翻译者根据network及onnx填充network;builder通过network及config构建出engin.

import tensorrt as trt
import common
def ONNX_build_engine(onnx_file_path, write_engine=True):
    # 通过加载onnx文件,构建engine
    # :param onnx_file_path: onnx文件路径
    # :return: engine
    #这里是创建日志记录器
    G_LOGGER = trt.Logger(trt.Logger.WARNING)
    #动态输入是使用onnx模型时必须要写的:
    explicit_batch = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
    batch_size = 8  # trt推理时最大支持的batchsize
    with trt.Builder(G_LOGGER) as builder, builder.create_network(explicit_batch) as network, \
            trt.OnnxParser(network, G_LOGGER) as parser:
        builder.max_batch_size = batch_size
        config = builder.create_builder_config()
        config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE,common.GiB(2))
        config.set_flag(trt.BuilderFlag.FP16)
        print('Loading ONNX file from path {}...'.format(onnx_file_path))
        with open(onnx_file_path, 'rb') as model:
            print('Beginning ONNX file parsing')
            parser.parse(model.read())
        print('Completed parsing of ONNX file')
        print('Building an engine from file {}; this may take a while...'.format(onnx_file_path))
        # 重点
        profile = builder.create_optimization_profile()  # 动态输入时候需要 分别为最小输入、常规输入、最大输入
        # 有几个输入就要写几个profile.set_shape 名字和转onnx的时候要对应
        # tensorrt6以后的版本是支持动态输入的,需要给每个动态输入绑定一个profile,用于指定最小值,常规值和最大值,如果超出这个范围会报异常。
        profile.set_shape("input", (1, 3, 32, 32), (1, 3, 32, 32), (8, 3, 32, 32))
        config.add_optimization_profile(profile)
        engine = builder.build_serialized_network(network, config)
        print("Completed creating Engine")
        # 保存engine文件
        if write_engine:
            engine_file_path = 'class10.trt'
            with open(engine_file_path, "wb") as f:
                f.write(engine)
        return engine

TRT_LOGGER = trt.Logger()
onnx_model_path = 'modelForTensorRT.onnx'
# Build an engine
engine = ONNX_build_engine(onnx_model_path, True)

运行结果为:

Loading ONNX file from path modelForTensorRT.onnx...
Beginning ONNX file parsing
[05/11/2022-23:08:06] [TRT] [W] onnx2trt_utils.cpp:365: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
Completed parsing of ONNX file
Building an engine from file modelForTensorRT.onnx; this may take a while...
[05/11/2022-23:08:07] [TRT] [W] TensorRT was linked against cuBLAS/cuBLAS LT 11.8.0 but loaded cuBLAS/cuBLAS LT 11.5.1
[05/11/2022-23:08:08] [TRT] [W] TensorRT was linked against cuDNN 8.3.2 but loaded cuDNN 8.2.0
Completed creating Engine

(2)TensorRT反序列化trt文件进行推理:

import numpy as np
import torch
import cv2
import time
import onnx
import onnxruntime
import tensorrt as trt
import common
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
runtime = trt.Runtime(TRT_LOGGER)
#反序列化模型
with open('class10.trt', 'rb') as f:
    engine = runtime.deserialize_cuda_engine(f.read())
    print("Completed creating Engine")
# Create the context for this engine
context = engine.create_execution_context()
#下面两行的1要根据输入图片,也即batch数灵活变动,最大为8
context.set_binding_shape(0, (1, 3, 32, 32))
# Allocate buffers for input and output
inputs, outputs, bindings, stream = common.allocate_buffers(engine,1)  # input, output: host # bindings
#
# 推断,产生待推断模型输入:
#网络输入为(n,c,w,h)格式
imagePath = r"C:\Users\25360\Desktop\monodepth.jpeg"
img = cv2.imdecode(np.fromfile(imagePath, np.uint8), -1)
img = cv2.resize(img, (32, 32))
# bgr转rgb
img = img[:, :, ::-1].copy()
inputX = torch.FloatTensor(img)
inputX = inputX.permute(2, 0, 1).contiguous()
inputX = inputX.unsqueeze(0)
inputs[0].host = inputX.numpy()
# inputs[1].host = ... 对于多输入
t1 = time.time()
# 输入数据要求为numpy
trt_outputs = common.do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
t2 = time.time()
print(t2-t1)
#由于最大batch为8,故输出可reshape成[8,10],但实际输入只有1,故其[0]维有意义,其它维度全为0.
print(np.reshape(trt_outputs[0],[-1,10])[0])

输出结果为:

0.0010251998901367188
[  71.5625      10.8828125  164.875      313.5       -148.125
  329.5        109.875     -266.        -171.25      -272.5      ]

为方便使用,最后附上common文件源码:

#
# Copyright (c) 1993-2022, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

import argparse
import os

import numpy as np
import pycuda.autoinit
import pycuda.driver as cuda
import tensorrt as trt

try:
    # Sometimes python does not understand FileNotFoundError
    FileNotFoundError
except NameError:
    FileNotFoundError = IOError

EXPLICIT_BATCH = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)


def GiB(val):
    return val * 1 << 30


def add_help(description):
    parser = argparse.ArgumentParser(description=description, formatter_class=argparse.ArgumentDefaultsHelpFormatter)
    args, _ = parser.parse_known_args()


def find_sample_data(description="Runs a TensorRT Python sample", subfolder="", find_files=[], err_msg=""):
    """
    Parses sample arguments.

    Args:
        description (str): Description of the sample.
        subfolder (str): The subfolder containing data relevant to this sample
        find_files (str): A list of filenames to find. Each filename will be replaced with an absolute path.

    Returns:
        str: Path of data directory.
    """

    # Standard command-line arguments for all samples.
    kDEFAULT_DATA_ROOT = os.path.join(os.sep, "usr", "src", "tensorrt", "data")
    parser = argparse.ArgumentParser(description=description, formatter_class=argparse.ArgumentDefaultsHelpFormatter)
    parser.add_argument(
        "-d",
        "--datadir",
        help="Location of the TensorRT sample data directory, and any additional data directories.",
        action="append",
        default=[kDEFAULT_DATA_ROOT],
    )
    args, _ = parser.parse_known_args()

    def get_data_path(data_dir):
        # If the subfolder exists, append it to the path, otherwise use the provided path as-is.
        data_path = os.path.join(data_dir, subfolder)
        if not os.path.exists(data_path):
            if data_dir != kDEFAULT_DATA_ROOT:
                print("WARNING: " + data_path + " does not exist. Trying " + data_dir + " instead.")
            data_path = data_dir
        # Make sure data directory exists.
        if not (os.path.exists(data_path)) and data_dir != kDEFAULT_DATA_ROOT:
            print(
                "WARNING: {:} does not exist. Please provide the correct data path with the -d option.".format(
                    data_path
                )
            )
        return data_path

    data_paths = [get_data_path(data_dir) for data_dir in args.datadir]
    return data_paths, locate_files(data_paths, find_files, err_msg)


def locate_files(data_paths, filenames, err_msg=""):
    """
    Locates the specified files in the specified data directories.
    If a file exists in multiple data directories, the first directory is used.

    Args:
        data_paths (List[str]): The data directories.
        filename (List[str]): The names of the files to find.

    Returns:
        List[str]: The absolute paths of the files.

    Raises:
        FileNotFoundError if a file could not be located.
    """
    found_files = [None] * len(filenames)
    for data_path in data_paths:
        # Find all requested files.
        for index, (found, filename) in enumerate(zip(found_files, filenames)):
            if not found:
                file_path = os.path.abspath(os.path.join(data_path, filename))
                if os.path.exists(file_path):
                    found_files[index] = file_path

    # Check that all files were found
    for f, filename in zip(found_files, filenames):
        if not f or not os.path.exists(f):
            raise FileNotFoundError(
                "Could not find {:}. Searched in data paths: {:}\n{:}".format(filename, data_paths, err_msg)
            )
    return found_files


# Simple helper data class that's a little nicer to use than a 2-tuple.
class HostDeviceMem(object):
    def __init__(self, host_mem, device_mem):
        self.host = host_mem
        self.device = device_mem

    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

    def __repr__(self):
        return self.__str__()


# Allocates all buffers required for an engine, i.e. host/device inputs/outputs.

def allocate_buffers(engine,max_batch_size=16):
    inputs = []
    outputs = []
    bindings = []
    stream = cuda.Stream()
    for binding in engine:
        dims = engine.get_binding_shape(binding) 
        #print(dims) 
        if dims[0] == -1:
            assert(max_batch_size is not None)
            dims[0] = max_batch_size #动态batch_size适应
        
        #size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
        size = trt.volume(dims) * engine.max_batch_size
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        #print(dtype,size)
        # Allocate host and device buffers
        host_mem = cuda.pagelocked_empty(size, dtype) #开辟出一片显存
        device_mem = cuda.mem_alloc(host_mem.nbytes)
        # Append the device buffer to device bindings.
        bindings.append(int(device_mem))
        # Append to the appropriate list.
        if engine.binding_is_input(binding):
            inputs.append(HostDeviceMem(host_mem, device_mem))
        else:
            outputs.append(HostDeviceMem(host_mem, device_mem))
    return inputs, outputs, bindings, stream


# This function is generalized for multiple inputs/outputs.
# inputs and outputs are expected to be lists of HostDeviceMem objects.
def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
    # Transfer input data to the GPU.
    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
    # Run inference.
    context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
    # Transfer predictions back from the GPU.
    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
    # Synchronize the stream
    stream.synchronize()
    # Return only the host outputs.
    return [out.host for out in outputs]


# This function is generalized for multiple inputs/outputs for full dimension networks.
# inputs and outputs are expected to be lists of HostDeviceMem objects.
def do_inference_v2(context, bindings, inputs, outputs, stream):
    # Transfer input data to the GPU.
    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
    # Run inference.
    context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
    # Transfer predictions back from the GPU.
    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
    # Synchronize the stream
    stream.synchronize()
    # Return only the host outputs.
    return [out.host for out in outputs]
  • 1
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
PyTorch是一个用于搭建和训练深度学习模型的开源框架。在模型训练完成后,模型部署是将训练好的模型应用于实际生产环境的过程。下面我将简单介绍PyTorch模型部署的一般步骤和方法。 首先,将训练好的模型保存为文件,通常为扩展名为`.pt`或`.pth`的文件。这个文件包含了模型的结构和参数。保存模型可以使用`torch.save()`函数。 要部署PyTorch模型,我们可以选择多种方法。一种常见的方法是使用模型加载器,比如TorchScript。TorchScript可以将PyTorch模型转换为一个脚本,这样我们可以在不依赖PyTorch的情况下运行模型。转换为TorchScript可以使用`torch.jit.trace()`或`torch.jit.script()`函数。 另一种常见的方法是使用ONNX(开放神经网络交换)格式。ONNX是一个开放标准的模型格式,可以在不同的深度学习框架之间共享模型。我们可以使用PyTorch提供的`torch.onnx.export()`函数将PyTorch模型导出为ONNX格式。 在部署模型之前,我们需要选择一个合适的推理引擎。推理引擎是一个用于加载和运行模型的软件库。常用的推理引擎包括PyTorch自带的`torchserve`和`torchscript`,还有其他第三方库,比如TensorRT、ONNX Runtime等。 最后,将部署好的模型连接到实际应用中。这可以通过API接口、命令行工具等方式完成PyTorch官方提供了`torchserve`工具,可以用于快速搭建一个用于模型推理的服务器。我们还可以使用Flask、Django等框架将模型集成到Web应用中。 总的来说,PyTorch模型部署是将训练好的模型应用于实际生产环境的过程。关键步骤包括保存模型、选择合适的部署方法、选择推理引擎、连接到实际应用。以上这些步骤可以根据特定的需求和情况进行调整和扩展。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

颢师傅

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值