【TensorRT python API Inference pipeline】

zhuyi3.14

已于 2023-09-25 13:18:05 修改

阅读量193

点赞数

文章标签： python 人工智能

于 2023-09-25 10:24:42 首次发布

本文链接：https://blog.csdn.net/shallowink/article/details/133267496

版权

提示：TensorRT python API Inference pipeline 记录自用

文章目录

前言
一、推理流程大概框架
二、我的例程
三、官方例程
- 1、流程框架
- 2、代码示例

前言

提示： Inference pipeline 推理管道（框架）：

import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit
import tensorrt as trt
import cv2

# TODO: 在这里导入模型的预处理和后处理函数
# from your_module import preprocess, postprocess

class MyLogger(trt.Logger):
    def log(self, severity, msg):
        if severity >= self.verbosity:
            print("[TRT Log]: {}".format(msg))

logger = MyLogger(trt.Logger.Severity.VERBOSE)

def load_engine(engine_file_path):
    assert os.path.exists(engine_file_path), "Engine file does not exist."
    with open(engine_file_path, "rb") as f, trt.Runtime(logger) as runtime:
        return runtime.deserialize_cuda_engine(f.read())

def infer(engine, input_img):
    # TODO: 应用模型特定的预处理
    # processed_img = preprocess(input_img)

    with engine.create_execution_context() as context:
        # 分配缓冲区
        input_shape = (1, 3, 48, 160)
        output_shape = (1, 20, 78)
        dtype = np.float32
        
        # 分配输入和输出的内存
        input_size = int(np.prod(input_shape) * 4)
        output_size = int(np.prod(output_shape) * 4)
        input_memory = cuda.mem_alloc(input_size)
        output_memory = cuda.mem_alloc(output_size)

        # 准备输入数据并复制到设备
        cuda.memcpy_htod(input_memory, processed_img.ravel())
        
        # 绑定输入输出缓冲区
        bindings = [int(input_memory), int(output_memory)]

        # 创建 CUDA 流
        stream = cuda.Stream()
        
        # 执行推理
        context.execute_async_v2(bindings, stream.handle)

        # 同步 CUDA 流
        stream.synchronize()
        
        # 获取输出数据
        output_array = np.empty(output_shape, dtype)
        cuda.memcpy_dtoh(output_array, output_memory)
        
        # TODO: 应用模型特定的后处理
        # output_result = postprocess(output_array)

        return output_result

# 主函数
if __name__ == "__main__":
    engine_file = "your_trt_engine_file.engine"
    input_img_path = "your_input_image.jpg"

    input_img = cv2.imread(input_img_path)

    with load_engine(engine_file) as engine:
        output_result = infer(engine, input_img)

    # TODO: 使用或保存 output_result
    print("Inference done, output:", output_result)

一、推理流程大概框架

1、创建执行上下文

想象一下你要做一道复杂的菜，你需要一个厨房（执行上下文）来准备所有的食材和工具。

with engine.create_execution_context() as context:

2、分配内存和设置绑定

这就像在厨房的各个角落摆放好你需要用到的食材和工具。"绑定"就是告诉程序哪些工具用于哪些任务。

input_memory = cuda.mem_alloc(input_size)
output_memory = cuda.mem_alloc(output_size)
bindings = [int(input_memory), int(output_memory)]

3、准备输入数据并复制到设备

这一步就是把你准备好的食材（输入数据）放到炉子或者砧板上（GPU内存）

cuda.memcpy_htod(input_memory, processed_img.ravel())

4、创建 CUDA 流

你可以把CUDA流想象成你的助手，它会帮你按顺序完成各个步骤，确保一切都在正确的时间和地点完成。

stream = cuda.Stream()

5、执行推理

这就是把食材放进烤箱或者锅里开始烹饪的过程。这一步是整个流程的核心，所有的数学计算都在这里完成。

context.execute_async_v2(bindings, stream.handle)

6、同步 CUDA 流

这一步确保所有的烹饪步骤都已经完成，就像你确认烤箱里的食物已经烤好了一样。

stream.synchronize()

7、获取输出数据

最后，你从烤箱或锅里取出做好的菜（输出数据），准备上桌。

output_array = np.empty(output_shape, dtype)
cuda.memcpy_dtoh(output_array, output_memory)

二、我的例程

# 推理函数
def infer(engine, input_img):
    processed_img, r, left, top = detect_pre_precessing(input_img, (640, 640))

    with engine.create_execution_context() as context:
        # 分配缓冲区
        input_shape = (1, 3, 640, 640)
        output_shape = (1, 25200, 15)
        dtype = np.float32
        
        # 分配输入和输出的内存
        input_size = int(np.prod(input_shape) * 4)
        output_size = int(np.prod(output_shape) * 4)
        # 计算输入数组的所有维度的乘积，然后乘以4（因为使用了float32，每个浮点数需要4字节）。
        input_memory = cuda.mem_alloc(input_size)
        output_memory = cuda.mem_alloc(output_size)

        # 准备输入数据并复制到设备
        cuda.memcpy_htod(input_memory, processed_img.ravel())
        
        # 绑定输入输出缓冲区
        bindings = [int(input_memory), int(output_memory)]

        # 创建 CUDA 流
        stream = cuda.Stream()
        
        # 执行推理
        context.execute_async_v2(bindings, stream.handle)

        # 同步 CUDA 流
        stream.synchronize()
        
        # 获取输出数据
        output_array = np.empty(output_shape, dtype)
        cuda.memcpy_dtoh(output_array, output_memory)
        
        # 后处理
        output_result = post_precessing(output_array, r, left, top)
        # print(output_result)

        return output_result

三、官方例程

官方例程：https://github.com/NVIDIA/TensorRT/blob/main/quickstart/SemanticSegmentation/tutorial-runtime.ipynb

1、流程框架

Starting with a deserialized engine, TensorRT inference pipeline consists of the following steps:从反序列化引擎开始，TensorRT 推理管道包含以下步骤：

1、Create an execution context and specify input shape (based on the image dimensions for inference).创建执行上下文并指定输入形状（基于图像尺寸进行推理）。

2、Allocate CUDA device memory for input and output.为输入和输出分配 CUDA 设备内存。

3、Allocate CUDA page-locked host memory to efficiently copy back the output.分配 CUDA 页锁定主机内存以有效地复制回输出。

4、Transfer the processed image data into input memory using asynchronous host-to-device CUDA copy.使用异步主机到设备 CUDA 复制将处理后的图像数据传输到输入内存中。

5、Kickoff the TensorRT inference pipeline using the asynchronous execute API.使用异步执行 API 启动 TensorRT 推理管道。

6、Transfer the segmentation output back into pagelocked host memory using device-to-host CUDA copy.使用设备到主机 CUDA 复制将分段输出传输回页锁定主机内存。

7、Synchronize the stream used for data transfers and inference execution to ensure all operations are completes.同步用于数据传输和推理执行的流，以确保所有操作完成。

8、Finally, write out the segmentation output to an image file for visualization.最后，将分割输出写入图像文件以进行可视化。

2、代码示例

def infer(engine, input_file, output_file):
    print("Reading input image from file {}".format(input_file))
    with Image.open(input_file) as img:
        input_image = preprocess(img)
        image_width = img.width
        image_height = img.height

    with engine.create_execution_context() as context:
        # Set input shape based on image dimensions for inference
        context.set_binding_shape(engine.get_binding_index("input"), (1, 3, image_height, image_width))
        # Allocate host and device buffers
        bindings = []
        for binding in engine:
            binding_idx = engine.get_binding_index(binding)
            size = trt.volume(context.get_binding_shape(binding_idx))
            dtype = trt.nptype(engine.get_binding_dtype(binding))
            if engine.binding_is_input(binding):
                input_buffer = np.ascontiguousarray(input_image)
                input_memory = cuda.mem_alloc(input_image.nbytes)
                bindings.append(int(input_memory))
            else:
                output_buffer = cuda.pagelocked_empty(size, dtype)
                output_memory = cuda.mem_alloc(output_buffer.nbytes)
                bindings.append(int(output_memory))

        stream = cuda.Stream()
        # Transfer input data to the GPU.
        cuda.memcpy_htod_async(input_memory, input_buffer, stream)
        # Run inference
        context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
        # Transfer prediction output from the GPU.
        cuda.memcpy_dtoh_async(output_buffer, output_memory, stream)
        # Synchronize the stream
        stream.synchronize()

    with postprocess(np.reshape(output_buffer, (image_height, image_width))) as img:
        print("Writing output image to file {}".format(output_file))
        img.convert('RGB').save(output_file, "PPM")

zhuyi3.14

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【TensorRT python API Inference pipeline】

提示： Inference pipeline 推理管道（框架）：import cv2# TODO: 在这里导入模型的预处理和后处理函数# TODO: 应用模型特定的预处理# 分配缓冲区# 分配输入和输出的内存# 准备输入数据并复制到设备# 绑定输入输出缓冲区# 创建 CUDA 流# 执行推理# 同步 CUDA 流# 获取输出数据# TODO: 应用模型特定的后处理# 主函数# TODO: 使用或保存 output_result。
复制链接

扫一扫