TensorRT python API 文档翻译

最新推荐文章于 2024-10-05 03:15:43 发布

mathlxj

最新推荐文章于 2024-10-05 03:15:43 发布

阅读量1.5k

点赞数

分类专栏： TensorRT 文章标签： Tensorrt Python API engine序列化 TRT Python TF-TRT

原文链接：https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#python_topics

版权

TensorRT 专栏收录该内容

14 篇文章 1 订阅

订阅专栏

前言

近期需要研究如何使用TensorRT的python API,翻译了部分文档,做我个人的笔记,大家也可分享交流,如有错误,感谢勘误.
参考网址:Using The Python API

Using The Python API

1 导入TRT

import tensorrt as tf
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

启动一个logging界面，以输出TRT的errors,warnings和信息化的信息(informational messages), 在以上代码中，我们已经抑制了信息化信息输出，仅仅报告warning和errors.

2.创建python定义的网络

在部署TensorRT推断时第一步是根据你的模型创建TRT网络。

最简单的方式是使用TRT解析库(TRT parser library),支持使用以下格式序列化模型:

Caffe(both BVLC and NVCaffe)
Supports ONNX releases up to ONNX 1.6, and ONNX opsets 7 to 11, and
UFF (used for TensorFlow)

一个可替代方案是直接使用TensorRT Network API定义模型，需要你去创建一小部分API来被调用，
以定义网络图中的每一层和实现你自己训练参数的导入机制.

2.1 使用Python API逐层搭建

目前未用到，不做翻译.

2.2 使用Python解析器导入模型

主要分为3个步骤:

创建 TRT builder 和 network
为特定的格式创建TRT 解析器
使用解析器解析输入的模型和populate网络

builder必须在network前创建，因为它对于network来讲是个factory.不同的parser有不同的机制来mark网络的输出.

2.3 从Caffe中导入模型

目前未用到，不做翻译

2.4 从Tensorflow中导入

Step1. import tensorrt as tf
Step2. 为tensorflow模型创建一个frozen TensorFlow模型. 如何将Tensorflow模型freez进一个stream可以见Freezing A TensorFlow Graph
Step3. 使用UFF converter来讲一个frozen tensorflow模型转为UFF文件，代码为:

convert-to-uff frozen_inference_graph.pb

若使用以上命令，可以使用代码python -c “import uff; print(uff.\__path__)” 找到UFF位置

Step4. 定义模型的位置，例如，对例子而言

model_file = '/data/mnist/mnist.uff'

Step5. 创建一个builder,network和parser:

with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network, trt.UffParser() as parser:
         parser.register_input("Placeholder", (1,28,28))
         parser.register_output("fc2/Relu")
         parser.parse(model_file, network)

3.使用Python创建一个Engine

builder中函数的作用之一是搜索CUDA kernels的catalog,以最快速部署.因此需要在building时需要使用同一个GPU,engine优化将在其上运行.
有两个重要的参数: maximun batch size 和 maximum workspace size

maximum batch size指定TRT优化的batch size,在运行时,必须选用更小的batch size
Layer需要临时的workspace,参数限制了network中所有曾可用的最大的工作空间大小，如果给的空间不充足，TRT针对给定的layer可能无法找到implementation.

流程:

step1. 使用build对象创建engine

 with trt.Builder(TRT_LOGGER) as builder, builder.create_builder_config() as config:
   config.max_workspace_size = 1 << 20 # This determines the amount of memory available to the builder when building an optimized engine and should generally be set as high as possible.
   with builder.build_engine(network, config) as engine:

 #Do inference here

当engine创建时,TRT复制权重.
- Step2.进行推断.

4. 使用Python序列化(serialize)一个模型

到第三步，可用序列化engine,也可以直接使用engine进行推断.序列化和反序列化是可选的.
序列化模型后,可用转换engine为存储的格式，可用在以后进行推断.使用这个序列化后的文件做推断,仅仅需要简单的反序列化engine.
序列化和反序列化是可选的，因为序列化是极其小耗时间的，序列化可用每次避免重复创建engine.
Notes:序列化的engines不是扩平台或者跨TRT版本的.engine对特定的用来创建它们的GPU模型是定制的.

Step1. 序列化模型

  serialized_engine = engine.serialize()

Step2.反序列化来执行推断,反序列化需要创建一个runtime对象
with trt.Runtime(TRT_LOGGER) as runtime: engine = runtime.deserialize_cuda_engine(serialized_engine)

同样可以指定序列化为一个文件，然后从文件中读回它
step1. 序列化并写入文件

   with open(“sample.engine”, “wb”) as f:
		f.write(engine.serialize())

Step2. 从文件中读取engine并反序列化

 with open(“sample.engine”, “rb”) as f, trt.Runtime(TRT_LOGGER) as runtime:
      engine = runtime.deserialize_cuda_engine(f.read())

5. 使用Python执行推断

Step1. 为输入输出分配一些host和device buffers

   # Determine dimensions and create page-locked memory buffers (i.e. won't be swapped to disk) to hold host inputs/outputs.
   	h_input = cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(0)), dtype=np.float32)
   	h_output = cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(1)), dtype=np.float32)
   	# Allocate device memory for inputs and outputs.
   	d_input = cuda.mem_alloc(h_input.nbytes)
   	d_output = cuda.mem_alloc(h_output.nbytes)
   	# Create a stream in which to copy inputs/outputs and run inference.
   	stream = cuda.Stream()

Step2. 创建一些空间来保存中间激活值,因为engine包含network定义和训练参数，因此需要额外的空间.

    with engine.create_execution_context() as context:
		# Transfer input data to the GPU.
		cuda.memcpy_htod_async(d_input, h_input, stream)
		# Run inference.
		context.execute_async(bindings=[int(d_input), int(d_output)], stream_handle=stream.handle)
		# Transfer predictions back from the GPU.
		cuda.memcpy_dtoh_async(h_output, d_output, stream)
		# Synchronize the stream
		stream.synchronize()
		# Return the host output.