TensorRT C++ API 中英文翻译【学习记录】


前言

本文仅对https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-825/developer-guide/index.html#api上C++ api的内容进行转述,主要用于自用。大家就看个参考就行。

一、TensorRT C++ API 中文翻译

本章说明 C++ API 的基本用法,假设您从 ONNX 模型开始。 sampleOnnxMNIST 更详细地说明了这个用例。

C++ API 可以通过头文件 NvInfer.h 访问,并且位于 nvinfer1 命名空间中。 例如,一个简单的应用程序可能以:

#include "NvInfer.h"

using namespace nvinfer1;

TensorRT C++ API 中的接口类以前缀 I 开头,例如 ILogger、IBuilder 等。

CUDA 上下文会在 TensorRT 第一次调用 CUDA 时自动创建,如果在该点之前不存在。 通常最好在第一次调用 TensoRT 之前自己创建和配置 CUDA 上下文。

为了说明对象的生命周期,本章中的代码不使用智能指针; 但是,建议将它们与 TensorRT 接口一起使用。

1.1构建阶段

要创建构建器,首先需要实例化 ILogger 接口。 此示例捕获所有警告消息,但忽略信息性消息:

class Logger : public ILogger           
{
    void log(Severity severity, const char* msg) override
    {
        // suppress info-level messages
        if (severity <= Severity::kWARNING)
            std::cout << msg << std::endl;
    }
} logger;

然后,您可以创建构建器的实例:

IBuilder* builder = createInferBuilder(logger);

1.1.1 创建网络定义

创建构建器后,优化模型的第一步是创建网络定义:

uint32_t flag = 1U <<static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);

INetworkDefinition* network = builder->createNetworkV2(flag);

为了使用 ONNX 解析器导入模型,需要 kEXPLICIT_BATCH 标志。 有关详细信息,请参阅显式与隐式批处理部分。

1.1.2 使用 ONNX 解析器导入模型

现在,需要从 ONNX 表示中填充网络定义。 ONNX 解析器 API 位于文件 NvOnnxParser.h 中,解析器位于 nvonnxparser C++ 命名空间中。

#include "NvOnnxParser.h"

using namespace nvonnxparser;

您可以创建一个 ONNX 解析器来填充网络,如下所示:

IParser*  parser = createParser(*network, logger);

然后,读取模型文件并处理任何错误。

parser->parseFromFile(modelFile, ILogger::Severity::kWARNING);
for (int32_t i = 0; i < parser.getNbErrors(); ++i)
{
std::cout << parser->getError(i)->desc() << std::endl;
}

TensorRT 网络定义的一个重要方面是它包含指向模型权重的指针,这些指针由构建器复制到优化的引擎中。由于网络是通过解析器创建的,解析器拥有权重占用的内存,因此在构建器运行之前不应删除解析器对象。

1.1.3 构建引擎

下一步是创建一个构建配置,指定 TensorRT 应该如何优化模型。

IBuilderConfig* config = builder->createBuilderConfig();

这个接口有很多属性,你可以设置这些属性来控制 TensorRT 如何优化网络。一个重要的属性是最大工作空间大小。层实现通常需要一个临时工作空间,并且此参数限制了网络中任何层可以使用的最大大小。如果提供的工作空间不足,TensorRT 可能无法找到层的实现。

config->setMaxWorkspaceSize(1U << 20);

指定配置后,即可构建引擎。

IHostMemory*  serializedModel = builder->buildSerializedNetwork(*network, *config);

由于序列化引擎包含权重的必要副本,因此不再需要解析器、网络定义、构建器配置和构建器,可以安全地删除:

delete parser;
delete network;
delete config;
delete builder;

然后可以将引擎保存到磁盘,并且可以删除它被序列化到的缓冲区。

delete serializedModel

Note: 序列化引擎不能跨平台或 TensorRT 版本移植。 引擎特定于它们构建的确切 GPU 模型(除了平台和 TensorRT 版本)。

1.2 反序列化计划

假设您之前已经序列化了一个优化模型并希望执行推理,您将需要创建一个运行时接口的实例。 与构建器一样,运行时需要一个记录器实例:

IRuntime* runtime = createInferRuntime(logger);

假设您已将模型从缓冲区中读取,然后可以对其进行反序列化以获得引擎:

ICudaEngine* engine = 
  runtime->deserializeCudaEngine(modelData, modelSize);

1.3 执行推理

引擎拥有优化的模型,但要执行推理,我们需要管理中间激活的额外状态。 这是通过 ExecutionContext 接口完成的:

IExecutionContext *context = engine->createExecutionContext();

一个引擎可以有多个执行上下文,允许一组权重用于多个重叠的推理任务。 (当前的一个例外是使用动态形状时,每个优化配置文件只能有一个执行上下文。)

要执行推理,您必须为输入和输出传递 TensorRT 缓冲区,TensorRT 要求您在指针数组中指定。 您可以使用为输入和输出张量提供的名称查询引擎,以在数组中找到正确的位置:

int32_t inputIndex = engine->getBindingIndex(INPUT_NAME);
int32_t outputIndex = engine->getBindingIndex(OUTPUT_NAME);

使用这些索引,设置一个缓冲区数组,指向 GPU 上的输入和输出缓冲区:

void* buffers[2];
buffers[inputIndex] = inputBuffer;
buffers[outputIndex] = outputBuffer;

然后,您可以调用 TensorRT 的 enqueue 方法以使用 CUDA 流异步启动推理:

context->enqueueV2(buffers, stream, nullptr);

通常在内核之前和之后将 cudaMemcpyAsync() 排入队列以从 GPU 中移动数据(如果数据尚不存在)。 enqueueV2() 的最后一个参数是一个可选的 CUDA 事件,当输入缓冲区被消耗时发出信号,并且它们的内存可以安全地重用。
要确定内核(可能还有 memcpy())何时完成,请使用标准 CUDA 同步机制,例如事件或等待流。

如果您更喜欢同步推理,请使用 executeV2 方法而不是 enqueueV2

二、TensorRT C++ API 英文原文

This chapter illustrates basic usage of the C++ API, assuming you are starting with an ONNX model. sampleOnnxMNIST illustrates this use case in more detail.

The C++ API can be accessed via the header NvInfer.h, and is in the nvinfer1 namespace. For example, a simple application might begin with:

#include “NvInfer.h”

using namespace nvinfer1;

Interface classes in the TensorRT C++ API begin with the prefix I, for example ILogger, IBuilder, etc.

A CUDA context is automatically created the first time TensorRT makes a call to CUDA, if none exists prior to that point. It is generally preferable to create and configure the CUDA context yourself before the first call to TensoRT.

In order to illustrate object lifetimes, code in this chapter does not use smart pointers; however, their use is recommended with TensorRT interfaces.

1.1 The Build Phase

To create a builder, you first need to instantiate the ILogger interface. This example captures all warning messages but ignores informational messages:

class Logger : public ILogger           
{
    void log(Severity severity, const char* msg) override
    {
        // suppress info-level messages
        if (severity <= Severity::kWARNING)
            std::cout << msg << std::endl;
    }
} logger;

You can then create an instance of the builder:
IBuilder* builder = createInferBuilder(logger);

1.1.1 Creating a Network Definition

Once the builder has been created, the first step in optimizing a model is to create a network definition:

uint32_t flag = 1U <<static_cast<uint32_t>
    (NetworkDefinitionCreationFlag::kEXPLICIT_BATCH) 

INetworkDefinition* network = builder->createNetworkV2(flag);

The kEXPLICIT_BATCH flag is required in order to import models using the ONNX parser. Refer to the Explicit vs Implicit Batch section for more information.

1.1.2 Importing a Model using the ONNX Parser

Now, the network definition needs to be populated from the ONNX representation. The ONNX parser API is in the file NvOnnxParser.h, and the parser is in the nvonnxparser C++ namespace.

#include “NvOnnxParser.h”

Using namespace nvonnxparser;

You can create an ONNX parser to populate the network as follows:

IParser*  parser = createParser(*network, logger);

Then, read the model file and process any errors.

parser->parseFromFile(modelFile, ILogger::Severity::kWARNING);
for (int32_t i = 0; i < parser.getNbErrors(); ++i)
{
std::cout << parser->getError(i)->desc() << std::endl;
}

An important aspect of a TensorRT network definition is that it contains pointers to model weights, which are copied into the optimized engine by the builder. Since the network was created via the parser, the parser owns the memory occupied by the weights, and so the parser object should not be deleted until after the builder has run.

1.1.3 Building an Engine

The next step is to create a build configuration specifying how TensorRT should optimize the model.

IBuilderConfig* config = builder->createBuilderConfig();

This interface has many properties that you can set in order to control how TensorRT optimizes the network. One important property is the maximum workspace size. Layer implementations often require a temporary workspace, and this parameter limits the maximum size that any layer in the network can use. If insufficient workspace is provided, it is possible that TensorRT will not be able to find an implementation for a layer.

config->setMaxWorkspaceSize(1U << 20);

Once the configuration has been specified, the engine can be built.

IHostMemory*  serializedModel = builder->buildSerializedNetwork(*network, *config);

Since the serialized engine contains the necessary copies of the weights, the parser, network definition, builder configuration and builder are no longer necessary and may be safely deleted:

delete parser;
delete network;
delete config;
delete builder;

The engine can then be saved to disk, and the buffer into which it was serialized can be deleted.

delete serializedModel

Note: Serialized engines are not portable across platforms or TensorRT versions. Engines are specific to the exact GPU model they were built on (in addition to the platform and the TensorRT version).

1.2 Deserializing a Plan

Assuming you have previously serialized an optimized model and wish to perform inference, you will need to create an instance of the Runtime interface. Like the builder, the runtime requires an instance of the logger:

IRuntime* runtime = createInferRuntime(logger);

Assuming you have read the model from into a buffer, you can then deserialize it to obtain an engine:

ICudaEngine* engine = 
  runtime->deserializeCudaEngine(modelData, modelSize);

1.3 Performing Inference

The engine holds the optimized model, but to perform inference we will need to manage additional state for intermediate activations. This is done via the ExecutionContext interface:

IExecutionContext *context = engine->createExecutionContext();

An engine can have multiple execution contexts, allowing one set of weights to be used for multiple overlapping inference tasks. (A current exception to this is when using dynamic shapes, when each optimization profile can only have one execution context.)

To perform inference, you must pass TensorRT buffers for input and output, which TensorRT requires you to specify in an array of pointers. You can query the engine using the names you provided for input and output tensors to find the right positions in the array:

int32_t inputIndex = engine->getBindingIndex(INPUT_NAME);
int32_t outputIndex = engine->getBindingIndex(OUTPUT_NAME);

Using these indices, set up a buffer array pointing to the input and output buffers on the GPU:

void* buffers[2];
buffers[inputIndex] = inputBuffer;
buffers[outputIndex] = outputBuffer;

You can then call TensorRT’s enqueue method to start inference asynchronously using a CUDA stream:

context->enqueueV2(buffers, stream, nullptr);

It is common to enqueue cudaMemcpyAsync() before and after the kernels to move data from the GPU if it is not already there. The final argument to enqueueV2() is an optional CUDA event that is signaled when the input buffers have been consumed, and their memory can be safely reused.

To determine when the kernel (and possibly memcpy()) are complete, use standard CUDA synchronization mechanisms such as events or waiting on the stream.

If you prefer synchronous inference, use the executeV2 method instead of enqueueV2.


  • 3
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
文件的步骤如下: 1. 导入必要的头文件: ```c #include "NvInfer.h" #include "NvOnnxParser.h" #include "NvOnnxParserRuntime.h" #include "NvInferRuntimeCommon.h" ``` 2. 创建 `IRuntime` 对象: ```c nvinfer1::IRuntime* runtime = nvinfer1::createInferRuntime(gLogger); ``` 其,`gLogger` 是用来记录日志的对象,需要先定义。 3. 从文件创建 `ICudaEngine` 对象: ```c std::ifstream trt_file("model.trt", std::ios::binary); if (!trt_file.good()) { std::cerr << "Failed to load TRT file: model.trt" << std::endl; return -1; } trt_file.seekg(0, trt_file.end); const int model_size = trt_file.tellg(); trt_file.seekg(0, trt_file.beg); char* model_data = new char[model_size]; trt_file.read(model_data, model_size); nvinfer1::ICudaEngine* engine = runtime->deserializeCudaEngine(model_data, model_size, nullptr); ``` 其,`model.trt` 是保存 TensorRT 模型的文件。 4. 创建 `IExecutionContext` 对象: ```c nvinfer1::IExecutionContext* context = engine->createExecutionContext(); ``` 5. 设置输入和输出的内存: ```c const int input_index = engine->getBindingIndex("input"); const int output_index = engine->getBindingIndex("output"); void* input_memory; cudaMalloc(&input_memory, input_size); void* output_memory; cudaMalloc(&output_memory, output_size); ``` 其,`input_size` 和 `output_size` 分别是输入和输出的数据大小。 6. 执行推理: ```c void* bindings[] = {input_memory, output_memory}; context->execute(1, bindings); ``` 其,`1` 是 batch size。 7. 获取输出数据: ```c float* output_data = new float[output_size / sizeof(float)]; cudaMemcpy(output_data, output_memory, output_size, cudaMemcpyDeviceToHost); ``` 8. 释放资源: ```c cudaFree(input_memory); cudaFree(output_memory); delete[] model_data; delete[] output_data; context->destroy(); engine->destroy(); runtime->destroy(); ```

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

迷失的walker

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值