【C++|TensorRT】使用ostream和istream类来读写TensorRT模型并推理（v9_trtx）

本文链接：https://blog.csdn.net/ycx_ccc/article/details/136829492

使用ofstream和ifstream类来读写TensorRT模型并推理

1、背景
2、序列化模型
3、反序列化模型
4、准备buffer容器（输入输出指针）
5、推理
6、指令解析函数

1、背景

我们知道在使用tensorrt来推理深度学习模型时，我们需要将不同格式保存的模型文件转换成trt或者engine格式（两个都行）。这里以王鑫宇大佬的yolov9-trtx的项目为例。

2、序列化模型

1、通过nvinfer1::ILogger实例化gLogger对象
2、使用gLogger来实例化一个builder对象
3、通过builder来实例化一个config对象
4、调用engine->serialize()接口来序列化模型
5、创建一个ostream的写入流
6、调用ostream::write方法来写入模型，并保存

 void serialize_engine(unsigned int maxBatchSize, std::string& wts_name, std::string& sub_type, std::string& engine_name) {
    // Create builder
    IBuilder* builder = createInferBuilder(gLogger);
    IBuilderConfig* config = builder->createBuilderConfig();

    // Create model to populate the network, then set the outputs and create an engine
    IHostMemory* serialized_engine = nullptr;
    if (sub_type == "e") {
        serialized_engine = build_engine_yolov9_e(maxBatchSize, builder, config, DataType::kFLOAT, wts_name);
    }
    else if (sub_type == "c") {
        serialized_engine = build_engine_yolov9_c(maxBatchSize, builder, config, DataType::kFLOAT, wts_name);
    }
    else {
        return;
    }
    assert(serialized_engine != nullptr);

    /*创建一个std::ofstream的写入对象，写入方式是二进制（一般序列化模型写入都是二进制的写入方式）*/
    std::ofstream p(engine_name, std::ios::binary);
    /*如果写入失败的话抛出异常*/
    if (!p) {
        std::cerr << "could not open plan output file" << std::endl;
        assert(false);
    }
    /*
    * reinterpret_cast<target_datatype>(my_datatype) 是一种数据类型转换的方法，这里使用write方法写入的时候需要将序列化模型的数据类型转成char型指针
    * ostream::write(const char* str, byte_size)方法可以写入序列化的模型数据
    */
    p.write(reinterpret_cast<const char*>(serialized_engine->data()), serialized_engine->size());

    delete config;
    delete serialized_engine;
    delete builder;
}

写入时，我们先创建一个std::ofstream的写入流p，初始化构造函数的第一个参数为const char *，填写保存路径，第二个参数为打开写入的方式，在写入序列化模型时，一般都是std::ios::binary的形式；

在调用write写入时，第一个参数是const char*的数据指针，第二个参数是写入流的字节数。其中，我们可以通过 nvinfer1::IHostMemory::data() 方法来获取指向序列化模型数据的指针，但是这里我们需要将这个指针的数据类型转为const char *型，因此我们需要用到 reinterpret_cast() 方法，该方法提供数字到指针间的转换以及不同类型指针间的强制转换。然后我们可以通过 nvinfer1::IHostMemory::size() 方法获得序列化模型的数据的字节数。最后记得要释放掉冗余的资源。

3、反序列化模型

1、std::istream 来实例化一个读取对象，读取方式同序列化模型的写入方式（二进制读取）
2、通过good方法来检查文件流是否处于良好的状态，如果文件流没有错误那么将返回true，如果文件流错误将返回false
3、使用istream::seekg将输入流的指针移动至末尾
4、使用istream::tellg来获取输入流的字节数size（此时指针已经在末尾）
5、使用istream::seekg将输入流的指针恢复至开头
6、根据获取的size来new一块内存给序列化的模型数据，并创建一个char * 指针指向这块内存
7、调用istream::read()方法来读取序列化模型的数据到这块内存
8、调用nvinfer1::IRuntime::deserializeCudaEngine来反序列化模型到engine中去
9、调用nvinfer1::IRuntime::createExecutionContext来创建可执行的推理上下文

void deserialize_engine(std::string& engine_name, IRuntime** runtime, ICudaEngine** engine, IExecutionContext** context) {
    std::ifstream file(engine_name, std::ios::binary);
    if (!file.good()) {
        std::cerr << "read " << engine_name << " error!" << std::endl;
        assert(false);
    }
    size_t size = 0;
    file.seekg(0, file.end);
    size = file.tellg();
    file.seekg(0, file.beg);
    char* serialized_engine = new char[size];
    assert(serialized_engine);
    file.read(serialized_engine, size);
    file.close();

    *runtime = createInferRuntime(gLogger);
    assert(*runtime);
    *engine = (*runtime)->deserializeCudaEngine(serialized_engine, size);
    assert(*engine);
    *context = (*engine)->createExecutionContext();
    assert(*context);
    delete[] serialized_engine;
}

4、准备buffer容器（输入输出指针）

1、准备填入context执行上下文的buffer容器，设计输入与输出的名称设定以及在gpu上为输入输出开辟内存
2、在cpu上new一块float类型的内存用来存储输出结果（从gpu上cudaMemcpy到host上）

void prepare_buffer(ICudaEngine* engine, float** input_buffer_device, float** output_buffer_device, float** output_buffer_host) {
    // getNbBindings() is supereseded by getNbIOTensors() in 8.5 
    assert(engine->getNbBindings() == 2);
    // In order to bind the buffers, we need to know the names of the input and output tensors.
    // Note that indices are guaranteed to be less than IEngine::getNbBindings()
    const int inputIndex = engine->getBindingIndex(kInputTensorName);
    const int outputIndex = engine->getBindingIndex(kOutputTensorName);
    assert(inputIndex == 0);
    assert(outputIndex == 1);
    // Create GPU buffers on device
    CUDA_CHECK(cudaMalloc((void**)input_buffer_device, kBatchSize * 3 * kInputH * kInputW * sizeof(float)));
    CUDA_CHECK(cudaMalloc((void**)output_buffer_device, kBatchSize * kOutputSize * sizeof(float)));

    *output_buffer_host = new float[kBatchSize * kOutputSize];
}

通过nvinfer1::ICudaEngine::getNbBindings来获取engine的输入输出总数，在TensorRT-8.5版本以后，可以用nvinfer1::ICudaEngine::getNbIOTensors来代替该api获取输入输出个数；

然后我们需要将engine的输入输出名称与索引进行关联（或者叫绑定），通过nvinfer1::ICudaEngine::getBindingIndex，方便我们后续查看的时候能够将索引映射到对应的输入输出节点上，来传和分析数据。

然后通过cudaMalloc方法来为输入（输出）节点在gpu上分配内存，内存大小根据我们的输入输出的维度来进行计算。

最后，我们需要将推理完的结果从gpu上拷贝到主机上，因此我们需要在主机上分配内存，直接通过new的方式即可。

tips：这里的应用场景是只有一个输入和一个输出，因此prepare_buffer的输入中只包含了两个buffer_device，我们可以根据不同的模型，来重新实现这个函数，无非就是加几个buffer_device多写几条分配内存的代码。

5、推理

1、调用IExecutionContext::enqueue()方法来执行推理。
2、然后通过buffer容器来获取推理结果，cudaMemcpyAsync()异步拷贝。
3、调用cudaStreamSynchronize来给异步cuda流上锁（阻塞），防止数据错乱。

void infer(IExecutionContext& context, cudaStream_t& stream, void** buffers, float* output, int batchSize) {
    // infer on the batch asynchronously, and DMA output back to host
    context.enqueue(batchSize, buffers, stream, nullptr);
    CUDA_CHECK(cudaMemcpyAsync(output, buffers[1], batchSize * kOutputSize * sizeof(float), cudaMemcpyDeviceToHost, stream));
    CUDA_CHECK(cudaStreamSynchronize(stream));
}

tips：还有enqueuev2，enqueuev3，executev2这三种推理api，具体的区别我也还在了解。

6、指令解析函数

1、当在命令行输入参数-s时，表明对程序下达的是serialize的任务，此时需要的参数量为4个。当在命令行输入参数-d时，表明对程序下达的是deserialize的任务，此时需要的参数量是3个。
2、argv的索引从1开始是因为，第0个参数是程序的全名，然后才是用户输入的参数。

bool parse_args(int argc, char** argv, std::string& wts, std::string& engine, std::string& img_dir, std::string& sub_type) {
    if (argc < 4) return false;
    if (std::string(argv[1]) == "-s" && argc == 5) {
        wts = std::string(argv[2]);
        engine = std::string(argv[3]);
        sub_type = std::string(argv[4]);
    }
    else if (std::string(argv[1]) == "-d" && argc == 4) {
        engine = std::string(argv[2]);
        img_dir = std::string(argv[3]);
    }
    else {
        return false;
    }
    return true;
}