前面的文章中已经写了一个tensorRT简单的demo---lenet推理【tensorRT-lenet】,实现了从torch模型转wts【同时也展示出了wts内网络的详细信息】再转engine后的推理过程,本文章是在之前的基础上去分析C++代码的实现。
我们从main函数开始分析每个函数或者说某块的作用。完整的代码会在文末附出。
本人C++比较薄弱,写的比较粗糙还请见谅。
目录
可以总结一下本文的主要步骤。
wts-->engine:
1.利用IHostMemory创建一个modelStream用于后面API写入engine
2.利用API创建模型:
1) 创建IBuilder;
2) 利用步骤1)创建IBuilderConfig;
3)利用自定义的createLenetEngine创建网络,将wts中的权重数据写入,返回engine;
4)将engine序列化后写入步骤1 的modelStream;
3.将步骤2所得的modelStream写入engine文件中
engine-->推理:
1.获取engine文件的size;
2.将engine内容放入开辟的堆区空间trtModelStream中(实际就是得到一个trt_model)
3.对步骤2中的model利用deserializeCudaEngine进行反序列化,得到反序列化后的engine;
4.创建可执行的Context();
5.推理
1.参数的传入
下面这几行代码是用来判断传入参数是否正确,-s指wts转engine文件,-d指的是engine的前向推理。
int main(int argc, char** argv)
{
if (argc != 2) {
std::cerr << "arguments not right!" << std::endl;
std::cerr << "./lenet -s // serialize model to plan file" << std::endl;
std::cerr << "./lenet -d // deserialize plan file and run inference" << std::endl;
return -1;
}
2.API创建模型--wts转engine
核心的C++ API包含在 NvInfer.h 中,因此需要导入这个头文件。
通过API创建模型,并将其序列化为数据流。
// create a model using the API directly and serialize it to a stream
char *trtModelStream{nullptr};
size_t size{0};
当传入参数为“-s”时为序列化model。
IHostMemory
这里需要先说明IHostMemory这个类。
该类是一个与分配内存相关的库,不可继承,因为会影响到前向传播计算。
该类有几个成员函数:data (指针,指向数据的首地址)、size( data bytes)、type(数据类型)、destroy(释放内存)。
if (std::string(argv[1]) == "-s") {
IHostMemory* modelStream{nullptr}; //模型二进制数据流
APIToModel(1, &modelStream);
assert(modelStream != nullptr);
std::ofstream p("lenet5.engine", std::ios::binary);
if (!p)
{
std::cerr << "could not open plan output file" << std::endl;
return -1;
}
p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());
modelStream->destroy();
return 1;
APIToModel
使用API创建model。APIToModel这个函数完整代码如下,可以看到该函数传入两个参数,第一个就是Batch_size,第二个就是我们上面创建的IHostMemory对象,这个对象是一个二进制的model stream,初始为空指针。
可以看到这个函数有这么几个类型的对象,IBuilder、IBuilderConfig、ICudaEngine。这里将一一给出介绍。
void APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream)
{
// Create builder
IBuilder* builder = createInferBuilder(gLogger);
IBuilderConfig* config = builder->createBuilderConfig();
// Create model to populate the network, then set the outputs and create an engine
ICudaEngine* engine = createLenetEngine(maxBatchSize, builder, config, DataType::kFLOAT);
assert(engine != nullptr);
// Serialize the engine
(*modelStream) = engine->serialize();
// Close everything down
engine->destroy();
builder->destroy();
}
IBuileder
创建builder。
下面这一行代码其中gLogger也是模板化的固定代码,用于显示执行过程的信息【这个信息就是后面创建engine时的过程】。
gLogger是之前定义的static Logger gLogger【需要logging.h头文件】;
IBuilder* builder = createInferBuilder(gLogger);
IBuilderConfig
构建器的配置。指定用于创建engine的详细信息。从下面这行代码可以看出,创建的config这个指针指向的是之前构建器(builder)中的成员函数createBuilderConfig()。
IBuilderConfig* config = builder->createBuilderConfig();
ICudaEngine
该API是NvInferRuntime.h头文件下的。用于在构建的网络上执行推理的engine,具有功能不安全的特性。 不可继承。
下面代码中的createLenetEngine函数用于构建网络推理时的engine。
// Create model to populate the network, then set the outputs and create an engine
ICudaEngine* engine = createLenetEngine(maxBatchSize, builder, config, DataType::kFLOAT);
createLenetEngine完整代码如下,这个代码很核心:
创建engine主要步骤为:
1.定义空网络;
2.创建一个tensor;
3.wts权重加载;
4.卷积层的建立;
5.设置输出结点name并获得网络输出;
// Creat the engine using only the API and not any parser.
ICudaEngine* createLenetEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt)
{
INetworkDefinition* network = builder->createNetworkV2(0U);
// Create input tensor of shape { 1, 32, 32 } with name INPUT_BLOB_NAME
ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{1, INPUT_H, INPUT_W});
assert(data);
// Add convolution layer with 6 outputs and a 5x5 filter.
std::map<std::string, Weights> weightMap = loadWeights("../lenet5.wts");
IConvolutionLayer* conv1 = network->addConvolutionNd(*data, 6, DimsHW{5, 5}, weightMap["conv1.weight"], weightMap["conv1.bias"]);
assert(conv1);
conv1->setStrideNd(DimsHW{1, 1});
// Add activation layer using the ReLU algorithm.
IActivationLayer* relu1 = network->addActivation(*conv1->getOutput(0), ActivationType::kRELU);
assert(relu1);
// Add max pooling layer with stride of 2x2 and kernel size of 2x2.
IPoolingLayer* pool1 = network->addPoolingNd(*relu1->getOutput(0), PoolingType::kAVERAGE, DimsHW{2, 2});
assert(pool1);
pool1->setStrideNd(DimsHW{2, 2});
// Add second convolution layer with 16 outputs and a 5x5 filter.
IConvolutionLayer* conv2 = network->addConvolutionNd(*pool1->getOutput(0), 16, DimsHW{5, 5}, weightMap["conv2.weight"], weightMap["conv2.bias"]);
assert(conv2);
conv2->setStrideNd(DimsHW{1, 1});
// Add activation layer using the ReLU algorithm.
IActivationLayer* relu2 = network->addActivation(*conv2->getOutput(0), ActivationType::kRELU);
assert(relu2);
// Add second max pooling layer with stride of 2x2 and kernel size of 2x2>
IPoolingLayer* pool2 = network->addPoolingNd(*relu2->getOutput(0), PoolingType::kAVERAGE, DimsHW{2, 2});
assert(pool2);
pool2->setStrideNd(DimsHW{2, 2});
// Add fully connected layer
IFullyConnectedLayer* fc1 = network->addFullyConnected(*pool2->getOutput(0), 120, weightMap["fc1.weight"], weightMap["fc1.bias"]);
assert(fc1);
// Add activation layer using the ReLU algorithm.
IActivationLayer* relu3 = network->addActivation(*fc1->getOutput(0), ActivationType::kRELU);
assert(relu3);
// Add second fully connected layer
IFullyConnectedLayer* fc2 = network->addFullyConnected(*relu3->getOutput(0), 84, weightMap["fc2.weight"], weightMap["fc2.bias"]);
assert(fc2);
// Add activation layer using the ReLU algorithm.
IActivationLayer* relu4 = network->addActivation(*fc2->getOutput(0), ActivationType::kRELU);
assert(relu4);
// Add third fully connected layer
IFullyConnectedLayer* fc3 = network->addFullyConnected(*relu4->getOutput(0), OUTPUT_SIZE, weightMap["fc3.weight"], weightMap["fc3.bias"]);
assert(fc3);
// Add softmax layer to determine the probability.
ISoftMaxLayer* prob = network->addSoftMax(*fc3->getOutput(0));
assert(prob);
prob->getOutput(0)->setName(OUTPUT_BLOB_NAME);
network->markOutput(*prob->getOutput(0));
// Build engine
builder->setMaxBatchSize(maxBatchSize);
config->setMaxWorkspaceSize(1 << 20);
ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
// Don't need the network any more
network->destroy();
// Release host memory
for (auto& mem : weightMap)
{
free((void*) (mem.second.values));
}
return engine;
}
该函数返回类型为ICudaEngine。传入参数有4个:Batch_Size,构建器(builder),builder config,DataType。
创建网络的方式有两种,一种是使用TRT的API创建网络,一种是利用解析器将已有的模型转换成Network,这里选择的是前者。
1.网络的定义
INetworkDefinition用于定义网络。调用builder下的成员函数createNetworkV2。createNetworkV2(0U)为先创建一个空白的Network。
INetworkDefinition* network = builder->createNetworkV2(0U);
2.tensor的创建
利用addInput创建一个tensor,addInput传入三个数据类型,const char*,DataType,Dims。这里设置的addInput是("data",dt,Dims3{1,32,32})。
// Create input tensor of shape { 1, 32, 32 } with name INPUT_BLOB_NAME
ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{1, INPUT_H, INPUT_W});
assert(data);
3.wts权重加载
// Add convolution layer with 6 outputs and a 5x5 filter.
std::map<std::string, Weights> weightMap = loadWeights("../lenet5.wts");
loadWeights的完整代码如下,file为权重路径:
std::map<std::string, Weights> loadWeights(const std::string file)
{
std::cout << "Loading weights: " << file << std::endl;
std::map<std::string, Weights> weightMap;
// Open weights file
std::ifstream input(file);
assert(input.is_open() && "Unable to load weight file.");
// Read number of weight blobs
int32_t count;
input >> count;
assert(count > 0 && "Invalid weight map file.");
while (count--)
{
Weights wt{DataType::kFLOAT, nullptr, 0};
uint32_t size;
// Read name and type of blob
std::string name;
input >> name >> std::dec >> size;
wt.type = DataType::kFLOAT;
// Load blob
uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));
for (uint32_t x = 0, y = size; x < y; ++x)
{
input >> std::hex >> val[x];
}
wt.values = val;
wt.count = size;
weightMap[name] = wt;
}
return weightMap;
}
4.卷积层的建立
调用addConvolutionNd函数,第一个参数为输入网络的tensor,第二个参数为输出通道数,第三个参数为卷积的尺寸这里是5 * 5,第四个参数为对应卷积的weights,第五个参数为对应卷积的bias。设置步长为1,setStrideNd。这里设置的宽和高的步长。
IConvolutionLayer* conv1 = network->addConvolutionNd(*data, 6, DimsHW{5, 5}, weightMap["conv1.weight"], weightMap["conv1.bias"]);
assert(conv1);
conv1->setStrideNd(DimsHW{1, 1});
4.1 添加激活函数
定义完卷积以后我们需要在添加激活函数,在lenet中使用的是Relu激活函数。addActivation()有两个参数,第一个是输入tensor,取的是conv的第一个维度(也就是batch这一维度),第二个是激活函数类型kRELU就是ReLu激活函数。
// Add activation layer using the ReLU algorithm.
IActivationLayer* relu1 = network->addActivation(*conv1->getOutput(0), ActivationType::kRELU);
assert(relu1)
4.2 添加平均池化层
与上面一样,addPoolingNd为池化层,第一参数为激活函数后的输出作为该层的输入,第二个参数为池化类型,第三个参数为尺寸的kernel大小这里是2 * 2,同样设置步长为2。
// Add max pooling layer with stride of 2x2 and kernel size of 2x2.
IPoolingLayer* pool1 = network->addPoolingNd(*relu1->getOutput(0), PoolingType::kAVERAGE, DimsHW{2, 2});
assert(pool1);
pool1->setStrideNd(DimsHW{2, 2});
4.3 添加全连接层
调用addFullyConnected添加全连接层。第一个参数为输入,第二个参数输出通道数,第三个和第四个数为weight和bias。
// Add fully connected layer
IFullyConnectedLayer* fc1 = network->addFullyConnected(*pool2->getOutput(0), 120, weightMap["fc1.weight"], weightMap["fc1.bias"]);
assert(fc1);
4.4添加softmax
添加softmax用addSoftmax即可。
// Add softmax layer to determine the probability.
ISoftMaxLayer* prob = network->addSoftMax(*fc3->getOutput(0));
assert(prob);
5. 设置输出结点的name并获得输出
prob->getOutput(0)->setName(OUTPUT_BLOB_NAME);
network->markOutput(*prob->getOutput(0));
上面的步骤就是人为定义网络结构【这个网络结构参考pytorch中lenet的forward写即可】。只不过是将对应的wts每层的权值赋值给你用C++写的网络的结构中。
build engine:
builder是我们前面构造的构造器,调用其中setMaxBatchSize传入Batch_size大小。
config是前面的配置器,设置engine在执行时使用最大GPU临时内存。
构建engine:调用buildEngineWithConfig传入前面定义好的network与配置信息。
构建好以后就可以销毁network了。这个engine就是我们已经序列化后的网络模型。
// Build engine
builder->setMaxBatchSize(maxBatchSize);
config->setMaxWorkspaceSize(1 << 20);
ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
// Don't need the network any more
network->destroy();
完成上述过程以后释放内存。
// Release host memory
for (auto& mem : weightMap)
{
free((void*) (mem.second.values));
}
engine序列化
// Serialize the engine
(*modelStream) = engine->serialize();
// Close everything down
engine->destroy();
builder->destroy();
上面的APIToModel就已经完成了engine的构建,构建完成后会自动保存engine文件在目录中。接下来就是推理阶段。
3. engine推理
传入-d参数为推理模式。
读取lenet5.engine文件。
获取文件内容大小(size)。
在堆区开辟一个和上述size一样大小的空间trtModelStream,利用file.read读入数据流。
else if (std::string(argv[1]) == "-d") {
std::ifstream file("lenet5.engine", std::ios::binary);
if (file.good()) {
file.seekg(0, file.end);
size = file.tellg();
file.seekg(0, file.beg);
trtModelStream = new char[size];
assert(trtModelStream);
file.read(trtModelStream, size);
file.close();
}
创建一个全为1的图像(样例)。
// Subtract mean from image
float data[INPUT_H * INPUT_W];
for (int i = 0; i < INPUT_H * INPUT_W; i++)
data[i] = 1.0;
记录log。
IRuntime* runtime = createInferRuntime(gLogger);
assert(runtime != nullptr);
反序列化
deserializeCudaEngine()为反序列化,传入三个参数,第一个就是我们在堆区内放入的序列化model,size为model大小,以及IPluginFactory。
ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);
执行推理
这个步骤就是得到反序化后的模型
IExecutionContext* context = engine->createExecutionContext();
开始前向推理
这里的OUTPUT_SIZE为10,这是因为lenet网络最终输出为10个分类。
// Run inference
float prob[OUTPUT_SIZE];
for (int i = 0; i < 1000; i++) {
auto start = std::chrono::system_clock::now();
doInference(*context, data, prob, 1);
auto end = std::chrono::system_clock::now();
//std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << "ms" << std::endl;
}
// Destroy the engine
context->destroy();
engine->destroy();
runtime->destroy();
doInference函数
关键的函数为doInference。
该函数传入参数有4个,第一个为可执行的context(其实就是反序列化后的model),第二个为输入,第三个为输出,第四个为batch_size。
将缓存中input和output的指针传递给engine。
为了绑定缓存区,需要知道输入和输出的name,这个我们前面有定义过输入的name为"data",输出的name为"prob"。
const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);
const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);
创建buffers
创建两个buffer,这两个buffer的大小要与input和ouput大小一样。
// Create GPU buffers on device
CHECK(cudaMalloc(&buffers[inputIndex], batchSize * INPUT_H * INPUT_W * sizeof(float)));
CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));
创建stream
// Create stream
cudaStream_t stream;
CHECK(cudaStreamCreate(&stream));
推理
DMA将batch输入到device,异步推理batch,并将DMA输出回host。
// DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host
CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));
context.enqueue(batchSize, buffers, stream, nullptr);
CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));
cudaStreamSynchronize(stream);
推理完后释放
// Release stream and buffers
cudaStreamDestroy(stream);
CHECK(cudaFree(buffers[inputIndex]));
CHECK(cudaFree(buffers[outputIndex]));
完整代码:
void doInference(IExecutionContext& context, float* input, float* output, int batchSize)
{
const ICudaEngine& engine = context.getEngine();
// Pointers to input and output device buffers to pass to engine.
// Engine requires exactly IEngine::getNbBindings() number of buffers.
assert(engine.getNbBindings() == 2);
void* buffers[2];
// In order to bind the buffers, we need to know the names of the input and output tensors.
// Note that indices are guaranteed to be less than IEngine::getNbBindings()
const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);
const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);
// Create GPU buffers on device
CHECK(cudaMalloc(&buffers[inputIndex], batchSize * INPUT_H * INPUT_W * sizeof(float)));
CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));
// Create stream
cudaStream_t stream;
CHECK(cudaStreamCreate(&stream));
// DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host
CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));
context.enqueue(batchSize, buffers, stream, nullptr);
CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));
cudaStreamSynchronize(stream);
// Release stream and buffers
cudaStreamDestroy(stream);
CHECK(cudaFree(buffers[inputIndex]));
CHECK(cudaFree(buffers[outputIndex]));
}
完整代码
#include "NvInfer.h"
#include "cuda_runtime_api.h"
#include "logging.h"
#include <fstream>
#include <map>
#include <chrono>
#define CHECK(status) \
do\
{\
auto ret = (status);\
if (ret != 0)\
{\
std::cerr << "Cuda failure: " << ret << std::endl;\
abort();\
}\
} while (0)
// stuff we know about the network and the input/output blobs
static const int INPUT_H = 32;
static const int INPUT_W = 32;
static const int OUTPUT_SIZE = 10;
const char* INPUT_BLOB_NAME = "data";
const char* OUTPUT_BLOB_NAME = "prob";
using namespace nvinfer1;
static Logger gLogger;
// Load weights from files shared with TensorRT samples.
// TensorRT weight files have a simple space delimited format:
// [type] [size] <data x size in hex>
std::map<std::string, Weights> loadWeights(const std::string file)
{
std::cout << "Loading weights: " << file << std::endl;
std::map<std::string, Weights> weightMap;
// Open weights file
std::ifstream input(file);
assert(input.is_open() && "Unable to load weight file.");
// Read number of weight blobs
int32_t count;
input >> count;
assert(count > 0 && "Invalid weight map file.");
while (count--)
{
Weights wt{DataType::kFLOAT, nullptr, 0};
uint32_t size;
// Read name and type of blob
std::string name;
input >> name >> std::dec >> size;
wt.type = DataType::kFLOAT;
// Load blob
uint32_t* val = reinterpret_cast<uint32_t*>(malloc(sizeof(val) * size));
for (uint32_t x = 0, y = size; x < y; ++x)
{
input >> std::hex >> val[x];
}
wt.values = val;
wt.count = size;
weightMap[name] = wt;
}
return weightMap;
}
// Creat the engine using only the API and not any parser.
ICudaEngine* createLenetEngine(unsigned int maxBatchSize, IBuilder* builder, IBuilderConfig* config, DataType dt)
{
INetworkDefinition* network = builder->createNetworkV2(0U);
// Create input tensor of shape { 1, 32, 32 } with name INPUT_BLOB_NAME
ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{1, INPUT_H, INPUT_W});
assert(data);
// Add convolution layer with 6 outputs and a 5x5 filter.
std::map<std::string, Weights> weightMap = loadWeights("../lenet5.wts");
IConvolutionLayer* conv1 = network->addConvolutionNd(*data, 6, DimsHW{5, 5}, weightMap["conv1.weight"], weightMap["conv1.bias"]);
assert(conv1);
conv1->setStrideNd(DimsHW{1, 1});
// Add activation layer using the ReLU algorithm.
IActivationLayer* relu1 = network->addActivation(*conv1->getOutput(0), ActivationType::kRELU);
assert(relu1);
// Add max pooling layer with stride of 2x2 and kernel size of 2x2.
IPoolingLayer* pool1 = network->addPoolingNd(*relu1->getOutput(0), PoolingType::kAVERAGE, DimsHW{2, 2});
assert(pool1);
pool1->setStrideNd(DimsHW{2, 2});
// Add second convolution layer with 16 outputs and a 5x5 filter.
IConvolutionLayer* conv2 = network->addConvolutionNd(*pool1->getOutput(0), 16, DimsHW{5, 5}, weightMap["conv2.weight"], weightMap["conv2.bias"]);
assert(conv2);
conv2->setStrideNd(DimsHW{1, 1});
// Add activation layer using the ReLU algorithm.
IActivationLayer* relu2 = network->addActivation(*conv2->getOutput(0), ActivationType::kRELU);
assert(relu2);
// Add second max pooling layer with stride of 2x2 and kernel size of 2x2>
IPoolingLayer* pool2 = network->addPoolingNd(*relu2->getOutput(0), PoolingType::kAVERAGE, DimsHW{2, 2});
assert(pool2);
pool2->setStrideNd(DimsHW{2, 2});
// Add fully connected layer
IFullyConnectedLayer* fc1 = network->addFullyConnected(*pool2->getOutput(0), 120, weightMap["fc1.weight"], weightMap["fc1.bias"]);
assert(fc1);
// Add activation layer using the ReLU algorithm.
IActivationLayer* relu3 = network->addActivation(*fc1->getOutput(0), ActivationType::kRELU);
assert(relu3);
// Add second fully connected layer
IFullyConnectedLayer* fc2 = network->addFullyConnected(*relu3->getOutput(0), 84, weightMap["fc2.weight"], weightMap["fc2.bias"]);
assert(fc2);
// Add activation layer using the ReLU algorithm.
IActivationLayer* relu4 = network->addActivation(*fc2->getOutput(0), ActivationType::kRELU);
assert(relu4);
// Add third fully connected layer
IFullyConnectedLayer* fc3 = network->addFullyConnected(*relu4->getOutput(0), OUTPUT_SIZE, weightMap["fc3.weight"], weightMap["fc3.bias"]);
assert(fc3);
// Add softmax layer to determine the probability.
ISoftMaxLayer* prob = network->addSoftMax(*fc3->getOutput(0));
assert(prob);
prob->getOutput(0)->setName(OUTPUT_BLOB_NAME);
network->markOutput(*prob->getOutput(0));
// Build engine
builder->setMaxBatchSize(maxBatchSize);
config->setMaxWorkspaceSize(1 << 20);
ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
// Don't need the network any more
network->destroy();
// Release host memory
for (auto& mem : weightMap)
{
free((void*) (mem.second.values));
}
return engine;
}
void APIToModel(unsigned int maxBatchSize, IHostMemory** modelStream)
{
// Create builder
IBuilder* builder = createInferBuilder(gLogger);
IBuilderConfig* config = builder->createBuilderConfig();
// Create model to populate the network, then set the outputs and create an engine
ICudaEngine* engine = createLenetEngine(maxBatchSize, builder, config, DataType::kFLOAT);
assert(engine != nullptr);
// Serialize the engine
(*modelStream) = engine->serialize();
// Close everything down
engine->destroy();
builder->destroy();
}
void doInference(IExecutionContext& context, float* input, float* output, int batchSize)
{
const ICudaEngine& engine = context.getEngine();
// Pointers to input and output device buffers to pass to engine.
// Engine requires exactly IEngine::getNbBindings() number of buffers.
assert(engine.getNbBindings() == 2);
void* buffers[2];
// In order to bind the buffers, we need to know the names of the input and output tensors.
// Note that indices are guaranteed to be less than IEngine::getNbBindings()
const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);
const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);
// Create GPU buffers on device
CHECK(cudaMalloc(&buffers[inputIndex], batchSize * INPUT_H * INPUT_W * sizeof(float)));
CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));
// Create stream
cudaStream_t stream;
CHECK(cudaStreamCreate(&stream));
// DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host
CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));
context.enqueue(batchSize, buffers, stream, nullptr);
CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));
cudaStreamSynchronize(stream);
// Release stream and buffers
cudaStreamDestroy(stream);
CHECK(cudaFree(buffers[inputIndex]));
CHECK(cudaFree(buffers[outputIndex]));
}
int main(int argc, char** argv)
{
if (argc != 2) {
std::cerr << "arguments not right!" << std::endl;
std::cerr << "./lenet -s // serialize model to plan file" << std::endl;
std::cerr << "./lenet -d // deserialize plan file and run inference" << std::endl;
return -1;
}
// create a model using the API directly and serialize it to a stream
char *trtModelStream{nullptr};
size_t size{0};
if (std::string(argv[1]) == "-s") {
IHostMemory* modelStream{nullptr}; //模型二进制数据流
APIToModel(1, &modelStream);
assert(modelStream != nullptr);
std::ofstream p("lenet5.engine", std::ios::binary);
if (!p)
{
std::cerr << "could not open plan output file" << std::endl;
return -1;
}
p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());
modelStream->destroy();
return 1;
} else if (std::string(argv[1]) == "-d") {
std::ifstream file("lenet5.engine", std::ios::binary);
if (file.good()) {
file.seekg(0, file.end);
size = file.tellg();
file.seekg(0, file.beg);
trtModelStream = new char[size];
assert(trtModelStream);
file.read(trtModelStream, size);
file.close();
}
} else {
return -1;
}
// Subtract mean from image
float data[INPUT_H * INPUT_W];
for (int i = 0; i < INPUT_H * INPUT_W; i++)
data[i] = 1.0;
IRuntime* runtime = createInferRuntime(gLogger);
assert(runtime != nullptr);
ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);
assert(engine != nullptr);
IExecutionContext* context = engine->createExecutionContext();
assert(context != nullptr);
// Run inference
float prob[OUTPUT_SIZE];
for (int i = 0; i < 1000; i++) {
auto start = std::chrono::system_clock::now();
doInference(*context, data, prob, 1);
auto end = std::chrono::system_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << "ms" << std::endl;
}
// Destroy the engine
context->destroy();
engine->destroy();
runtime->destroy();
// Print histogram of the output distribution
std::cout << "\nOutput:\n\n";
for (unsigned int i = 0; i < 10; i++)
{
std::cout << prob[i] << ", ";
}
std::cout << std::endl;
return 0;
}