TensorRT系列传送门(不定期更新): 深度框架|TensorRT
以caffe分类模型为例,简单介绍TRT的使用流程,这里不涉及量化,就以fp32为例,验证trt的结果是否和python端的结果一致。
caffe为例,需要提供以下文件
deploy.prototxt 网络结构文件
weight.caffemodel 权重文件
TensorRT的使用过程包括两个阶段, build和 deployment
build阶段主要完成模型转换,从caffe或者tf或者pyTorch到TensorRT
TRT加载模型,并构建TRT的引擎,主要分为六步:
- 1、建立一个logger日志,必须要有,但又不是那么重要
- 2、创建一个builder
- 3、创建一个netwok,这时候netWork只是一个空架子
- 4、建立一个 Parser,caffe模型、onnx模型和TF模型都有对应的paser,顾名思义,就用用来解析模型文件的
- 5、建立 engine,进行层之间融合或者进度校准方式,可以fp32、fp16或者fp8
- 6、建议一个context,这个是用来做inference推断的。上面连接engine,下对应推断数据,所以称之为上下文联系器(博主自取的)
(PS:感谢晖神提供流程图)
一、在线加载caffe模型,序列化保存到本地
主要功能:
1、opencv读取图像,减均值除方差,在NHWC转NCHW
2、trt解析caffe模型
3、保存序列化后的模型
4、建立engine,推断模型、生成结果
/*====================================================================
文件 : sampleCaffeClassf.cc
功能 : TensorRT学习系列1、走通流程
====================================================================*/
#include "NvCaffeParser.h"
#include "NvInfer.h"
#include "NvInferPlugin.h"
#include "logger.h"
#include "cuda_runtime_api.h"
#include "common.h"
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <sstream>
#include <opencv2/opencv.hpp>
using namespace nvinfer1;
using namespace plugin;
using namespace nvcaffeparser1;
const int MODEL_HEIGHT = 256;
const int MODEL_WIDTH = 256;
const int MODEL_CHANNEL = 3;
const int MODEL_OUTPUT_SIZE = 5; // 5分类
/**********************************
* @brief 先resize、再减均值、除方差
*
* @param src
* @param dst
* @return
*********************************/
void preData(cv::Mat &matSrc, cv::Mat &matDst)
{
cv::resize(matSrc, matSrc, cv::Size(MODEL_WIDTH, MODEL_HEIGHT));
cv::Mat matMean(MODEL_HEIGHT, MODEL_WIDTH, CV_32FC3, \
cv::Scalar(103.53f, 116.28f, 123.675f)); // 均值
cv::Mat matStd(256, 256, CV_32FC3, \
cv::Scalar(1.0f, 1.0f, 1.0f)); // 方差
cv::Mat matF32Img;
matSrc.convertTo(matF32Img, CV_32FC3);
matDst = (matF32Img - matMean) / matStd;
}
int main()
{
std::string strTrtSavedPath = "./savedTrt.trt";
// gLogger
// 1、创建一个builder, gLogger是一个日志类,必须要有,但又不是那么重要,可以自己继承
IBuilder* builder = createInferBuilder(gLogger);
// 2、创建一个netwok,推荐使用V2,这时候netWork只是一个空架子,因为是解析caffe模型,那后面的必须是0U
// 别问我为啥,官方这么写的
INetworkDefinition* network = builder->createNetworkV2(0U);
// TensorRt提供了一个高级别的API,CaffePaser,用于解析caffe模型
ICaffeParser *parser = createCaffeParser();
const IBlobNameToTensor *blobNameToTensor = parser->parse("./model/caffeProfile/deploy_vgg16_places365.prototxt",
"./model/caffeProfile/vgg_iter_100000.caffemodel",
*network,
DataType::kFLOAT);
// 3、标记输入Tensor的节点名
network->markOutput(*blobNameToTensor->find("prob"));
// config是用来填充network的参数
IBuilderConfig *config = builder->createBuilderConfig();
// 设置最大batchSize的大小
builder->setMaxBatchSize(1);
// 设置工作空间
config->setMaxWorkspaceSize(1 << 20);
// 4、建立 engine,进行层之间融合或者进度校准方式
ICudaEngine *engine = builder->buildEngineWithConfig(*network, *config);
if (1) // 如果需要离线保存模型
{
IHostMemory* trtModelStream{ nullptr };
trtModelStream = engine->serialize();
std::ofstream modeStreamoutfile(strTrtSavedPath, std::ofstream::binary);
assert(!modeStreamoutfile.fail());
modeStreamoutfile.write((char*)trtModelStream->data(), trtModelStream->size());
gLogInfo<<"Saving TRT engine " << strTrtSavedPath << "." <<std::endl;
}
// inference推断过程
IExecutionContext *context = engine->createExecutionContext();
int nInputIdx = engine->getBindingIndex("data");
int nOutputIndex = engine->getBindingIndex("prob");
std::cout << " nINputIdx = " << nInputIdx << std::endl;
std::cout << " nOutputIdx = " << nOutputIndex << std::endl;
//
std::cout << " n = " << engine->getNbBindings() << std::endl;
//申请GPU显存
// Allocate GPU memory for Input / Output data
void* buffers[2] = {NULL, NULL};
int nBatchSize = 1;
int nOutputSize = MODEL_OUTPUT_SIZE;
CHECK(cudaMalloc(&buffers[nInputIdx], nBatchSize * MODEL_CHANNEL * MODEL_HEIGHT * MODEL_WIDTH * sizeof(float)));
CHECK(cudaMalloc(&buffers[nOutputIndex], nBatchSize * nOutputSize * sizeof(float)));
// 创建cuda流
cudaStream_t stream;
CHECK(cudaStreamCreate(&stream));
cudaEvent_t start, end; //calculate run time
CHECK(cudaEventCreate(&start));
CHECK(cudaEventCreate(&end));
cv::Mat matBgrImg = cv::imread("./data/fram_25.jpg");
cv::Mat matNormImage;
preData(matBgrImg, matNormImage); // 减均值除方差
std::vector<std::vector<cv::Mat>> nChannels;
std::vector<cv::Mat> rgbChannels(3);
cv::split(matNormImage, rgbChannels);
nChannels.push_back(rgbChannels); // NHWC 转NCHW
void *data = malloc(nBatchSize * MODEL_CHANNEL * MODEL_HEIGHT * MODEL_WIDTH *sizeof(float));;
if (NULL == data)
{
printf("malloc error!\n");
return 0;
}
for (int c = 0; c < 3; ++c)
{
cv::Mat cur_imag_plane = nChannels[0][c];
memcpy(data + c * MODEL_HEIGHT * MODEL_WIDTH * sizeof(float), cur_imag_plane.ptr<unsigned char>(0), 256 *256 * sizeof(float));
}
// DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host
CHECK(cudaMemcpyAsync(buffers[nInputIdx], data, \
nBatchSize * MODEL_CHANNEL * MODEL_WIDTH * MODEL_HEIGHT * sizeof(float), cudaMemcpyHostToDevice, stream));
float ms;
cudaEventRecord(start, stream);
// 5、 启动cuda核计算
context->enqueue(nBatchSize, buffers, stream, nullptr);
cudaEventRecord(end, stream);
cudaEventSynchronize(end);
cudaEventElapsedTime(&ms, start, end);
float prob[nBatchSize * nOutputSize];
CHECK(cudaMemcpyAsync(prob, buffers[nOutputIndex], 1 * 5 * sizeof(float), cudaMemcpyDeviceToHost, stream));
cudaStreamSynchronize(stream);
cudaEventDestroy(start);
cudaEventDestroy(end);
// Release stream and buffers
cudaStreamDestroy(stream);
CHECK(cudaFree(buffers[nInputIdx]));
CHECK(cudaFree(buffers[nOutputIndex]));
for(int i=0; i< 5; ++i)
{
std::cout << prob[i] << " ";
}
std::cout << std::endl;
parser->destroy();
network->destroy();
config->destroy();
builder->destroy();
printf("hello world \n");
return 0;
}
二、反序列化直接加载保存后的trt模型
/*====================================================================
文件 : sampleCaffeClassf.cc
功能 : TensorRT学习系列1、走通流程
====================================================================*/
#include "NvCaffeParser.h"
#include "NvInfer.h"
#include "NvInferPlugin.h"
#include "logger.h"
#include "cuda_runtime_api.h"
#include "common.h"
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <sstream>
#include <opencv2/opencv.hpp>
using namespace nvinfer1;
using namespace plugin;
using namespace nvcaffeparser1;
const int MODEL_HEIGHT = 256;
const int MODEL_WIDTH = 256;
const int MODEL_CHANNEL = 3;
const int MODEL_OUTPUT_SIZE = 5; //模型的输出,5分类
/**********************************
* @brief 先resize、再减均值、除方差
*
* @param src
* @param dst
* @return
*********************************/
void preData(cv::Mat &matSrc, cv::Mat &matDst)
{
cv::resize(matSrc, matSrc, cv::Size(MODEL_WIDTH, MODEL_HEIGHT));
cv::Mat matMean(MODEL_HEIGHT, MODEL_WIDTH, CV_32FC3, \
cv::Scalar(103.53f, 116.28f, 123.675f)); // 均值
cv::Mat matStd(256, 256, CV_32FC3, \
cv::Scalar(1.0f, 1.0f, 1.0f)); // 方差
cv::Mat matF32Img;
matSrc.convertTo(matF32Img, CV_32FC3);
matDst = (matF32Img - matMean) / matStd;
}
int main()
{
std::string strTrtSavedPath = "./savedTrt.trt";
// gLogger
// gLogger是一个日志类,必须要有,但又不是那么重要,可以自己继承
IRuntime* runtime = createInferRuntime(gLogger);
std::ifstream fin(strTrtSavedPath);
// 1、将文件中的内容读取至cached_engine字符串
std::string modelData = "";
while (fin.peek() != EOF){ // 使用fin.peek()防止文件读取时无限循环
std::stringstream buffer;
buffer << fin.rdbuf();
modelData.append(buffer.str());
}
fin.close();
// 2、 将序列化得到的结果进行反序列化,以执行后续的inference
ICudaEngine* engine = runtime->deserializeCudaEngine(modelData.data(), modelData.size(), nullptr);
// inference推断过程
IExecutionContext *context = engine->createExecutionContext();
int nInputIdx = engine->getBindingIndex("data"); // 输入节点名
int nOutputIndex = engine->getBindingIndex("prob"); // 输出节点名
std::cout << " nINputIdx = " << nInputIdx << std::endl;
std::cout << " nOutputIdx = " << nOutputIndex << std::endl;
//
std::cout << " n = " << engine->getNbBindings() << std::endl;
//申请GPU显存
// Allocate GPU memory for Input / Output data
void* buffers[2] = {NULL, NULL};
int nBatchSize = 1;
int nOutputSize = MODEL_OUTPUT_SIZE;
CHECK(cudaMalloc(&buffers[nInputIdx], nBatchSize * MODEL_CHANNEL * MODEL_HEIGHT * MODEL_WIDTH * sizeof(float)));
CHECK(cudaMalloc(&buffers[nOutputIndex], nBatchSize * nOutputSize * sizeof(float)));
// 创建cuda流
cudaStream_t stream;
CHECK(cudaStreamCreate(&stream));
cudaEvent_t start, end; //calculate run time
CHECK(cudaEventCreate(&start));
CHECK(cudaEventCreate(&end));
cv::Mat matBgrImg = cv::imread("./data/fram_25.jpg");
cv::Mat matNormImage;
preData(matBgrImg, matNormImage); // 减均值除方差
std::vector<std::vector<cv::Mat>> nChannels;
std::vector<cv::Mat> rgbChannels(3);
cv::split(matNormImage, rgbChannels);
nChannels.push_back(rgbChannels); // NHWC 转NCHW
void *data = malloc(nBatchSize * MODEL_CHANNEL * MODEL_HEIGHT * MODEL_WIDTH *sizeof(float));;
if (NULL == data)
{
printf("malloc error!\n");
return 0;
}
for (int c = 0; c < 3; ++c)
{
cv::Mat cur_imag_plane = nChannels[0][c];
memcpy(data + c * MODEL_HEIGHT * MODEL_WIDTH * sizeof(float), cur_imag_plane.ptr<unsigned char>(0), 256 *256 * sizeof(float));
}
// DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host
CHECK(cudaMemcpyAsync(buffers[nInputIdx], data, \
nBatchSize * MODEL_CHANNEL * MODEL_WIDTH * MODEL_HEIGHT * sizeof(float), cudaMemcpyHostToDevice, stream));
float ms;
cudaEventRecord(start, stream);
// 5、 启动cuda核计算
context->enqueue(nBatchSize, buffers, stream, nullptr);
cudaEventRecord(end, stream);
cudaEventSynchronize(end);
cudaEventElapsedTime(&ms, start, end);
float prob[nBatchSize * nOutputSize];
CHECK(cudaMemcpyAsync(prob, buffers[nOutputIndex], 1 * 5 * sizeof(float), cudaMemcpyDeviceToHost, stream));
cudaStreamSynchronize(stream);
cudaEventDestroy(start);
cudaEventDestroy(end);
// Release stream and buffers
cudaStreamDestroy(stream);
CHECK(cudaFree(buffers[nInputIdx]));
CHECK(cudaFree(buffers[nOutputIndex]));
for(int i=0; i< 5; ++i)
{
std::cout << prob[i] << " ";
}
std::cout << std::endl;
engine->destroy();
printf("hello world \n");
return 0;
}