基于TensorRT(*.engine)的yolov8-pose部署
1、部署环境
- NVIDIA AGX ORIN 32GB
- TensoRT 8.4
- CUDA 11.4
- cudnn 8.4.1
- opencv 4.5.1(dnn)
2.编译平台(问题记录)
之前在PC上跑的是TensorRT 8.6,基于VS搭建编译的,原本想直接代码拿过来用QtCreator编译,发现模型调用不成功,报错提示如下:
[TensorRT] ERROR: 1: [stdArchiveReader.cpp::StdArchiveReader::35] Error Code 1: Serialization (Serialization assertion safeVersionRead == safeSerializationVersion failed.Version tag does not match. Note: Current Version: 0, Serialized Engine Version: 89)
最后发现是模型导出的版本与最终调用的TRT版本不一致导致的问题,我在这里卡了很久!!!
还有一个问题就是,最终模型可以成功调用,但是一直打印这个
[runtime.cpp::deserializeCudaEngine::37] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/runtime.cpp::deserializeCudaEngine::37, condition: (blob) != nullptr
解决方法说是engine模型路径错误,但是我写的是绝对路径,不过最终程序也能正常运行,这个错误也就没管。
3.C++调用
#include<detect.h>
#pragma once
#include <fstream>
#include <iostream>
#include <sstream>
#include <opencv2/opencv.hpp>
#include "NvInfer.h"
#include "NvInferPlugin.h"
using namespace nvinfer1;
using namespace cv;
const int connect_list[36] = { 0, 1, 0, 2, 1, 3, 2, 4, 3, 5, 4, 6, 5, 6, 5, 7, 7, 9, 6, 8, 8, 10, 5, 11, 6, 12, 11, 12, 11, 13, 13, 15, 12, 14, 14, 16 };
class Yolov8PoseTrt {
public:
void initConfig(std::string enginefile, float conf_thresholod, float score_thresholod);
void detect(cv::Mat &frame);
~Yolov8PoseTrt();
private:
float conf_thresholod = 0.25;
float score_thresholod = 0.25;
int input_h = 640;
int input_w = 640;
int output_h;
int output_w;
int num_pts, count_num = 0;
float tempTime;
IRuntime* runtime{ nullptr };
ICudaEngine* engine{ nullptr };
IExecutionContext* context{ nullptr };
void* buffers[2] = { NULL, NULL };
std::vector<float> prob;
cudaStream_t stream;
};
创建tensor;
copy到显存进行推理;
获取推理结果;
cv::Mat tensor = cv::dnn::blobFromImage(image, 1.0f / 225.f, cv::Size(input_w, input_h), cv::Scalar(), true);
// 内存到GPU显存
cudaMemcpyAsync(buffers[0], tensor.ptr<float>(), input_h * input_w * 3 * sizeof(float), cudaMemcpyHostToDevice, stream);
// 推理
context->enqueueV2(buffers, stream, nullptr);
// GPU显存到内存
cudaMemcpyAsync(prob.data(), buffers[1], output_h *output_w * sizeof(float), cudaMemcpyDeviceToHost, stream);
CPP文件在这里。
4.总结(问题记录)
通体来说,调用比较方便,主要有以下几个难点: