深度学习系列01——TensorRT模型部署流程

Mr_Sosimple

已于 2022-12-30 18:10:52 修改

阅读量916

点赞数

分类专栏： C++ 计算机视觉 CUDA 文章标签：深度学习人工智能

于 2022-12-30 18:09:33 首次发布

本文链接：https://blog.csdn.net/qq_18076309/article/details/128498787

版权

计算机视觉同时被 3 个专栏收录

8 篇文章 1 订阅

订阅专栏

C++

7 篇文章 0 订阅

订阅专栏

CUDA

6 篇文章 0 订阅

订阅专栏

模型部署流程

1. 模型准备

pytorch -> (onnx) -> trt engine

trtexec --onnx=output.onnx --saveEngine=outfp32.engine --workspace=2048 --minShapes=x:1x3x224x224 --optShapes=x:1x3x224x224 --maxShapes=x:1x3x224x224

trtexec --onnx=output.onnx --saveEngine=outfp16.engine --workspace=2048 --minShapes=x:1x3x224x224 --optShapes=x:1x3x224x224 --maxShapes=x:1x3x224x224 --fp16

trtexec --onnx=output.onnx --saveEngine=outfpi8.engine --workspace=2048 --minShapes=x:1x3x224x224 --optShapes=x:1x3x224x224 --maxShapes=x:1x3x224x224  --int8

trtexec --onnx=output.onnx --saveEngine=outfpbest.engine --workspace=2048 --minShapes=x:1x3x224x224 --optShapes=x:1x3x224x224 --maxShapes=x:1x3x224x224  --best

2. 准备图片输入

尺寸适配：图片固定长宽比 resize + padding 到模型要求的输入尺寸
归一化：减均值，除方差 -> float 格式
展开：按channel展开成一维float数组 (size = 3×w×h)
最终输入就是一维数组

3. 结果输出

根据网络结构有几个output head，就绑定几个buffer

void doInference(IExecutionContext& context, float* input, float* output, const int output_size, Size input_shape) {
    const ICudaEngine& engine = context.getEngine();

    // Pointers to input and output device buffers to pass to engine.
    // Engine requires exactly IEngine::getNbBindings() number of buffers.
    assert(engine.getNbBindings() == 2);
    void* buffers[2];

    // In order to bind the buffers, we need to know the names of the input and output tensors.
    // Note that indices are guaranteed to be less than IEngine::getNbBindings()
    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);

    assert(engine.getBindingDataType(inputIndex) == nvinfer1::DataType::kFLOAT);
    const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);
    assert(engine.getBindingDataType(outputIndex) == nvinfer1::DataType::kFLOAT);
    int mBatchSize = engine.getMaxBatchSize();

    // Create GPU buffers on device
    CHECK(cudaMalloc(&buffers[inputIndex], 3 * input_shape.height * input_shape.width * sizeof(float)));
    CHECK(cudaMalloc(&buffers[outputIndex], output_size*sizeof(float)));

    // Create stream
    cudaStream_t stream;
    CHECK(cudaStreamCreate(&stream));

    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host
    CHECK(cudaMemcpyAsync(buffers[inputIndex], input, 3 * input_shape.height * input_shape.width * sizeof(float), cudaMemcpyHostToDevice, stream));
    context.enqueue(1, buffers, stream, nullptr);
    CHECK(cudaMemcpyAsync(output, buffers[outputIndex], output_size * sizeof(float), cudaMemcpyDeviceToHost, stream));
    cudaStreamSynchronize(stream);

    // Release stream and buffers
    cudaStreamDestroy(stream);
    CHECK(cudaFree(buffers[inputIndex]));
    CHECK(cudaFree(buffers[outputIndex]));
}

decode输出：阈值处理、非极大值抑制、还原Box位置和尺寸、还原关键点位置

参考

https://blog.csdn.net/HaoZiHuang/article/details/125859167
https://blog.csdn.net/weixin_42492254/article/details/126028199
https://github.com/ifzhang/ByteTrack/blob/main/deploy/TensorRT/cpp/src/bytetrack.cpp