【部署】预处理和后处理加速方案：CVCuda

最新推荐文章于 2024-05-20 15:29:54 发布

华农度假村村长

最新推荐文章于 2024-05-20 15:29:54 发布

阅读量674

点赞数 9

文章标签：预处理 cv c++

本文链接：https://blog.csdn.net/weixin_50862344/article/details/134807011

版权

预处理和后处理加速方案大概可以包括以下几种

（1）nvidia开源的CVCuda

（2）使用opencv4的cuda加速模块

（3）手写cuda算子

这一章我们先从CVCuda开始

一.基本要求

1.1 何时可以使用CVcuda库

在模型的预处理和后处理阶段都可以

1.2 硬件要求 & 安装

Ubuntu >= 20.04
CUDA driver >= 11.7

【参考】Installation — CV-CUDA Beta documentation (cvcuda.github.io)

二.基本实现流程

创建流
请求图片分配
为输入预留空间buff
将图片加载到内存中
转入CvCuda标准格式
调用CvCuda算子进行操作

我们以CvCuda的目标分类作为案例进行学习

2.1 创建流

    cudaStream_t stream;
    CHECK_CUDA_ERROR(cudaStreamCreate(&stream));

2.2 CalcRequirements请求内存

nvcv::Tensor::Requirements inReqs
        = nvcv::Tensor::CalcRequirements(batchSize, {maxImageWidth, maxImageHeight}, nvcv::FMT_RGB8);

2.3 为输入预留空间

此时分为两种基本情况：（1）已知待处理的图像的大小（H，W，C）（2）难以计算

（1）已知待处理的图像的大小（H，W，C）

    // Allocating memory for input image batch
    nvcv::TensorDataStridedCuda::Buffer inBuf;
    inBuf.strides[3] = sizeof(uint8_t);
    inBuf.strides[2] = maxChannels * inBuf.strides[3];
    inBuf.strides[1] = maxImageWidth * inBuf.strides[2];
    inBuf.strides[0] = maxImageHeight * inBuf.strides[1];
    CHECK_CUDA_ERROR(cudaMallocAsync(&inBuf.basePtr, batchSize * inBuf.strides[0], stream));

（2）难以计算

以模型输入为例，此时就需要调用CalcTotalSizeBytes计算内存

    // Calculates the total buffer size needed based on the requirements
    int64_t inputLayerSize = CalcTotalSizeBytes(nvcv::Requirements{reqsInputLayer.mem}.cudaMem());
    nvcv::TensorDataStridedCuda::Buffer bufInputLayer;
    std::copy(reqsInputLayer.strides, reqsInputLayer.strides + NVCV_TENSOR_MAX_RANK, bufInputLayer.strides);
    CHECK_CUDA_ERROR(cudaMalloc(&bufInputLayer.basePtr, inputLayerSize));

2.4.将图片加载到内存中

同样也是也有两种方式：（1） NvJpeg （2）OpenCV

    uint8_t *gpuInput = reinterpret_cast<uint8_t *>(inBuf.basePtr);
    NvDecode(imagePath, batchSize, totalImages, outputFormat, gpuInput);

只有输入端需要做这一步操作，模型操作过程中预留的空间不需要

2.5 转入CvCuda标准格式

创建nvcv::TensorDataStridedCuda类型，为下一步转化为类型做准备

    nvcv::TensorDataStridedCuda inData(nvcv::TensorShape{inReqs.shape, inReqs.rank, inReqs.layout},
                                       nvcv::DataType{inReqs.dtype}, inBuf);

通过TensorWrapData转化为nvcv::Tensor类型才能进行下一步调用英伟达的算子

nvcv::Tensor inTensor = TensorWrapData(inData);

2.6 调用CvCuda算子进行操作

英伟达提供的算子可以在这里查看

下面是用resize操作做一个展示：

    nvcv::Tensor   resizedTensor(batchSize, {inputLayerWidth, inputLayerHeight}, nvcv::FMT_RGB8);
    cvcuda::Resize resizeOp;
    resizeOp(stream, inTensor, resizedTensor, NVCV_INTERP_LINEAR);