CV-CUDA初学习

1.CV-CUDA简介

NVIDIA CV- cuda™是一个开源项目,用于构建云级人工智能(AI)成像和计算机视觉(CV)应用程序。它使用图形处理单元(GPU)加速来帮助开发人员构建高效的预处理和后处理管道。它可以将吞吐量提高10倍以上,同时降低云计算成本。
在这里插入图片描述
代码库地址:https://github.com/CVCUDA/CV-CUDA/
在线文档地址:https://cvcuda.github.io/

2.安装CV-CUDA

cv-cuda安装对系统、cuda版本与驱动版本皆有要求

Ubuntu >= 20.04 (22.04 recommended for building the documentation)
CUDA >= 11.7 (cuda 12 required for samples)
NVIDIA driver r525 or later (r535 required for samples)

以下有两种安装CV-CUDA的方法。选择适合你环境需要的安装方法即可。

(1)安装tar包

tar -xvf cvcuda-lib-<x.x.x>-<cu_ver>-<arch>-linux.tar.xz
tar -xvf cvcuda-dev-<x.x.x>-<cu_ver>-<arch>-linux.tar.xz

(2)安装deb包

sudo apt install -y ./cvcuda-lib-<x.x.x>-<cu_ver>-<arch>-linux.deb
sudo apt install -y ./cvcuda-dev-<x.x.x>-<cu_ver>-<arch>-linux.deb

默认安装目录是在/opt/nvidia/cvcuda0/

2.使用CV-CUDA

对于推理一张图片来说,一般包含三部分:图片前处理、图片推理、图片后处理。一般来说,图片的前后处理都是放在cpu上来做,图片推理放在gpu上进行。推理少量图片时大家都和和美美,但当出现多个进程大量图片去进行推理时,cpu利用率会飙升。对于推理图片的初衷,当然是越快越好,(准确性大多有保证),资源占用越少越好,对于图片推理阶段,我们可以采用TensorR进行推理加速,对于前后处理方面我们同样可以将两者放到gpu上来做,这就用到了cv-cuda。

CV-CUDA主要用于图片前处理阶段,(图片后处理多是数据的处理,可以采用核函数加速),我接下来介绍如何使用CV-CUDA进行图片的前处理操作。

在CV-CUDA中,GPU上的数据都用nvcv::Tensor来表示,图像预处理操作需要用到两个Tensor:原始输入图像Tensor和模型输入数据Tensor。这两个Tensor可以根据原始输入图像的尺寸和模型输入尺寸预先构建好:

	cudaStream_t stream;
    CHECK_CUDA_ERROR(cudaStreamCreate(&stream));

    // Allocating memory for input image batch
    nvcv::TensorDataStridedCuda::Buffer inBuf;
    inBuf.strides[3] = sizeof(uint8_t);
    inBuf.strides[2] = maxChannels * inBuf.strides[3];
    inBuf.strides[1] = maxImageWidth * inBuf.strides[2];
    inBuf.strides[0] = maxImageHeight * inBuf.strides[1];
    CHECK_CUDA_ERROR(cudaMallocAsync(&inBuf.basePtr, batchSize * inBuf.strides[0], stream));

    nvcv::Tensor::Requirements inReqs
        = nvcv::Tensor::CalcRequirements(1, {maxImageWidth, maxImageHeight}, nvcv::FMT_RGB8);

    nvcv::TensorDataStridedCuda inData(nvcv::TensorShape{inReqs.shape, inReqs.rank, inReqs.layout},
                                       nvcv::DataType{inReqs.dtype}, inBuf);
                                       
    nvcv::Tensor inTensor = TensorWrapData(inData);

    //----------------------------
	nvcv::Tensor::Requirements reqsInputLayer
        = nvcv::Tensor::CalcRequirements(batchSize, {inputDims.width, inputDims.height}, nvcv::FMT_RGBf32p);
    // Calculates the total buffer size needed based on the requirements
    int64_t inputLayerSize = CalcTotalSizeBytes(nvcv::Requirements{reqsInputLayer.mem}.cudaMem());
    nvcv::TensorDataStridedCuda::Buffer bufInputLayer;
    std::copy(reqsInputLayer.strides, reqsInputLayer.strides + NVCV_TENSOR_MAX_RANK, bufInputLayer.strides);
    // Allocate buffer size needed for the tensor
    CHECK_CUDA_ERROR(cudaMalloc(&bufInputLayer.basePtr, inputLayerSize));
    // Wrap the tensor as a CVCUDA tensor
    nvcv::TensorDataStridedCuda inputLayerTensorData(
        nvcv::TensorShape{reqsInputLayer.shape, reqsInputLayer.rank, reqsInputLayer.layout},
        nvcv::DataType{reqsInputLayer.dtype}, bufInputLayer);
    nvcv::Tensor inputLayerTensor = TensorWrapData(inputLayerTensorData);
定义好了之后将图片内存数据拷贝到 inTensor中
// copy image data to tensor
  auto image_data =
      inTensor .exportData<nvcv::TensorDataStridedCuda>();
  cudaMemcpyAsync(image_data ->basePtr(), input_image.data,
             image_data ->stride(0), cudaMemcpyHostToDevice,stream);

resize
下面以尺寸变换为例介绍CV-CUDA中算子的使用方法。CV-CUDA中尺寸变换对应的算子类为cvcuda::Resize,在调用算子之前需要为其构建一个Tensor保存算子输出的数据:

nvcv::Tensor   resizedTensor(batchSize, {model_input_Width, model_input_Height}, nvcv::FMT_RGB8);

创建cvcuda::Resize对象resizeOp、调用()操作符。

    nvcv::Tensor   resizedTensor(batchSize, {inputLayerWidth, inputLayerHeight}, nvcv::FMT_RGB8);
    cvcuda::Resize resizeOp;
    resizeOp(stream, inTensor, resizedTensor, NVCV_INTERP_LINEAR);

就是这么简单,resize完的数据存放在resizedTensor中,具体怎样实现,有兴趣的读者可以去看源码,这里直介绍如何简单实用。
像CV-CUDA的其他算子大都这样设计,所以使用方式基本一样
ConvertTo
再介绍一个转数据格式的算子ConvertTo,首先创建一个tensor

nvcv::Tensor      floatTensor(batchSize, {inputLayerWidth, inputLayerHeight}, nvcv::FMT_RGBf32);

然后调用convertOp()

     cvcuda::ConvertTo convertOp;
    convertOp(stream, resizedTensor, floatTensor, 1.0f , 0.0f);

其他的算子不再赘述,具体操作方式如下代码:包括归一化与数据通道顺序变换。

 // Resize to the dimensions of input layer of network
    nvcv::Tensor   resizedTensor(batchSize, {inputLayerWidth, inputLayerHeight}, nvcv::FMT_RGB8);
    cvcuda::Resize resizeOp;
    resizeOp(stream, inTensor, resizedTensor, NVCV_INTERP_LINEAR);

    // Convert to data format expected by network (F32). Apply scale 1/255f.
    nvcv::Tensor      floatTensor(batchSize, {inputLayerWidth, inputLayerHeight}, nvcv::FMT_RGBf32);
    cvcuda::ConvertTo convertOp;
    convertOp(stream, resizedTensor, floatTensor, 1.0f / 255.f, 0.0f);

    // The input to the network needs to be normalized based on the mean and std deviation values
    // to standardize the input data.

    // Create a Tensor to store the standard deviation values for R,G,B
    nvcv::Tensor::Requirements reqsScale       = nvcv::Tensor::CalcRequirements(1, {1, 1}, nvcv::FMT_RGBf32);
    int64_t                    scaleBufferSize = CalcTotalSizeBytes(nvcv::Requirements{reqsScale.mem}.cudaMem());
    nvcv::TensorDataStridedCuda::Buffer bufScale;
    std::copy(reqsScale.strides, reqsScale.strides + NVCV_TENSOR_MAX_RANK, bufScale.strides);
    CHECK_CUDA_ERROR(cudaMalloc(&bufScale.basePtr, scaleBufferSize));
    nvcv::TensorDataStridedCuda scaleIn(nvcv::TensorShape{reqsScale.shape, reqsScale.rank, reqsScale.layout},
                                        nvcv::DataType{reqsScale.dtype}, bufScale);

    nvcv::Tensor stddevTensor = TensorWrapData(scaleIn);

    // Create a Tensor to store the mean values for R,G,B
    nvcv::TensorDataStridedCuda::Buffer bufBase;
    nvcv::Tensor::Requirements          reqsBase       = nvcv::Tensor::CalcRequirements(1, {1, 1}, nvcv::FMT_RGBf32);
    int64_t                             baseBufferSize = CalcTotalSizeBytes(nvcv::Requirements{reqsBase.mem}.cudaMem());
    std::copy(reqsBase.strides, reqsBase.strides + NVCV_TENSOR_MAX_RANK, bufBase.strides);
    CHECK_CUDA_ERROR(cudaMalloc(&bufBase.basePtr, baseBufferSize));
    nvcv::TensorDataStridedCuda baseIn(nvcv::TensorShape{reqsBase.shape, reqsBase.rank, reqsBase.layout},
                                       nvcv::DataType{reqsBase.dtype}, bufBase);

    nvcv::Tensor meanTensor = TensorWrapData(baseIn);

    // Copy the values from Host to Device
    // The R,G,B scale and mean will be applied to all the pixels across the batch of input images
    float stddev[3]  = {0.229, 0.224, 0.225};
    float mean[3]    = {0.485f, 0.456f, 0.406f};
    auto  meanData   = meanTensor.exportData<nvcv::TensorDataStridedCuda>();
    auto  stddevData = stddevTensor.exportData<nvcv::TensorDataStridedCuda>();

    // Flag to set the scale value as standard deviation i.e use 1/scale
    uint32_t flags = CVCUDA_NORMALIZE_SCALE_IS_STDDEV;
    CHECK_CUDA_ERROR(cudaMemcpyAsync(stddevData->basePtr(), stddev, 3 * sizeof(float), cudaMemcpyHostToDevice, stream));
    CHECK_CUDA_ERROR(cudaMemcpyAsync(meanData->basePtr(), mean, 3 * sizeof(float), cudaMemcpyHostToDevice, stream));

    nvcv::Tensor normTensor(batchSize, {inputLayerWidth, inputLayerHeight}, nvcv::FMT_RGBf32);

    // Normalize
    cvcuda::Normalize normOp;
    normOp(stream, floatTensor, meanTensor, stddevTensor, normTensor, 1.0f, 0.0f, 0.0f, flags);

    // Convert the data layout from interleaved to planar
    cvcuda::Reformat reformatOp;
    reformatOp(stream, normTensor, outTensor);

具体函数参数读者可以去翻阅文档,简单的CV-CUDA编程便介绍到这里,做一个入门分享。
今天写代码了嘛??????????
再见!

  • 7
    点赞
  • 16
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值