CUDA和FFMPEG硬件解码视频流

最新推荐文章于 2024-04-25 17:31:25 发布

poweroranger

最新推荐文章于 2024-04-25 17:31:25 发布

阅读量2.2w

点赞数 19

分类专栏：硬件视频流解码文章标签：硬件解码 cuda 视频流 ffmpeg

本文链接：https://blog.csdn.net/poweroranger/article/details/77845238

版权

硬件视频流解码专栏收录该内容

1 篇文章 0 订阅

订阅专栏

本文主要讲述了通过FFMPEG获取H264格式的RTSP流数据（也可以获取本地视频文件），并通过CUDA进行硬件解码的过程。其他博客给出的教程要么只是给出了伪代码，非常的模糊，要么是基于D3D进行显示，使得给出的源码非常复杂，而无法看出CUDA解码的核心框架，而本文将其他非核心部分剥离出去，视频播放部分通过opencv调用cv::mat显示。

当然本博客的工作也参考了其他博客的内容，CSDN上原创的东西比较难找，大部分都是转载的，所以大家还是积极的贡献力量吧。

本文将分为以下两个部分：
1.CUDA硬件解码核心原理和框架解释；
2.解码核心功能代码的实现

CUDA硬件解码核心原理和框架
做过FFMPEG解码开发的同学肯定都对以下函数比较熟悉avcodec_decode_video2()，该函数实现可以解码从视频流中获取的数据包AVPACKET转化为AV_FRAME,AV_FRAME中包含了解码后的数据。通过CUDA硬件进行解码，最核心的思想就通过回调函数形式来调用CUDA硬件解码接口，对该函数替换，将CPU解码功能转移到GPU中去。
博客给出了一个很好的基础性框架，本文也是借鉴了该博客，该博客中修改了原始的VideoSource，将视频流的获取改为了ffmpeg，而CUDA解码部分框架如下图
1.1 VideoSource
VideoSourceData中包含了CUvideoparser和FrameQueue，通过上图可以看出，CUvideoparser是在VideoDecoder基础上实现了接口的封装，而VideoSource则是通过CUvideoparser进行解码。FrameQueue是存储硬件解码后图像的队列，注意硬件解码完的图像是存放在GPU显存里面了，而VideoDecoder中函数mapFrame，可完成从显存到内存的映射。
1.2 VideoParser
VideoParser中最重要的是三个回调函数，static int CUDAAPI HandleVideoSequence(void *pUserData, CUVIDEOFORMAT *pFormat)， HandlePictureDecode(void *pUserData, CUVIDPICPARAMS *pPicParams)，HandlePictureDisplay(void *pUserData, CUVIDPARSERDISPINFO *pPicParams)，实现对视频格式变换、视频解码、解码后显示等处理功能。HandleVideoSequence主要负责视频格式进行校验，没有实现其他功能，解码函数HandlePictureDecode调用的就是VideoDecoder的解码函数（CUDA的接口），显示函数HandlePictureDisplay完成了解码后GPU图像进入FrameQueue。
1.3 VideoDecoder
该类是最核心的硬件解码功能类，CUVIDDECODECREATEINFO oVideoDecodeCreateInfo_是创建解码信息结构体，CUvideodecoder oDecoder_是最内核的CUDA硬件解码器，VideoParser的解码功能实际上是在CUvideodecoder解码内核上封装实现的（层层封装导致源码有点复杂，所以想看懂实现机制需要有点耐心）。

2 核心解码模块的实现
示例中NvDecodeD3D9.cpp实现了D3D环境的创建，CUDA模块的初始化，其中取视频帧图像显示的函数如下，该函数实现了从解码图像队列取出图像（实际上是显存指针），完成格式转换（NV12到ARGB），最后映射到D3D的Texture进行显示等功能，代码中我给出了关键部位的解释。

bool copyDecodedFrameToTexture(unsigned int &nRepeats, int bUseInterop, int *pbIsProgressive)
{
    CUVIDPARSERDISPINFO oDisplayInfo;

    if (g_pFrameQueue->dequeue(&oDisplayInfo))
    {
    CCtxAutoLock lck(g_CtxLock);
    // Push the current CUDA context (only if we are using CUDA decoding path)
    CUresult result = cuCtxPushCurrent(g_oContext);
    //创建解码图像的显存指针，注意存储的是NV12格式的
    CUdeviceptr  pDecodedFrame[3] = { 0, 0, 0 }; 
    //用于解码图像后进行格式转换
    CUdeviceptr  pInteropFrame[3] = { 0, 0, 0 };

    *pbIsProgressive = oDisplayInfo.progressive_frame;
    g_bIsProgressive = oDisplayInfo.progressive_frame ? true : false;

    int num_fields = 1;
    if (g_bUseVsync) {            
        num_fields = std::min(2 + oDisplayInfo.repeat_first_field, 3);            
    }
    nRepeats = num_fields;

    CUVIDPROCPARAMS oVideoProcessingParameters;
    memset(&oVideoProcessingParameters, 0, sizeof(CUVIDPROCPARAMS));

    oVideoProcessingParameters.progressive_frame = oDisplayInfo.progressive_frame;
    oVideoProcessingParameters.top_field_first = oDisplayInfo.top_field_first;
    oVideoProcessingParameters.unpaired_field = (oDisplayInfo.repeat_first_field < 0);

    for (int active_field = 0; active_field < num_fields; active_field++)
    {
        unsigned int nDecodedPitch = 0;
        unsigned int nWidth = 0;
        unsigned int nHeight = 0;

        oVideoProcessingParameters.second_field = active_field;

        // map decoded video frame to CUDA surfae
        // 调用Videodecoder中映射功能，找到解码后图像的显存地址，并得到Pitch关键参数
        if (g_pVideoDecoder->mapFrame(oDisplayInfo.picture_index, &pDecodedFrame[active_field], &nDecodedPitch, &oVideoProcessingParameters) != CUDA_SUCCESS)
        {
            // release the frame, so it can be re-used in decoder
            g_pFrameQueue->releaseFrame(&oDisplayInfo);

            // Detach from the Current thread
            checkCudaErrors(cuCtxPopCurrent(NULL));

            return false;
        }
        nWidth  = g_pVideoDecoder->targetWidth();
        nHeight = g_pVideoDecoder->targetHeight();
        // map DirectX texture to CUDA surface
        size_t nTexturePitch = 0;

        // If we are Encoding and this is the 1st Frame, we make sure we allocate system memory for readbacks
        if (g_bReadback && g_bFirstFrame && g_ReadbackSID)
        {
            CUresult result;
            checkCudaErrors(result = cuMemAllocHost((void **)&g_pFrameYUV[0], (nDecodedPitch * nHeight + nDecodedPitch*nHeight/2)));
            checkCudaErrors(result = cuMemAllocHost((void **)&g_pFrameYUV[1], (nDecodedPitch * nHeight + nDecodedPitch*nHeight/2)));
            checkCudaErrors(result = cuMemAllocHost((void **)&g_pFrameYUV[2], (nDecodedPitch * nHeight + nDecodedPitch*nHeight/2)));
            checkCudaErrors(result = cuMemAllocHost((void **)&g_pFrameYUV[3], (nDecodedPitch * nHeight + nDecodedPitch*nHeight/2)));
            checkCudaErrors(result = cuMemAllocHost((void **)&g_pFrameYUV[4], (nDecodedPitch * nHeight + nDecodedPitch*nHeight / 2)));
            checkCudaErrors(result = cuMemAllocHost((void **)&g_pFrameYUV[5], (nDecodedPitch * nHeight + nDecodedPitch*nHeight / 2)));

            g_bFirstFrame = false;

            if (result != CUDA_SUCCESS)
            {
                printf("cuMemAllocHost returned %d\n", (int)result);
                checkCudaErrors(result);
            }
        }

        // If streams are enabled, we can perform the readback to the host while the kernel is executing
        if (g_bReadback && g_ReadbackSID)
        {
            CUresult result = cuMemcpyDtoHAsync(g_pFrameYUV[active_field], pDecodedFrame[active_field], (nDecodedPitch * nHeight * 3 / 2), g_ReadbackSID);

            if (result != CUDA_SUCCESS)
            {
                printf("cuMemAllocHost returned %d\n", (int)result);
                checkCudaErrors(result);
            }
        }

#if ENABLE_DEBUG_OUT
        printf("%s = %02d, PicIndex = %02d, OutputPTS = %08d\n",
               (oDisplayInfo.progressive_frame ? "Frame" : "Field"),
               g_DecodeFrameCount, oDisplayInfo.picture_index, oDisplayInfo.timestamp);
#endif

        if (g_pImageDX)
        {
            // map the texture surface
            g_pImageDX->map(&pInteropFrame[active_field], &nTexturePitch, active_field);
        }
        else
        {
            pInteropFrame[active_field] = g_pInteropFrame[active_field];
            nTexturePitch = g_pVideoDecoder->targetWidth() * 2;
        }

        // perform post processing on the CUDA surface (performs colors space conversion and post processing)
        // comment this out if we inclue the line of code seen above
        //调用CUDA功能模块，完成从NV12格式到ARGB格式的转换，该功能模块比较复杂，后面我将给出一个简单的实现方式
        cudaPostProcessFrame(&pDecodedFrame[active_field], nDecodedPitch, &pInteropFrame[active_field], 
                             nTexturePitch, g_pCudaModule->getModule(), g_kernelNV12toARGB, g_KernelSID);

        if (g_pImageDX)
        {
            // unmap the texture surface
            g_pImageDX->unmap(active_field);
        }

        // unmap video frame
        // unmapFrame() synchronizes with the VideoDecode API (ensures the frame has finished decoding)
        g_pVideoDecoder->unmapFrame(pDecodedFrame[active_field]);                  
        g_DecodeFrameCount++;

        if (g_bWriteFile)
        {
            checkCudaErrors(cuStreamSynchronize(g_ReadbackSID));
            SaveFrameAsYUV(g_pFrameYUV[active_field + 3],
                g_pFrameYUV[active_field],
                nWidth, nHeight, nDecodedPitch);
        }
    }

    // Detach from the Current thread
    checkCudaErrors(cuCtxPopCurrent(NULL));
    // release the frame, so it can be re-used in decoder
    g_pFrameQueue->releaseFrame(&oDisplayInfo);
}
else
{
    // Frame Queue has no frames, we don't compute FPS until we start
    return false;
}

// check if decoding has come to an end.
// if yes, signal the app to shut down.
if (!g_pVideoSource->isStarted() && g_pFrameQueue->isEndOfDecode() && g_pFrameQueue->isEmpty())
{
    // Let's free the Frame Data
    if (g_ReadbackSID && g_pFrameYUV)
    {
        cuMemFreeHost((void *)g_pFrameYUV[0]);
        cuMemFreeHost((void *)g_pFrameYUV[1]);
        cuMemFreeHost((void *)g_pFrameYUV[2]);
        cuMemFreeHost((void *)g_pFrameYUV[3]);
        cuMemFreeHost((void *)g_pFrameYUV[4]);
        cuMemFreeHost((void *)g_pFrameYUV[5]);
        g_pFrameYUV[0] = NULL;
        g_pFrameYUV[1] = NULL;
        g_pFrameYUV[2] = NULL;
        g_pFrameYUV[3] = NULL;
        g_pFrameYUV[4] = NULL;
        g_pFrameYUV[5] = NULL;
    }

    // Let's just stop, and allow the user to quit, so they can at least see the results
    g_pVideoSource->stop();

    // If we want to loop reload the video file and restart
    if (g_bLoop && !g_bAutoQuit)
    {
        HRESULT hr = reinitCudaResources();
        if (SUCCEEDED(hr))
        {
            g_FrameCount = 0;
            g_DecodeFrameCount = 0;
            g_pVideoSource->start();
        }
    }

    if (g_bAutoQuit)
    {
        g_bDone = true;
    }
}
return true;
}

以上功能模块与D3D掺和在一起，难以找到解码后取图像数据功能的核心模块，下面我给出基于opencv Mat的取图像数据方法，代码如下：
bool GetGpuDecodeFrame(shared_ptr ptr_video_stream, Mat &frame)
{
CUVIDPARSERDISPINFO oDisplayInfo;

if (ptr_video_stream->p_cuda_frame_queue)
{
    if (ptr_video_stream->p_cuda_frame_queue->FrameNumInQueue()>0 && ptr_video_stream->p_cuda_frame_queue->dequeue(&oDisplayInfo))
    {
        CCtxAutoLock lck(cuvideo_ctx_lock_);
        CUresult result = cuCtxPushCurrent(cuda_context_);
        CUdeviceptr  pDecodedFrame = 0;
        int num_fields = 1;

        CUVIDPROCPARAMS oVideoProcessingParameters;
        memset(&oVideoProcessingParameters, 0, sizeof(CUVIDPROCPARAMS));

        oVideoProcessingParameters.progressive_frame = oDisplayInfo.progressive_frame;
        oVideoProcessingParameters.top_field_first = oDisplayInfo.top_field_first;
        oVideoProcessingParameters.unpaired_field = (oDisplayInfo.repeat_first_field < 0);
        oVideoProcessingParameters.second_field = 0;

        unsigned int nDecodedPitch = 0;
        unsigned int nWidth = 0;
        unsigned int nHeight = 0;
        //找到图像数据GPU显存地址
        if (ptr_video_stream->p_cuda_video_decoder->mapFrame(oDisplayInfo.picture_index, &pDecodedFrame, &nDecodedPitch, &oVideoProcessingParameters) != CUDA_SUCCESS)
        {
            // release the frame, so it can be re-used in decoder
            ptr_video_stream->p_cuda_frame_queue->releaseFrame(&oDisplayInfo);
            // Detach from the Current thread
            checkCudaErrors(cuCtxPopCurrent(NULL));
            return false;
        }
        nWidth = ptr_video_stream->p_cuda_video_decoder->targetWidth();
        nHeight = ptr_video_stream->p_cuda_video_decoder->targetHeight();
        Mat raw_frame = Mat::zeros(cvSize(nWidth, nHeight), CV_8UC3);
        //直接对显存图像数据进行NV12到RGB格式的转换，并将转换后的数据拷贝到内存
        VasGpuBoost::ColorConvert::Get()->ConvertD2HYUV422pToRGB24((uchar*)pDecodedFrame, raw_frame.data, nWidth, nHeight, nDecodedPitch);
        resize(raw_frame, frame, cvSize(ptr_video_stream->decode_param_.dst_width(), ptr_video_stream->decode_param_.dst_height()));
        raw_frame.release();
        ptr_video_stream->p_cuda_video_decoder->unmapFrame(pDecodedFrame);
        checkCudaErrors(cuCtxPopCurrent(NULL));
        ptr_video_stream->p_cuda_frame_queue->releaseFrame(&oDisplayInfo);

        pts = 0;
        No = 0;
        return true;
    }
    else return false;
}
else return false;
}

解码用到的cuda核心函数如下，需要强调的Ptich的大小并不是图像宽度，而是解码后图像存放数据行的宽度，通常情况下要比图像宽度要大，实际上格式转换过程参考了博客。常见图像格式转换

__global__ void DevYuv420iToRgb(const uchar* yuv_data, uchar *rgb_data, const int width, const int height, const int pitch, const uchar *table_r, const uchar *table_g, const uchar *table_b)
{
    int i = threadIdx.x + blockIdx.x * blockDim.x;

    int64 size = pitch*height;
    int64 compute_size = width*height;

    int x, y;
    x = i % width;
    y = i / width;

    CUDA_KERNEL_LOOP(i, compute_size)
    {
        int y_offset = pitch * y + x;
        int nv_index = y / 2 * pitch + x - x % 2;


        int v_offset = size + nv_index;
        int u_offset = v_offset + 1;

        int Y = *(yuv_data + y_offset);
        int U = *(yuv_data + u_offset);
        int V = *(yuv_data + v_offset);

        *(rgb_data + 3 * i) = table_r[(Y << 8) + V];
        *(rgb_data + 3 * i + 1) = table_g[(Y << 16) + (U << 8) + V];
        *(rgb_data + 3 * i + 2) = table_b[(Y << 8) + U];
    }
}

小结
本文旨在揭示CUDA的硬件框架，通过对比实验发现硬件解码还是强大的，GTX970能够做到720p视频大约800fps的速度。我也是基于此框架，实现了一套基于cuda的多路视频硬件解码C++接口，输出opencv mat格式图像，做后续视频分析。
由于时间关系，行文较为仓促，错误或者讲的不清楚的地方，大家可以给我留言。

poweroranger

关注

19
点赞
踩
81

收藏

觉得还不错? 一键收藏
15
评论
CUDA和FFMPEG硬件解码视频流

本文主要讲述了通过FFMPEG获取H264格式的RTSP流数据（也可以获取本地视频文件），并通过CUDA进行硬件解码的过程。其他博客给出的教程要么只是给出了伪代码，非常的模糊，要么是基于D3D进行显示，使得给出的源码非常复杂，而无法看出CUDA解码的核心框架，而本文将其他非核心部分剥离出去，视频播放部分通过opencv调用cv::mat显示。当然本博客的工作也参考了其他博客的内容，CSDN上原创的东
复制链接

扫一扫

专栏目录