CUDA并发相关(流并发、主机设备并发)

CUDA中下面的一些执行步骤能够互相异步并发执行:
主机上的计算
设备上的计算
主机到设备的数据传输
设备到主机的数据传输
一个设备中内存的数据传输
不同设备之间的数据传输

注意,上面的6个操作是互相之间能够并发,在每个单独项上是不能够并发的(除了第二项设备上可以通过流并发来进行设备上计算的并发),但并不是每个英伟达的显卡都支持上面所有的并发,因此使用之前需要查询显卡是否支持。

主机和设备之间的并发:
核函数的启动
一个设备内,内存的复制
从主机到设备复制的内存块大小小于等于64kb时
函数后缀上带有Async的内存复制
内存设置函数调用(比如将某块内存全部置零)
最后,可以通过将变量CUDA_LAUNCH_BLOCKING置1来将核函数的启动变成同步,这个特性只用于调试

核函数之间的并发:
在设备计算能力大于2.1时支持核函数的并发,这个特性可以通过查询变量concurrentKernels来知道,当这个变量为1时说明设备支持,另外需要注意以下几点:
一个cuda上下文中的核函数不能够和另一个cuda上下文中的核函数并行
当一个核函数中大量使用纹理或局部内存时,核函数之间能够并行的可能性就在降低

数据传输与核函数执行的并发:
通过查询变量asyncEngineCount来检查设备是否支持,当设备支持时其值应当大于0

数据传输并发(指数据从主机到设备和从设备到主机的并发):
查询变量asyncEngineCount,支持时,其值应当等于2


前面的并发操作都是通过流来管理的:流与流之间可以并发,但流内部是有顺序的

流的同步:

//等待所有的流都执行完
cudaDeviceSynchronize()
//等待指定流中的任务全部执行完
cudaStreamSynchronize(cudaStream_t stream)
//某个流等待某个事件执行完再执行,下面的例子中使用了这个函数:需要注意的是这个函数也是非阻塞的
cudaStreamWaitEvent(cudaStream_t stream,cudaEvent_t event,unsigned int flags)
//查询某个流中的任务是否完成
cudaTreamQuery(cudaStream_t stream)

上面是流的显式同步,而隐式同步有以下几点:
如果主机线程中涉及到以下几点的话,那么两个流中的命令是无法同步的:
主机分配锁页内存
设备分配内存
将设备内存全部置成指定的值(原话是:a device memory set)
将两个地址的数据复制到同一个设备内存(原话是:a memory copy between two addresses to the same device memory)
任何附加到默认流中的命令

流中添加回调函数:
cudaStreamAddCallback(cudaStream_t stream,cudaStreamCallback_t callback,void *userdata,unsigned int flags)
需要注意的是第三个参数,是用户传递给回调函数的参数,cudaStreamCallback_t的定义:
typedef void (CUDART_CB cudaStreamCallback_t) (cudaStream_t stream,cudaError_t status,void userData)

流的优先级,流在创建时可以指定优先级:涉及的函数如下:
cudaStreamCreateWaitPriority()
cudaDeviceGetStreamPriorityRange()

事件:事件有两个作用:
第一个是记录某个流中某个操作的运行事件,创建时使用cudaEventCreate(cudaEvent_t *event),记录时间点:
cudaEventRecord(cudaEvent_t event,cudaStream_t stream),记录某个流下的时间点,由于是在主机上打印事件信息,因此主机需要等待事件完成,
因此还需要加上事件同步:cudaEventSynchronize(cudaEvent_t event),一个标准的事件记录:

cudaEvent_t start,end;
cudaEventCreate(&start);
cudaEventCreate(&end);
cudaEventRecord(start,stream)
//some stream operation
cudaEventRecord(end event,stream)
cudaEventSynchronize(end)
float elaspsedTime;
cudaEventElapsedTime(&elapsedTime,start,end);

第二个是作为一个标记,放在某个流操作的后面,标记这个流的该操作完成:
创建时使用:cudaEventCreateWithFlags(cudaEvent_t *event,unsigned int flags)
flags应该指定为cudaEventDisableTiming;flags还有其他值,请参考具体的手册

同步调用;相关函数:
cudaSetDeviceFlags(unsigned int flags):相见手册

核函数的并发执行,在下面需要注意的是,流与流之间通过事件来同步,即一个流等待另一个流执行完某个操作后再进行下一步操作,这时,创建事件时,应该使用:

checkCudaErrors(cudaEventCreateWithFlags(&(kernelEvent[i]), cudaEventDisableTiming));

使用时应该是:当前流等待上一个流运行后当前流再执行:

    for (int i=0; i<nkernels; ++i)
    {
        clock_block<<<1,1,0,streams[i]>>>(&d_a[i], time_clocks);
        total_clocks += time_clocks;
        checkCudaErrors(cudaEventRecord(kernelEvent[i], streams[i]));
        //make the last stream wait for the kernel event to be recorded
        checkCudaErrors(cudaStreamWaitEvent(streams[nstreams-1], kernelEvent[i],0));
    }
/*
 * Copyright 1993-2015 NVIDIA Corporation.  All rights reserved.
 *
 * Please refer to the NVIDIA end user license agreement (EULA) associated
 * with this source code for terms and conditions that govern your use of
 * this software. Any use, reproduction, disclosure, or distribution of
 * this software and related documentation outside the terms of the EULA
 * is strictly prohibited.
 *
 */

//
// This sample demonstrates the use of streams for concurrent execution. It also illustrates how to
// introduce dependencies between CUDA streams with the new cudaStreamWaitEvent function introduced
// in CUDA 3.2.
//
// Devices of compute capability 1.x will run the kernels one after another
// Devices of compute capability 2.0 or higher can overlap the kernels
//
#include <stdio.h>
#include <helper_functions.h>
#include <helper_cuda.h>

// This is a kernel that does no real work but runs at least for a specified number of clocks
__global__ void clock_block(clock_t *d_o, clock_t clock_count)
{
    unsigned int start_clock = (unsigned int) clock();

    clock_t clock_offset = 0;

    while (clock_offset < clock_count)
    {
        unsigned int end_clock = (unsigned int) clock();

        // The code below should work like
        // this (thanks to modular arithmetics):
        //
        // clock_offset = (clock_t) (end_clock > start_clock ?
        //                           end_clock - start_clock :
        //                           end_clock + (0xffffffffu - start_clock));
        //
        // Indeed, let m = 2^32 then
        // end - start = end + m - start (mod m).

        clock_offset = (clock_t)(end_clock - start_clock);
    }

    d_o[0] = clock_offset;
}


// Single warp reduction kernel
__global__ void sum(clock_t *d_clocks, int N)
{
    __shared__ clock_t s_clocks[32];

    clock_t my_sum = 0;

    for (int i = threadIdx.x; i < N; i+= blockDim.x)
    {
        my_sum += d_clocks[i];
    }

    s_clocks[threadIdx.x] = my_sum;
    syncthreads();

    for (int i=16; i>0; i/=2)
    {
        if (threadIdx.x < i)
        {
            s_clocks[threadIdx.x] += s_clocks[threadIdx.x + i];
        }

        syncthreads();
    }

    d_clocks[0] = s_clocks[0];
}

int main(int argc, char **argv)
{
    int nkernels = 8;               // number of concurrent kernels
    int nstreams = nkernels + 1;    // use one more stream than concurrent kernel
    int nbytes = nkernels * sizeof(clock_t);   // number of data bytes
    float kernel_time = 10; // time the kernel should run in ms
    float elapsed_time;   // timing variables
    int cuda_device = 0;

    printf("[%s] - Starting...\n", argv[0]);

    // get number of kernels if overridden on the command line
    if (checkCmdLineFlag(argc, (const char **)argv, "nkernels"))
    {
        nkernels = getCmdLineArgumentInt(argc, (const char **)argv, "nkernels");
        nstreams = nkernels + 1;
    }

    // use command-line specified CUDA device, otherwise use device with highest Gflops/s
    cuda_device = findCudaDevice(argc, (const char **)argv);

    cudaDeviceProp deviceProp;
    checkCudaErrors(cudaGetDevice(&cuda_device));

    checkCudaErrors(cudaGetDeviceProperties(&deviceProp, cuda_device));

    if ((deviceProp.concurrentKernels == 0))
    {
        printf("> GPU does not support concurrent kernel execution\n");
        printf("  CUDA kernel runs will be serialized\n");
    }

    printf("> Detected Compute SM %d.%d hardware with %d multi-processors\n",
           deviceProp.major, deviceProp.minor, deviceProp.multiProcessorCount);

    // allocate host memory
    clock_t *a = 0;                     // pointer to the array data in host memory
    checkCudaErrors(cudaMallocHost((void **)&a, nbytes));

    // allocate device memory
    clock_t *d_a = 0;             // pointers to data and init value in the device memory
    checkCudaErrors(cudaMalloc((void **)&d_a, nbytes));

    // allocate and initialize an array of stream handles
    cudaStream_t *streams = (cudaStream_t *) malloc(nstreams * sizeof(cudaStream_t));

    for (int i = 0; i < nstreams; i++)
    {
        checkCudaErrors(cudaStreamCreate(&(streams[i])));
    }

    // create CUDA event handles
    cudaEvent_t start_event, stop_event;
    checkCudaErrors(cudaEventCreate(&start_event));
    checkCudaErrors(cudaEventCreate(&stop_event));


    // the events are used for synchronization only and hence do not need to record timings
    // this also makes events not introduce global sync points when recorded which is critical to get overlap
    cudaEvent_t *kernelEvent;
    kernelEvent = (cudaEvent_t *) malloc(nkernels * sizeof(cudaEvent_t));

    for (int i = 0; i < nkernels; i++)
    {
        checkCudaErrors(cudaEventCreateWithFlags(&(kernelEvent[i]), cudaEventDisableTiming));
    }

    //////////////////////////////////////////////////////////////////////
    // time execution with nkernels streams
    clock_t total_clocks = 0;
#if defined(__arm__) || defined(__aarch64__)
    // the kernel takes more time than the channel reset time on arm archs, so to prevent hangs reduce time_clocks.
    clock_t time_clocks = (clock_t)(kernel_time * (deviceProp.clockRate / 1000));
#else
    clock_t time_clocks = (clock_t)(kernel_time * deviceProp.clockRate);
#endif

    cudaEventRecord(start_event, 0);

    // queue nkernels in separate streams and record when they are done
    for (int i=0; i<nkernels; ++i)
    {
        clock_block<<<1,1,0,streams[i]>>>(&d_a[i], time_clocks);
        total_clocks += time_clocks;
        checkCudaErrors(cudaEventRecord(kernelEvent[i], streams[i]));

        // make the last stream wait for the kernel event to be recorded
        checkCudaErrors(cudaStreamWaitEvent(streams[nstreams-1], kernelEvent[i],0));
    }

    // queue a sum kernel and a copy back to host in the last stream.
    // the commands in this stream get dispatched as soon as all the kernel events have been recorded
    sum<<<1,32,0,streams[nstreams-1]>>>(d_a, nkernels);
    checkCudaErrors(cudaMemcpyAsync(a, d_a, sizeof(clock_t), cudaMemcpyDeviceToHost, streams[nstreams-1]));

    // at this point the CPU has dispatched all work for the GPU and can continue processing other tasks in parallel

    // in this sample we just wait until the GPU is done
    checkCudaErrors(cudaEventRecord(stop_event, 0));
    checkCudaErrors(cudaEventSynchronize(stop_event));
    checkCudaErrors(cudaEventElapsedTime(&elapsed_time, start_event, stop_event));

    printf("Expected time for serial execution of %d kernels = %.3fs\n", nkernels, nkernels * kernel_time/1000.0f);
    printf("Expected time for concurrent execution of %d kernels = %.3fs\n", nkernels, kernel_time/1000.0f);
    printf("Measured time for sample = %.3fs\n", elapsed_time/1000.0f);

    bool bTestResult  = (a[0] > total_clocks);

    // release resources
    for (int i = 0; i < nkernels; i++)
    {
        cudaStreamDestroy(streams[i]);
        cudaEventDestroy(kernelEvent[i]);
    }

    free(streams);
    free(kernelEvent);

    cudaEventDestroy(start_event);
    cudaEventDestroy(stop_event);
    cudaFreeHost(a);
    cudaFree(d_a);

    if (!bTestResult)
    {
        printf("Test failed!\n");
        exit(EXIT_FAILURE);
    }

    printf("Test passed\n");
    exit(EXIT_SUCCESS);
}
  • 2
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
以下是使用FFmpeg和CUDA实现多路并发硬解码的示例代码: ```c #include <stdio.h> #include <stdlib.h> #include <string.h> #include <pthread.h> #include <cuda.h> #include <libavcodec/avcodec.h> #include <libavformat/avformat.h> #include <libavutil/avutil.h> #include <libavutil/imgutils.h> #include <libavutil/opt.h> #include <libavutil/hwcontext.h> #define MAX_STREAMS 4 static AVCodecContext *codec_ctx[MAX_STREAMS]; static AVCodecParameters *codec_params[MAX_STREAMS]; static AVFormatContext *format_ctx[MAX_STREAMS]; static int stream_index[MAX_STREAMS]; static pthread_t threads[MAX_STREAMS]; static int num_streams = 0; static int quit = 0; static CUcontext cuda_ctx; static CUstream cuda_stream[MAX_STREAMS]; static CUvideodecoder decoder[MAX_STREAMS]; static AVBufferRef *hw_device_ctx[MAX_STREAMS]; static void *decode_thread(void *arg) { int stream_index = *(int *)arg; AVPacket packet; AVFrame *frame = NULL; int ret; while (!quit) { ret = av_read_frame(format_ctx[stream_index], &packet); if (ret < 0) break; if (packet.stream_index == stream_index) { ret = avcodec_send_packet(codec_ctx[stream_index], &packet); if (ret < 0) break; while (ret >= 0) { frame = av_frame_alloc(); ret = avcodec_receive_frame(codec_ctx[stream_index], frame); if (ret == AVERROR(EAGAIN) || ret == AVERROR_EOF) { av_frame_free(&frame); break; } // Send frame to decoder CUVIDSOURCEDATAPACKET pkt = { 0 }; pkt.payload_size = frame->pkt_size; pkt.payload = frame->data[0]; pkt.flags = CUVID_PKT_TIMESTAMP; pkt.timestamp = frame->pts; cuvidParseVideoData(decoder[stream_index], &pkt); av_frame_free(&frame); } } av_packet_unref(&packet); } return NULL; } int main(int argc, char *argv[]) { int ret, i; if (argc < 2) { fprintf(stderr, "Usage: %s <input file(s)>\n", argv[0]); return 1; } av_log_set_level(AV_LOG_INFO); // Initialize CUDA cuInit(0); cuDeviceGet(NULL, 0); cuCtxCreate(&cuda_ctx, CU_CTX_SCHED_BLOCKING_SYNC, 0); // Initialize FFmpeg av_register_all(); avcodec_register_all(); avformat_network_init(); // Open input files and extract streams for (i = 1; i < argc && num_streams < MAX_STREAMS; i++) { ret = avformat_open_input(&format_ctx[num_streams], argv[i], NULL, NULL); if (ret < 0) { av_log(NULL, AV_LOG_ERROR, "Failed to open input file %s: %s\n", argv[i], av_err2str(ret)); continue; } ret = avformat_find_stream_info(format_ctx[num_streams], NULL); if (ret < 0) { av_log(NULL, AV_LOG_ERROR, "Failed to find stream info for input file %s: %s\n", argv[i], av_err2str(ret)); avformat_close_input(&format_ctx[num_streams]); continue; } stream_index[num_streams] = av_find_best_stream(format_ctx[num_streams], AVMEDIA_TYPE_VIDEO, -1, -1, NULL, 0); if (stream_index[num_streams] < 0) { av_log(NULL, AV_LOG_ERROR, "Failed to find video stream for input file %s\n", argv[i]); avformat_close_input(&format_ctx[num_streams]); continue; } codec_params[num_streams] = format_ctx[num_streams]->streams[stream_index[num_streams]]->codecpar; codec_ctx[num_streams] = avcodec_alloc_context3(NULL); ret = avcodec_parameters_to_context(codec_ctx[num_streams], codec_params[num_streams]); if (ret < 0) { av_log(NULL, AV_LOG_ERROR, "Failed to copy codec parameters for input file %s: %s\n", argv[i], av_err2str(ret)); avcodec_free_context(&codec_ctx[num_streams]); avformat_close_input(&format_ctx[num_streams]); continue; } // Initialize hardware decoding codec_ctx[num_streams]->hw_device_ctx = av_buffer_ref(hw_device_ctx[num_streams]); codec_ctx[num_streams]->get_format = av_hwdevice_get_hwframe_constraints; codec_ctx[num_streams]->opaque = codec_params[num_streams]; ret = av_hwframe_ctx_init(codec_ctx[num_streams]->hw_frames_ctx); if (ret < 0) { av_log(NULL, AV_LOG_ERROR, "Failed to initialize hardware decoding for input file %s: %s\n", argv[i], av_err2str(ret)); avcodec_free_context(&codec_ctx[num_streams]); avformat_close_input(&format_ctx[num_streams]); continue; } // Initialize CUDA decoder CUVIDDECODECREATEINFO create_info = { 0 }; create_info.CodecType = codec_params[num_streams]->codec_id; create_info.ulWidth = codec_params[num_streams]->width; create_info.ulHeight = codec_params[num_streams]->height; create_info.ulMaxWidth = codec_params[num_streams]->width; create_info.ulMaxHeight = codec_params[num_streams]->height; create_info.ulNumDecodeSurfaces = 8; create_info.ChromaFormat = codec_params[num_streams]->format; create_info.OutputFormat = cudaVideoSurfaceFormat_NV12; create_info.DeinterlaceMode = cudaVideoDeinterlaceMode_Adaptive; cuvidCreateDecoder(&decoder[num_streams], &create_info); // Initialize CUDA context and stream cuCtxSetCurrent(cuda_ctx); cuStreamCreate(&cuda_stream[num_streams], CU_STREAM_NON_BLOCKING); // Start decoding thread pthread_create(&threads[num_streams], NULL, decode_thread, &num_streams); num_streams++; } // Wait for decoding threads to finish for (i = 0; i < num_streams; i++) { pthread_join(threads[i], NULL); } // Clean up for (i = 0; i < num_streams; i++) { avcodec_free_context(&codec_ctx[i]); avformat_close_input(&format_ctx[i]); cuvidDestroyDecoder(decoder[i]); cuStreamDestroy(cuda_stream[i]); } cuCtxDestroy(cuda_ctx); return 0; } ``` 请注意,这只是一个示例代码,您需要根据您的实际需求进行修改和定制化。同时,如果您没有足够的CUDA和FFmpeg编程经验,建议您参考官方文档和其他教程,避免出现错误和安全问题

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值