CUDA编程之CUDA Sample-0_Introduction-asyncAPI

翊桐公羽

已于 2024-06-14 16:39:08 修改

阅读量1.6k

点赞数 42

文章标签：算法

于 2024-06-06 17:09:18 首次发布

本文链接：https://blog.csdn.net/zhujingwen_tongji/article/details/139503636

版权

CUDA sample中0_Introduction里包含CUDA基础的sample，asyncAPI展示了如何使用CUDA stream和CUDA event在CPU和GPU上进行重叠执行。

CUDA event被插入到CUDA的Stream中。
由于CUDA Stream的调用是异步的，所以CPU可以在GPU执行时进行计算（包括主机和设备之间的DMA记忆复制）。
CPU可以查询CUDA event以确定 GPU是否已经完成任务。


// includes, system
#include <stdio.h>

// includes CUDA Runtime
#include <cuda_runtime.h>
#include <cuda_profiler_api.h>

// includes, project
#include <helper_cuda.h>
#include <helper_functions.h>  // helper utility functions

__global__ void increment_kernel(int *g_data, int inc_value) {
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  g_data[idx] = g_data[idx] + inc_value;
}

bool correct_output(int *data, const int n, const int x) {
  for (int i = 0; i < n; i++)
    if (data[i] != x) {
      printf("Error! data[%d] = %d, ref = %d\n", i, data[i], x);
      return false;
    }

  return true;
}

int main(int argc, char *argv[]) {
  int devID;
  cudaDeviceProp deviceProps;

  printf("[%s] - Starting...\n", argv[0]);

  // This will pick the best possible CUDA capable device
  devID = findCudaDevice(argc, (const char **)argv);

  // get device name
  checkCudaErrors(cudaGetDeviceProperties(&deviceProps, devID));
  printf("CUDA device [%s]\n", deviceProps.name);

  int n = 16 * 1024 * 1024;
  int nbytes = n * sizeof(int);
  int value = 26;

  // allocate host memory
  int *a = 0;
  checkCudaErrors(cudaMallocHost((void **)&a, nbytes));
  memset(a, 0, nbytes);

  // allocate device memory
  int *d_a = 0;
  checkCudaErrors(cudaMalloc((void **)&d_a, nbytes));
  checkCudaErrors(cudaMemset(d_a, 255, nbytes));

  // set kernel launch configuration
  dim3 threads = dim3(512, 1);
  dim3 blocks = dim3(n / threads.x, 1);

  // create cuda event handles
  cudaEvent_t start, stop;
  checkCudaErrors(cudaEventCreate(&start));
  checkCudaErrors(cudaEventCreate(&stop));

  StopWatchInterface *timer = NULL;
  sdkCreateTimer(&timer);
  sdkResetTimer(&timer);

  checkCudaErrors(cudaDeviceSynchronize());
  float gpu_time = 0.0f;

  // asynchronously issue work to the GPU (all to stream 0)
  checkCudaErrors(cudaProfilerStart());
  sdkStartTimer(&timer);
  cudaEventRecord(start, 0);
  cudaMemcpyAsync(d_a, a, nbytes, cudaMemcpyHostToDevice, 0);
  increment_kernel<<<blocks, threads, 0, 0>>>(d_a, value);
  cudaMemcpyAsync(a, d_a, nbytes, cudaMemcpyDeviceToHost, 0);
  cudaEventRecord(stop, 0);
  sdkStopTimer(&timer);
  checkCudaErrors(cudaProfilerStop());

  // have CPU do some work while waiting for stage 1 to finish
  unsigned long int counter = 0;

  while (cudaEventQuery(stop) == cudaErrorNotReady) {
    counter++;
  }

  checkCudaErrors(cudaEventElapsedTime(&gpu_time, start, stop));

  // print the cpu and gpu times
  printf("time spent executing by the GPU: %.2f\n", gpu_time);
  printf("time spent by CPU in CUDA calls: %.2f\n", sdkGetTimerValue(&timer));
  printf("CPU executed %lu iterations while waiting for GPU to finish\n",
         counter);

  // check the output for correctness
  bool bFinalResults = correct_output(a, n, value);

  // release resources
  checkCudaErrors(cudaEventDestroy(start));
  checkCudaErrors(cudaEventDestroy(stop));
  checkCudaErrors(cudaFreeHost(a));
  checkCudaErrors(cudaFree(d_a));

  exit(bFinalResults ? EXIT_SUCCESS : EXIT_FAILURE);
}

代码概述：

包含库：代码包含了必要的系统和CUDA相关的头文件。
内核函数：代码定义了一个CUDA内核函数increment_kernel()，该内核函数接受一个整数数组g_data和一个增量值inc_value，并将数组中的每个元素增加给定的值。
正确性检查：代码定义了一个函数correct_output()，通过将data数组中的值与预期值进行比较来检查输出的正确性。
主函数：main()函数是程序的入口点，执行以下步骤：
- 选择最佳可用的CUDA设备。
- 为一个大小为1600万个元素的整数数组分配主机和设备内存。
- 创建CUDA事件句柄以测量GPU的执行时间。
- 设置内核启动配置，每个块512个线程，并计算适当的块数。
- 启动CUDA分析器。
- 异步地向GPU发出工作，包括：
  - 将主机数组复制到设备数组。
  - 启动increment_kernel()内核以增加设备数组。
  - 将修改后的设备数组复制回主机数组。
  - 记录开始和停止的CUDA事件。
- 在等待GPU执行完成时，CPU执行忙等待循环。
- 使用CUDA事件和定时器计算GPU和CPU的执行时间。
- 检查输出的正确性。
- 释放分配的资源。
- 根据输出的正确性以成功或失败状态退出程序。

CUDA 相关API解读

该Sample使用的CUDA API: cudaEventCreate, cudaEventRecord, cudaEventQuery, cudaEventDestroy, cudaEventElapsedTime, cudaMemcpyAsync

这些API属于CUDA runtime API, 我们先介绍CUDA runtime API中event的概念：

Events runtime提供了一种密切监控device端进度以及执行精确计时的方式,让应用程序能够在程序的任何时候异步记录事件,并查询这些事件何时完成。当前一个事件的所有任务(或可选地,给定Stream中的所有命令)都已完成时,该事件就算完成了。Stream 0中的事件在所有先前的任务和所有Stream中的命令都完成后才算完成。

Events创建和销毁

创建2个events:

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

销毁2个events：

cudaEventDestroy(start);
cudaEventDestroy(stop);

计算时间

cudaEventRecord(start, 0);
for (int i = 0; i < 2; ++i) {
    cudaMemcpyAsync(inputDev + i * size, inputHost + i * size,
                    size, cudaMemcpyHostToDevice, stream[i]);
    MyKernel<<<100, 512, 0, stream[i]>>>
               (outputDev + i * size, inputDev + i * size, size);
    cudaMemcpyAsync(outputHost + i * size, outputDev + i * size,
                    size, cudaMemcpyDeviceToHost, stream[i]);
}
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop);

Kernel解读

对g_data进行累加inc_value

Main函数

correct_output函数是在CPU端对Kernel赋值数据进行校验；

findCudaDevice是一个inline function，在CUDA编程中,当系统中有多个CUDA设备时,通常需要选择性能最好的设备来运行CUDA程序。findCudaDevice()函数会自动完成这个选择过程,确保CUDA程序在最佳硬件上执行,从而获得最佳性能。

cudaGetDeviceProperties是一个runtime API，主要是返回CUDA device的信息，包括卡信息和compute capability。

cudaMallocHost用来分配host端内存。memset初始化数据为0。

cudaMalloc分配device端内存，cudaMemset初始化数据为255。

sdkCreateTimer和sdkResetTimer是inline function，创建并初始化计时器，用来记录CPU的时间。

cudaDeviceSynchronize进行同步。

cudaMemcpyAsync拷贝host到device，然后调用Kernel，再把device拷贝回host。

下面的代码使用了CUA 相关API解读中讲到的event时间计算，主要是记录GPU时间和CPU时间。

 checkCudaErrors(cudaProfilerStart());
 sdkStartTimer(&timer);
 cudaEventRecord(start, 0);
 cudaMemcpyAsync(d_a, a, nbytes, cudaMemcpyHostToDevice, 0);
 increment_kernel<<<blocks, threads, 0, 0>>>(d_a, value);
 cudaMemcpyAsync(a, d_a, nbytes, cudaMemcpyDeviceToHost, 0);
 cudaEventRecord(stop, 0);
 sdkStopTimer(&timer);
 checkCudaErrors(cudaProfilerStop());

 // have CPU do some work while waiting for stage 1 to finish
 unsigned long int counter = 0;

 while (cudaEventQuery(stop) == cudaErrorNotReady) {
   counter++;
 }

 checkCudaErrors(cudaEventElapsedTime(&gpu_time, start, stop));

下面主要是event和内存销毁：

  checkCudaErrors(cudaEventDestroy(start));
  checkCudaErrors(cudaEventDestroy(stop));
  checkCudaErrors(cudaFreeHost(a));
  checkCudaErrors(cudaFree(d_a));

运行结果：

运行时间单位是毫秒

- Starting...
GPU Device 0: "Ada" with compute capability 8.9

CUDA device [NVIDIA GeForce RTX 4080]
time spent executing by the GPU: 8.74
time spent by CPU in CUDA calls: 4.30
CPU executed 69740 iterations while waiting for GPU to finish

翊桐公羽

关注

42
点赞
踩
49

收藏

觉得还不错? 一键收藏
0
评论
CUDA编程之CUDA Sample-0_Introduction-asyncAPI

包含库：代码包含了必要的系统和CUDA相关的头文件。内核函数：代码定义了一个CUDA内核函数，该内核函数接受一个整数数组g_data和一个增量值inc_value，并将数组中的每个元素增加给定的值。正确性检查：代码定义了一个函数，通过将data数组中的值与预期值进行比较来检查输出的正确性。主函数main()选择最佳可用的CUDA设备。为一个大小为1600万个元素的整数数组分配主机和设备内存。创建CUDA事件句柄以测量GPU的执行时间。设置内核启动配置，每个块512个线程，并计算适当的块数。
复制链接

扫一扫