CUDA学习笔记 —— （六）程序计时器Events

本文链接：https://blog.csdn.net/chongbin007/article/details/123891517

文章目录

- CPU计时器

https://developer.nvidia.com/blog/how-implement-performance-metrics-cuda-cc/
https://blog.csdn.net/litdaguang/article/details/77585011

CPU计时器

clock

clock_t start, finish;
start = clock();
// 要测试的部分
finish = clock();
duration = (double)(finish - start) / CLOCKS_PER_SEC;

gettimeofday

#include <sys/time.h>
double cpuSecond()
{
  struct timeval tp;
  gettimeofday(&tp,NULL);
  return((double)tp.tv_sec+(double)tp.tv_usec*1e-6);
}

  //timer
  double iStart,iElaps;
  iStart=cpuSecond();
  sumArraysGPU<<<grid,block>>>(a_d,b_d,res_d,nElem);
  cudaDeviceSynchronize();
  iElaps=cpuSecond()-iStart;

使用CUDA Events来计时

核函数计时，有专门的工具，使用cpu计时是不准确的。
所以我们有一个更轻量级的API 叫做event API来代替CPU计时器。
看下面代码：

cudaEvent_t start, stop;
cudaEventCreate(&start); //创建event
cudaEventCreate(&stop);

cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);

cudaEventRecord(start); //将start放到 默认stream中，因为我们没创建stream，所以是在默认stream
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);
cudaEventRecord(stop);

cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);

cudaEventSynchronize(stop);
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);

cudaEventRecord(start); 将start放到默认stream中，因为我们没创建stream，所以是在默认stream。当这个start到stream的时候，就会在device上记录一个时间戳。cudaEventRecord(）视为一条记录当前时间的语句，并且把这条事件放入GPU的未完成队列中。因为直到GPU执行完了在调用cudaEventRecord(）之前的所有语句时，事件才会被记录下来。所以这里我们cudaEventRecord(stop);记录到的stop，是device执行完之后才会将事件加入到device。
所以cudaEventElapsedTime记录的事件start，和stop的时间就是device在某个stream的执行时间。
cudaEventSynchronize(stop);会阻塞CPU，直到特定的event被记录。也就是这里会阻塞，直到stop在stream中被记录才会向下执行。不使用这句话的话，kernel是异步的，还没执行完，CPU就继续往下走了。那么cudaEventElapsedTime就记录不到时间了。因为stop还没加入到device中。

Event来记录特定stream

event更重要的功能体现在:

同步stream执行
操控device运行步调

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
...

cudaEventRecord(start, 0);
for (int i = 0; i < 2; ++i) {
	cudaMemcpyAsync(inputDev + i * size, inputHost + i * size,
	size, cudaMemcpyHostToDevice, stream[i]);
	MyKernel<<<100, 512, 0, stream[i]>>>(outputDev + i * size, inputDev + i * size, size);
	cudaMemcpyAsync(outputHost + i * size, outputDev + i * size,
	size, cudaMemcpyDeviceToHost, stream[i]);
}
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop);
...
cudaEventDestroy(start);
cudaEventDestroy(stop);

Cuda api提供了相关函数来插入event到stream中和查询该event是否完成。只有当该event标记的stream位置的所有操作都被执行完毕，该event才算完成。关联到默认stream上的event则对所有的stream有效。因为参数0是默认stream，也就是关联到默认stream上的event，会同步所有stream的操作。

https://blog.csdn.net/qq_24990189/article/details/89602618