GPU高性能计算CUDA编程：输出结果和清理

阿里猫的编程小辈

已于 2024-09-09 00:35:17 修改

阅读量304

点赞数 6

分类专栏： 007 文章标签： c++ 开发语言人工智能 gpu算力

于 2024-09-07 23:56:05 首次发布

本文链接：https://blog.csdn.net/m0_51165837/article/details/142006779

版权

007 专栏收录该内容

37 篇文章 0 订阅

订阅专栏

高性能计算CUDA编程：输出结果和清理

声明：本文不做商用

之前的示例代码显示了该如何输出程序的结果。其中一些输出结果来自我们对GPU的查询。例如，GPUprop.name 是GPU的名称，像“GeForce GTX TitanZ”，我将在以后的文章中详细介绍输出。
就像每个 malloc() 调用都必须调用相应的free()函数来释放内存一样，每个 cudaMalloc() 也必须调用cudaFree()来释放，告诉 Nvidia 运行时不再需要该GPU内存区域。同样，使用 cudaEventCreate() 创建的每个事件都必须调用cudaEventDestroy()函数销毁。完成所有这些后，调用 cudaDeviceReset()，它告诉 Nvidia 运行时我们对 GPU 的使用已经结束。然后，我们继续工作，用 fee() 函数释放分配的CPU内存区域，imflipG.cu 运行结束!

int main(int argc, char **argv)
{
    ...
    printf("\n\n--------------------------------------------------------------------------\n");
    printf("%s ComputeCapab=%d.%d [supports max %s blocks; %d thr/blk] \n", 
           GPUprop.name, GPUprop.major, GPUprop.minor, SupportedBlocks, MaxThrPerBlk);
    printf("--------------------------------------------------------------------------\n");
    printf("%s %s %s %c %u   [%u BLOCKS, %u BLOCKS/ROW]\n", ProgName, InputFileName, OutputFileName,
           Flip, ThrPerBlk, NumBlocks, BlkPerRow);
    printf("--------------------------------------------------------------------------\n");
    printf("CPU->GPU Transfer   =%7.2f ms  ...  %4d MB  ...  %6.2f GB/s\n", tfrCPUtoGPU, DATAMB(IMAGESIZE), DATABW(IMAGESIZE, tfrCPUtoGPU));
    printf("Kernel Execution    =%7.2f ms  ...  %4d MB  ...  %6.2f GB/s\n", kernelExecutionTime, DATAMB(GPUDataTransfer), DATABW(GPUDataTransfer, kernelExecutionTime));
    printf("GPU->CPU Transfer   =%7.2f ms  ...  %4d MB  ...  %6.2f GB/s\n", tfrGPUtoCPU, DATAMB(IMAGESIZE), DATABW(IMAGESIZE, tfrGPUtoCPU));
    printf("--------------------------------------------------------------------------\n");
    printf("Total time elapsed  =%7.2f ms       %4d MB  ...  %6.2f GB/s\n", totalTime, DATAMB((2 * IMAGESIZE + GPUDataTransfer)), DATABW((2 * IMAGESIZE + GPUDataTransfer), totalTime));
    printf("--------------------------------------------------------------------------\n\n");

    // Deallocate CPU, GPU memory and destroy events.
    cudaFree(GPUImg);
    cudaFree(GPUCopyImg);
    cudaEventDestroy(time1);
    cudaEventDestroy(time2);
    cudaEventDestroy(time3);
    cudaEventDestroy(time4);
    // cudaDeviceReset must be called before exiting in order for profiling and
    // tracing tools such as Parallel Nsight and Visual Profiler to show complete traces.
    cudaStatus = cudaDeviceReset();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceReset failed!");
        free(TheImg);
        free(CopyImg);
        exit(EXIT_FAILURE);
    }
    free(TheImg);
    free(CopyImg);
    return(EXIT_SUCCESS);

}

通用的核函数启动代码如下所示:

GPU Kernel Name <<< dimension,dimension >>> (arg1,arg2,...);

其中argl、arg2、……是从CPU端传递到GPU核函数的参数。参见【0voice C++】在代码6.3中，它们是两个指针和IPH。在GPU内存区域创建存储图像存储空间时，由两个指针(GPUCopyImg和GPUImg)cudaMalloc()返回。IPH是一个变量，存放图像水平方向上的像素数(ipHpixels)。GPU的核函数Hfip()在执行期间需要这三个参数，如果在核函数启动时没有被传递进来，核函数将无法获得它们。回忆一下，类比6.2中的两个维度分别是166656和256，实际上对应于下面的启动代码:

Hflip <<< 166,656，256 >>> (GPUCopyImg，GPUImg， IPH);

这告诉Nvidia运行时引擎为Hflip()核函数启动166656个线程块，并将这三个参数传递给每一个线程块。也就是启动:线程块0、线程块1、线程块2、……、线程块166655。每一…、tid=255)，与我们在本书第一部分中个线程块都将启动256个线程(tid=0、tid=1、看到的 pthread示例相同。我们真正想说的是，该行代码总共启动了166656x256~41M个线程。