CUDA编程之CUDA Sample-2_Concepts_and_Techniques-histogram

翊桐公羽

于 2024-07-10 17:29:15 发布

阅读量645

点赞数 19

文章标签：算法

本文链接：https://blog.csdn.net/zhujingwen_tongji/article/details/140328173

版权

CUDA samples 中的 2_Concepts_and_Techniques 目录主要包含一些展示 CUDA 编程概念和技术的示例程序。这些示例主要涉及以下方面:

内存管理:
- simpleMemcpy: 展示如何使用 CUDA 内存复制 API 在设备内存和主机内存间进行数据传输。
- simpleAssert: 展示如何在 CUDA 内核中使用 __syncthreads() 和 assert() 来进行线程同步和错误检查。
线程管理:
- simpleStreams: 展示如何使用 CUDA 流来并行执行多个内核/内存操作,提高整体吞吐量。
- simpleMultiTask: 展示如何在同一个 CUDA 流中启动多个内核,以充分利用设备的计算资源。
卷积运算:
- convolutionSeparable: 展示如何使用分离卷积算法来提高卷积操作的性能。
共享内存使用:
- shfl: 展示如何使用 CUDA 的 __shfl() 指令在线程块内部高效地交换数据。
- eigenvalues: 展示如何使用共享内存来加速矩阵特征值计算。
并行化技术:
- reduction: 展示如何使用并行规约算法来高效计算数组元素的总和。
- scan: 展示如何使用并行前缀和算法来高效计算数组前缀和。

histogram 主要展示如何在 GPU 上高效地实现直方图计算。直方图是一种常见的数据分析和可视化工具,用于描述数据集中数值分布的情况。

这个示例程序主要包括以下几个方面:

直方图计算的并行化:
- 将数据集划分为多个块,每个块由多个线程并行计算局部直方图。
- 使用原子操作在共享内存中更新局部直方图。
- 最后将所有局部直方图合并得到最终的全局直方图。
内存访问优化:
- 利用共享内存来减少对全局内存的访问,提高内存访问效率。
- 使用内存对齐和内存coalescence技术,最大化内存带宽利用率。
性能优化:
- 尝试不同的直方图bin数量,观察性能变化。
- 对比CPU和GPU版本的直方图计算性能差异。
- 分析GPU内核执行时间的分布,找出性能瓶颈。
扩展功能:
- 支持不同数据类型(整数、浮点数)的直方图计算。
- 支持多维直方图计算。
- 支持使用纹理内存进行直方图计算。

这个Sample演示了如何高效实现 64 个bin和 256 个bin的直方图计算。

具体来说:

64个bin的直方图计算:
- 将输入数据划分为 64 个bin,每个bin代表一个数值范围。
- 使用并行计算的方式,在GPU上快速统计每个bin中数据的个数。
- 优化内存访问模式,充分利用GPU的内存带宽。
256个bin的直方图计算:
- 将输入数据划分为 256 个bin,相比 64 个bin提供了更高的数据分辨率。
- 采用与 64 bin 直方图类似的并行计算和内存优化策略。
- 分析不同bin数量对性能的影响,评估所需的性能和分辨率之间的权衡。

#include <cuda_runtime.h>
#include <helper_cuda.h>

#include <iostream>
#include <memory>
#include <string>

int *pArgc = NULL;
char **pArgv = NULL;

#if CUDART_VERSION < 5000

// CUDA-C includes
#include <cuda.h>

// This function wraps the CUDA Driver API into a template function
template <class T>
inline void getCudaAttribute(T *attribute, CUdevice_attribute device_attribute,
                             int device) {
  CUresult error = cuDeviceGetAttribute(attribute, device_attribute, device);

  if (CUDA_SUCCESS != error) {
    fprintf(
        stderr,
        "cuSafeCallNoSync() Driver API error = %04d from file <%s>, line %i.\n",
        error, __FILE__, __LINE__);

    exit(EXIT_FAILURE);
  }
}

#endif /* CUDART_VERSION < 5000 */


// Program main

int main(int argc, char **argv) {
  pArgc = &argc;
  pArgv = argv;

  printf("%s Starting...\n\n", argv[0]);
  printf(
      " CUDA Device Query (Runtime API) version (CUDART static linking)\n\n");

  int deviceCount = 0;
  cudaError_t error_id = cudaGetDeviceCount(&deviceCount);

  if (error_id != cudaSuccess) {
    printf("cudaGetDeviceCount returned %d\n-> %s\n",
           static_cast<int>(error_id), cudaGetErrorString(error_id));
    printf("Result = FAIL\n");
    exit(EXIT_FAILURE);
  }

  // This function call returns 0 if there are no CUDA capable devices.
  if (deviceCount == 0) {
    printf("There are no available device(s) that support CUDA\n");
  } else {
    printf("Detected %d CUDA Capable device(s)\n", deviceCount);
  }

  int dev, driverVersion = 0, runtimeVersion = 0;

  for (dev = 0; dev < deviceCount; ++dev) {
    cudaSetDevice(dev);
    cudaDeviceProp deviceProp;
    cudaGetDeviceProperties(&deviceProp, dev);

    printf("\nDevice %d: \"%s\"\n", dev, deviceProp.name);

    // Console log
    cudaDriverGetVersion(&driverVersion);
    cudaRuntimeGetVersion(&runtimeVersion);
    printf("  CUDA Driver Version / Runtime Version          %d.%d / %d.%d\n",
           driverVersion / 1000, (driverVersion % 100) / 10,
           runtimeVersion / 1000, (runtimeVersion % 100) / 10);
    printf("  CUDA Capability Major/Minor version number:    %d.%d\n",
           deviceProp.major, deviceProp.minor);

    char msg[256];
#if defined(WIN32) || defined(_WIN32) || defined(WIN64) || defined(_WIN64)
    sprintf_s(msg, sizeof(msg),
              "  Total amount of global memory:                 %.0f MBytes "
              "(%llu bytes)\n",
              static_cast<float>(deviceProp.totalGlobalMem / 1048576.0f),
              (unsigned long long)deviceProp.totalGlobalMem);
#else
    snprintf(msg, sizeof(msg),
             "  Total amount of global memory:                 %.0f MBytes "
             "(%llu bytes)\n",
             static_cast<float>(deviceProp.totalGlobalMem / 1048576.0f),
             (unsigned long long)deviceProp.totalGlobalMem);
#endif
    printf("%s", msg);

    printf("  (%03d) Multiprocessors, (%03d) CUDA Cores/MP:    %d CUDA Cores\n",
           deviceProp.multiProcessorCount,
           _ConvertSMVer2Cores(deviceProp.major, deviceProp.minor),
           _ConvertSMVer2Cores(deviceProp.major, deviceProp.minor) *
               deviceProp.multiProcessorCount);
    printf(
        "  GPU Max Clock rate:                            %.0f MHz (%0.2f "
        "GHz)\n",
        deviceProp.clockRate * 1e-3f, deviceProp.clockRate * 1e-6f);

#if CUDART_VERSION >= 5000
    // This is supported in CUDA 5.0 (runtime API device properties)
    printf("  Memory Clock rate:                             %.0f Mhz\n",
           deviceProp.memoryClockRate * 1e-3f);
    printf("  Memory Bus Width:                              %d-bit\n",
           deviceProp.memoryBusWidth);

    if (deviceProp.l2CacheSize) {
      printf("  L2 Cache Size:                                 %d bytes\n",
             deviceProp.l2CacheSize);
    }

#else
    // This only available in CUDA 4.0-4.2 (but these were only exposed in the
    // CUDA Driver API)
    int memoryClock;
    getCudaAttribute<int>(&memoryClock, CU_DEVICE_ATTRIBUTE_MEMORY_CLOCK_RATE,
                          dev);
    printf("  Memory Clock rate:                             %.0f Mhz\n",
           memoryClock * 1e-3f);
    int memBusWidth;
    getCudaAttribute<int>(&memBusWidth,
                          CU_DEVICE_ATTRIBUTE_GLOBAL_MEMORY_BUS_WIDTH, dev);
    printf("  Memory Bus Width:                              %d-bit\n",
           memBusWidth);
    int L2CacheSize;
    getCudaAttribute<int>(&L2CacheSize, CU_DEVICE_ATTRIBUTE_L2_CACHE_SIZE, dev);

    if (L2CacheSize) {
      printf("  L2 Cache Size:                                 %d bytes\n",
             L2CacheSize);
    }

#endif

    printf(
        "  Maximum Texture Dimension Size (x,y,z)         1D=(%d), 2D=(%d, "
        "%d), 3D=(%d, %d, %d)\n",
        deviceProp.maxTexture1D, deviceProp.maxTexture2D[0],
        deviceProp.maxTexture2D[1], deviceProp.maxTexture3D[0],
        deviceProp.maxTexture3D[1], deviceProp.maxTexture3D[2]);
    printf(
        "  Maximum Layered 1D Texture Size, (num) layers  1D=(%d), %d layers\n",
        deviceProp.maxTexture1DLayered[0], deviceProp.maxTexture1DLayered[1]);
    printf(
        "  Maximum Layered 2D Texture Size, (num) layers  2D=(%d, %d), %d "
        "layers\n",
        deviceProp.maxTexture2DLayered[0], deviceProp.maxTexture2DLayered[1],
        deviceProp.maxTexture2DLayered[2]);

    printf("  Total amount of constant memory:               %zu bytes\n",
           deviceProp.totalConstMem);
    printf("  Total amount of shared memory per block:       %zu bytes\n",
           deviceProp.sharedMemPerBlock);
    printf("  Total shared memory per multiprocessor:        %zu bytes\n",
           deviceProp.sharedMemPerMultiprocessor);
    printf("  Total number of registers available per block: %d\n",
           deviceProp.regsPerBlock);
    printf("  Warp size:                                     %d\n",
           deviceProp.warpSize);
    printf("  Maximum number of threads per multiprocessor:  %d\n",
           deviceProp.maxThreadsPerMultiProcessor);
    printf("  Maximum number of threads per block:           %d\n",
           deviceProp.maxThreadsPerBlock);
    printf("  Max dimension size of a thread block (x,y,z): (%d, %d, %d)\n",
           deviceProp.maxThreadsDim[0], deviceProp.maxThreadsDim[1],
           deviceProp.maxThreadsDim[2]);
    printf("  Max dimension size of a grid size    (x,y,z): (%d, %d, %d)\n",
           deviceProp.maxGridSize[0], deviceProp.maxGridSize[1],
           deviceProp.maxGridSize[2]);
    printf("  Maximum memory pitch:                          %zu bytes\n",
           deviceProp.memPitch);
    printf("  Texture alignment:                             %zu bytes\n",
           deviceProp.textureAlignment);
    printf(
        "  Concurrent copy and kernel execution:          %s with %d copy "
        "engine(s)\n",
        (deviceProp.deviceOverlap ? "Yes" : "No"), deviceProp.asyncEngineCount);
    printf("  Run time limit on kernels:                     %s\n",
           deviceProp.kernelExecTimeoutEnabled ? "Yes" : "No");
    printf("  Integrated GPU sharing Host Memory:            %s\n",
           deviceProp.integrated ? "Yes" : "No");
    printf("  Support host page-locked memory mapping:       %s\n",
           deviceProp.canMapHostMemory ? "Yes" : "No");
    printf("  Alignment requirement for Surfaces:            %s\n",
           deviceProp.surfaceAlignment ? "Yes" : "No");
    printf("  Device has ECC support:                        %s\n",
           deviceProp.ECCEnabled ? "Enabled" : "Disabled");
#if defined(WIN32) || defined(_WIN32) || defined(WIN64) || defined(_WIN64)
    printf("  CUDA Device Driver Mode (TCC or WDDM):         %s\n",
           deviceProp.tccDriver ? "TCC (Tesla Compute Cluster Driver)"
                                : "WDDM (Windows Display Driver Model)");
#endif
    printf("  Device supports Unified Addressing (UVA):      %s\n",
           deviceProp.unifiedAddressing ? "Yes" : "No");
    printf("  Device supports Managed Memory:                %s\n",
           deviceProp.managedMemory ? "Yes" : "No");
    printf("  Device supports Compute Preemption:            %s\n",
           deviceProp.computePreemptionSupported ? "Yes" : "No");
    printf("  Supports Cooperative Kernel Launch:            %s\n",
           deviceProp.cooperativeLaunch ? "Yes" : "No");
    printf("  Supports MultiDevice Co-op Kernel Launch:      %s\n",
           deviceProp.cooperativeMultiDeviceLaunch ? "Yes" : "No");
    printf("  Device PCI Domain ID / Bus ID / location ID:   %d / %d / %d\n",
           deviceProp.pciDomainID, deviceProp.pciBusID, deviceProp.pciDeviceID);

    const char *sComputeMode[] = {
        "Default (multiple host threads can use ::cudaSetDevice() with device "
        "simultaneously)",
        "Exclusive (only one host thread in one process is able to use "
        "::cudaSetDevice() with this device)",
        "Prohibited (no host thread can use ::cudaSetDevice() with this "
        "device)",
        "Exclusive Process (many threads in one process is able to use "
        "::cudaSetDevice() with this device)",
        "Unknown", NULL};
    printf("  Compute Mode:\n");
    printf("     < %s >\n", sComputeMode[deviceProp.computeMode]);
  }

  // If there are 2 or more GPUs, query to determine whether RDMA is supported
  if (deviceCount >= 2) {
    cudaDeviceProp prop[64];
    int gpuid[64];  // we want to find the first two GPUs that can support P2P
    int gpu_p2p_count = 0;

    for (int i = 0; i < deviceCount; i++) {
      checkCudaErrors(cudaGetDeviceProperties(&prop[i], i));

      // Only boards based on Fermi or later can support P2P
      if ((prop[i].major >= 2)
#if defined(WIN32) || defined(_WIN32) || defined(WIN64) || defined(_WIN64)
          // on Windows (64-bit), the Tesla Compute Cluster driver for windows
          // must be enabled to support this
          && prop[i].tccDriver
#endif
          ) {
        // This is an array of P2P capable GPUs
        gpuid[gpu_p2p_count++] = i;
      }
    }

    // Show all the combinations of support P2P GPUs
    int can_access_peer;

    if (gpu_p2p_count >= 2) {
      for (int i = 0; i < gpu_p2p_count; i++) {
        for (int j = 0; j < gpu_p2p_count; j++) {
          if (gpuid[i] == gpuid[j]) {
            continue;
          }
          checkCudaErrors(
              cudaDeviceCanAccessPeer(&can_access_peer, gpuid[i], gpuid[j]));
          printf("> Peer access from %s (GPU%d) -> %s (GPU%d) : %s\n",
                 prop[gpuid[i]].name, gpuid[i], prop[gpuid[j]].name, gpuid[j],
                 can_access_peer ? "Yes" : "No");
        }
      }
    }
  }

  // csv masterlog info
  // *****************************
  // exe and CUDA driver name
  printf("\n");
  std::string sProfileString = "deviceQuery, CUDA Driver = CUDART";
  char cTemp[16];

  // driver version
  sProfileString += ", CUDA Driver Version = ";
#if defined(WIN32) || defined(_WIN32) || defined(WIN64) || defined(_WIN64)
  sprintf_s(cTemp, 10, "%d.%d", driverVersion / 1000,
            (driverVersion % 100) / 10);
#else
  snprintf(cTemp, sizeof(cTemp), "%d.%d", driverVersion / 1000,
           (driverVersion % 100) / 10);
#endif
  sProfileString += cTemp;

  // Runtime version
  sProfileString += ", CUDA Runtime Version = ";
#if defined(WIN32) || defined(_WIN32) || defined(WIN64) || defined(_WIN64)
  sprintf_s(cTemp, 10, "%d.%d", runtimeVersion / 1000,
            (runtimeVersion % 100) / 10);
#else
  snprintf(cTemp, sizeof(cTemp), "%d.%d", runtimeVersion / 1000,
           (runtimeVersion % 100) / 10);
#endif
  sProfileString += cTemp;

  // Device count
  sProfileString += ", NumDevs = ";
#if defined(WIN32) || defined(_WIN32) || defined(WIN64) || defined(_WIN64)
  sprintf_s(cTemp, 10, "%d", deviceCount);
#else
  snprintf(cTemp, sizeof(cTemp), "%d", deviceCount);
#endif
  sProfileString += cTemp;
  sProfileString += "\n";
  printf("%s", sProfileString.c_str());

  printf("Result = PASS\n");

  // finish
  exit(EXIT_SUCCESS);
}

代码详解：

将数据输入到 64 个bin和 256 个bin中:
- 根据数据值的范围,将输入数据划分到 64 个bin或 256 个bin中。每个bin代表一个数值区间。
使用CPU方法计算每个bin的值:
- 遍历输入数据,统计每个bin中数据的个数。这是使用CPU进行直方图计算的方法。
使用GPU方法计算每个bin的值,并记录时间:
- 利用GPU的并行计算能力,在GPU上实现直方图计算。
- 记录GPU直方图计算的耗时,用于与CPU方法的性能对比。
比较CPU计算和GPU计算的结果:
- 对比CPU和GPU方法计算得到的每个bin中数据个数的结果,验证计算结果的正确性。
- 分析CPU和GPU方法在计算速度、吞吐量等方面的差异,评估GPU加速的效果。

通过这样的实验,可以全面了解直方图计算在CPU和GPU上的实现差异,为进一步优化直方图算法提供有价值的性能数据和分析。

在 CPU 上计算 64 个 bin 的直方图具体过程:

首先，代码初始化了 64 个 bin 对应的计数器为 0。
接下来，代码进行数据遍历和统计:
- 假设输入数据是 4 字节无符号整型(uint)，并且字节数是 4 的倍数。
- 对于每个 4 字节的数据块:
  - 将数据右移 2 位，并取低 6 位，得到第一个 bin 索引。
  - 将数据右移 10 位，并取低 6 位，得到第二个 bin 索引。
  - 将数据右移 18 位，并取低 6 位，得到第三个 bin 索引。
  - 将数据右移 26 位，并取低 6 位，得到第四个 bin 索引。
- 对应的4个 bin 的计数器都加 1。
这样，通过一次遍历输入数据，就完成了 64 个 bin 的直方图统计。

在GPU中实现64bin直方图计算的内核函数:

addWord函数:
- 这个函数的作用是将一个32位的word数据分成4个字节,并将它们添加到共享内存中的直方图统计中。
- 它使用了位运算来提取每个字节的高6位,因为这个直方图只需要64个bin。
- 每个线程都会调用这个函数来处理自己负责的数据。
histogram64Kernel函数:
- 这是CUDA的全局内核函数,用于计算64bin直方图。
- 它首先初始化共享内存中的直方图统计数组为0。
- 然后遍历分配给当前线程块的数据,调用addWord函数来更新直方图统计。
- 最后,每个线程都将自己线程块内的直方图统计结果写入到全局内存的部分直方图数组中。
mergeHistogram64Kernel函数:
- 这个函数用于合并之前计算的部分直方图结果,得到最终的64bin直方图。
- 每个线程块负责合并一个bin的统计结果。
- 它使用一种归约的方式,先将每个线程块内的部分结果累加,然后利用线程块内的同步机制进行进一步的累加。
- 最终,每个线程块的线程0将合并后的结果写入到全局内存的最终直方图数组中。

总的来说,这段代码实现了一种高效的并行直方图计算算法,充分利用了CUDA的线程块和共享内存机制来提高性能。它首先在各个线程块内计算部分直方图,然后通过一个额外的合并内核函数得到最终的直方图结果。

运行结果：

[[histogram]] - Starting...
GPU Device 0: "Ada" with compute capability 8.9

CUDA device [NVIDIA GeForce RTX 4080] has 76 Multi-Processors, Compute 8.9
Initializing data...
...allocating CPU memory.
...generating input data
...allocating GPU memory and copying input data

Starting up 64-bin histogram...

Running 64-bin GPU histogram for 67108864 bytes (16 runs)...

histogram64() time (average) : 0.00695 sec, 9657.2546 MB/sec

histogram64, Throughput = 9657.2546 MB/s, Time = 0.00695 s, Size = 67108864 Bytes, NumDevsUsed = 1, Workgroup = 64

Validating GPU results...
 ...reading back GPU results
 ...histogram64CPU()
 ...comparing the results...
 ...64-bin histograms match

Shutting down 64-bin histogram...


Initializing 256-bin histogram...
Running 256-bin GPU histogram for 67108864 bytes (16 runs)...

histogram256() time (average) : 0.00541 sec, 12398.0069 MB/sec

histogram256, Throughput = 12398.0069 MB/s, Time = 0.00541 s, Size = 67108864 Bytes, NumDevsUsed = 1, Workgroup = 192

Validating GPU results...
 ...reading back GPU results
 ...histogram256CPU()
 ...comparing the results
 ...256-bin histograms match

Shutting down 256-bin histogram...


Shutting down...

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

[histogram] - Test Summary
Test passed

翊桐公羽

关注

19
点赞
踩
24

收藏

觉得还不错? 一键收藏
1
评论
CUDA编程之CUDA Sample-2_Concepts_and_Techniques-histogram

CUDA samples 中的目录主要包含一些展示 CUDA 编程概念和技术的示例程序。这些示例主要涉及以下方面:assert()shfl__shfl()reductionscanhistogram主要展示如何在 GPU 上高效地实现直方图计算。直方图是一种常见的数据分析和可视化工具,用于描述数据集中数值分布的情况。这个示例程序主要包括以下几个方面:这个Sample演示了如何高效实现 64 个bin和 256 个bin的直方图计算。
复制链接

扫一扫