《GPU高性能编程CUDA实战》学习笔记（九）

最新推荐文章于 2024-06-17 20:39:03 发布

yongwuzhijing800

最新推荐文章于 2024-06-17 20:39:03 发布

阅读量4.1k

点赞数 3

分类专栏： CUDA 文章标签： nvidia 高性能 cuda 显卡 gpu

本文链接：https://blog.csdn.net/w09103419/article/details/52550723

版权

CUDA 专栏收录该内容

24 篇文章 4 订阅

订阅专栏

第9章原子性

在某些情况下，对于单线程应用程序来说非常简单的任务，或许使用大规模的并行架构实现却会变成一个复杂的问题。这里我们将在这些情况中使用特殊的原语从而确保安全地完成传统单线程应用程序中的简单任务。

9.1 本章目标

了解不同NVIDIA GPU的计算功能集。
了解原子操作以及为什么需要使用它们。
了解如何在CUDA C核函数中执行带有原子操作的运算。

9.2 计算功能集

1-8章，介绍的是所以支持CUDA 的GPU的通用功能。如，启动核函数、访问全局内存、读取常量内存和纹理内存。

但不同架构有不同功能，NVIDIA 将GPU支持的各种功能统称为计算功能集（Compute Capability）。

9.2.1 NVIDIA GPU的计算功能集

目前支持1.0、1.1、1.2、1.3以及2.0，以后可能支持更高版本，需要从官网来了解。高版本计算功能集是低版本计算功能集的超集。如支持1.2版本的GPU，同样会支持1.0和1.1版本的所有功能。NVIDIA CUDA编程指南中包含所以最新列表及计算功能集。

9.2.2 基于最小计算功能集的编译

假设在编写的代码中要求计算功能集的版本最低不能低于某个版本。例如，假设你阅读完本章，并开始编写一个需要使用全局内存原子操作的应用程序。你知道要支持全局内存原子操作，计算功能集的最低版本为1.1。当编译代码时，你需要告诉编译器，如果硬件支持的计算功能集低于1.1，那么将无法运行这个核函数。而且，当告诉编译器这个要求时，还可以指定一些只在1.1或者更高版本的计算功能集中才支持的编译优化。要将这个信息告诉编译器，只需在调用 nvcc 时增加一个命令选项：

nvcc -arch = sm_11

同样的，在编译需要使用共享内存原子操作的核函数时，你要告诉编译器代码需要1.2版本或者更高的计算功能集。

nvcc -arch = sm_12

9.2.3 原子操作简介

在编写传统的单线程应用程序时，程序员通常不需要使用原子操作。但有时也是需要的。

C++ 递增运算符：

x++;

步骤：1）读取x值；2）增加1；3）递增后结果写回x。读取-修改-写入（Read-Modify-Write）

多线程时会出现混乱，结果不可预测。所以我们需要将这3步变成1个不可分割为更小的操作，满足这种条件限制的操作称为原子操作。

CUDA C支持多种原子操作，当有数千个线程在内存访问上发生竞争时，这些操作能够确保在内存上实现安全的操作。

现在我们已经看到一个只有使用原子操作才能计算出正确结果的示例。

9.4 计算直方图

直方图(Histogram)又称质量分布图。是一种统计报告图，由一系列高度不等的纵向条纹或线段表示数据分布的情况。一般用横轴表示数据类型，纵轴表示分布情况。

给定一个包含一组元素的数据集，直方图表示每个元素的出现频率。例如，“Programming with CUDA C”中字符频率的直方图，结果如下

2 2 1 2 1 2 2 1 1 1 2 1 1 1

a c d g h i m n o p r t u w

直方图定义简单，但应用广泛，包括图像处理、数据压缩、计算机视觉、机器学习、音频编码等等。

9.4.1 在 CPU 上计算直方图

#include "../common/book.h"

#define SIZE    (100*1024*1024)

int main( void ) {
    unsigned char *buffer =
                     (unsigned char*)big_random_block( SIZE );

    // capture the start time
    clock_t         start, stop;
    start = clock();

    unsigned int    histo[256];
    for (int i=0; i<256; i++)
        histo[i] = 0;

    for (int i=0; i<SIZE; i++)
        histo[buffer[i]]++;

    stop = clock();
    float   elapsedTime = (float)(stop - start) /
                          (float)CLOCKS_PER_SEC * 1000.0f;
    printf( "Time to generate:  %3.1f ms\n", elapsedTime );

    long histoCount = 0;
    for (int i=0; i<256; i++) {
        histoCount += histo[i];
    }
    printf( "Histogram Sum:  %ld\n", histoCount );

    free( buffer );
    return 0;
}

big_random_block() 生成随机的字节流。

1字节 = 8比特

思想：每当buffer中出现某个z时，就递增直方图数组中索引为z的元素。这样就计算出z的出现次数。

后边的代码是验证个数是否正确。

9.4.2 在GPU上计算直方图

#include "../common/book.h"

#define SIZE    (100*1024*1024)


__global__ void histo_kernel( unsigned char *buffer,
                              long size,
                              unsigned int *histo ) {
    // calculate the starting index and the offset to the next
    // block that each thread will be processing
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    int stride = blockDim.x * gridDim.x;
    while (i < size) {
        atomicAdd( &histo[buffer[i]], 1 );
        i += stride;
    }
}

int main( void ) {
    unsigned char *buffer =
                     (unsigned char*)big_random_block( SIZE );

    // capture the start time
    // starting the timer here so that we include the cost of
    // all of the operations on the GPU.
    cudaEvent_t     start, stop;
    HANDLE_ERROR( cudaEventCreate( &start ) );
    HANDLE_ERROR( cudaEventCreate( &stop ) );
    HANDLE_ERROR( cudaEventRecord( start, 0 ) );

    // allocate memory on the GPU for the file's data
    unsigned char *dev_buffer;
    unsigned int *dev_histo;
    HANDLE_ERROR( cudaMalloc( (void**)&dev_buffer, SIZE ) );
    HANDLE_ERROR( cudaMemcpy( dev_buffer, buffer, SIZE,
                              cudaMemcpyHostToDevice ) );

    HANDLE_ERROR( cudaMalloc( (void**)&dev_histo,
                              256 * sizeof( int ) ) );
    HANDLE_ERROR( cudaMemset( dev_histo, 0,
                              256 * sizeof( int ) ) );

    // kernel launch - 2x the number of mps gave best timing
    cudaDeviceProp  prop;
    HANDLE_ERROR( cudaGetDeviceProperties( &prop, 0 ) );
    int blocks = prop.multiProcessorCount;
    histo_kernel<<<blocks*2,256>>>( dev_buffer, SIZE, dev_histo );
    
    unsigned int    histo[256];
    HANDLE_ERROR( cudaMemcpy( histo, dev_histo,
                              256 * sizeof( int ),
                              cudaMemcpyDeviceToHost ) );

    // get stop time, and display the timing results
    HANDLE_ERROR( cudaEventRecord( stop, 0 ) );
    HANDLE_ERROR( cudaEventSynchronize( stop ) );
    float   elapsedTime;
    HANDLE_ERROR( cudaEventElapsedTime( &elapsedTime,
                                        start, stop ) );
    printf( "Time to generate:  %3.1f ms\n", elapsedTime );

    long histoCount = 0;
    for (int i=0; i<256; i++) {
        histoCount += histo[i];
    }
    printf( "Histogram Sum:  %ld\n", histoCount );

    // verify that we have the same counts via CPU
    for (int i=0; i<SIZE; i++)
        histo[buffer[i]]--;
    for (int i=0; i<256; i++) {
        if (histo[i] != 0)
            printf( "Failure at %d!  Off by %d\n", i, histo[i] );
    }

    HANDLE_ERROR( cudaEventDestroy( start ) );
    HANDLE_ERROR( cudaEventDestroy( stop ) );
    cudaFree( dev_histo );
    cudaFree( dev_buffer );
    free( buffer );
    return 0;
}

经验告诉我们：当线程块数量为GPU中处理器数量的2倍时，将达到最优性能。

1. 使用全局内存原子操作的直方图核函数

atomicAdd(addr, y)将生成一个原子的操作系列。包括，读取地址addr处的值，将y增加到这个值，以及将结果保存回地址addr。底层硬件确保addr安全。

不过此时的效率比cpu更低，因为对相同内存位置的操作都将被硬件串行化，这导致保存未完成操作的队列非常长，抵消了并行效果。

2. 使用共享内存原子操作和全局内存原子操作的直方图核函数

解决上述问题，全部代码如下：

#include "../common/book.h"

#define SIZE    (100*1024*1024)


__global__ void histo_kernel( unsigned char *buffer,
                              long size,
                              unsigned int *histo ) {

    // clear out the accumulation buffer called temp
    // since we are launched with 256 threads, it is easy
    // to clear that memory with one write per thread
    __shared__  unsigned int temp[256];
    temp[threadIdx.x] = 0;
    __syncthreads();

    // calculate the starting index and the offset to the next
    // block that each thread will be processing
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    int stride = blockDim.x * gridDim.x;
    while (i < size) {
        atomicAdd( &temp[buffer[i]], 1 );
        i += stride;
    }
    // sync the data from the above writes to shared memory
    // then add the shared memory values to the values from
    // the other thread blocks using global memory
    // atomic adds
    // same as before, since we have 256 threads, updating the
    // global histogram is just one write per thread!
    __syncthreads();
    atomicAdd( &(histo[threadIdx.x]), temp[threadIdx.x] );
}

int main( void ) {
    unsigned char *buffer =
                     (unsigned char*)big_random_block( SIZE );

    // capture the start time
    // starting the timer here so that we include the cost of
    // all of the operations on the GPU.  if the data were
    // already on the GPU and we just timed the kernel
    // the timing would drop from 74 ms to 15 ms.  Very fast.
    cudaEvent_t     start, stop;
    HANDLE_ERROR( cudaEventCreate( &start ) );
    HANDLE_ERROR( cudaEventCreate( &stop ) );
    HANDLE_ERROR( cudaEventRecord( start, 0 ) );

    // allocate memory on the GPU for the file's data
    unsigned char *dev_buffer;
    unsigned int *dev_histo;
    HANDLE_ERROR( cudaMalloc( (void**)&dev_buffer, SIZE ) );
    HANDLE_ERROR( cudaMemcpy( dev_buffer, buffer, SIZE,
                              cudaMemcpyHostToDevice ) );

    HANDLE_ERROR( cudaMalloc( (void**)&dev_histo,
                              256 * sizeof( int ) ) );
    HANDLE_ERROR( cudaMemset( dev_histo, 0,
                              256 * sizeof( int ) ) );

    // kernel launch - 2x the number of mps gave best timing
    cudaDeviceProp  prop;
    HANDLE_ERROR( cudaGetDeviceProperties( &prop, 0 ) );
    int blocks = prop.multiProcessorCount;
    histo_kernel<<<blocks*2,256>>>( dev_buffer,
                                    SIZE, dev_histo );
    
    unsigned int    histo[256];
    HANDLE_ERROR( cudaMemcpy( histo, dev_histo,
                              256 * sizeof( int ),
                              cudaMemcpyDeviceToHost ) );

    // get stop time, and display the timing results
    HANDLE_ERROR( cudaEventRecord( stop, 0 ) );
    HANDLE_ERROR( cudaEventSynchronize( stop ) );
    float   elapsedTime;
    HANDLE_ERROR( cudaEventElapsedTime( &elapsedTime,
                                        start, stop ) );
    printf( "Time to generate:  %3.1f ms\n", elapsedTime );

    long histoCount = 0;
    for (int i=0; i<256; i++) {
        histoCount += histo[i];
    }
    printf( "Histogram Sum:  %ld\n", histoCount );

    // verify that we have the same counts via CPU
    for (int i=0; i<SIZE; i++)
        histo[buffer[i]]--;
    for (int i=0; i<256; i++) {
        if (histo[i] != 0)
            printf( "Failure at %d!\n", i );
    }

    HANDLE_ERROR( cudaEventDestroy( start ) );
    HANDLE_ERROR( cudaEventDestroy( stop ) );
    cudaFree( dev_histo );
    cudaFree( dev_buffer );
    free( buffer );
    return 0;
}

使用共享内存原子操作和全局原子性，性能将数量级的提升。

这里主要使用了重构算法，分成两个阶段计算，降低了内存访问上的竞争程度，带来了不错的效果。以后用得到哦，要记住这种策略。