CUDA-GPU programming Introduction (1)

最新推荐文章于 2024-08-10 08:45:02 发布

NewCoder_BL

最新推荐文章于 2024-08-10 08:45:02 发布

阅读量634

点赞数 1

分类专栏： cuda 文章标签： gpu simd

本文链接：https://blog.csdn.net/u011934885/article/details/70702420

版权

cuda 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

基本定位：
CPU的并行是对于多任务的同时进行，task parallelism, 力求minimize latency，而GPU的并行是对于单任务的数据并行，data parallelism, 力求maximize throughout。CPU的组成有相当的部分作为控制和调度，GPU则主要是计算单元的堆积，large scale SIMD (Single Instruction Multiple Data)。

传统的GPU服务于图像处理，主要的特点就是流处理，stream computing，得益于大量的计算单元，可以对大量的相互独立的数据同时做计算。现代GPU更倾向于通用型。

Shared memory and thread synchronization primitives eliminate the need for data independence;
Gather and scatter operations allow kernels to read and write data at arbitrary locations.

CUDA programming model:
CPU作为主处理器被称为host, GPU作为协处理器，coprocessor，被称为device，host通过调用kernel, 将需要并行处理的大量计算扔给device。host和device有各自的memory，但是互相之间不能直接access，可以得是数据transfer。host负责自己的以及device的memory allocation，相互的data transfer，以及kernel的调用invocation。
示意图如下:
gpu diagram

基本硬件说明：
一个GPU包含多个multiprocessor，一个multiprocessor包含多个stream processor（SP），或者是多个core, (CUDA cores), 这些基本配置可以通过CUDA的devicequery来查看，比如，我的PC配的是很普通的NVIDIA GT630，如下：
gt630

在进行运算的时候，一个multiprocessor对应处理一个CUDA里设置的block，一个core对应于一个CUDA里设置的thread，现在可以大致这么理解。而所有的具体执行，都是以warp为单位的，warp大小最初为16 threads，后来就一直是32，直到现在。物理上来说，一个multiprocessor真正完全同时处理的线程数量就是warp size。一个block的所有线程共享这个MP的resources（register and shared memory）。

At runtime, a thread can determine the block that it belongs to, the block dimensions, and the thread index within the block。

关于thread和block的寻址我们稍微再讨论。

CUDA programming：

CUDA provides a set of extensions to the C programming
language
– new storage quantifiers, kernel invocation syntax, intrinsics, vector
types, etc.
• CUDA source code saved in .cu files
– host and device code and coexist in the same file
– storage qualifiers determine type of code
• Compiled to object files using nvcc compiler
– object files contain executable host and device code
• Can be linked with object files generated by other C/C++
compilers

例子：

__global__ void saxpy_gpu(float *vecY, float *vecX, float alpha ,int n)
{
    int i;
    i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i<n)
    vecY[i] = alpha * vecX[i] + vecY[i];
}

key points:
1. The global qualifier identifies this function as a kernel that executes on the device.
2. blockIdx, blockDim and threadIdx are built-in variables that uniquely identify a thread’s position in the execution environment
– they are used to compute an offset into the data array
3. The host specifies the number of blocks and block size during
kernel invocation:

saxpy_gpu<<<numBlocks, blockSize>>>(y_d, x_d, alpha, n);

基本的寻址示意图：
这里写图片描述
key difference:
• No need to explicitly loop over array elements – each element is processed in a separate
thread
• The element index is computed based on block index, block width and thread index within
the block

basic scheme on host:

The host performs the following operations:
1. initialize device
2. allocate and initialize input arrays in host DRAM
3. allocate memory on device
4. upload input data to device
5. execute kernel on device
6. download results
7. check results
8. clean-up

example code:

#include <cuda.h> /* CUDA runtime API */
#include <cstdio>
int main(int argc, char *argv[])
{
    float *x_host, *y_host; /* arrays for computation on host*/
    float *x_dev, *y_dev; /* arrays for computation on device */
    float *y_shadow; /* host-side copy of device results */
    int n = 32*1024;
    float alpha = 0.5f;
    int nerror;
    size_t memsize;
    int i, blockSize, nBlocks;
    /* here could add some code to check if GPU device is present */

    memsize = n * sizeof(float);
    /* allocate arrays on host */
    x_host = (float *)malloc(memsize);
    y_host = (float *)malloc(memsize);
    y_shadow = (float *)malloc(memsize);
    /* allocate arrays on device */
    cudaMalloc((void **) &x_dev, memsize);
    cudaMalloc((void **) &y_dev, memsize);

    /* add checks to catch any errors */

    /* initialize arrays on host */
    for ( i = 0; i < n; i++)
    {
    x_host[i] = rand() / (float)RAND_MAX;
    y_host[i] = rand() / (float)RAND_MAX;
    }
    /* copy arrays to device memory (synchronous) */
    cudaMemcpy(x_dev, x_host, memsize, cudaMemcpyHostToDevice);
    cudaMemcpy(y_dev, y_host, memsize, cudaMemcpyHostToDevice);

    /* set up device execution configuration */
    blockSize = 512;
    nBlocks = n / blockSize + (n % blockSize > 0);
    /* execute kernel (asynchronous!) */
    saxpy_gpu<<<nBlocks, blockSize>>>(y_dev, x_dev, alpha, n);
    /* could add check if this succeeded */
    /* execute host version (i.e. baseline reference results) */
    saxpy_cpu(y_host, x_host, alpha, n);

    /* retrieve results from device (synchronous) */
    cudaMemcpy(y_shadow, y_dev, memsize, cudaMemcpyDeviceToHost);
    /* ensure synchronization (cudaMemcpy is synchronous in most cases, but not all) */
    cudaDeviceSynchronize();
    /* check results */
    nerror=0;
    for(i=0; i < n; i++)
    {
    if(y_shadow[i]!=y_host[i]) nerror=nerror+1;
    }
    printf("test comparison shows %d errors\n",nerror);

    /* free memory on device*/
    cudaFree(x_dev);
    cudaFree(y_dev);
    /* free memory on host */
    free(x_host);
    free(y_host);
    free(y_shadow);
    return 0;
} /* main */

Compiling:

• nvcc -arch=sm_20 -O2 program.cu -o program.x
• -arch=sm_20 means code is targeted at Compute Capability 2.0 architecture
• -O2 optimizes the CPU portion of the program

Be aware of memory bandwidth bottlenecks:
这里写图片描述

• The connection between CPU and GPU has low bandwidth
– need to minimize data transfers
– important to use asynchronous transfers if possible (overlap computation and transfer)

Using pinned memory:
• The transfer between host and device is very slow compared to access to memory within either the CPU or the GPU
• One way to speed it up by a factor of 2 or so is to use pinned memory on the host for memory allocation of array that will be transferred to the GPU

int main(int argc, char *argv[])
{
    cudaMallocHost((void **) &a_host, memsize_input)
    ...
    cudaFree(a_host);
}

Timing GPU accelerated codes
• Presents specific difficulties because the CPU and GPU can be computing independently in parallel, i.e. asynchronously
• On the cpu can use standard function gettimeofday(…) (microsecond precision) and process the result
• If trying to time events on GPU with this function, must
ensure synchronization
• This can be done with a call to cudaDeviceSynchronize()
• Memory copies to/from device are synchronized, so can be used for timing.
• Timing GPU kernels on the CPU may be insufficiently accurate

Using mechanisms on the GPU for timing
• This is highly accurate on the GPU side, and very useful for optimizing kernels

sample code:

    ...
    cudaEvent_t start, stop;
    float kernel_timer;
    ...
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start, 0);
    saxpy_gpu<<<nBlocks, blockSize>>>(y_dev, x_dev, alpha, n);
    cudaEventRecord(stop, 0);
    cudaEventSynchronize( stop );
    cudaEventElapsedTime( &kernel_timer, start, stop );
    printf("Test Kernel took %f ms\n",kernel_timer);
    cudaEventDestroy(start);
    cudaEventDestroy(stop);