CUDA-GPU programming Introduction (1)

基本定位:
CPU的并行是对于多任务的同时进行,task parallelism, 力求minimize latency,而GPU的并行是对于单任务的数据并行,data parallelism, 力求maximize throughout。CPU的组成有相当的部分作为控制和调度,GPU则主要是计算单元的堆积,large scale SIMD (Single Instruction Multiple Data)。

传统的GPU服务于图像处理,主要的特点就是流处理,stream computing,得益于大量的计算单元,可以对大量的相互独立的数据同时做计算。现代GPU更倾向于通用型。

Shared memory and thread synchronization primitives eliminate the need for data independence;
Gather and scatter operations allow kernels to read and write data at arbitrary locations.

CUDA programming model:
CPU作为主处理器被称为host, GPU作为协处理器,coprocessor,被称为device,host通过调用kernel, 将需要并行处理的大量计算扔给device。host和device有各自的memory,但是互相之间不能直接access,可以得是数据transfer。host负责自己的以及device的memory allocation,相互的data transfer,以及kernel的调用invocation。
示意图如下:
gpu diagram

基本硬件说明:
一个GPU包含多个multiprocessor,一个multiprocessor包含多个stream processor(SP),或者是多个core, (CUDA cores), 这些基本配置可以通过CUDA的devicequery来查看,比如,我的PC配的是很普通的NVIDIA GT630,如下:
gt630

在进行运算的时候,一个multiprocessor对应处理一个CUDA里设置的block,一个core对应于一个CUDA里设置的thread,现在可以大致这么理解。而所有的具体执行,都是以warp为单位的,warp大小最初为16 threads,后来就一直是32,直到现在。物理上来说,一个multiprocessor真正完全同时处理的线程数量就是warp size。一个block的所有线程共享这个MP的resources(register and shared memory)。

At runtime, a thread can determine the block that it belongs to, the block dimensions, and the thread index within the block。

关于thread和block的寻址我们稍微再讨论。

CUDA programming:

CUDA provides a set of extensions to the C programming
language
– new storage quantifiers, kernel invocation syntax, intrinsics, vector
types, etc.
• CUDA source code saved in .cu files
– host and device code and coexist in the same file
– storage qualifiers determine type of code
• Compiled to object files using nvcc compiler
– object files contain executable host and device code
• Can be linked with object files generated by other C/C++
compilers

例子:

__global__ void saxpy_gpu(float *vecY, float *vecX, float alpha ,int n)
{
    int i;
    i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i<n)
    vecY[i] = alpha * vecX[i] + vecY[i];
}

key points:
1. The global qualifier identifies this function as a kernel that executes on the device.
2. blockIdx, blockDim and threadIdx are built-in variables that uniquely identify a thread’s position in the execution environment
– they are used to compute an offset into the data array
3. The host specifies the number of blocks and block size during
kernel invocation:

saxpy_gpu<<<numBlocks, blockSize>>>(y_d, x_d, alpha, n);

基本的寻址示意图:
这里写图片描述
key difference:
• No need to explicitly loop over array elements – each element is processed in a separate
thread
• The element index is computed based on block index, block width and thread index within
the block

basic scheme on host:

The host performs the following operations:
1. initialize device
2. allocate and initialize input arrays in host DRAM
3. allocate memory on device
4. upload input data to device
5. execute kernel on device
6. download results
7. check results
8. clean-up

example code:

#include <cuda.h> /* CUDA runtime API */
#include <cstdio>
int main(int argc, char *argv[])
{
    float *x_host, *y_host; /* arrays for computation on host*/
    float *x_dev, *y_dev; /* arrays for computation on device */
    float *y_shadow; /* host-side copy of device results */
    int n = 32*1024;
    float alpha = 0.5f;
    int nerror;
    size_t memsize;
    int i, blockSize, nBlocks;
    /* here could add some code to check if GPU device is present */

    memsize = n * sizeof(float);
    /* allocate arrays on host */
    x_host = (float *)malloc(memsize);
    y_host = (float *)malloc(memsize);
    y_shadow = (float *)malloc(memsize);
    /* allocate arrays on device */
    cudaMalloc((void **) &x_dev, memsize);
    cudaMalloc((void **) &y_dev, memsize);

    /* add checks to catch any errors */

    /* initialize arrays on host */
    for ( i = 0; i < n; i++)
    {
    x_host[i] = rand() / (float)RAND_MAX;
    y_host[i] = rand() / (float)RAND_MAX;
    }
    /* copy arrays to device memory (synchronous) */
    cudaMemcpy(x_dev, x_host, memsize, cudaMemcpyHostToDevice);
    cudaMemcpy(y_dev, y_host, memsize, cudaMemcpyHostToDevice);

    /* set up device execution configuration */
    blockSize = 512;
    nBlocks = n / blockSize + (n % blockSize > 0);
    /* execute kernel (asynchronous!) */
    saxpy_gpu<<<nBlocks, blockSize>>>(y_dev, x_dev, alpha, n);
    /* could add check if this succeeded */
    /* execute host version (i.e. baseline reference results) */
    saxpy_cpu(y_host, x_host, alpha, n);

    /* retrieve results from device (synchronous) */
    cudaMemcpy(y_shadow, y_dev, memsize, cudaMemcpyDeviceToHost);
    /* ensure synchronization (cudaMemcpy is synchronous in most cases, but not all) */
    cudaDeviceSynchronize();
    /* check results */
    nerror=0;
    for(i=0; i < n; i++)
    {
    if(y_shadow[i]!=y_host[i]) nerror=nerror+1;
    }
    printf("test comparison shows %d errors\n",nerror);

    /* free memory on device*/
    cudaFree(x_dev);
    cudaFree(y_dev);
    /* free memory on host */
    free(x_host);
    free(y_host);
    free(y_shadow);
    return 0;
} /* main */

Compiling:

• nvcc -arch=sm_20 -O2 program.cu -o program.x
• -arch=sm_20 means code is targeted at Compute Capability 2.0 architecture
• -O2 optimizes the CPU portion of the program

Be aware of memory bandwidth bottlenecks:
这里写图片描述

• The connection between CPU and GPU has low bandwidth
– need to minimize data transfers
– important to use asynchronous transfers if possible (overlap computation and transfer)

Using pinned memory:
• The transfer between host and device is very slow compared to access to memory within either the CPU or the GPU
• One way to speed it up by a factor of 2 or so is to use pinned memory on the host for memory allocation of array that will be transferred to the GPU

int main(int argc, char *argv[])
{
    cudaMallocHost((void **) &a_host, memsize_input)
    ...
    cudaFree(a_host);
}

Timing GPU accelerated codes
• Presents specific difficulties because the CPU and GPU can be computing independently in parallel, i.e. asynchronously
• On the cpu can use standard function gettimeofday(…) (microsecond precision) and process the result
• If trying to time events on GPU with this function, must
ensure synchronization
• This can be done with a call to cudaDeviceSynchronize()
• Memory copies to/from device are synchronized, so can be used for timing.
• Timing GPU kernels on the CPU may be insufficiently accurate

Using mechanisms on the GPU for timing
• This is highly accurate on the GPU side, and very useful for optimizing kernels

sample code:

    ...
    cudaEvent_t start, stop;
    float kernel_timer;
    ...
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start, 0);
    saxpy_gpu<<<nBlocks, blockSize>>>(y_dev, x_dev, alpha, n);
    cudaEventRecord(stop, 0);
    cudaEventSynchronize( stop );
    cudaEventElapsedTime( &kernel_timer, start, stop );
    printf("Test Kernel took %f ms\n",kernel_timer);
    cudaEventDestroy(start);
    cudaEventDestroy(stop);
CUDA programming: a developer's guide to parallel computing with GPUs. by Shane Cook. Over the past five years there has been a revolution in computing brought about by a company that for successive years has emerged as one of the premier gaming hardware manufacturersdNVIDIA. With the introduction of the CUDA (Compute Unified Device Architecture) programming language, for the first time these hugely powerful graphics coprocessors could be used by everyday C programmers to offload computationally expensive work. From the embedded device industry, to home users, to supercomputers, everything has changed as a result of this. One of the major changes in the computer software industry has been the move from serial programming to parallel programming. Here, CUDA has produced great advances. The graphics processor unit (GPU) by its very nature is designed for high-speed graphics, which are inherently parallel. CUDA takes a simple model of data parallelism and incorporates it into a programming model without the need for graphics primitives. In fact, CUDA, unlike its predecessors, does not require any understanding or knowledge of graphics or graphics primitives. You do not have to be a games programmer either. The CUDA language makes the GPU look just like another programmable device. Throughout this book I will assume readers have no prior knowledge of CUDA, or of parallel programming. I assume they have only an existing knowledge of the C/C++ programming language. As we progress and you become more competent with CUDA, we’ll cover more advanced topics, taking you from a parallel unaware programmer to one who can exploit the full potential of CUDA. For programmers already familiar with parallel programming concepts and CUDA, we’ll be discussing in detail the architecture of the GPUs and how to get the most from each, including the latest Fermi and Kepler hardware. Literally anyone who can program in C or C++ can program with CUDA in a few hours given a little training. Getting from novice CUDA programmer, with a several times speedup to 10 times–plus speedup is what you should be capable of by the end of this book. The book is very much aimed at learning CUDA, but with a focus on performance, having first achieved correctness. Your level of skill and understanding of writing high-performance code, especially for GPUs, will hugely benefit from this text. This book is a practical guide to using CUDA in real applications, by real practitioners. At the same time, however, we cover the necessary theory and background so everyone, no matter what their background, can follow along and learn how to program in CUDA, making this book ideal for both professionals and those studying GPUs or parallel programming.
CUDA示例:通用GPU编程入门》是一本介绍使用CUDA编程的书籍。CUDA是一种通用计算架构,可以使开发者能够在GPU上执行复杂的并行计算任务。这本书通过大量的示例代码,介绍了如何使用CUDA来利用GPU的并行计算能力。 这本书首先介绍了GPU的工作原理和CUDA的基本概念,激发了读者对GPU编程的兴趣。然后,它详细介绍了CUDA的核心概念,包括线程、线程块和网格,以及CUDA内存模型。读者可以了解如何编写CUDA核函数,并了解如何在不同的线程间进行通信和同步。 随后,这本书通过一系列实际的示例代码,展示了如何使用CUDA来解决不同类型的问题。这些示例包括向量加法、矩阵乘法、图像处理等。每个示例都详细介绍了问题的背景、解决方案和实现细节。读者可以通过阅读这些示例代码,学习如何将问题转化为可在GPU上运行的并行计算任务,并了解如何优化GPU程序的性能。 此外,这本书还介绍了一些高级的CUDA主题,如共享内存、纹理内存和流式处理器等。这些主题可以帮助读者进一步扩展他们的GPU编程知识,并实现更复杂和高效的并行计算任务。 总之,《CUDA示例:通用GPU编程入门》是一本很好的介绍CUDA编程的书籍。它深入浅出地介绍了CUDA的基本概念和技术,通过丰富的示例代码,帮助读者从零开始学习并掌握CUDA编程。无论是初学者还是有一定CUDA编程经验的开发者,都可以从这本书中获得很多有价值的知识和经验。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值