该篇介绍基础GPU程序设计模型、CUDA kernel、两个向量加法的简单CUDA程序、以及在linux下对该程序代码进行编译。
SIMD model:Single Instruction Multiple Data
SIMT model:Single Instruction Multiple Thread Programming Mode
SIMT是SIMD在GPUs上的一个应用版本。
Matrix operations,Monte Carlo calculations,以及一些Data manipulations都可以使用SIMT模型进行程序设计。
CUDA kernel routine:
要写一个SIMT程序,需要写一个GPU上的所有threads都要执行的代码序列。对CUDA来说,这个代码序列就叫一个Kernel routne.
这个Kernel code要求符合C规范,如果需要thread ID的话,那么还需如下格式来加以说明:
…index = ThreadID;A[index] = B[index] + C[index];而这些事所有threads都要做的。
基本CUDA程序结构:
int main (int argc, char **argv ) {1. Allocate memory space in device (GPU) for data2. Allocate memory space in host (CPU) for data3. Copy data to GPU4. Call “kernel” routine to execute on GPU(with CUDA syntax that defines no of threads and their physical structure)5. Transfer results from GPU to CPU6. Free memory space in device (GPU)7. Free memory space in host (CPU)return;}
1.分配GPU存储(device):使用CUDA malloc函数
int size = N *sizeof( int); // space for N integersint *devA, *devB, *devC; // devA, devB, devC ptrscudaMalloc( (void**)&devA, size) );cudaMalloc( (void**)&devB, size );cudaMalloc( (void**)&devC, size );2.分配cpu存储(host),使用C 的malloc函数Use regular C malloc routines:int *a, *b, *c;…a = (int*)malloc(size);b = (int*)malloc(size);c = (int*)malloc(size);or statically declare variables:#define N 256…int a[N], b[N], c[N];
3.把数据从CPU(HOST)上传输到GPU(DEVICE)上,使用CUDA routine的cudaMemcpy函数:
cudaMemcpy( devA, A, size, cudaMemcpyHostToDevice);cudaMemcpy( dev_B, B, size, cudaMemcpyHostToDevice);where:devA and devB are pointers to destination in deviceA and B are pointers to host data
4.“kernel” routine to execute on GPU(device)的说明:
CUDA扩展了一个C语法用于这些功能:使用尖括号来调用从host大device上的代码。这一对尖括号内包含了organization和number of threads的2个参数。myKernel<<< n, m >>>(arg1, … );n andm will define organization of thread blocks and threads in a block.For now, we will set n = 1, which say one block andm = N, which says N threads in this block.arg1, … , -- arguments to routinemyKernel typically pointers to device memory obtained previously fromcudaMallac.
声明一个Kernel程序格式如下:首先使用__global__来对其加以说明;
#define N 256__global__ void vecAdd(int *A, int *B, int *C) { // Kernel definitionint i = threadIdx.x; CUDA structure that provides thread ID in blockC[i] = A[i] + B[i];}int main() {// allocate device memory &// copy data to device// device mem. ptrs devA,devB,devCvecAdd<<<1, N>>>(devA,devB,devC); // Grid of one block, N threads in block…}
Each of the N threads performs one pair-wise addition:Thread 0: devC[0] = devA[0] + devB[0];Thread 1: devC[1] = devA[1] + devB[1];Thread N-1: devC[N-1] = devA[N-1]+devB[N-1];5.运行完后将数据从GPU(device)传输回到CPU(host)上:
还是使用CUDA函数cudaMemcpy
cudaMemcpy( C, devC, size, cudaMemcpyDeviceToHost);where:devC is a pointer in device andC is a pointer in host.6.释放device上的内存:使用cudaFree函数,cudaFree( dev_a);cudaFree( dev_b);cudaFree( dev_c);7.释放host上的内存:如果host上使用了malloc函数来动态分配空间,则可以使用free函数free( a );free( b );free( c );
#define N 256__global__ void vecAdd(int *A, int *B, int *C) {int i = threadIdx.x;C[i] = A[i] + B[i];}int main (int argc, char **argv ) {int size = N *sizeof( int);int a[N], b[N], c[N], *devA, *devB, *devC;cudaMalloc( (void**)&devA, size) );cudaMalloc( (void**)&devB, size );cudaMalloc( (void**)&devC, size );cudaMemcpy( devA, a, size, cudaMemcpyHostToDevice);cudaMemcpy( devB, b size, cudaMemcpyHostToDevice);vecAdd<<<1, N>>>(devA, devB, devC);cudaMemcpy( c, devC size, cudaMemcpyDeviceToHost);cudaFree( dev_a);cudaFree( dev_b);cudaFree( dev_c);return (0);}该程序完成两个向量的加法。
对该程序再加以修饰,可以按照如下方式来处理了:
#include <stdio.h>#include <cuda.h>#include <stdlib.h>#include <time.h>#define N 4096 // size of array__global__ void add(int *a,int *b, int *c) {int tid = blockIdx.x*blockDim.x + threadIdx.x;if(tid < N){c[tid] = a[tid]+b[tid];}}
int main(int argc, char *argv[]) {int T = 10, B = 1; // threads per block/blocks per gridint a[N],b[N],c[N];int *dev_a, *dev_b, *dev_c;printf("Size of array = %d\n", N);do {printf("Enter number of threads per block: ");scanf("%d",&T);printf("\nEnter nuumber of blocks per grid: ");scanf("%d",&B);if (T * B < N) printf("Error T x B < N, try again");} while (T * B < N);cudaMalloc((void**)&dev_a,N * sizeof(int));cudaMalloc((void**)&dev_b,N * sizeof(int));cudaMalloc((void**)&dev_c,N * sizeof(int));for(int i=0;i<N;i++) { // load arrays with some numbersa[i] = i;b[i] = i*1;}cudaMemcpy(dev_a, a , N*sizeof(int),cudaMemcpyHostToDevice);cudaMemcpy(dev_b, b , N*sizeof(int),cudaMemcpyHostToDevice);cudaMemcpy(dev_c, c , N*sizeof(int),cudaMemcpyHostToDevice);add<<<B,T>>>(dev_a,dev_b,dev_c);cudaMemcpy(c,dev_c,N*sizeof(int),cudaMemcpyDeviceToHost);for(int i=0;i<N;i++) {printf("%d+%d=%d\n",a[i],b[i],c[i]);}cudaFree(dev_a); // clean upcudaFree(dev_b);cudaFree(dev_c);return 0;}
剩下是在linux上的编译的问题了,且等下回讲解!
当然了windows vsstudio下你可以自己尝试了!