CUDA 程序设计模型

该篇介绍基础GPU程序设计模型、CUDA kernel、两个向量加法的简单CUDA程序、以及在linux下对该程序代码进行编译。

SIMD model:Single Instruction Multiple Data

SIMT model:Single Instruction Multiple Thread Programming Mode

SIMT是SIMD在GPUs上的一个应用版本。

Matrix operations,Monte Carlo calculations,以及一些Data manipulations都可以使用SIMT模型进行程序设计。

CUDA kernel routine:

要写一个SIMT程序,需要写一个GPU上的所有threads都要执行的代码序列。对CUDA来说,这个代码序列就叫一个Kernel routne.

这个Kernel code要求符合C规范,如果需要thread ID的话,那么还需如下格式来加以说明:

index = ThreadID;
A[index] = B[index] + C[index];
 
而这些事所有threads都要做的。
CPU和GPU存储器

 

基本CUDA程序结构:

int main (int argc, char **argv ) {
1. Allocate memory space in device (GPU) for data
2. Allocate memory space in host (CPU) for data
3. Copy data to GPU
4. Call “kernel” routine to execute on GPU
(with CUDA syntax that defines no of threads and their physical structure)
5. Transfer results from GPU to CPU
6. Free memory space in device (GPU)
7. Free memory space in host (CPU)
return;
}

 

1.分配GPU存储(device):使用CUDA malloc函数

int size = N *sizeof( int);      // space for N integers
int *devA, *devB, *devC;    // devA, devB, devC ptrs
cudaMalloc( (void**)&devA, size) );
cudaMalloc( (void**)&devB, size );
cudaMalloc( (void**)&devC, size );
 
2.分配cpu存储(host),使用C 的malloc函数
Use regular C malloc routines:
int *a, *b, *c;
a = (int*)malloc(size);
b = (int*)malloc(size);
c = (int*)malloc(size);
or statically declare variables:
#define N 256

int  a[N], b[N], c[N];

 

3.把数据从CPU(HOST)上传输到GPU(DEVICE)上,使用CUDA routine的cudaMemcpy函数:

cudaMemcpy( devA, A, size, cudaMemcpyHostToDevice);
cudaMemcpy( dev_B, B, size, cudaMemcpyHostToDevice);
where:
devA and devB are pointers to destination in device
A and B are pointers to host data

 

4.“kernel” routine to execute on GPU(device)的说明:

CUDA扩展了一个C语法用于这些功能:
使用尖括号来调用从host大device上的代码。这一对尖括号内包含了organization和number of threads的2个参数。
                 myKernel<<< n, m >>>(arg1, … );
n andm will define organization of thread blocks and threads in a block.
For now, we will set n = 1, which say one block andm = N, which says N threads in this block.
arg1, … , -- arguments to routinemyKernel typically pointers to device memory obtained previously fromcudaMallac.

 

声明一个Kernel程序格式如下:首先使用__global__来对其加以说明;

#define N 256
__global__ void vecAdd(int *A, int *B, int *C) { // Kernel definition
 
int i = threadIdx.x;     CUDA structure that provides thread ID in block
C[i] = A[i] + B[i];
}
int main() {
// allocate device memory &
  // copy data to device
  // device mem. ptrs devA,devB,devC
vecAdd<<<1, N>>>(devA,devB,devC); // Grid of one block, N threads in block
}

 

 

Each of the N threads performs one pair-wise addition:
 
Thread 0:    devC[0] = devA[0] + devB[0];
Thread 1:    devC[1] = devA[1] + devB[1];
Thread N-1: devC[N-1] = devA[N-1]+devB[N-1];
 

5.运行完后将数据从GPU(device)传输回到CPU(host)上:

还是使用CUDA函数cudaMemcpy

cudaMemcpy( C, devC, size, cudaMemcpyDeviceToHost);
where:
devC is a pointer in device andC is a pointer in host.
 
6.释放device上的内存:使用cudaFree函数,
cudaFree( dev_a);
cudaFree( dev_b);
cudaFree( dev_c);
 
7.释放host上的内存:如果host上使用了malloc函数来动态分配空间,则可以使用free函数
free( a );
free( b );
free( c );
则一个完整的CUDA程序如下:
#define N 256
__global__ void vecAdd(int *A, int *B, int *C) { 
int i = threadIdx.x;
C[i] = A[i] + B[i];
}
 
int main (int argc, char **argv ) {
int size = N *sizeof( int);
int  a[N], b[N], c[N], *devA, *devB, *devC;    
cudaMalloc( (void**)&devA, size) );
cudaMalloc( (void**)&devB, size );
cudaMalloc( (void**)&devC, size );
cudaMemcpy( devA, a, size, cudaMemcpyHostToDevice);
cudaMemcpy( devB, b size, cudaMemcpyHostToDevice);
vecAdd<<<1, N>>>(devA, devB, devC);
cudaMemcpy( c, devC size, cudaMemcpyDeviceToHost);
cudaFree( dev_a);
cudaFree( dev_b);
cudaFree( dev_c);
return (0);
}
 
 
该程序完成两个向量的加法。

对该程序再加以修饰,可以按照如下方式来处理了:

#include <stdio.h>
#include <cuda.h>
#include <stdlib.h>
#include <time.h>
#define N 4096        // size of array
__global__ void add(int *a,int *b, int *c) {
int tid = blockIdx.x*blockDim.x + threadIdx.x;
        if(tid < N){
         c[tid] = a[tid]+b[tid];
        }

}

 

int main(int argc, char *argv[]) {
int T = 10, B = 1;            // threads per block/blocks per grid
int a[N],b[N],c[N];
int *dev_a, *dev_b, *dev_c;
printf("Size of array = %d\n", N);
do {
printf("Enter number of threads per block: ");
scanf("%d",&T);
printf("\nEnter nuumber of blocks per grid: ");
scanf("%d",&B);
if (T * B < N) printf("Error T x B < N, try again");
} while (T * B < N);
cudaMalloc((void**)&dev_a,N * sizeof(int));
cudaMalloc((void**)&dev_b,N * sizeof(int));
cudaMalloc((void**)&dev_c,N * sizeof(int));
for(int i=0;i<N;i++) {    // load arrays with some numbers
a[i] = i;
b[i] = i*1;
}
cudaMemcpy(dev_a, a , N*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b , N*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(dev_c, c , N*sizeof(int),cudaMemcpyHostToDevice);
add<<<B,T>>>(dev_a,dev_b,dev_c);
cudaMemcpy(c,dev_c,N*sizeof(int),cudaMemcpyDeviceToHost);
for(int i=0;i<N;i++) {
printf("%d+%d=%d\n",a[i],b[i],c[i]);
}
cudaFree(dev_a); // clean up
cudaFree(dev_b);
cudaFree(dev_c);
return 0;
}

 

剩下是在linux上的编译的问题了,且等下回讲解!

当然了windows vsstudio下你可以自己尝试了!

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值