CUDA 程序设计模型

最新推荐文章于 2021-11-11 15:45:05 发布

albberk

最新推荐文章于 2021-11-11 15:45:05 发布

阅读量625

点赞数

分类专栏： CUDA

本文链接：https://blog.csdn.net/albberk/article/details/8660104

版权

CUDA 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

该篇介绍基础GPU程序设计模型、CUDA kernel、两个向量加法的简单CUDA程序、以及在linux下对该程序代码进行编译。

SIMD model：Single Instruction Multiple Data

SIMT model：Single Instruction Multiple Thread Programming Mode

SIMT是SIMD在GPUs上的一个应用版本。

Matrix operations，Monte Carlo calculations，以及一些Data manipulations都可以使用SIMT模型进行程序设计。

CUDA kernel routine：

要写一个SIMT程序，需要写一个GPU上的所有threads都要执行的代码序列。对CUDA来说，这个代码序列就叫一个Kernel routne.

这个Kernel code要求符合C规范，如果需要thread ID的话，那么还需如下格式来加以说明：

…

index = ThreadID;

A[index] = B[index] + C[index];

而这些事所有threads都要做的。

CPU和GPU存储器

基本CUDA程序结构：

int main (int argc, char **argv ) {

1. Allocate memory space in device (GPU) for data

2. Allocate memory space in host (CPU) for data

3. Copy data to GPU

4. Call “kernel” routine to execute on GPU

(with CUDA syntax that defines no of threads and their physical structure)

5. Transfer results from GPU to CPU

6. Free memory space in device (GPU)

7. Free memory space in host (CPU)

return;

}

1.分配GPU存储（device）：使用CUDA malloc函数

int size = N *sizeof( int);      // space for N integers

int *devA, *devB, *devC;    // devA, devB, devC ptrs

cudaMalloc( (void**)&devA, size) );

cudaMalloc( (void**)&devB, size );

cudaMalloc( (void**)&devC, size );

2.分配cpu存储（host），使用C 的malloc函数

Use regular C malloc routines:

int *a, *b, *c;

…

a = (int*)malloc(size);

b = (int*)malloc(size);

c = (int*)malloc(size);

or statically declare variables:

#define N 256

…

int a[N], b[N], c[N];

3.把数据从CPU(HOST)上传输到GPU(DEVICE)上,使用CUDA routine的cudaMemcpy函数：

cudaMemcpy( devA, A, size, cudaMemcpyHostToDevice);

cudaMemcpy( dev_B, B, size, cudaMemcpyHostToDevice);

where:

devA and devB are pointers to destination in device

A and B are pointers to host data

4.“kernel” routine to execute on GPU（device）的说明：

CUDA扩展了一个C语法用于这些功能：

使用尖括号来调用从host大device上的代码。这一对尖括号内包含了organization和number of threads的2个参数。

                 myKernel<<< n, m >>>(arg1, … );

n andm will define organization of thread blocks and threads in a block.

For now, we will set n = 1, which say one block andm = N, which says N threads in this block.

arg1, … , -- arguments to routinemyKernel typically pointers to device memory obtained previously fromcudaMallac.

声明一个Kernel程序格式如下：首先使用__global__来对其加以说明；

#define N 256

__global__ void vecAdd(int *A, int *B, int *C) { // Kernel definition

int i = threadIdx.x;     CUDA structure that provides thread ID in block

C[i] = A[i] + B[i];

}

int main() {

// allocate device memory &

// copy data to device

// device mem. ptrs devA,devB,devC

vecAdd<<<1, N>>>(devA,devB,devC); // Grid of one block, N threads in block

…

}

Each of the N threads performs one pair-wise addition:

Thread 0:    devC[0] = devA[0] + devB[0];

Thread 1:    devC[1] = devA[1] + devB[1];

Thread N-1: devC[N-1] = devA[N-1]+devB[N-1];

5.运行完后将数据从GPU（device）传输回到CPU（host）上：

还是使用CUDA函数cudaMemcpy

cudaMemcpy( C, devC, size, cudaMemcpyDeviceToHost);

where:

devC is a pointer in device andC is a pointer in host.

6.释放device上的内存：使用cudaFree函数，

cudaFree( dev_a);

cudaFree( dev_b);

cudaFree( dev_c);

7.释放host上的内存：如果host上使用了malloc函数来动态分配空间，则可以使用free函数

free( a );

free( b );

free( c );

则一个完整的CUDA程序如下：

#define N 256

__global__ void vecAdd(int *A, int *B, int *C) {

int i = threadIdx.x;

C[i] = A[i] + B[i];

}

int main (int argc, char **argv ) {

int size = N *sizeof( int);

int a[N], b[N], c[N], *devA, *devB, *devC;

cudaMalloc( (void**)&devA, size) );

cudaMalloc( (void**)&devB, size );

cudaMalloc( (void**)&devC, size );

cudaMemcpy( devA, a, size, cudaMemcpyHostToDevice);

cudaMemcpy( devB, b size, cudaMemcpyHostToDevice);

vecAdd<<<1, N>>>(devA, devB, devC);

cudaMemcpy( c, devC size, cudaMemcpyDeviceToHost);

cudaFree( dev_a);

cudaFree( dev_b);

cudaFree( dev_c);

return (0);

}

该程序完成两个向量的加法。

对该程序再加以修饰，可以按照如下方式来处理了：

#include <stdio.h>

#include <cuda.h>

#include <stdlib.h>

#include <time.h>

#define N 4096        // size of array

__global__ void add(int *a,int *b, int *c) {

int tid = blockIdx.x*blockDim.x + threadIdx.x;

        if(tid < N){

         c[tid] = a[tid]+b[tid];

        }

}

int main(int argc, char *argv[]) {

int T = 10, B = 1;            // threads per block/blocks per grid

int a[N],b[N],c[N];

int *dev_a, *dev_b, *dev_c;

printf("Size of array = %d\n", N);

do {

printf("Enter number of threads per block: ");

scanf("%d",&T);

printf("\nEnter nuumber of blocks per grid: ");

scanf("%d",&B);

if (T * B < N) printf("Error T x B < N, try again");

} while (T * B < N);

cudaMalloc((void**)&dev_a,N * sizeof(int));

cudaMalloc((void**)&dev_b,N * sizeof(int));

cudaMalloc((void**)&dev_c,N * sizeof(int));

for(int i=0;i<N;i++) {    // load arrays with some numbers

a[i] = i;

b[i] = i*1;

}

cudaMemcpy(dev_a, a , N*sizeof(int),cudaMemcpyHostToDevice);

cudaMemcpy(dev_b, b , N*sizeof(int),cudaMemcpyHostToDevice);

cudaMemcpy(dev_c, c , N*sizeof(int),cudaMemcpyHostToDevice);

add<<<B,T>>>(dev_a,dev_b,dev_c);

cudaMemcpy(c,dev_c,N*sizeof(int),cudaMemcpyDeviceToHost);

for(int i=0;i<N;i++) {

printf("%d+%d=%d\n",a[i],b[i],c[i]);

}

cudaFree(dev_a); // clean up

cudaFree(dev_b);

cudaFree(dev_c);

return 0;

}