Second Chapter: Parallel programing using CUDA C


  In this chapter, it will start with a variable addition program and then incrementally build towards complex vector manipulation examples in CUDA C. The chapter discusses how vectors are operated upon in CUDA programs and how CUDA can accelerate vector operations compared to CPU processing.

  • the conceptof the kernel call
  • creating kernel funtions and passing parameters to it in CUDA
  • coniguring kernel parameters and memory location for CUDA programs;
  • thread execution in CUA programs;
  • Accessing GPU device properties from CUDA programs;
  • Working with vectors in CUDA programs
  • Parallel communication pattern;

CUDA program structure

  In this section, a simple two-variable addtion program is taken to explain important concepts related to CUDA programing, such as kernel calls, passing parameters to kernel functions from host to device, the configuration of kernel parameters, CUDA APIs needed to exploit data parallelism, and how memory allocation takes place on the host and the device.

Two-variablle addition program in CUDA C

  [Note]:
    The project you create in Microsoft visual studio is NVIDIA|CUDA x.x|CUDA x.x Runtime;
    It’s very important to free up all the memory used on the device explicitly from the program, otherwise, you might run out of memory at some point.

__global__ void kernel(int g_a, int g_b, int * p_g_c)
{
       *p_g_c = g_a + g_b;
}

int main()
{

       int h_c;
       int * p_d_c;
       
       cudaMalloc((void **)&p_d_c, sizeof(int));
       kernel << <1, 1 >> > (1, 4, p_d_c);
       cudaMemcpy(&h_c, p_d_c, sizeof(int), cudaMemcpyDeviceToHost);
       
       cout << "1 + 4 = " << h_c << endl;
       cudaFree(p_d_c);

       return 0;
}

A kernel call

  Basically, the meaning of kernel call is that we are lauching device code from the host code. A kernel call typically generates a large number of blocks and threads to exploit data parallelism on the GPU.
  kernel call syntax: kernel<<<blockNum, threadNumforPerBlock, sharedMemSize>>>(parameters...);
  <<<>>> is the kernel launch operator that contains configuration parameters for kernel.
  [Note]: Pointers passed as parameters to kernel should only point to device memory.

Configuing kernel parameters

  Normally, there is a limit of 512 or 1024 threads per block. Each block runs on the streaming multiprocessors, and threads in one block can communicate with one another via shared memory.
  [Note]: the programmmer can’t choose which multiprocessor will execute a particular block and in which order blocks or threads will execute.
  GPU supports a 3D grids of blocks and 3D blocks of threads. It has the following syntax:
  kernel<<<dim3(blockNumX, blockNumY, blockNumZ), dim3(threadNumX, threadNumY, threadNumZ)>>>(parameters);

CUDA API functions

  1. __ global __ (KeyWords):
      It’s one of three qualifier keywords, along with __ device __ and __ host __. This keyword indicates that a function is declared as a device function and will execute on the device when called from the host.
    [Note]:
      The function declared with this qulifier can only be called from the host and execute on the device. If you want your function to execute on the device and called from the device function, then you have to use the __ device __ keywords.

  2. cudaMalloc (function):
      It’s similar to the malloc function used in C for dynamic memory allocation. This function is used to allocate a memory block of a specific size on the device. the syntax of cudaMalloc with an example is as follows:

cudaMalloc(void ** dPointer, size_t size);
//Example:
cudaMalloc((void **) & dPointer. sizeof(int));
  1. cudaMemcpy:
      This function is used to copy one block of memory to oher blocks on a host or a device.
cudaMemcpy(void * dstPtr, const void * srcPtr, size_t size, enum cudaMemcpyKind kind)
// Example:
cudaMemcpy(&h_c, d_c, sizeof(int), cudaMemcpyDeviceToHost);
  1. cudaFree
      It frees the memory space. the syntax of cudaFree is as follows:
cudaFree(void * dPtr);
//Example
cudaFree(d_c);

Passing parameters to CUDA functions

passing parameters by value

kernel<<<1, 1>>>(a, b, gP);

passing parameters by reference

__global__ void kernel(int * ga, int * gb)
{
       int t;
       t = *ga;
       *ga = *gb;
       *gb = t;
}

int main()
{
       int a = 1;
       int b = 2;
       int * ga;
       int * gb;
// we should notice that ga and gb Pointers are redising on host memory;
// but the two pointers point to the device memory, so they are called device pointers;
// and we should copy the data from host to device, when we passing parameters by reference
       cudaMalloc((void **)&ga, sizeof(int));
       cudaMalloc((void **)&gb, sizeof(int));

       cudaMemcpy(ga, &a, sizeof(int), cudaMemcpyHostToDevice);
       cudaMemcpy(gb, &b, sizeof(int), cudaMemcpyHostToDevice);

       cout << "a = " << a << "\t b = " << b <<endl;

       kernel << <1, 1 >> > (ga, gb);

       cudaMemcpy(&a, ga, sizeof(int), cudaMemcpyDeviceToHost);
       cudaMemcpy(&b, gb, sizeof(int), cudaMemcpyDeviceToHost);
       
       cout << "a = " << a << "\t b = " << b;

       cudaFree(ga);
       cudaFree(gb);
       return 0;
 }

Executing threads on a device

  The blocks execute in random order, and the threads of every block execute in sequence.

__global__ void kernel()

{

       printf("Blocks: %d, \tThreads: %d.\n", blockIdx.x, threadIdx.x);

}

int main()

{
       kernel << <5, 5 >> > ();
       //cudaDeviceSynchronize();
       printf("all threads are finished!\n");
       return 0;      
}

The running results of program above:
在这里插入图片描述

The runing results of program with cudaDeviceSynchronize();
在这里插入图片描述

  According to the two results, we can see that if we don’t add cudaDeviceSynchronize into the program, it will returns control to CPU thread immediately after a kernel call, although the kernel has not finished.

Accessing GPU device properties from CUDA programs

cudaGetDeviceCount();

  To get a count of how many CUDA-enabled devices are present on the system. The syntax of cudaGetDeviceCount() is as follows:

int deviceCount = 0;
cudaGetDeviceCount( & deviceCount );

general device properties

cudaDeviceProp provides several properties that can be used to identify the device and the versions being used.
cudaDeviceProp->name;
cudaDeviceProp->multiProcessorCount;
cudaDeviceProp->clockRate;

cudaDeviceProp deviceProperties;
cudaGetDeviceProperties(&deviceProperties, deviceCount - 1);

cout << "Device " << deviceCount - 1<< " :" << deviceProperties.name << endl;

int driverVersion, runtimeVersion;
cudaDriverGetVersion(&driverVersion);
cudaRuntimeGetVersion(&runtimeVersion);

cout << deviceProperties.multiProcessorCount << " Multiprocessors" << endl;
cout << "GPU Max Clock rate: " << deviceProperties.clockRate * 1e-3f << " MHz" <<
      "(" << deviceProperties.clockRate * 1e-6f << " GHz)" << endl;

Memory-realted properties

  Memory on the GPU has a hierachical architecture. It can be divided in terms of L1 cache, L2 cache, global memory, texture memory, and shared memory.
cudaDeviceProp->memoryClockRate;
cudaDeviceProp->memoryBusWidth;

the two members of cudaDeviceProp affectes the overall speed of your program.
**cudaDeviceProp->totalGlobalMem; // the size of global memory
cudaDeviceProp->constMem; // total constant memory
cudaDeviceProp->sharedMemPerBlock; // total shared memory in per block
cudaDeviceProp->l2CacheSize; // the size of L2 Cache
**

Thread-realted properties

**cudaDeviceProp->maxThreadsPerMultiProcessor; // the number of threads per multiprocessor
cudaDeviceProp->maxThreadsPerBlock; // the number of threads per block
cudaDeviceProp->maxThreadDim[3]; // the maximum threads per block in each dimension
cudaDeviceProp->maxGridSize; // the maximum blocks per grid in each dimension

       cout << "the max number of threads per multiprocessors: " << deviceProperties.maxThreadsPerMultiProcessor << endl;
       cout << "the max number of threads per block: " << deviceProperties.maxThreadsPerBlock << endl;
       cout << "the max threads per block in each dimension: ("
              << deviceProperties.maxThreadsDim[0] << ", "
              << deviceProperties.maxThreadsDim[1] << ", "
              << deviceProperties.maxThreadsDim[2] << ")" << endl;
       cout << "the max blocks per grid in each dimension: ("
              << deviceProperties.maxGridSize[0] << ", "
              << deviceProperties.maxGridSize[1] << ", "
              << deviceProperties.maxGridSize[2] << ")" << endl;

the runnign result of aboved program:在这里插入图片描述

To know whehter the device support double precision floating-point operation and set that device for your application.

cudaDeviceProp deviceProperties;
int device;
cudaGetDevice(&device);
// ---------------------------------------------------------------------------------
// To set double precision floating-point operation available
// 首先设置 cudaDeviceProp deviceProperties(所有内容清零)
// 然后设置 cudaDeviceProp->minor, cudaDeviceProp->major
// 在通过 cudaChoosingDevice 函数,依据上述设置的 deviceProperties,
// 搜索符合条件的 device,并通过 cudaSetDevice() 选择该 device。
// ---------------------------------------------------------------------------------
memset(&deviceProperties, 0, sizeof(cudaDeviceProp));
deviceProperties.minor = 3;
deviceProperties.major = 1;
cudaChooseDevice(&device, &deviceProperties);
std::cout << "ID of device which supports double precision is: " << device << endl;
cudaSetDevice(device);

CUDA documentation says that if major is greater than 1 and minor is greater than 3, then that device will support double precision operations.
cudaChooseDevice PAI that helps in choosing a device with particular properties.

Vector operations in CUDA

two-vector addition program

// ----------------------------------------------------------------------
// Two-vector addition program (CPU edt.)
// ----------------------------------------------------------------------
int hostVec1[N];
int hostVec2[N];
int hostResult[N];

size_t size = N * sizeof(int);

// vector initialization
for (int i = 0; i < N; i++)
{
      hostVec1[i] = 2 * i;
      hostVec2[i] = 3 * i;
}


// gpu 参数传递仅支持指针类型
// 不能定义为 vector, int vec[n]
int * gpuVec1;
int * gpuVec2;
int * gpuResult;

cudaMalloc((void **) & gpuVec1, size);
cudaMalloc((void **) & gpuVec2, size);
cudaMalloc((void **) & gpuResult, size);

// Error passing parameter by reference
// int gpuVector[N];
// cudaMalloc((void **) & gpuVector, size);
// gpuVecInit<<<1, N>>>( gpuVector);

// vector GPU initialization
clock_t startTime = clock();
cout << "gpuVec1: " << endl;
gpuVecInit<<<1, N>>>(gpuVec1);
cudaDeviceSynchronize();

cout << "gpuVec2: " << endl;
gpuVecInit<<<1, N>>>(gpuVec2);
cudaDeviceSynchronize();

gpuAddVec<<<N, 1>>>(gpuVec1, gpuVec2, gpuResult);
clock_t endTime = clock();
cout << "the running time of vector addtion: " << (startTime - endTime) / CLOCKS_PER_SEC << endl;

cudaMemcpy(hostResult, gpuResult, sizeof(int) * N, cudaMemcpyDeviceToHost);
cout << "\nhostResult: " << endl;

for (int i = 0; i < N; i++)
{
      cout << hostResult[i] << " ";
}
// two kernel function 
__global__ void gpuAddVec(int * gpuVec1, int * gpuVec2, int * gpuResult)
{
       int x = blockIdx.x;
       if (x < N)
              gpuResult[x] = gpuVec1[x] + gpuVec2[x];             
}

__global__ void gpuVecInit(int * vec)
{
       int i = threadIdx.x;
       if (i < N)
       {
              vec[i] = 2 * i * i;
       }
}

[Note] clock() 获取当前时钟数,CLOCKS_PER_SEC 为时钟频率

Parallel communication patterns

  When several threads are executed in parallel, they follow a certain communication pattern that indicates where it is taking inputs and where it is writing its output in memory.

Map

  In this communication pattern, each thread or task takes a single input and produces a single output. The code of the map pattern will look as follows:
output[i] = input[i] * 2;

Gather

   In this pattern, each thread or task has multiple inputs, and it produces a sigle output to be written at a single location in memory.

Scatter

  In a scatter pattern, a thread or a task takes a single input and computes where in the memory it should write the output. It can also be one-to-many operations.

stencil

  When threads or task read input from a fixed set of a neighborhoood in an array, the this is callled a stencil communication pattern. It’s very useful in image-processing examples where we work on 3x3 or 5x5 neighborhood windows. It’s similar to Gather pattern.

Transpose

   When the input is in form of a row-major matrix, and we want the output to be in column-major form. It’s particularly useful if you have a structure of arrays and you want to convert it in the form of an array of structures.

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Dongz__

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值