相关概念
GPU 硬件架构具有三层的存储结构,分别是:线程私有存储、线程块共享存储、全局存储。访问速度自然是由快到慢。CUDA 程序编写的一个重要理念也是尽可能少的访问全局缓存。如下图所示:
下面我们对这三种访存模式进行一一的代码测试。
代码测试
下面的代码分别对三种访存模式进行相应的简单测试。
// Using different memory spaces in CUDA
#include <stdio.h>
/**********************
* using local memory *
**********************/
// a __device__ or __global__ function runs on the GPU
__global__ void use_local_memory_GPU(float in)
{
float f; // variable "f" is in local memory and private to each thread
f = in; // parameter "in" is in local memory and private to each thread
// ... real code would presumably do other stuff here ...
}
/**********************
* using global memory *
**********************/
// a __global__ function runs on the GPU & can be called from host
__global__ void use_global_memory_GPU(float *array)
{
// "array" is a pointer into global memory on the device
array[threadIdx.x] = 2.0f * (float) threadIdx.x;
}
/**********************
* using shared memory *
**********************/
// (for clarity, hardcoding 128 threads/elements and omitting out-of-bounds checks)
__global__ void use_shared_memory_GPU(float *array)
{
// local variables, private to each thread
int i, index = threadIdx.x;
float average, sum = 0.0f;
// __shared__ variables are visible to all threads in the thread block
// and have the same lifetime as the thread block
__shared__ float sh_arr[128];
// copy data from "array" in global memory to sh_arr in shared memory.
// here, each thread is responsible for copying a single element.
sh_arr[index] = array[index];
__syncthreads(); // ensure all the writes to shared memory have completed
// now, sh_arr is fully populated. Let's find the average of all previous elements
for (i=0; i<index; i++) { sum += sh_arr[i]; }
average = sum / (index + 1.0f);
// if array[index] is greater than the average of array[0..index-1], replace with average.
// since array[] is in global memory, this change will be seen by the host (and potentially
// other thread blocks, if any)
if (array[index] > average) { array[index] = average; }
// the following code has NO EFFECT: it modifies shared memory, but
// the resulting modified data is never copied back to global memory
// and vanishes when the thread block completes
sh_arr[index] = 3.14;
}
int main(int argc, char **argv)
{
/*
* First, call a kernel that shows using local memory
*/
use_local_memory_GPU<<<1, 128>>>(2.0f);
/*
* Next, call a kernel that shows using global memory
*/
float h_arr[128]; // convention: h_ variables live on host
float *d_arr; // convention: d_ variables live on device (GPU global mem)
// allocate global memory on the device, place result in "d_arr"
cudaMalloc((void **) &d_arr, sizeof(float) * 128);
// now copy data from host memory "h_arr" to device memory "d_arr"
cudaMemcpy((void *)d_arr, (void *)h_arr, sizeof(float) * 128, cudaMemcpyHostToDevice);
// launch the kernel (1 block of 128 threads)
use_global_memory_GPU<<<1, 128>>>(d_arr); // modifies the contents of array at d_arr
// copy the modified array back to the host, overwriting contents of h_arr
cudaMemcpy((void *)h_arr, (void *)d_arr, sizeof(float) * 128, cudaMemcpyDeviceToHost);
// ... do other stuff ...
/*
* Next, call a kernel that shows using shared memory
*/
// as before, pass in a pointer to data in global memory
use_shared_memory_GPU<<<1, 128>>>(d_arr);
// copy the modified array back to the host
cudaMemcpy((void *)h_arr, (void *)d_arr, sizeof(float) * 128, cudaMemcpyHostToDevice);
// ... do other stuff ...
return 0;
}
其中对共享存储的访问是最复杂的。
可以发现在共享存储的使用中调用了 __syncthreads() 函数,该函数的作用便是令所有进程都写完之后再进行后面的操作,类似线程挂起。