MP1的目标:通过实现Vector Addition熟悉使用CUDA API及相关的设置。
主要会考察以下几个方面,在代码"//@@"下面添加你自己的代码。
Allocate device memory
Copy host memory to device
Initialize thread block and kernel grid dimensions
Invoke CUDA kernel
Copy results from device to host
Free device memory
Write the CUDA kernel
Vector Addition就不多讲了,下面我给出了自己的源码:
// MP 1
#include <wb.h>
__global__ void vecAdd(float * in1, float * in2, float * out, int len) {
//@@ Insert code to implement vector addition here
// Get our global thread ID
int id = blockIdx.x*blockDim.x+threadIdx.x;
// Make sure we do not go out of bounds
if (id < len)
out[id] = in1[id] + in2[id];
}
int main(int argc, char ** argv) {
wbArg_t args;
int inputLength;
float * hostInput1;
float * hostInput2;
float * hostOutput;
float * deviceInput1;
float * deviceInput2;
float * deviceOutput;
args = wbArg_read(argc, argv);
wbTime_start(Generic, "Importing data and creating memory on host");
hostInput1 = (float *) wbImport(wbArg_getInputFile(args, 0), &inputLength);
hostInput2 = (float *) wbImport(wbArg_getInputFile(args, 1), &inputLength);
hostOutput = (float *) malloc(inputLength * sizeof(float));
wbTime_stop(Generic, "Importing data and creating memory on host");
wbLog(TRACE, "The input length is ", inputLength);
wbTime_start(GPU, "Allocating GPU memory.");
//@@ Allocate GPU memory here
size_t bytes = inputLength*sizeof(float);
cudaMalloc(&deviceInput1, bytes);
cudaMalloc(&deviceInput2, bytes);
cudaMalloc(&deviceOutput, bytes);
wbTime_stop(GPU, "Allocating GPU memory.");
wbTime_start(GPU, "Copying input memory to the GPU.");
//@@ Copy memory to the GPU here
cudaMemcpy( deviceInput1, hostInput1, bytes, cudaMemcpyHostToDevice);
cudaMemcpy( deviceInput2, hostInput2, bytes, cudaMemcpyHostToDevice);
wbTime_stop(GPU, "Copying input memory to the GPU.");
//@@ Initialize the grid and block dimensions here
// Number of threads in each thread block
int blockSize = 1024;
// Number of thread blocks in grid
int gridSize = (int)ceil((float)inputLength/blockSize);
wbTime_start(Compute, "Performing CUDA computation");
//@@ Launch the GPU Kernel here
vecAdd<<<gridSize, blockSize>>>(deviceInput1, deviceInput2, deviceOutput, inputLength);
cudaThreadSynchronize();
wbTime_stop(Compute, "Performing CUDA computation");
wbTime_start(Copy, "Copying output memory to the CPU");
//@@ Copy the GPU memory back to the CPU here
cudaMemcpy( hostOutput, deviceOutput, bytes, cudaMemcpyDeviceToHost );
wbTime_stop(Copy, "Copying output memory to the CPU");
wbTime_start(GPU, "Freeing GPU Memory");
//@@ Free the GPU memory here
cudaFree(deviceInput1);
cudaFree(deviceInput2);
cudaFree(deviceOutput);
wbTime_stop(GPU, "Freeing GPU Memory");
wbSolution(args, hostOutput, inputLength);
free(hostInput1);
free(hostInput2);
free(hostOutput);
return 0;
}
之前的博文中《Hetergeneous Parallel Programming编程作业之MP0: Initial Lab Tour with Device Query》,所需的源文件:https://github.com/ashwin/coursera-heterogeneous. 用src/GenDataMP1.cpp生成可执行文件程序GenDataMP1,然后用它生成长度为10000的vector. 它会一次性产生三个vector文件A, B, C. 这里 C = A + B.
然后我们执行MP1的程序:
$ optirun ./MP1 vecA.txt vecB.txt vecC.txt
运行结果:
[Generic] 0.0193745920 Importing data and creating memory on host
Trace main::32 The input length is 10000
[GPU ] 0.0597575680 Allocating GPU memory.
[GPU ] 0.0000576000 Copying input memory to the GPU.
[Compute] 0.0000394240 Performing CUDA computation
[Copy ] 0.0000514560 Copying output memory to the CPU
[GPU ] 0.0001185280 Freeing GPU Memory
All tests passed!
本地测试完成后,可以将代码cp到在线提交系统进行测试,一共有6个数据集,测试通过,完成。