http://http.download.nvidia.com/developer/cuda/podcasts/CUDA_Programming_Basics_-_Part_I.m4v
This is the first of two models of discussing CUDA programming basic. It is soon some simulity of basic CUDA concepts, such as CUDA threads, thread blocks, and CUDA kernels, which were descripted in previous module - CUDA programming model overview.
This model discussed the CUDA software stack, and some aspects of compiling CUDA code, alone with the basic GPU memory management. Only the basics of CUDA programming are presented in this two models, for additional features of CUDA API, please consoult the CUDA programing guild.
CUDA installationg consist three components: a driver, a toolkit which contains everything needed to compile CUDA code, and an SDK consisting code examples. Detail installations and instructions for different platform are contained in quick start guide located at CUDA zone web site alone with download and other CUDA resources.
CUDA is a hetrgene programming environment where both CPU and GPU are used by an application. The developer can write a single source file contenting both CPU and GPU code. While the CPU code can levgerve wrida CUDA optimized routings such as those found in bload and FFT libraires. The nvidia C compiler is the compiler driver that splits the GPU code from the CPU host code. The CPU code is compiled by native host complier, and GPU code is compiled into PTX or paralell thread excution object code.
This slide represent the complation path of CUDA code. Any source file containing CUDA language must have "cu" extention and be compiled with NVCC. NVCC is actually compile driver that invokes all the necessory tools and compile for both CPU and GPU code. Although not the fault case one can specify we are options to NVCC that CPU c code all device PTX objext code be generated. PTX object code is the intermedia device indepent representation that is rent compiled to a perticular device. CUDA executable contents the PTX code inaddition to target code. So the new taret code can be generated on the fly when the executable is running on the device defferent from where is compiled for.
There are four different build configurations one can use when development CUDA code. Using NVCC with no flag build release mode. The dash g flag can be used tobuild debug mode. The dash deviceemu flag builds device emulation mode where all code run on the CPU. But the executale contents no debuging sample. The dash g and dash deviceemu can be compiled to build debug device emulation mode while all code running on CPU and all code content debug sample.
Have gruffly discuss the CUDA compliation process, we now turn attention to write CUDA code. First the top topic we discuss is managing the device memory. The CPU and GPU have seperate memory spaces and both of the memory spaces are managed from the host or CPU. The device memory management from host code includes allocation and deallocation of memory, as well as data copying between host and device. The device memory we discuss here is globle memary discussed in the program model of overview.
The device memory allocation assignement arraylist functions are similar to C's functions. While C has "malloc", "memset",and "free", CUDA has "cudaMalloc", "cudaMemset", and "cudaFree". The prototypes between C and CUDA functions are similar, except cudaMalloc return to the pointer to the address allocated memory on the device as the first argument. This is because all CUDA API functions return in area code there can be used very fire success completion of the code. CUDA area reporting is discussed in the end of programming basics. The segment of host code on this line demostrate the use of this function. The segment code allocate and initilize andadi allocate a 1024 elements integer array. Note the declariation of device memory porinter is the same as host memory pointer. In order to distingush bwtween pointers to device and host memory spaces, programers often chose variable name with "d" and "h" prefix or surfix. To initialize the array from host code, cudaMemset can be used. as a sub d point to the address in the device memory, access to must be perform though CUDA API functions on host and not by assignment.
Once memory is allocated on the device, data can be tranfered to or from the devcie using cudaMemcpy. The first three arguments of cudaMemcpy follow the format of C's memcpy. The desternation memory address, the source memory address, and the number of bytes to be transfered. The last argument is the enumeration that indication the directon of transfer in turn of host and device, and can take values "cudaMemcpyHostToDevice", "cudaMemcpyDeciveToHost", "cudaMemcpyDeviceToDevice". This is needed because pointer to host and device memory are declared the same way. The source and destination pointer declarations provide no information on which memory spaces they point to. As far as syncriation between host and device is concerned, cudaMemcpy is the saft call. It blocks the CPU thread and returns after the copy is complete. and was only initiate the copy after all previously cuda call have completed. There are a syncnic memory transfer function that can be used, and it will be discussed in memory optimization model.
To demonstray the use of CUDA data management, we step though a sample code that allocate memory for four arraries, two on the host, a sub h and b sub h, and two on device, a sub d, and b sub d. The code initialize a sub h transfer its data to a sub d, then performance the device to device transfer to b sub d, and then back to the host in b sub h. Remember that all this is done in host code. There is no device code in this example. In a highlight section of code, note the pointers to host and device memory are declared in same manner. The h and d suffixes are simply convention used to aid programmers in distinguishing between the two memory spaces. The highlight section of code in this slide caculates the number of bytes to be allocated for each array. And then use C's malloc function to allocate the two arraries on the host. Like wise cudaMalloc is use to allocate the two arraires on the device. Here is the host array a_h is initilized. and data is transfered from a sub h on the host to a sub d on the device using cudaMemcpy. Note the emuration cudaMemcpyHostToDevice is used, and the order of frist two arguments is desternation followed by the source: a sub d and a sub h. cudaMemcpy can be used to tranfer data between two arraries on the device. as it is done here, from a sub d to b sub d. The final transfer is from b sub d on the device to b sub h on the host. Once again, notice the order of the first two arguments relation to the last argument cudaMemcpyDeviceToDevice. The two host arraries are now compared element by element. And finally, host arraries are deallocated using free, and device arraries are deallocated using cudaFree.
The previous example of GPU data management concludes part I of CUDA programming basics. All of the CUDA API discussed this point executed on the host. Part II programming basics focus on GPU code.