CUDA Programming Basics Part I

http://http.download.nvidia.com/developer/cuda/podcasts/CUDA_Programming_Basics_-_Part_I.m4v

 

This is the first of two models of discussing CUDA programming basic. It is soon some simulity of basic CUDA concepts, such as CUDA threads, thread blocks, and CUDA kernels, which were descripted in previous module - CUDA programming model overview.

This model discussed the CUDA software stack, and some aspects of compiling CUDA code, alone with the basic GPU memory management. Only the basics of CUDA programming are presented in this two models, for additional features of CUDA API, please consoult the CUDA programing guild.

CUDA installationg consist three components: a driver, a toolkit which contains everything needed to compile CUDA code, and an SDK consisting code examples. Detail installations and instructions for different platform are contained in quick start guide located at CUDA zone web site alone with download and other CUDA resources.

CUDA is a hetrgene programming environment where both CPU and GPU are used by an application. The developer can write a single source file contenting both CPU and GPU code. While the CPU code can levgerve wrida CUDA optimized routings such as those found in bload and FFT libraires. The nvidia C compiler is the compiler driver that splits the GPU code from the CPU host code. The CPU code is compiled by native host complier, and GPU code is compiled into PTX or paralell thread excution object code.

This slide represent the complation path of CUDA code. Any source file containing CUDA language must have "cu" extention and be compiled with NVCC. NVCC is actually compile driver that invokes all the necessory tools and compile for both CPU and GPU code. Although not the fault case one can specify we are options to NVCC that CPU c code all device PTX objext code be generated. PTX object code is the intermedia device indepent representation that is rent compiled to a perticular device. CUDA executable contents the PTX code inaddition to target code. So the new taret code can be generated on the fly when the executable is running on the device defferent from where is compiled for.

There are four different build configurations one can use when development CUDA code. Using NVCC with no flag build release mode. The dash g flag can be used tobuild debug mode. The dash deviceemu flag builds device emulation mode where all code run on the CPU. But the executale contents no debuging sample. The dash g and dash deviceemu can be compiled to build debug device emulation mode while all code running on CPU and all code content debug sample.

Have gruffly discuss the CUDA compliation process, we now turn attention to write CUDA code. First the top topic we discuss is managing the device memory. The CPU and GPU have seperate memory spaces and both of the memory spaces are managed from the host or CPU. The device memory management from host code includes allocation and deallocation of memory, as well as data copying between host and device. The device memory we discuss here is globle memary discussed in the program model of overview.

The device memory allocation assignement arraylist functions are similar to C's functions. While C has "malloc", "memset",and "free", CUDA has "cudaMalloc", "cudaMemset", and "cudaFree". The prototypes between C and CUDA functions are similar, except cudaMalloc return to the pointer to the address allocated memory on the device as the first argument. This is because all CUDA API functions return in area code there can be used very fire success completion of the code. CUDA area reporting is discussed in the end of programming basics. The segment of host code on this line demostrate the use of this function. The segment code allocate and initilize andadi allocate a 1024 elements integer array. Note the declariation of device memory porinter is the same as host memory pointer. In order to distingush bwtween pointers to device and host memory spaces, programers often chose variable name with "d" and "h" prefix or surfix. To initialize the array from host code, cudaMemset can be used. as a sub d point to the address in the device memory, access to must be perform though CUDA API functions on host and not by assignment.

Once memory is allocated on the device, data can be tranfered to or from the devcie using cudaMemcpy. The first three arguments of cudaMemcpy follow the format of C's memcpy. The desternation memory address, the source memory address, and the number of bytes to be transfered. The last argument is the enumeration that indication the directon of transfer in turn of host and device, and can take values "cudaMemcpyHostToDevice", "cudaMemcpyDeciveToHost", "cudaMemcpyDeviceToDevice". This is needed because pointer to host and device memory are declared the same way. The source and destination pointer declarations provide no information on which memory spaces they point to. As far as syncriation between host and device is concerned, cudaMemcpy is the saft call. It blocks the CPU thread and returns after the copy is complete. and was only initiate the copy after all previously cuda call have completed. There are a syncnic memory transfer function that can be used, and it will be discussed in memory optimization model.

To demonstray the use of CUDA data management, we step though a sample code that allocate memory for four arraries, two on the host, a sub h and b sub h, and two on device, a sub d, and b sub d. The code initialize a sub h transfer its data to a sub d, then performance the device to device transfer to b sub d, and then back to the host in b sub h. Remember that all this is done in host code. There is no device code in this example. In a highlight section of code, note the pointers to host and device memory are declared in same manner. The h and d suffixes are simply convention used to aid programmers in distinguishing between the two memory spaces. The highlight section of code in this slide caculates the number of bytes to be allocated for each array. And then use C's malloc function to allocate the two arraries on the host. Like wise cudaMalloc is use to allocate the two arraires on the device. Here is the host array a_h is initilized. and data is transfered from a sub h on the host to a sub d on the device using cudaMemcpy. Note the emuration cudaMemcpyHostToDevice is used, and the order of frist two arguments is desternation followed by the source: a sub d and a sub h. cudaMemcpy can be used to tranfer data between two arraries on the device. as it is done here, from a sub d to b sub d. The final transfer is from b sub d on the device to b sub h on the host. Once again, notice the order of the first two arguments relation to the last argument cudaMemcpyDeviceToDevice. The two host arraries are now compared element by element. And finally, host arraries are deallocated using free, and device arraries are deallocated using cudaFree.

The previous example of GPU data management concludes part I of CUDA programming basics. All of the CUDA API discussed this point executed on the host. Part II programming basics focus on GPU code.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
自动控制节水灌溉技术的高低代表着农业现代化的发展状况,灌溉系统自动化水平较低是制约我国高效农业发展的主要原因。本文就此问题研究了单片机控制的滴灌节水灌溉系统,该系统可对不同土壤的湿度进行监控,并按照作物对土壤湿度的要求进行适时、适量灌水,其核心是单片机和PC机构成的控制部分,主要对土壤湿度与灌水量之间的关系、灌溉控制技术及设备系统的硬件、软件编程各个部分进行了深入的研究。 单片机控制部分采用上下位机的形式。下位机硬件部分选用AT89C51单片机为核心,主要由土壤湿度传感器,信号处理电路,显示电路,输出控制电路,故障报警电路等组成,软件选用汇编语言编程。上位机选用586型以上PC机,通过MAX232芯片实现同下位机的电平转换功能,上下位机之间通过串行通信方式进行数据的双向传输,软件选用VB高级编程语言以建立友好的人机界面。系统主要具有以下功能:可在PC机提供的人机对话界面上设置作物要求的土壤湿度相关参数;单片机可将土壤湿度传感器检测到的土壤湿度模拟量转换成数字量,显示于LED显示器上,同时单片机可采用串行通信方式将此湿度值传输到PC机上;PC机通过其内设程序计算出所需的灌水量和灌水时间,且显示于界面上,并将有关的灌水信息反馈给单片机,若需灌水,则单片机系统启动鸣音报警,发出灌水信号,并经放大驱动设备,开启电磁阀进行倒计时定时灌水,若不需灌水,即PC机上显示的灌水量和灌水时间均为0,系统不进行灌水。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值