CUDA Programming Basics Part II

http://http.download.nvidia.com/developer/cuda/podcasts/CUDA_Programming_Basics_-_Part_II.m4v


CUDA progromming basics part II kernels.


In part I of CUDA programming basics, we discussed the device memory allocation and had move data between host and device. All of this memory management is performed from the host code. In part II, we focus on the code that runs on the GPU. We discuss kernel launches and some specifics of writing GPU code.


Kernels or functions launched from host and run on the device are basicly C functions with some add restrictions. Kernels cannot access host memory and must have a void return type. Kernels also cannot be called with variable number of arguments, can not be recursive, and cannot contain s variables. Kernels function arguments are automatically pass to the device, and do not need to be transfered using cudaMemcpy. If kernel arguments are pointers, then the value of pointers are automaticlly passed, but the memory they point to is not copy. For example, if an array in host memory is to be used by kernel, first the array is copyed to the device memory through cudaMemcpy, and then the pointer to the array in device memory could be automatically pass as the kernel arguments.


Becuase CUDA is a hetrigene programming environment, where both host and device code can reside in the same file. Other function qualifiers are device and host. Functions with the device qualifier are both called from and executed on the device. They can not be called from the host. These perfetecly kernel herb functions. The host qualifier denote functions that call and run on the host, which is the default behavour. The host qualifier can be combined with the device qualifer, to generate both CPU and GPU code full function.


Kernel or functions declared with global qualifer are called from the host using a modified C function syntax. The syntax modification is an executation configuration delimited by trigle angle brackets and place between function and function arguments. The basic executation configuration defines the demension of grid and block that used to execute the function. Both of these execution configuration arguments are of CUDA build in vector type dim3. The grid dimension in block is at most 2 demensional, and size in each demension are specified in the x and y fields of first execution configuration arguments. The arrangment of threads was in the block can be 1 2 or 3 dimensional, and size in each dimension are spedified in the x, y and z fields of the second execution configuration argument. Multi-dimensional grid and block offer the convience to the programmer when dealing and hearing multi-dimensional data. All unspecified dim3 fields are initialized to 1.

Several examples of specify execution configuration permentors and launching kernels are given on this slide. In the first code segment, the variable grid and block are defined and assigned values used to lunch the two dimensional grid of 2 by 4 blocks, where each block contains 8 by 16 arrangement of threads. The second code segment performs the same assignment to grid and block variables in a more sync manner. In the first line, constructor functions for the build in vactor type in the row assigning the argument to the respect fields. The kernel launch is the same as the previous case. Finally, for one dimensional grid in thread blocks, integer can be used as execution parameter arguments.

The previous two slides discussed how the kernels are lanuched from the host code, in perticular how the dimension and sizes the grid and block are spedified in execution configuration. In the next few slides, we discuss how the device code gain the access of information and how the opiticly used. All code that run on the device namely function declared with global and device qualifiers have access to four automatically defined variables of type dim3. Two of the variables gridDim and blockDim contain the grid and block dimension supply by the host in the two arguments  of execution configuration. Recall that all threads that run the same kernel code and the order for the thread to perform independ work, a unique thread ID is used to access different data. This unique thread ID is created from the other two  automatically defined variables, blockIdx and threadIdx. BlockIdx contains the block index within the grid. And treadIdx contains the thread index within the block.

Unique Thread IDs are generated from the build-in variables by mapping the local thread index to the global index, which terpecly is used in turn to access different array elements. The mathing makes of using three of the four automatically defined variables, blockDim, blockIdx, and threadIdx. The one dimensonly example demonstrate how this is done in the grid of three blocks with 5 thread per block. The product of block index time block dimension  provide an offset to the local thread index. and some of this offsets and local thread index result the unique thread ID. This is the only one way in which unique thread ID can be created. One can use other mapping as well.

This slide shows two small kernels, both of which are pass to the pointer to integer array as well as integer value to be assigned in each element of the array. The first kernel is not pertically useful for kernel, in that all the threads executing the kernel assigned the value to only the first element of the array. The second kernel represent more tipical the situation where unique thread IDs is caculated in first expression using the automatically defined variables blockDim, blockIdx, and threadIdx. Here each of the thread using the unique ID to assign the valude to different elements of the array. There is a inhere subtion here that the execution configuration has been chosen to launch at most as many threads as there are element in the array.

Having given some examples of kernels, we now look at CUDA example containing both host and device code, and compare two traditional CPU implementation. Both CUDA and CPU code encreacement each element of integer array. In the CPU main code, a function inc_cpu is called with two arguments, a pointer to the array, and N, an integer contains the number of array elements. The function in the CPU contains the loop with increacement of each an element of the array. In the code host code, the variable blocksize representing the number of threads in the block is used to initialize dimBlock, and in the following line, to determine the dimGrid, the number of blocks in the grid. The seeing function is using for the case where N is not evenly divisiable by blocksize. In this case, more threads than array elements will be launched by execution configuration, since all blocks must contain the same number of threads. The kernel code will be responsible for avoiding out of bound addressing by the extra threads. The inc_gpu kernel is launch using dimGrid and dimBlock in the executation configuration. The array pointer and array size N are passed as it is done in the CPU function call. In the kernel, the first operation perform is the caculation of the unique thread ID for all threads executing in the kernel. The kernel then check the thest thread ID used in the array index would perfermance the inbound memory refernece. If so, the due array element is incremented, if no, then no operation is performed by that thread. When comparing the CUDA and CPU versions, one should know that the actual code increment array is the same. The different is that the loop in the CPU code does not exist in the CUDA kernel. In accencence, the loop body of CPU code is replaced by many threads executing the CUDA kernel in parallel. The theads are launched by execution configuration in the host CUDA code.

Data movement and kernel launches on host code have different synchronization care derestive. All kernel launches are asynchronous, and that control return to host code immediately. On the device, the kernel will only begin exection untill after all previous CUDA call have completed. Data transfers via cudaMemcpy are synchronous. Thansfering using cudaMemcpy are initiated only after all exist CUDA calls have finished, and return only after the data transfer has completed. One can manually block host execution until all previous CUDA calls complete, using cudaThreadSynachronize, this is useful when using asynchronous memory transfers which will be discussed to optimization models. Or when timing kernel executions. Aside from these two situations, CUDA threads synchronize is typically not needed. Due to the synchronize of data transfer via cudaMe cpy.

The asynchronization kernel launches alies some scenario execution of host and device code, as demostrated in the example on the slide. In the host code segment, data is sent to the device using cudaMemcpy, which returns control of CPU after the tranfer has completed. The kernel is then launched, which returns the control to the CPU immeditally. In this case, the function run_cpu_stuff is executed on the CPU overlaping the execution with the end GPU kernel on the device. After run_cpu_stuff complete, cudaMemcpy is called, if the kernel has completed, cudaMemcpy will bagin transfer of data to the host. Otherwise, cudaMemcpy will block untill the kernel has completed.

Just as CUDA has function qualifiers for indicating where functions are invoke and where they are executed, CUDA has variable qualifiers use to indicate in which memory spece a variable reside. The device variable qualifier can be used desginate the variable reside in global memory, with a large, has high latency, and is not cached. Memory allocated with cudaMalloc is a example on device memory. It is accessable by all threads and has the life time for the application. The shared variable qualifier is used to indicate the memory is to be stored in on-chip share memory. The amount of share memory can be specified ether at compile time or at kernel launch. And is accessable by all threads in the same thread block. The life time of share memory is at of thread block. Unqualified variables in device code are placed into registers, space permitting, or are stored in local memory. Local here refer physical, the data is local to each thread. Physically, local memory is off cheap, in dee round, and has a large latency like global memory. Both register and local memory have life time of the thread.

We mention on the previous slide, the share memory can be specified by ether in compile time or at kernel launch. In this slide, we demotrate how this complished. If the size of shared memory array is no init at compile time, then the array is declared to be a var size in the kernel, and the kernel is launched with an execution configuration, spedifies the grid size and block size. If the size of share memory array is not known at compile time, but is determined by CPU code lead up at kernel launch, the amount of share memory can be spedified dynamatically by a third argument to the execution configuration. In the example on the right, the share memory space specify is equal to the space require for block size float. In the kernel, the extern key word is used to long with empty slong bracket. Both of these methods can be use some atteniously. So that kernel can use statically as well as dynamically specify share memory.

We previously discussed the whole synchronation in respect to data transfer and kernel launches. And how CUDA theads synchronize can be called to block the CPU code and tell all CUDA calls on the device have completed. The GPU also has a synchronation function, double under score syncthreads, which is used to synchronize threads within a block. It is a barrier synchronation, meaning that no thread can pass until all threads in the block reach it. It is needed because the order in which thread of block execute is unspecified, and it is used to avoid read after write, write after read, and write after write hazards when different threads access the same addresss in shared memory. Syncthreads is allowed in conditional code, but only if the condition evaluated identacally across the entire thread block. For example, in the code of block this slide, the syncthreads call is allowed since the condition is based on the block id, and were therefore evaluated dynatically across all threads within any block.

GPU atomic operations allow different threads to safely modify the same address for associative operations or operations were order of execution does not impact the result. These operations, including add, subtract, increment, decrement, min and max, logical and, or and execlusive or, and as well as exchange, compare and swap. Atomic operations are aviable on certain Nvidia GPUs depend on GPU compute capability. The compute capability is used to keep track certain features aviable on GPU. Such as which atomic operations are supported. Atomic operations on 32-bit words in global memory requires device compute capability 1.1 or higher, such as g84, g86 or gat92. Atomic operations operating on 32-bit words in shared memory and 64-bit words in global memory requrie the devices with compute capability 1.2 or higher, such as the 10 serious architecture.

We have already discussed one specific built-in vector type dim3, which is used in host code, to specify the execution configuration, and by the automatically defined variable in device code used to caculate the unique thread ID. The programmer has accessed the other built-in vector types both from host or device code. These types are derive from base integer and floating pointer types and contain from one to four components, which are accessed by fields x y z and w.

Non of these sample codes gaven so far have utilize CUDA's error reporting capability. this is likely done by a six brackly. Here we discuss how error reporting is used in CUDA. All CUDA calls return in error code of type cudaError sub t. With the exception of kernel launch, which must be return type void. It is possible to get the error code for kernel launchs using the function cudaGetLastError which returns the code of last error. For detecting kernel errors, one should use this after CUDA threads synchronize called to ensure the kernel has completed. Once the error code has been obtained, it can be supplied as an argument to cudaGetErrorString function which returns with null-terminated character string describing the error.

This finishes the CUDA programming basics models. Only the basic features of CUDA API have been covered in this model. For more API functions, see the programming guild, and check out the advance CUDA progrmming models.

以下是对提供的参考资料的总结,按照要求结构化多个要点分条输出: 4G/5G无线网络优化与网规案例分析: NSA站点下终端掉4G问题:部分用户反馈NSA终端频繁掉4G,主要因终端主动发起SCGfail导致。分析显示,在信号较好的环境下,终端可能因节能、过热保护等原因主动释放连接。解决方案建议终端侧进行分析处理,尝试关闭节电开关等。 RSSI算法识别天馈遮挡:通过计算RSSI平均值及差值识别天馈遮挡,差值大于3dB则认定有遮挡。不同设备分组规则不同,如64T和32T。此方法可有效帮助现场人员识别因环境变化引起的网络问题。 5G 160M组网小区CA不生效:某5G站点开启100M+60M CA功能后,测试发现UE无法正常使用CA功能。问题原因在于CA频点集标识配置错误,修正后测试正常。 5G网络优化与策略: CCE映射方式优化:针对诺基亚站点覆盖农村区域,通过优化CCE资源映射方式(交织、非交织),提升RRC连接建立成功率和无线接通率。非交织方式相比交织方式有显著提升。 5G AAU两扇区组网:与三扇区组网相比,AAU两扇区组网在RSRP、SINR、下载速率和上传速率上表现不同,需根据具体场景选择适合的组网方式。 5G语音解决方案:包括沿用4G语音解决方案、EPS Fallback方案和VoNR方案。不同方案适用于不同的5G组网策略,如NSA和SA,并影响语音连续性和网络覆盖。 4G网络优化与资源利用: 4G室分设备利旧:面对4G网络投资压减与资源需求矛盾,提出利旧多维度调优策略,包括资源整合、统筹调配既有资源,以满足新增需求和提质增效。 宏站RRU设备1托N射灯:针对5G深度覆盖需求,研究使用宏站AAU结合1托N射灯方案,快速便捷地开通5G站点,提升深度覆盖能力。 基站与流程管理: 爱立信LTE基站邻区添加流程:未提供具体内容,但通常涉及邻区规划、参数配置、测试验证等步骤,以确保基站间顺畅切换和覆盖连续性。 网络规划与策略: 新高铁跨海大桥覆盖方案试点:虽未提供详细内容,但可推测涉及高铁跨海大桥区域的4G/5G网络覆盖规划,需考虑信号穿透、移动性管理、网络容量等因素。 总结: 提供的参考资料涵盖了4G/5G无线网络优化、网规案例分析、网络优化策略、资源利用、基站管理等多个方面。 通过具体案例分析,展示了无线网络优化中的常见问题及解决方案,如NSA终端掉4G、RSSI识别天馈遮挡、CA不生效等。 强调了5G网络优化与策略的重要性,包括CCE映射方式优化、5G语音解决方案、AAU扇区组网选择等。 提出了4G网络优化与资源利用的策略,如室分设备利旧、宏站RRU设备1托N射灯等。 基站与流程管理方面,提到了爱立信LTE基站邻区添加流程,但未给出具体细节。 新高铁跨海大桥覆盖方案试点展示了特殊场景下的网络规划需求。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值