第三章 Programming Interface【编程接口】
3.1 Compilation with NVCC 【编译】
// 略
3.2 CUDA C Runtime
The runtime 实现于cudart library
静态库 cudart.lib 动态库 cudart.dll
- Device Memory gives an overview of the runtime functions used to manage device memory.
- Shared Memory【共享内存】 illustrates the use of shared memory.共享内存为最优化性能.
- Page-Locked Host Memory【锁页主机内存】 introduces page-locked host memory that is required to overlap kernel execution with data transfers between host and device memory. 页锁定内存
- Asynchronous Concurrent Execution【异步并发执行】describes the concepts and API used to enable asynchronous concurrent execution at various levels in the system.
- Multi-Device System shows how the programming model extends to a system with multiple devices attached to the same host.
- Error Checking describes how to properly check the errors generated by the runtime.
- Call Stack【调用栈】 mentions the runtime functions used to manage the CUDA C call stack.
- Texture and Surface Memory presents the texture and surface memory spaces that provide another way to access device memory; they also expose a subset of the GPU texturing hardware.
- Graphics Interoperability【图形互操作性】 introduces the various functions the runtime provides to interoperate with the two main graphics APIs, OpenGL and Direct3D.
3.2.1 Initialiazation 【初始化】
- During initialization, the runtime creates a CUDA context for each device in the system.
- This context is the primary context for this device and it is shared among all the host threads of the application.
// cuda context 上下文 - cudaDeviceReset() 销毁primary context.
3.2.2 Device Memory【设备内存】
- Device memory can be allocated either as linear memory or as CUDA arrays.
- CUDA arrays are opaque memory layouts optimized for texture fetching.
- 设备内存可以通过 cudaMalloc 和 cudaFree 来申请与销毁
- 数据的传输 cudaMemcpy()
- Linear memory can also be allocated through cudaMallocPitch() and cudaMalloc3D(). 这些函数被推荐用来分配2D或着3D的数组, it makes sure that the allocation is appropriately padded to meet the alignment requirements(对齐).
- 此时.拷贝函数用cudaMemcpy2D().cudaMemcpy3D().The returned pitch (or stride) must be used to access array elements. // 返回的pitch和stride必须在访问元素时使用.
3.2.3 Shared Memory【共享内存】
- shared
- Shared memory is expected to be much faster than global memory.
// 使用共享内存计算的例子.
3.2.4 Page-Locked Host Memory【页锁定内存】
- cudaHostAlloc() and cudaFreeHost() allocate and free page-locked host memory; cudaHostRegister()
- 使用页锁定内存的几个好处:
- Copies between page-locked host memory and device memory can be performed concurrently with kernel execution for some devices.可以在核函数执行的同时在设备内存和主机内存之间进行拷贝操作.
- 对于一些设备,页锁定的主机内存可以映射为设备地址空间.免除拷贝.[零拷贝].
- On systems with a front-side bus, bandwidth between host memory and device memory is higher if host memory is allocated as page-locked.带宽更高.[wirite-combinaing memory有更高的带宽]
- 但,使用太多的页锁定主机内存回降低整体系统的表现.
Portable Memory
// pass
Write-Combining Memory
- 传递cudaHostAllocWriteCombined参数给cudaHostAlloc().
- L1,L2 cache.
- 读取非常慢,通常用于只写.
Mapped Memory
- passing flag cudaHostAllocMapped to cudaHostAlloc()
- or by passing flag cudaHostRegisterMapped to cudaHostRegister()
- 两个地址(主机内存地址 设备内存地址):
- cudaHostAlloc() or malloc()的返回值.
- cudaHostGetDevicePointer().[The only exception is for pointers allocated with cudaHostAlloc()]
核函数直接访问hostmemory的好处:
- 无需分配设备内存和进行拷贝.数据传输按需获取.
- There is no need to use streams to overlap data transfers with kernel execution