CUDA C Programming Guide | Programming Interface

第三章 Programming Interface【编程接口】

3.1 Compilation with NVCC 【编译】

// 略

3.2 CUDA C Runtime

The runtime 实现于cudart library
静态库 cudart.lib 动态库 cudart.dll

  • Device Memory gives an overview of the runtime functions used to manage device memory.
  • Shared Memory【共享内存】 illustrates the use of shared memory.共享内存为最优化性能.
  • Page-Locked Host Memory【锁页主机内存】 introduces page-locked host memory that is required to overlap kernel execution with data transfers between host and device memory. 页锁定内存
  • Asynchronous Concurrent Execution【异步并发执行】describes the concepts and API used to enable asynchronous concurrent execution at various levels in the system.
  • Multi-Device System shows how the programming model extends to a system with multiple devices attached to the same host.
  • Error Checking describes how to properly check the errors generated by the runtime.
  • Call Stack【调用栈】 mentions the runtime functions used to manage the CUDA C call stack.
  • Texture and Surface Memory presents the texture and surface memory spaces that provide another way to access device memory; they also expose a subset of the GPU texturing hardware.
  • Graphics Interoperability【图形互操作性】 introduces the various functions the runtime provides to interoperate with the two main graphics APIs, OpenGL and Direct3D.

3.2.1 Initialiazation 【初始化】

  • During initialization, the runtime creates a CUDA context for each device in the system.
  • This context is the primary context for this device and it is shared among all the host threads of the application.
    // cuda context 上下文
  • cudaDeviceReset() 销毁primary context.

3.2.2 Device Memory【设备内存】

  • Device memory can be allocated either as linear memory or as CUDA arrays.
  • CUDA arrays are opaque memory layouts optimized for texture fetching.
  • 设备内存可以通过 cudaMalloc 和 cudaFree 来申请与销毁
  • 数据的传输 cudaMemcpy()
  • Linear memory can also be allocated through cudaMallocPitch() and cudaMalloc3D(). 这些函数被推荐用来分配2D或着3D的数组, it makes sure that the allocation is appropriately padded to meet the alignment requirements(对齐).
  • 此时.拷贝函数用cudaMemcpy2D().cudaMemcpy3D().The returned pitch (or stride) must be used to access array elements. // 返回的pitch和stride必须在访问元素时使用.

3.2.3 Shared Memory【共享内存】

  • shared
  • Shared memory is expected to be much faster than global memory.
    // 使用共享内存计算的例子.

3.2.4 Page-Locked Host Memory【页锁定内存】

  • cudaHostAlloc() and cudaFreeHost() allocate and free page-locked host memory; cudaHostRegister()
  • 使用页锁定内存的几个好处:
  1. Copies between page-locked host memory and device memory can be performed concurrently with kernel execution for some devices.可以在核函数执行的同时在设备内存和主机内存之间进行拷贝操作.
  2. 对于一些设备,页锁定的主机内存可以映射为设备地址空间.免除拷贝.[零拷贝].
  3. On systems with a front-side bus, bandwidth between host memory and device memory is higher if host memory is allocated as page-locked.带宽更高.[wirite-combinaing memory有更高的带宽]
  • 但,使用太多的页锁定主机内存回降低整体系统的表现.
Portable Memory

// pass

Write-Combining Memory
  • 传递cudaHostAllocWriteCombined参数给cudaHostAlloc().
  • L1,L2 cache.
  • 读取非常慢,通常用于只写.
Mapped Memory
  • passing flag cudaHostAllocMapped to cudaHostAlloc()
  • or by passing flag cudaHostRegisterMapped to cudaHostRegister()
  • 两个地址(主机内存地址 设备内存地址):
  1. cudaHostAlloc() or malloc()的返回值.
  2. cudaHostGetDevicePointer().[The only exception is for pointers allocated with cudaHostAlloc()]


  1. 无需分配设备内存和进行拷贝.数据传输按需获取.
  2. There is no need to use streams to overlap data transfers with kernel execution
  • 0
  • 1
    觉得还不错? 一键收藏
  • 0


  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


