CUDA C Programming Guide | Programming Interface

最新推荐文章于 2024-04-24 15:08:33 发布

桑来93

最新推荐文章于 2024-04-24 15:08:33 发布

阅读量451

点赞数

分类专栏： [CUDA笔记]

本文链接：https://blog.csdn.net/qjh5606/article/details/82820194

版权

25 篇文章 13 订阅

订阅专栏

第三章 Programming Interface【编程接口】

// 略

The runtime 实现于cudart library
静态库 cudart.lib 动态库 cudart.dll
在这里插入图片描述

Device Memory gives an overview of the runtime functions used to manage device memory.
Shared Memory【共享内存】 illustrates the use of shared memory.共享内存为最优化性能.
Page-Locked Host Memory【锁页主机内存】 introduces page-locked host memory that is required to overlap kernel execution with data transfers between host and device memory. 页锁定内存
Asynchronous Concurrent Execution【异步并发执行】describes the concepts and API used to enable asynchronous concurrent execution at various levels in the system.
Multi-Device System shows how the programming model extends to a system with multiple devices attached to the same host.
Error Checking describes how to properly check the errors generated by the runtime.
Call Stack【调用栈】 mentions the runtime functions used to manage the CUDA C call stack.
Texture and Surface Memory presents the texture and surface memory spaces that provide another way to access device memory; they also expose a subset of the GPU texturing hardware.
Graphics Interoperability【图形互操作性】 introduces the various functions the runtime provides to interoperate with the two main graphics APIs, OpenGL and Direct3D.

During initialization, the runtime creates a CUDA context for each device in the system.
This context is the primary context for this device and it is shared among all the host threads of the application.
// cuda context 上下文
cudaDeviceReset() 销毁primary context.

Device memory can be allocated either as linear memory or as CUDA arrays.
CUDA arrays are opaque memory layouts optimized for texture fetching.
设备内存可以通过 cudaMalloc 和 cudaFree 来申请与销毁
数据的传输 cudaMemcpy()
Linear memory can also be allocated through cudaMallocPitch() and cudaMalloc3D(). 这些函数被推荐用来分配2D或着3D的数组, it makes sure that the allocation is appropriately padded to meet the alignment requirements(对齐).
此时.拷贝函数用cudaMemcpy2D().cudaMemcpy3D().The returned pitch (or stride) must be used to access array elements. // 返回的pitch和stride必须在访问元素时使用.

cudaHostAlloc() and cudaFreeHost() allocate and free page-locked host memory; cudaHostRegister()
使用页锁定内存的几个好处:

Copies between page-locked host memory and device memory can be performed concurrently with kernel execution for some devices.可以在核函数执行的同时在设备内存和主机内存之间进行拷贝操作.
对于一些设备,页锁定的主机内存可以映射为设备地址空间.免除拷贝.[零拷贝].
On systems with a front-side bus, bandwidth between host memory and device memory is higher if host memory is allocated as page-locked.带宽更高.[wirite-combinaing memory有更高的带宽]

// pass

cudaHostAlloc() or malloc()的返回值.
cudaHostGetDevicePointer().[The only exception is for pointers allocated with cudaHostAlloc()]

核函数直接访问hostmemory的好处:

关注