Kernels 核函数
- A kernel is defined using the global declaration specifier 【声明说明符】
- the number of CUDA threads that execute that kernel for a given kernel call is specified using a new <<<…>>>execution configuration syntax【执行配置语法】
- Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through the built-in threadIdx variable.
Thread Hierarchy 线程层次结构
- threadIdx 是一个 三分量向量.可以使用一维,二维或三维线程索引来识别线程,形成一维,二维或三维线程块,称为线程块.
- 线程块 Thread blocks are required to execute independently【独立地】
- 块内的线程:
Threads within a block can cooperate by sharing data through some shared memory and by synchronizing their execution to coordinate memory accesses. More precisely【精确地】, one can specify synchronization points in the kernel by calling the __syncthreads() intrinsic function【内部函数】; __syncthreads() acts as a barrier at which all threads in the block must wait before any is allowed to proceed. Shared Memory gives an example of using shared memory. In addition to __syncthreads(), the Cooperative Groups API provides a rich set of thread-synchronization【线程同步】 primitives.
为了更效率第操作,共享内存 low-latency memory(更像是L1 cache).
Memory Hierarchy【内存层次结构】
- Each thread has private local memory. 每一个线程有独立的局部内存
- Each thread block has shared memory visible to all threads of the block and with the same lifetime as the block.块内的共享内容仅对同一线程块的线程可见.
- All threads have access to the same global memory.所有的线程都能访问全局内存.
- There are also two additional read-only memory spaces accessible by all threads: the constant and texture memory spaces.【只读内存】:constant【常量】和texture【纹理】。
- 全局内存,常量内存,纹理内存都是连续的。
- 常量内存和纹理内存时经过优化的,有不同用途。纹理内存也提供不同的addressing modes,比如data filtering for some specific data formats。
Heterogeneous Programming【异构编程】
- The CUDA programming model also assumes that both the host and the device maintain their own separate memory spaces in DRAM, referred to as host memory and device memory, respectively.
- CUDA runtime:a program manages the global, constant, and texture memory spaces visible to kernels through calls to the CUDA runtime .这包括数据内存的分配与销毁也包括数据在host和deviece之间的传输。
- Unified Memory 提供托管内存以连通主机和设备内存空间。 可以从系统中的所有CPU和GPU访问托管内存,作为具有公共地址空间的单个连贯内存映像。 此功能可实现设备内存的超额预订,并且无需在主机和设备上显式镜像数据,从而大大简化了移植应用程序的任务。
2.5 Compute Capability【计算能力】
- The compute capability of a device is represented by a version number, also sometimes called its “SM version” 设备的计算能力:SM Version.【X.Y.】
https://blog.csdn.net/touch_dream/article/details/73674780