http://ju.outofmemory.cn/entry/292387
http://www.hds.bme.hu/~fhegedus/C++/Professional%20CUDA%20C%20Programming.pdf
- 一个
GPU
包含多个Streaming Multiprocessor
,而每个Streaming Multiprocessor
又包含多个core
。Streaming Multiprocessors
支持并发执行多达几百的thread
。 - 一个 thread
block
只能调度到一个Streaming Multiprocessor
上运行,直到 thread block 运行完毕。一个Streaming Multiprocessor
可以同时运行多个threadblock
。
这里有两种分割数据的方式:block
就是按线程数等分数据,10
个线程就把数据分成10
份,一个线程处理一份;而cyclic
则是数据的份数大于线程数,举个例子,10
个线程把数据分成20
份,第一个线程处理第1
,11
份,第二个线程处理第2
,12
份。。。。。。,循环处理多次。
- Even though many-core and multicore are used to label GPU and CPU architectures, a GPU core is quite different than a CPU core.
- A CPU core, relatively heavy-weight, is designed for very complex control logic, seeking to optimize the execution of sequential programs.
- A GPU core, relatively light-weight, is optimized for data-parallel tasks with simpler control logic, focusing on the throughput of parallel programs.
- CPU处理器相对重量化,设计用来应对复杂的控制逻辑,寻求优化序列化程序的执行方法。
- GPU处理器相对而言轻量化,经过优化之后用较为简单的控制逻辑来处理并行数据任务,主要集中于并行程序的使用。
一个基本的CUDA程序架构包含5个主要方面:
A typical CUDA program structure consists of five main steps:
1. Allocate GPU memories. 分配GPU的内存
2. Copy data from CPU memory to GPU memory. 复制CPU内存数据到GPU内存
3. Invoke the CUDA kernel to perform program-specific computation. 激活CUDA内核去计算特定程序的计算
4. Copy data back from GPU memory to CPU memory. 将数据GPU再一次拷贝回到CPU当中
5. Destroy GPU memories. 删除GPU数据
When a kernel function is launched from the host side, execution is moved to a device where a large number of threads are generated and each thread executes the statements specified by the kernel function. CUDA exposes a thread hierarchy abstraction to enable you to organize your threads. This is a two-level thread hierarchy decomposed into blocks of threads and grids of blocks:
当一个核函数在host side启动(即 CPU 处理器端),执行过程将会被移动到device上(即GPU上),在GPU上会产生大量的线程(Threads),每一个线程都执行和函数的一个特定代码。CUDA 中有两种抽象层级,让你可以安排组织你的线程。这两种层级的线程层可以分解成两部分:线程块和包含块的格。
All threads spawned by a single kernel launch are collectively called a grid. All threads in a grid share the same global memory space. A grid is made up of many thread blocks. A thread block is a group of threads that can cooperate with each other using:
➤ Block-local synchronization
➤ Block-local shared memory
Threads from different blocks cannot cooperate.
所有由一个核产生一系列线程将成为一个集合,名为 grid 。在一个格子内的所有线程将分享一样的全局内存空间。一个 grid 由许多的 block 组成。一个线程块(thread block) 是一组线程的组合,它们可以互相之间合作(用下面的方法):
- Block-local synchronization
- Block-local shared memory
不同的block中的线程不可以合作。
关于kernel
A kernel function is the code to be executed on the device side. In a kernel function, you define the computation for a single thread, and the data access for that thread. When the kernel is called, many different CUDA threads perform the same computation in parallel.
一个核函数是将要在设备端(device)运行的一系列代码。在核函数中,你为一个单独的线程指定运算和线程可以接触到的数据。当这个核被激活时,许多的CUDA线程并行同样的计算。
The following restrictions apply for all kernels:
➤ Access to device memory only 只可以接触设备的内存
➤ Must have void return type 返回值必须是空
➤ No support for a variable number of arguments 不支持一个参数的变量
➤ No support for static variables 不支持静态变量
➤ No support for function pointers 不支持函数指针
➤ Exhibit an asynchronous behavior 表现为一个非同步的
变量类型
关于SM