CUDA编程与硬件架构理解

simple_whu

已于 2022-08-24 11:27:05 修改

阅读量793

点赞数

分类专栏： CUDA 文章标签： CUDA

于 2022-05-19 08:30:13 首次发布

本文链接：https://blog.csdn.net/qq_42679415/article/details/124183114

版权

CUDA 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

Nvidia GPU的CUDA编程模型

预览版，内容有待完善。

1. Kernel

kernel函数是CUDA单个线程所执行的函数。其调用方式如下所述：

kernel_name<<< Dg, Db, Ns, S >>>([kernel arguments]);

Dg is of type dim3 and specifies the dimensions and size of the grid
Db is of type dim3 and specifies the dimensions and size of each thread block
Ns is of type size_t and specifies the number of bytes of shared memory that is dynamically allocated per thread block for this call and addition to statically allocated memory. Ns is an optional argument that defaults to 0.
S is of type cudaStream_t and specifies the stream associated with this call. The stream must have been allocated in the same thread block where the call is being made. S is an optional argument that defaults to 0.

2. Warp & SM & Thread Block线程块

硬件这方面来说，一个线程块（也叫CTA）由线程束warp的形式组织。一个线程束包含同一个线程块中的32个连续的线程（由SM划分，每个线程也称为一个lane，拥有自己的寄存器）。同一warp中的所有线程都执行相同的指令。
**Streaming Multiprocessor(SM)**与线程块的执行关系如下图所示。

Block&SM

2.1 Thread Block线程块

block是编程中用到的概念，常常和grid一同出现。在启动CUDA核函数时，需要指定blocksize和gridsize（这两项需要根据任务划分计算得到）。一个block由数个线程构成，其尺寸blocksize有三个维度(x,y,z)；grid由block构成，也有三个维度(x,y,z)。block（软件）被不断安排到SM（硬件）上并行执行。

2.2 Warp Shufle函数

Warp Shufle函数的作用是在一个warp的线程之间交换变量var的值。
函数原型如下面的代码块所示：

T __shfl_sync(unsigned mask, T var, int srcLane, int width=warpSize);
T __shfl_up_sync(unsigned mask, T var, unsigned int delta, int width=warpSize);
T __shfl_down_sync(unsigned mask, T var, unsigned int delta, int width=warpSize);
T __shfl_xor_sync(unsigned mask, T var, int laneMask, int width=warpSize);

All of the __shfl_sync() intrinsics take an optional width parameter which alters the behavior of the intrinsic. width must have a value which is a power of 2; results are undefined if width is not a power of 2, or is a number greater than warpSize.

__shfl_sync() returns the value of var held by the thread whose ID is given by srcLane. If width is less than warpSize then each subsection of the warp behaves as a separate entity with a starting logical lane ID of 0. If srcLane is outside the range [0:width-1], the value returned corresponds to the value of var held by the srcLane modulo width (i.e. within the same subsection).

__shfl_up_sync() calculates a source lane ID by subtracting delta from the caller’s lane ID. The value of var held by the resulting lane ID is returned: in effect, var is shifted up the warp by delta lanes. If width is less than warpSize then each subsection of the warp behaves as a separate entity with a starting logical lane ID of 0. The source lane index will not wrap around the value of width, so effectively the lower delta lanes will be unchanged.

__shfl_down_sync() calculates a source lane ID by adding delta to the caller’s lane ID. The value of var held by the resulting lane ID is returned: this has the effect of shifting var down the warp by delta lanes. If width is less than warpSize then each subsection of the warp behaves as a separate entity with a starting logical lane ID of 0. As for __shfl_up_sync(), the ID number of the source lane will not wrap around the value of width and so the upper delta lanes will remain unchanged.

__shfl_xor_sync() calculates a source line ID by performing a bitwise XOR of the caller’s lane ID with laneMask: the value of var held by the resulting lane ID is returned. If width is less than warpSize then each group of width consecutive threads are able to access elements from earlier groups of threads, however if they attempt to access elements from later groups of threads their own value of var will be returned. This mode implements a butterfly addressing pattern such as is used in tree reduction and broadcast.

The new *_sync shfl intrinsics take in a mask indicating the threads participating in the call. A bit, representing the thread’s lane id, must be set for each participating thread to ensure they are properly converged before the intrinsic is executed by the hardware. All non-exited threads named in mask must execute the same intrinsic with the same mask, or the result is undefined.

2.3 SM如何安排多个warp执行

考虑当前一个warp将要执行某个指令。如果其中某个操作数（operands）尚未就绪（例如，尚未从全局内存中读取完成），则会进行“上下文切换”（context switch）的过程，转而控制另一个warp执行指令。当进行warp切换时，当前warp的所有数据都保留在寄存器文件中，以便在操作数就绪时可以快速恢复（当一条指令的所有操作数都已就绪，则认为相应的warp已准备好执行）。如果有多个warp可以执行，则SM将使用线程束调度策略来决定哪个warp获得下一条读取的指令。

2.4 SM如何安排多个线程块执行

若一个SM的资源足够，硬件会给它安排尽可能多的线程块并行执行。其执行线程块的方式如下图所示：
blocks_SM 参考Wiki

只要一个线程块在SM上启动，它的所有线程束都将是驻留状态，直到它们的执行结束。
因此，直到有足够的空闲寄存器供一个线程块的所有warp使用，以及足够的空闲共享内存供一个线程块使用，SM才会启动新的线程块。

3. CUDA中的SIMT(Single-Instruction, Multiple-Thread)

Pasca之后的架构Warp内SIMT需要显式同步。之前架构的假设是，同一warp中的每条指令都是自动同步的(warp-synchronicity)。但是，Pasca以后引入了Independent Thread Scheduling(ITS)的概念，去掉了自动同步的特性。例如，在下面的代码块中，__syncwarp()之间有一次共享内存的读写。Pasca以前的架构读指令和写指令严格分开的，下面的代码不会有问题；然而，ITS出现后，读写不一定就是依次执行的，warp内不同的线程可能有的在读有的在写，会产生竞争条件(race condition)。

unsigned tid = threadIdx.x;

// Incorrect use of __syncwarp()
shmem[tid] += shmem[tid+16]; __syncwarp();
shmem[tid] += shmem[tid+8];  __syncwarp();
shmem[tid] += shmem[tid+4];  __syncwarp();
shmem[tid] += shmem[tid+2];  __syncwarp();
shmem[tid] += shmem[tid+1];  __syncwarp();

所以正确的写法是，在每个读取/写入语句后插入__syncwarp()来确保指令同步。如下面的代码所示。

unsigned tid = threadIdx.x;
int v = 0;

v += shmem[tid+16]; __syncwarp();
shmem[tid] = v;     __syncwarp();
v += shmem[tid+8];  __syncwarp();
shmem[tid] = v;     __syncwarp();
v += shmem[tid+4];  __syncwarp();
shmem[tid] = v;     __syncwarp();
v += shmem[tid+2];  __syncwarp();
shmem[tid] = v;     __syncwarp();
v += shmem[tid+1];  __syncwarp();
shmem[tid] = v;

什么意思呢？
原文
 译文

Volta 架构相较之前的 NVIDIA GPU 显著降低了编程难度，用户可以更专注于将各种多样的应用产品化。Volta GV100 是第一个支持独立线程调度的 GPU，也就是说，在程序中的不同线程可以更精细地同步和协作。Volta 的一个主要设计目标就是降低程序在 GPU 上运行所需的开发成本，以及线程之间灵活的共享机制，最终使得并行计算更为高效。

3.1 Volta架构之前的SIMT

在 Pascal 和之前的 GPU 中，可以执行由 32 个线程组成的 group，在 SIMT 术语里也被称为 warps。在 Pascal 的 warp 里，这 32 个线程使用同一个程序计数器，然后由一个活动掩码（active mask）标明 warp 里的哪些线程是有效的。这意味着不同的执行路径里有些线程是“非活动态”的。

A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path.

The threads of a warp that are participating in the current instruction are called the active threads, whereas threads not on the current instruction are inactive (disabled).

下图给出了一个 warp 里不同分支的顺序执行过程。在程序中，原始的掩码会先被保存起来，直到 warps 执行结束，线程再度收敛，掩码会被恢复，程序再接着执行。

一文详解英伟达刚发布的 Tesla V100 究竟牛在哪？

从本质上来说，Pascal 的 SIMT 模式通过减少跟踪线程状态所需的资源和积极地恢复线程将并行效率最大化。这种对整个 warps 进行线程状态跟踪的模式，其实意味着当程序出现并行分支时，warps 内部实际上是顺序执行的，这里已经丧失了并行的意义，直到并行分支的结束。也就是说，不同 warp 里的线程的确在并行执行，但同一 warp 里的分支线程却在未恢复之前顺序执行，它们之间无法交互信息和共享数据。

举个例子来说，要求数据精准共享的那些算法，在不同的线程访问被锁和互斥机制保护的数据块时，因为不确定遇到的线程是来自哪个 warp，所以很容易导致死锁。因此，在 Pascal 和之前的 GPU 里，开发者们不得不避免细粒度同步，或者使用那些不依赖锁，或明确区分 warp 的算法。

3.2 Volta及以后架构的SIMT

Volta引入了独立线程调度，通过在所有线程间（不管是哪个 warp 的）实施同等级别的并发性解决了这一问题，对每个线程，包括程序计数器和调用栈，Volta 都维护同一个执行状态，如下图所示。
在这里插入图片描述

Volta 的独立线程调配机制允许 GPU 将执行权限让步于任何一个线程，这样做使线程的执行效率更高，同时也让线程间的数据共享更合理。为了最大化并行效率，Volta 有一个调度优化器，可以决定如何对同一个 warp 里的有效线程进行分组，并一起送到 SIMT 单元。这不仅保持了在 NVIDIA 之前的 GPU 里较高的 SIMT 吞吐量，而且灵活性更高：现在，线程可以在 sub-warp 级别上分支和恢复，并且，Volta 仍将那些执行相同代码的线程分组在一起，让他们并行运行。

下图展示了 Volta 多线程模式的一个样例。这个程序里的 if/else 分支现在可以按照时序被间隔开来，如图12所示。可以看到，执行过程依然是 SIMT 的，在任意一个时钟周期，和之前一样，同一个 warp 里的所有有效线程，CUDA 核执行的是同样的指令，这样依然可以保持之前架构中的执行效率。重点是，Volta 的这种独立调度能力，可以让程序员有机会用更加自然的方式开发出复杂且精细的算法和数据结构。虽然调度器支持线程执行的独立性，但它依然会优化那些非同步的代码段，在确保线程收敛的同时，最大限度地提升 SIMT 的高效性。

在这里插入图片描述

另外，上图中还有一个有趣的现象：Z 在所有的线程中都不是同一时刻执行的。这是因为 Z 可能会输出其它分支进程需要的数据，在这种情况下，强制进行收敛并不安全。但在之前的架构中，一般认为 A，B，X，Y 并不包含同步性操作，因此调度器会认定在 Z 上收敛是安全的。

在这种情况下，程序可以调用新的 warp 同步函数 __syncwarp() 来强制进行线程收敛，如下图所示。这时分支线程可能并不会同步执行 Z，但是通过调用 __syncwarp() 函数，同一个 warp 里的这些线程的所有执行路径将会在执行到 Z 语句之前完备。**类似的，在执行 Z 之前，如果调用一下 __syncwarp() 函数，则程序将会在执行 Z 之前强制收敛。**如果开发者能提前确保这种操作的安全性，无疑这会在一定程度上提升 SIMT 的执行效率。
在这里插入图片描述

3.3 Demo: Am I understanding __synchthreads() correctly?

源链接
Q: I understand that __synchthreads() is needed to make sure data dependencies are maintained, e.g. if you want to read from something wait until it’s written to. My understanding of __synchthreads() is that for the below code, the second case of it might not be needed since the writing to the array occurs after it so if it is called while a thread that depends on it calls their first __synchthreads() then a race condition is created right? Or do we need both?
下面的代码中，第二个同步是不是没有必要？这两个同步避免了哪些race condition？

#define BLOCK_SIZE 256 
__global__  void test_kernel(int * dev_A)
{
    __shared__ int array[block_size]; 
    int tx = threadIdx. x;
    array[tx] = threadIdx.x;
    
    __synchthreads();
    int temp = array[tx];              
    if (tx > 0)
    {
        temp = (array[tx-1] + temp )/2; 	
    }
    __synchthreads();
    array[tx] = temp; 
}

A: If I interpreted the code correctly, there is a race condition if you get rid of the second __syncthreads(), however, it does not impact the result because of the integer division. The race condition occurs between the threads that go into the if-statement, and read from tx - 1. From the CUDA programming guide:

“If applications have warp-synchronous codes, they will need to insert the new __syncwarp() warp-wide barrier synchronization instruction between any steps where data is exchanged between threads via global or shared memory.”

This is because of a thing called independent thread scheduling, so threads inside of a warp are not guaranteed to execute in lockstep. This is further backed up by another comment in the guide that says:

“Note that threads within a warp can diverge even within a single code path”

Here is an example of an intra-warp race condition when there is no __syncthreads():

Possibility 1
a. tx = 1 enters the if-statement, calculates a new value of temp (will be 0, because (0 + 1)/2 = 0), and writes 0 to array[1].
b. tx = 2 enters the if-statement, calculates a new value of temp (will be 1, because (0 + 2)/2 = 1), and writes 1 to array[2].
Possibility 2
tx = 2 enters the if-statement, calculates a new value of temp (will be 1, because (1 + 2)/2 = 1, and writes 1 to array[2].

Notice that the values read by tx = 2 is different in these two possible cases. However, in this case, it does not produce an effect because of the rounding from integer division (code with UB is not a good idea though).
On older architectures without independent thread scheduling, there is still a race condition between warps. The only difference from the intra-warp case, is that the race condition is between thread ids such as tx = 31, and tx = 32, as the warp with tx = 31 could be scheduled first, and execute all the way through the write back to array[tx], before the other warp with tx = 32 is scheduled and reads array[tx - 1].
With a __syncthreads(), the final write to array[tx] will not occur until all threads in the CTA have read array[tx - 1].

Long story short, there is a race condition. (Answer End)

simple_whu

关注

0
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
CUDA编程与硬件架构理解

Nvidia GPU的CUDA编程模型预览版，内容有待完善。1. Kernelkernel函数是CUDA单个线程所执行的函数。其调用方式如下所述：kernel_name<<< Dg, Db, Ns, S >>>([kernel arguments]);Dg is of type dim3 and specifies the dimensions and size of the gridDb is of type dim3 and specifies t
复制链接

扫一扫