GPGPU Hazard & Instruction Replay

一只纯洁的树袋熊

已于 2024-03-10 17:54:16 修改

阅读量729

点赞数 17

文章标签：学习硬件架构

于 2024-03-10 17:48:08 首次发布

本文链接：https://blog.csdn.net/weixin_45669724/article/details/136605501

版权

作为一个GPU架构的初学者，在阅读《Synthesis Lectures on Computer Architecture》系列2018年出的《General-Purpose Graphics Processor Architectures》[1]一文时我非常痛苦，虽然大致的架构能够理解，但其中很多更modern的技术都是纯文字叙述的，让我实在是摸不着头脑，尤其是3.3.2节 “Instruction Replay: Handling Structural Hazards”中提到的Replay技术，让我产生了一种看名字好像知道是在干什么，但细看之下又完全不懂的感觉，但在网上搜索一下又完全没有介绍此部分内容的博客，我就只能自力更生了......以下是原文中涉及Replay的描述。

ps：希望各位可以先好好看一下这部分的原文。

3.3.2 Instruction Replay: Handling Structural Hazards

To avoid these issues （注：这里指 structural hazard） GPUs implement a form of instruction replay. Instruction replay is found in some CPU designs where it is used as a recovery mechanism when speculatively scheduling a dependent instruction upon a earlier instruction that has variable latency. For example, loads may either hit or miss in a first-level cache but CPU designs that are clocked at a high frequency may pipeline first-level cache access over as many as four clock cycles. Some CPUs speculative wake up instructions depending upon a load so as to improve single threaded performance. In contrast, GPUs avoid speculation as it tends to waste energy and reduce throughput. Instead, instruction replay is used in GPUs to avoid clogging the pipeline and the circuit area and/or timing overheads resulting from stalling.

To implement instruction replay a GPU can hold instructions in the instruction buffer either until it is known that they have completed or all individual portions of the instruction have executed.

以及后面在Memory部分结合Shared Memory介绍的内容，从结果而言，我个人认为后半部分可以提供明显更加直观的理解。

4.1.1 SCRATCHPAD MEMORY AND L1 DATA CACHE (Shared Memory Access Operations)

For a shared memory accesses the arbiter determines whether the requested addresses within the warp will cause bank conflicts. If the requested addresses would cause one or more bank conflicts, the arbiter splits the request into two parts. The first part includes addresses for a subset of threads in the warp which do not have bank conflicts. This part of the original request is accepted by the arbiter for further processing by the cache. The second part contains those addresses that cause bank conflicts with addresses in the first part. This part of the original request is returned to the instruction pipeline and must be executed again. This subsequent execution is known as a “replay.” There is a tradeoff in where the replay part of the original shared memory request is stored. While area can be saved by replaying the memory access instruction from the instruction buffer this consumes energy in accessing the large register file. A better alternative for energy efficiency may be to provide limited buffering for replaying memory access instructions in the load/store unit and avoiding scheduling memory access operations from the instruction buffer when free space in this buffer beings to run out. Before considering what happens to the replay request, let us consider how the accepted portion of the memory request is processed.

通过这些部分也可以看出，这篇文章的介绍确实不是非常详细，主要是想起到一个启发性的作用，那这篇博客接下来的部分就将主要基于我自己搜到的一篇UBC的硕士毕业论文[2]来讲解什么是Instruction Replay。

1、Brief Architecture of GPU/GPGPU

对于GPU的架构，[1]的作者使用了以下的图片进行表示

[2]的作者首先提供了一个chip的完整架构，然后对上图的右半部分做了简单的展开，使图片更易于理解。（ps：这里包含一个cache-> cash的烂谐音梗）

不同人使用的图片风格不同，GPU的架构一般更具结构功能被分为两部分，其中部分模块的具体作用描述如下：

Front-End：处理 warp selection 和 warp issue
- Warp Buffer： Consists of a table of storage for warp instruction streams, each entry consists of a number of instruction storage areas and tracking information, such as instruction addresses, whether operand values are ready, and execution state.
- Scheduler： Instructions are issued down the pipeline from the Warp Buffer by the Scheduler; also supports multiple issue.
  - 机制：Stall-on-use: 为了解决RAW（读后写）依赖，处理器可以在前面的指令没有完成执行时发射后面的指令，直到后面的指令也被RAW依赖Block，或者所以指令执行完。
  - 结构：Two level warp schedular: Ensure a constant supply of both memory and ALU instructions.
- Mask Stage:：Mark out threads in the warp that are not to be executed with this instruction. Mask设计的最主要用法是执行分支指令，make sure only the correct threads are executed on each side of branches, 但在后文中也会提到，这个结构在Replay也是有作用的。
Back-End：处理 Threads 的 data 计算
- Operand Collector：It is where instructions receive data for their operands from the register file. 支持多发射。
- ALU & MEM
- Write Back

2、Hazard for GPU

简单学习过CPU架构的应该知道，CPU（五级流水线）的冒险主要是结构冒险和数据冒险两种，现代的CPU通过乱序发射和分支预测结构来减少这些Hazard的影响。

对于GPU来说，其通过大量的线程级并行来抵消这些Hazard造成的影响，遭遇冒险的线程不会被发射，转而让后续那些准备好的线程开始执行。

文章[2]的作者将Hazard分为了Structural Hazard和Dynamic Hazard，并在着重介绍了后者，将其定义为：unpredictable and occur due to interactions with other warps in the core or uncertainties inherent with a particular instruction. Dynamic hazard可以进一步分为两种：

Stall Hazard: Stalls occur when an instruction cannot proceed to the next pipeline stage because that pipeline stage is not empty. 这里的Stall是一种在流水线上可传递的行为，可能由于前面产生了Stall，导致后续的指令也被卡住无法发射，如下图所示。图中蓝色为MEM指令，绿色为ALU指令，可以看到，当MEM指令出现Stall后，可能会沿着流水线向前传递，最终导致整个并行计算单元被卡死。
MEM Hazard(我自己命名的): Memory instructions can encounter a number of different conditions which prevent them from completing. 下图为Baseline GPU的内存单元结构。

正常的流程是地址计算→寻址分析→数据获取（Cache）→数据返回（Dispatch）。但是当出现Cache Miss时，就需要用到图中右半部分的内容了。为了介绍在此过程中出现的四种Hazard，作者进一步使用了下图。（本部分中加粗的内容均为具体的Hazard种类）
- DIV and BANK: DIV被定义为由于访问常数、texture数据、全局内存而产生的Hazard，其只能通过coalesced access对内存进行访问，即每次只能访问存储器中一个单独的cache line：地址计算得出的是一组Threads需要使用的内存地址（它们之间是各不相同的），根据内存设计，这些内存请求被组合为一个很大（带宽）的内存请求，即一次性拿回内存中一个完整的数据块进行使用，但这就意味着必然会有一些线程的内存请求的地址不在这个数据块的范围内，导致这些零散的线程出现Hazard，这就是DIV（divergence） hazard。BANK的原理与DIV类似，BANK定义为由于Shared Memory operations导致的Hazard：由于Bank化的设计，对这部分存储器的访问并不限于每次一个cache line，于是就不存在与DIV相同的问题，但如果有多个线程同时访问同一个Bank时也会出现Hazard。
- MSHR (Miss Status Holding Register): 每当出现cache miss时，就需要向MSHR table的一个寄存器中写入本次cache miss的详细信息，但如果MSHR table被写满，就会出现Hazard。
- COMQ(Communication Queue): interconnect full导致的Hazard。
- RSV(reservation): RSV hazards occur when all the valid lines that a request could be written in a cache are already reserved by other requests that have not yet returned. 读回来的数据没有地方存。

3、Instructions Replay

一句话概括一下，Replay的主要作用就是解决GPU面临的Hazard，这也就是前面为什么要具体些那么多的有关冒险的内容。

在这一部分中，我将先介绍Replay机制的具体硬件实现方案，然后在用一个实例说明Replay是如何工作的。

3.1 Hardware Implementation

Replay机制在缓解了Hazard带来的问题的同时并没有产生很大的硬件实现上的代价，其基本只需要对Warp Buffer中存储的数据和Masks中的逻辑进行部分扩充就可以实现。下图展示了使用Replay机制的Warp Buffer的数据结构。

在Warp Buffer的中，每条指令（蓝色方框）中包含：

Instruction Label：指令标签，用于区别不同的指令，在这里主要是方便展示。
Valid Bit：Issue标记位，有V代表指令使用的操作数可用，该指令可被发射。
Replay Bit： Replay标记为，有R代表该指令不能从Warp Buffer中被清除，需要在后续被在此发射。
PAM(Private Active Mask)：用于标记在Replay过程中需要issue的线程，其会在指令第一次issue后根据返回的信号进行初始化。根据上文的介绍，通常每个Warp中只有一部分线程会出现Hazard，那么就算需要对指令进行Replay，已经执行完成该指令的线程就不需要再次执行了，PAM就是用于指示在执行Replay机制时哪些线程需要重新发射指令（PAM位为1），而其他的则不需要（PAM位为0）。

在Warp Slot层次上，Replay机制添加了3个指针：

IP(issue pointer)：指向要被发射的指令，并且在发射后自动递增。
ITP(issue tail pointer) ：指向最老的、还未完成的replayable instruction。
FP(fill pointer)：指向可以被填补的指令slot，FP不能越过IP和ITP进行指令填补。

对于Mask Stage来说，只需要结合原先的Warp Masks中的数据和PAM中的数据对Warp中的线程进行Mask即可。

3.2 Replay Example

一个简单的例子如下图所示。

ps：Warp Buffer一个可以存放4条指令，FP每次填充两条指令，所有指针都是循环运行的。

根据图片顺序，对例子中的执行流程描述如下：

指令A被发射，但其标记为需要Replay的指令，以此PAM置位，PAM为1111，代表指令为本Warp内全部四个线程需要执行的指令；
指令B被发射，可以直接执行完成，从Buffer中移除；指令C被发射，但其标记为需要Replay，PAM为1100；
根据反馈信息，导致指令A无法完成执行的因素在1和3线程中被解决，以此A的PAM被修改为0101，导致C无法完成执行的因素没有被解决；
根据反馈信息，线程0和2中导致指令A无法完成执行的因素被解决（图有问题），因此A被成功发射，并在Buffer中被清除；此时FP指针的填充条件满足，因此向Buffer中加入新的指令D和E；
尝试Replay指令C。

在此基础上，根据前文介绍的GPU中的Hazard类型，可以看到：