Instruction-Level Parallelism and Its Exploitation (指令级并行)

本文链接：https://blog.csdn.net/weixin_42437114/article/details/115429454

参考： $Computer\ Arichitecture\ (6\th\ Edition)$

What Is Instruction-Level Parallelism?

Instruction-Level Parallelism (ILP): overlap the execution of instructions to improve performance
2 approaches to exploit ILP:
- (1) Rely on software technology to find parallelism, statically at compile time (编译器静态调度)
  - successful only in domain-specific environments or in well-structured scientific applications with significant data-level parallelism
- (2) Rely on hardware to help discover and exploit the parallelism dynamically
  - including all recent Intel and many ARM processors, dominate in the desktop and server markets

在这里插入图片描述

The preceding equation allows us to characterize various techniques by what component of the overall CPI a technique reduces.

Basic Block (BB)

a straight-line code sequence with no branches in except to the entry and no branches out except at the exit (除入口和出口外没有其他分支的线性指令序列)
The amount of parallelism available within a basic block is quite small
- average dynamic branch frequency 15% to 25%=> 4 to 7 instructions execute between a pair of branches
- Plus instructions in BB likely to depend on each other: the amount of overlap we can exploit within a basic block is likely to be less than the average basic block size.
To obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks.

loop-level parallelism

Exploit parallelism among iterations of a loop (Simplest).
- Here is a simple example of a loop that adds two 1000-element arrays and is completely parallel:
- Converting such loop-level parallelism into instruction-level parallelism:
  - (1) unrolling the loop either statically by the compiler or dynamically by the hardware (循环展开，分支指令变少)
  - (2) An important alternative method is the use of vector instructions or GPU: SIMD (循环展开分支指令变少). A SIMD instruction exploits data-level parallelism by operating on a small to moderate number of data items in parallel (typically two to eight).

Dependences 相关

Determining how one instruction depends on another is critical to determining how much parallelism exists in a program and how that parallelism can be exploited.
- If two instructions are parallel, they can execute simultaneously in a pipeline of arbitrary depth without causing any stalls, assuming the pipeline has sufficient resources (and hence no structural hazards exist).
- If two instructions are dependent, they are not parallel and must be executed in order, although they may often be partially overlapped.

3 types of dependences:

(1) data dependences
(2) name dependences
(3) control dependences

Dependences are a property of programs:

A data dependence conveys three things: (1) the possibility of a hazard, (2) the order in which results must be calculated, and (3) an upper bound on how much parallelism can possibly be exploited.
Presence of dependence indicates potential for a hazard, but actual hazard and length of any stall is property of the pipeline (相关性是程序的属性；冲突是流水线结构的性质；相关性的存在只预示着存在有冲突的可能性)

HW/SW goal:

exploit parallelism by preserving program order only where it affects the outcome of the program

Detection of a dependence

A data value may flow between instructions either through registers or through memory locations.
- When the data flow occurs through a register, detecting the dependence is straightforward because the register names are fixed in the instructions.
- Dependences that flow through memory locations are more difficult to detect because two addresses may refer to the same location but look different
  - For example, 100(x4) and 20(x6) may be identical memory addresses.
  - In addition, the effective address of a load or store may change from one execution of the instruction to another (so that 20(x4) and 20(x4) may be different), further complicating the detection of a dependence.

Data Dependences

$Instr_J$ is data dependent on $Instr_I$ if either of the following holds:
- $Instr_J$ tries to read operand before $Instr_I$ writes it (may cause a Read After Write (RAW) hazard)
- or $Instr_J$ is data dependent on $Instr_K$ which is dependent on $Instr_I$
If two instructions are data dependent, they cannot execute simultaneously or be completely overlapped

Name Dependences

Name dependence: when 2 instructions use same register or memory location, called a name, but no flow of data between the instructions associated with that name;
2 versions of name dependence:
- Anti-dependence: $Instr_J$ writes operand before $Instr_I$ reads it (may cause a Write After Read (WAR) hazard)
  This results from reuse of the name “r1”
- Output dependence: $Instr_J$ writes operand before $Instr_I$ writes it. (may cause a Write After Write (WAW) hazard)
  This also results from the reuse of name “r1”
Instructions involved in a name dependence can execute simultaneously if name used in instructions is changed so instructions do not conflict
- Register renaming resolves name dependence for regs (Either by compiler or by HW)

Control Dependencies

控制相关决定了某指令相对于分支指令的顺序，使其按正确程序顺序，在应当执行时执行
Every instruction is control dependent on some set of branches (第一个基本块除外), and, in general, these control dependencies must be preserved to preserve program order.
- For example, in the code segment
  S1 is control-dependent on p1, and S2 is control-dependent on p2 but not on p1.
In general, two constraints are imposed by control dependences:
- (1) An instruction that is control-dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch.
  - For example, we cannot take an instruction from the then portion of an if statement and move it before the if statement.
- (2) An instruction that is not control-dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch.
  - For example, we cannot take a statement before the if statement and move it into the then portion.

Control dependence is not the critical property that must be preserved
- e.g. willing to execute instructions that should not have been executed, thereby violating the control dependences, if can do so without affecting correctness of the program.
Instead, 2 properties critical to program correctness are
- exception behavior 异常行为
- data flow 数据流

Both 2 properties can be preserved by maintenance of data dependences and control dependences

Exception Behavior

Preserving exception behavior means that any changes in instruction execution order must not change how exceptions are raised in program. Often this is relaxed to mean that the reordering of instruction execution must not cause any new exceptions in the program.
A simple example shows how maintaining the control and data dependences can prevent such situations:
If we ignore the control dependence and move the load instruction before the branch, the load instruction may cause a memory protection exception. (正常代码顺序是在 R2 为 0 时跳转，不会访问 0 地址；而如果把 LW 挪到分支语句之前，当 R2 为 0 时就会访问 0 地址，引起内存保护异常)

To allow us to reorder these instructions (and still preserve the data dependence), we want to just ignore the exception when the branch is taken. Later, we will look at a hardware technique, speculation, which allows us to overcome this exception problem.

Data Flow

Data flow: actual flow of data values among instructions that produce results and those that consume them.
- branches make flow dynamic, determine which instruction is supplier of data
e.g.
OR depends on DADDU or DSUBU? Must preserve data flow on execution.

Violating the control dependence may not affect either the exception behavior or the data flow

Consider the following code sequence:
Suppose we knew that the register destination of the sub instruction (x4) was unused after the instruction labeled skip and the existing sub instruction could not generate an exception. Then we could move the sub instruction before the branch because the data flow could not be affected by this change:
- If the branch is taken, the sub instruction will execute and will be useless, but it will not affect the program results.
- This type of code scheduling is also a form of speculation, often called software speculation, because the compiler is betting on the branch outcome; in this case, the bet is that the branch is usually not taken.

Basic Compiler Techniques for Exposing ILP

Basic Pipeline Scheduling and Loop Unrolling

Static Scheduling

If there is an unavoidable hazard, then the hazard detection hardware stalls the pipeline (starting with the instruction that uses the result). No new instructions are fetched or issued until the dependence is cleared.
To overcome these performance losses, the compiler can attempt to schedule instructions to avoid the hazard; this approach is called compiler or static scheduling.

To avoid a pipeline stall, the execution of a dependent instruction must be separated from the source instruction by a distance in clock cycles equal to the pipeline latency of that source instruction.
A compiler’s ability to perform this scheduling depends both on the amount of ILP available in the program and on the latencies of the functional units in the pipeline.

Assume following latencies for all examples (Ignore delayed branch in these examples)

The last column is the number of intervening clock cycles needed to avoid a stall.

We assume that the functional units are fully pipelined or replicated (as many times as the pipeline depth) so that an operation of any type can be issued on every clock cycle and there are no structural hazards.

In this section, we look at how the compiler can increase the amount of available ILP by transforming loops. We will rely on the following code segment, which adds a scalar to a vector:
We can see that this loop is parallel by noticing that the body of each iteration is independent.
First translate into MIPS code:
- 整数寄存器 R1：循环计数器，初值为向量 $x$ 中最高端地址元素的地址
- 浮点寄存器 F2：保存常数 $S$
- 假定最低端元素的地址为 8
Without any scheduling, the loop will execute as follows, taking nine cycles:
We can schedule the loop to obtain only two stalls and reduce the time to seven cycles:
- 7 clock cycles, but just 3 for execution (L.D, ADD.D,S.D), 4 for loop overhead(stall, branch, sub);
- How make faster? A simple scheme for increasing the number of instructions relative to the branch and overhead instructions is loop unrolling. Unrolling simply replicates the loop body multiple times, adjusting the loop termination code.
Try to Unroll Loop Four Times
Rename the Register: If we simply replicated the instructions when we unrolled the loop, the resulting use of the same registers could prevent us from effectively scheduling the loop. Thus we will want to use different registers for each iteration, increasing the required number of registers.
27 clock cycles, or 6.75 per iteration
Loop unrolling can also be used to improve scheduling. Because it eliminates the branch, it allows instructions from different iterations to be scheduled together. In this case, we can eliminate the data use stalls by creating additional independent instructions within the loop body.
- Scheduling the loop in this fashion necessitates realizing that the loads and stores are independent and can be interchanged.
- 14 clock cycles, or 3.5 per iteration. Unrolling improves the performance of this loop by eliminating overhead instructions, although it increases code size substantially.

Unrolled Loop Detail

Do not usually know upper bound of loop. Suppose it is $n,$ and we would like to unroll the loop to make $k$ copies of the body ( $n = 30, k = 4$ ). Instead of a single unrolled loop, we generate a pair of consecutive loops:
- 1st executes (n mod k) times and has a body that is the original loop (30 mod 4=2)
- 2nd is the unrolled body surrounded by an outer loop that iterates (n/k) times (30/4=7)

Loop Unrolling Decisions

Requires understanding how one instruction depends on another and how the instructions can be changed or reordered given the dependences:
- (1) Determine loop unrolling useful by finding that loop iterations were independent (except for the loop maintenance code)
- (2) Use different registers to avoid unnecessary constraints forced by using same registers for different computations
- (3) Eliminate the extra test and branch instructions and adjust the loop termination and iteration code
- (4) Determine that loads and stores in unrolled loop can be interchanged by observing that loads and stores from different iterations are independent
  - Transformation requires analyzing memory addresses and finding that they do not refer to the same address
- (5) Schedule the code, preserving any dependences needed to yield the same result as the original code

Limits to Loop Unrolling

(1) Decrease in amount of overhead amortized with each extra unrolling（展开次数要合理）
(2) Growth in code size (For larger loops, concern it increases the instruction cache miss rate)
(3) Register pressure: potential shortfall in registers created by aggressive unrolling and scheduling
- If not be possible to allocate all live values to registers, may lose some or all of its advantage

Dynamic Branch Prediction

Why does prediction work?

Underlying algorithm has regularities
Data that is being operated on has regularities
Instruction sequence has redundancies that are artifacts of way that humans/compilers think about problems 指令序列有无用信息 / 静态预测受影响

Branch-Prediction Buffers

The simplest dynamic branch-prediction scheme is a branch-prediction buffer or branch history table (BHT).
- A branch-prediction buffer is a small memory indexed by the lower portion of the address of the branch instruction. (假如指令字长 32 位，如果用全部 32 位指令地址作为索引，需要 4 G bits，消耗太大，因此只取低位作为索引；取低位也利用了局部性原理)
- The memory contains a bit that says whether the branch was recently taken or not.
With such a buffer, we don’t know, in fact, if the prediction is correct—it may have been put there by another branch that has the same low-order address bits. 因此分支预测错误可能有两个原因: (1) misprediction (2) not hit
- But this doesn’t matter. The prediction is a hint that is assumed to be correct, and fetching begins in the predicted direction. If the hint turns out to be wrong, the prediction bit is inverted and stored back. (hit rate of the buffer is not the major limiting factor. Instead, we need to look at how we might increase the accuracy of each predictor)
Problem: in a loop, 1-bit BHT will cause two mispredictions (even if a branch is almost always taken):
- End of loop case, when it exits instead of looping as before 最后一次预测错误不可避免，因为前面分支总是成功
- First time through loop on next time through code, when it predicts exit instead of looping 第一次预测错误是源于上次代码段的执行，因为这最后一次分支总是不成功

2-bit prediction schemes

In a 2-bit scheme, a prediction must miss twice before it is changed. (Adds hysteresis (滞后) to decision making process)

The 2-bit scheme is actually a specialization of a more general scheme that has an n-bit saturating counter for each entry in the prediction buffer. With an n-bit counter, the counter can take on values between 0 and $2^{n}-1$ : when the counter is greater than or equal to one-half of its maximum value, the branch is predicted as taken; otherwise, it is predicted as untaken. Studies of n-bit predictors have shown that the 2-bit predictors do almost as well, thus most systems rely on 2-bit branch predictors rather than the more general n-bit predictors.

Implementation

A branch-prediction buffer can be implemented as a small, special “cache” accessed with the instruction address during the IF pipe stage or as a pair of bits attached to each block in the instruction cache and fetched with the instruction.
If the instruction is decoded as a branch and if the branch is predicted as taken, fetching begins from the target as soon as the PC is known. Otherwise, sequential fetching and executing continue.

正如之前提到的，单纯的增加 BHT 容量或者每个单元的 bit 数无法实质性地提高预测正确率；下面进一步介绍提高动态预测精度的方法

Correlating Branch Predictors

相关分支预测

The 2-bit predictor schemes use only the recent behavior of a single branch to predict the future behavior of that branch.
It may be possible to improve the prediction accuracy if we also look at the recent behavior of other branches rather than just the branch we are trying to predict.

Consider a small code fragment

在这里插入图片描述

Translate to MIPS code (assuming that aa and bb are assigned to registers R1 and R2)
The key observation is that the behavior of branch b3 is correlated with the behavior of branches b1 and b2. Clearly, if branches b1 and b2 are both not taken then b3 will be taken. (i.e., if the conditions both evaluate to true and aa and bb are both assigned 0)

Correlating predictors / two-level predictors

Idea: record $m$ most recently executed branches as taken or not taken, and use that pattern to select the proper $n$ -bit branch history table. In the general case, an $\boldsymbol{(m,n)}$ predictor uses the behavior of the last $m$ branches to choose from $2^m$ history tables, each of which is an $n$ -bit predictor for a single branch.
- e.g., a ${(1,2)}$ predictor uses the behavior of the last branch to choose from among a pair of 2-bit branch predictors in predicting a particular branch.
- old 2-bit BHT is a $\boldsymbol{(0,2)}$ predictor

Implementation (an $(m, n)$ predictor)

Global Branch History: $m$ -bit shift register keeping T/NT status of last $m$ branches.
Each entry in BHT has $m$ $n$ -bit predictors.
- The branch-prediction buffer can then be indexed using a concatenation of the low-order bits from the branch address with the $m$ -bit global history.
- Behavior of recent branches selects between $2^m$ predictions of next branch, updating just that prediction
e.g. Using a (2,2) predictor and assuming the branch address 0x48

Correlating branch predictors VS Standard 2-bit scheme

To compare them fairly, we must compare predictors that use the same number of state bits. The number of bits in an $(m, n)$ predictor is
The Figure below compares the misprediction rates of the earlier (0,2) predictor with 4K entries and a (2,2) predictor with 1K entries.

Tournament Predictors: Adaptively Combining Local and Global Predictors

Tournament predictors use multiple predictors, usually a global predictor and a local predictor, and choose between them with a selector.
- A global predictor uses the most recent $m$ branch history to index the predictor
- A local predictor uses the address of the branch as the index.
- The selector acts like a 2-bit predictor, changing the preferred predictor for a branch address when two mispredicts occur in a row.

Notes

The number of bits of the branch address used to index the selector table and the local predictor table is equal to the length of the global branch history used to index the global prediction table.
Note that misprediction is a bit tricky because we need to change both the selector table and either the global or local predictor.

Alpha 21264 - Tournament Predictors

The most advanced of these predictors has been on the Alpha 21264. Tournament predictor using, say, 4K 2-bit counters indexed by local branch address. Chooses between:
- Global predictor: 4K entries index by history of last 12 branches ( $m = 12$ , $2^{12} = 4K$ ); Each entry is a standard 2-bit predictor
- Local predictor: consists of a two-level predictor.
  - Local history table: 1024 10-bit entries recording last 10 branches, indexed by branch address. Each 10-bit entry corresponds to the most recent 10 branch outcomes for the entry.
  - The pattern of the last 10 occurrences of that particular branch used to index table of 1K entries with 3-bit saturating counters, which provide the local prediction
This combination, which uses a total of 29K bits, leads to high accuracy in branch prediction while requiring fewer bits than a single level table with the same prediction accuracy.

Total size of predictor = 8K (selector) + 8K (global predictor) + 10K (local history table) + 3K = 29K

Comparing Predictors

The advantage of a tournament predictor is its ability to select the right predictor for a particular branch
- Particularly crucial for the integer benchmarks. A typical tournament predictor will select the global predictor almost 40% of the time for the SPEC integer benchmarks and less than 15% of the time for the SPEC FP benchmarks.

Advanced Techniques for Instruction Delivery and Speculation

Branch-Target Buffers (BTB)

分支目标缓冲

Branch target calculation is costly and stalls the instruction fetch. If the instruction is a branch and we know what the next PC should be, we can have a branch penalty of zero.

Branch-target buffer or Branch-target cache

A branch-prediction cache that stores the predicted address for the next instruction after a branch is called a BTB
The PC of a branch is sent to the BTB. When a match is found the corresponding Predicted PC is returned.
Because a branch-target buffer predicts the next instruction address and will send it out before decoding the instruction, if a matching entry is found in the branch-target buffer, we must know whether the fetched instruction is predicted as a taken branch. If the branch was predicted taken, instruction fetch continues at the returned predicted PC

Example

Determine the total branch penalty for a branch-target buffer assuming the penalty cycles for individual mispredictions in Figure 3.27. Make the following assumptions about the prediction accuracy and hit rate:
- Prediction accuracy is 90% (for instructions in the buffer).
- Hit rate in the buffer is 90% (for branches predicted taken).

Answer

在这里插入图片描述

Dynamic Scheduling

动态调度

Dynamic Scheduling

hardware rearranges the instruction execution to reduce the stalls while maintaining data flow and exception behavior
- It handles cases when dependences unknown at compile time. It allows the processor to tolerate unpredictable delays such as cache misses, by executing other code while waiting for the miss to resolve
- It allows code that compiled for one pipeline to run efficiently on a different pipeline
- It simplifies the compiler

In-order instruction issue (顺序指令发射)

If an instruction is stalled in the pipeline, no later instructions can proceed.
With in-order issue, if two instructions have a hazard between them, the pipeline will stall, even if there are later instructions that are independent and would not stall.

All the techniques discussed so far use in-order instruction issue

Out-of-order execution

In the RISC V pipeline developed earlier, both structural and data hazards were checked during instruction decode (ID): when an instruction could execute properly, it was issued from ID.
Key idea: Allow an instruction to begin execution as soon as its operands are available, even if a predecessor is stalled
- We must split the ID pipe stage into two stages:
  - (1) Issue—Decode instructions, check for structural hazards. (decode and issue instructions in order – in-order issue)
  - (2) Read operands—Wait until no data hazards, then read operands. (Instructions begin execution as soon as their data operands are available – out-of-order execution and out-of-order completion)

An instruction fetch stage precedes the issue stage and may fetch either to an instruction register or into a queue of pending instructions; instructions are then issued from the register or queue.

Note: Dynamic execution creates WAR and WAW hazards and makes exceptions harder

Scoreboard Algorithm

记分牌算法 (named after the CDC 6600 scoreboard, which developed this capability)

Idea: when stalled, other instructions can be issued and executed if they do not depend on any active or stalled instruction

The Basic Structure of a MIPS Processor with a Scoreboard

Taking advantage of out-of-order execution requires multiple instructions to be in their EX stage simultaneously. This can be achieved with multiple functional units, with pipelined functional units, or with both. (here, we will assume the processor has multiple functional units)
On a processor for the RISC V architecture, scoreboards make sense primarily on the floating-point unit because the latency of the other functional units is very small.
- Let’s assume that there are two multipliers, one adder, one divide unit, and a single integer unit for all memory references, branches, and integer operations.

4 steps

The scoreboard takes full responsibility for instruction issue and execution, including hazard detection (centralized control)
Each instruction undergoes four steps in executing. (Since we are concentrating on the FP operations, we will not consider a step for memory access.) The four steps, which replace the ID, EX, and WB steps in the standard MIPS pipeline, are as follows:
$Issue\rightarrow Read\ operands\rightarrow Execution\rightarrow Write \ result$
- (1) Issue—If a functional unit for the instruction is free (no structural hazards) and no other active instruction has the same destination register (avoid WAW), the scoreboard issues the instruction to the functional unit and updates its internal data structure. (Wait untill no structural or WAW hazard exists, and no further instructions will issue until these hazards are cleared)
- (2) Read operands—When the source operands are available (no earlier issued active instruction is going to write it), the scoreboard tells the functional unit to proceed to read the operands from the registers and begin execution. The scoreboard resolves RAW hazards dynamically in this step, and instructions may be sent into execution out of order.
- (3) Execution—The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution. This step takes multiple cycles in the RISC V FP pipeline.
- (4) Write result—Once the scoreboard is aware that the functional unit has completed execution, the scoreboard checks for WAR hazards and stalls the completing instruction, if necessary. If this WAR hazard does not exist, or when it clears, the scoreboard tells the functional unit to store its result to the destination register.
  - e.g.
    fadd.d has a source operand f8, which is the same register as the destination of fsub.d. But fadd.d actually depends on an earlier instruction. The scoreboard will still stall the fsub.d in its write result stage until fadd.d reads its operands.
  - In general, then, a completing instruction cannot be allowed to write its results when:
    - There is an instruction that has not read its operands that precedes (i.e., in order of issue) the completing instruction, and
    - One of the operands is the same register as the result of the completing instruction.
Because the operands for an instruction are read only when both operands are available in the register file, this scoreboard does not take advantage of forwarding. This is not as large a penalty as you might initially think. Unlike our simple pipeline of earlier, instructions will write their result into the register file as soon as they complete execution (assuming no WAR hazards), rather than wait for a statically assigned write slot that may be several cycles away. The effect reduces the pipeline latency and the benefits of forwarding. There is still one additional cycle of latency that arises because the write result and read operand stages cannot overlap. We would need additional buffering to eliminate this overhead.

3 Parts to the Scoreboard

Instruction status: indicating the pipeline stage of the instruction
Functional Unit Status: 9 fields
- $B u s y$
- $O p$ – operation performing (e.g. add, sub)
- $F_i$ – destination register (The scoreboard records operand specifier information, such as register number)
- $F_j$ , $F_k$ – source register
- $Q_j$ , $Q_k$ – FU that produces $F_j$ , $F_k$
- $R_j$ , $R_k$ – flags of $F_j$ , $F_k$ to indicate being read or not (set to $N o$ after operands are read):
  - 就绪且未读 - Yes
  - 未就绪 - No 且 $Q$ 不空
  - 已读 - No 且 $Q$ 空
Register result status: which FU will write the result to register

Scoreboard Example

不必特别关注例子的细节，知道流程即可

在这里插入图片描述

Costs and benefits of scoreboard

The amount of parallelism available among the instructions: Whether independent instructions can be found ?
The number of scoreboard entries: The instrs. window the pipeline can look for independent instr.
The number and types of FUs
The presence of anti-dependence and output dependence

Tomasulo’s Approach

The primary difference between scoreboard and Tomasulo’s algorithm is that Tomasulo’s algorithm handles anti-dependences and output dependences by effectively renaming the registers dynamically.
- 发展的背景: IBM 360/91 上只有 4 个浮点寄存器 $\rightarrow$ Small number of floating point registers prevented interesting compiler scheduling of operations
- This led Tomasulo to try to figure out how to get more effective registers — renaming in hardware!

RAW hazards are avoided by executing an instruction only when its operands are available (与 Scoreboard 一致)
WAR and WAW hazards are eliminated by register renaming.

reservation stations (RS) (保留站)

Reservation stations buffer the operands of instructions waiting to issue and are associated with the functional units. (Load and Stores treated as FUs with RSs as well)
- Control & buffers distributed with Function Units (FU): the information held in the reservation stations at each functional unit determines when an instruction can begin execution at that unit
Registers in instructions replaced by values or pointers to RS; called register renaming;
- More reservation stations than registers, so can do optimizations compilers can’t
Results to FU from RS, not through registers, over Common Data Bus (CDB) that broadcasts results to all FUs
- In pipelines that issue multiple instructions per clock and also have multiple execution units, more than one result bus will be needed. (因为多条指令不能同时写 CDB)

Three Stages of Tomasulo Algorithm

(1) Issue—get instruction from FP Op Queue.
- If reservation station free (no structural hazard), control issues instr & sends operands (renames registers).
(2) Execute—operate on operands (EX) (需要读的操作数准备好之后自动由 CDB 广播过来，因此不用留一个 stage 读操作数)
- When both operands ready then execute; if not ready, watch Common Data Bus for result
- Loads and stores require a two-step execution process.
  - (i) compute the effective address, which is then placed in the load or store buffer
  - (ii) Loads in the load buffer execute as soon as the memory unit is available. Stores in the store buffer wait for the value to be stored before being sent to the memory unit.
  - Loads and stores are maintained in program order through the effective address calculation, which will help to prevent hazards through memory.
(3) Write result (WB)
- Write on Common Data Bus to all awaiting units (broadcast); mark reservation station available
- Normal data bus: data + destination (“go to” bus); Common data bus: data + source
  - 64 bits of data + 4 bits of Functional Unit source address
  - Write if matches expected Functional Unit (produces result)

introduces one cycle of latency between source and result because the matching of a result and its use cannot be done until the end of the Write Result stage

在这里插入图片描述

Reservation Station Components

Busy: Indicates reservation station or FU is busy
Op: Operation to perform in the unit (e.g., + or –)
Vj, Vk: Value of Source operands (存储的是实际的值，由 CDB 广播得到)
Qj, Qk: Reservation stations producing source registers (value to be written) 产生结果的保留站号
- Note: Qj,Qk=0 => ready
- Store buffers only have Qi for RS producing result
A: Used to hold information for the memory address calculation for a load or store. Initially, the immediate field of the instruction is stored here; after the address calculation, the effective address is stored here.

Register result status

Indicates which FU will write each register, if one exists. Blank when no pending instructions that will write that register. (same as scoreboard)

Tomasulo Example

在这里插入图片描述

Tomasulo’s scheme offers 2 major advantages

(1) Distribution of the hazard detection logic
- distributed reservation stations and the CDB
- If multiple instructions waiting on single result, & each instruction has other operand, then instructions can be released simultaneously by broadcast on CDB (同时获取需要的操作数)
- If a centralized register file were used, the units would have to read their results from the registers when register buses are available
(2) Elimination of stalls for WAW and WAR hazards

Tomasulo Drawbacks

Complexity
Many associative stores at high speed (需要大量高速的相关缓冲区，保存相关中间结果)
Performance limited by Common Data Bus
- Each CDB must go to multiple functional units $\rightarrow$ high capacitance, high wiring density
- Number of functional units that can complete per cycle limited to one! 单总线限制了同时完成的指令数
Non-precise interrupts! (精确中断) (乱序执行导致中断位置不精确)
- We will address this later

Hardware-Based Speculation

前瞻执行

前瞻技术 允许在处理器还未判断指令是否能执行之前就提前执行，以克服控制相关. Overcoming control dependence is done by speculating on the outcome of branches and executing the program as if our guesses are correct.
- Speculation $\Rightarrow$ fetch, issue, and execute instructions as if branch predictions were always correct
- branch prediction with dynamic scheduling $\Rightarrow$ only fetches and issues instructions
Essentially a data flow execution model: (数据流驱动) Operations execute as soon as their operands are available

three key ideas of HW-based speculation

(1) Dynamic branch prediction to choose which instructions to execute 采用动态分支预测技术来选择后续执行语句；
(2) Speculation to allow execution of instructions before control dependences are resolved + ability to undo effects of incorrectly speculated sequence
- 与循环展开的不同 $\rightarrow$ 寄存器不用重命名，因为可以 undo
(3) Dynamic scheduling to deal with scheduling of different combinations of basic blocks 对基本块组合采用动态调度
- In comparison, dynamic scheduling without speculation only partially overlaps basic blocks because it requires that a branch be resolved before actually executing any instructions in the successor basic block.

A Loop-Based Example with Branch

在这里插入图片描述

Assume that we have issued all the instructions in the loop twice. (动态分支预测会执行第 2 次循环，因此 Speculation 允许提前执行第 2 次循环的代码) Let’s also assume that the L.D and MUL.D from the first iteration have committed and all other instructions have completed execution.
可以看到，不同于循环展开，每次循环的寄存器名都是一样的，因为用 ROB 进行区分了；如果第 2 次循环实际不做，那么把 ROB 里红线以下的指令都 flush 掉就行了

Adding Speculation to Tomasulo

关键思想：允许指令乱序执行，但必须顺序提交，以实现精确例外 (precise interrupts)

ReOrder Buffer (ROB)

Must separate execution from allowing instruction to finish or “commit”; This additional step called instruction commit 指令提交
- When an instruction is no longer speculative, allow it to update the register file or memory
Requires additional set of reorder buffer (ROB) 再定序缓冲 to hold results of instructions that have finished execution but have not committed (in FIFO order, exactly as issued).
- Instructions commit $\Rightarrow$ values at head of ROB placed in registers
- As a result, easy to undo speculated instructions on mispredicted branches or on exceptions
This buffer is also used to pass results among instructions that may be speculated. When instructions complete, results placed into ROB
- Supplies operands to other instruction between execution complete & commit $\Rightarrow$ more registers like RS (just as reservation stations (RS) provide operands in Tomasulo’s algorithm)
- Tag results with ROB buffer number instead of reservation station
RS 与 ROB 是现代 CPU 中的核心部件，作用相反：RS 把有序变为乱序；ROB 把乱序变为有序

Reorder Buffer Entry

Each entry in the ROB contains four fields:
- (1) Instruction type: a branch (has no destination result), a store (has a memory address destination), or a register operation (ALU operation or load, which has register destinations)
- (2) Destination: Register number (for loads and ALU operations) or memory address (for stores) where the instruction result should be written
- (3) Value of instruction result until the instruction commits
- (4) Ready: Indicates that instruction has completed execution, and the value is ready

4 Steps of Speculative Tomasulo Algorithm

(1) Issue—get instruction from FP Op Queue
- If reservation station & reorder buffer slot free, issue instr & send the operands to the reservation station if they are available in either the registers or the ROB & reorder buffer number for destination
(2) Execution—operate on operands (EX)
- When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW
(3) Write result—finish execution (WB)
- Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available.
(4) Commit—update register with reorder result
- Normal case: When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer. Once an instruction commits, its entry in the ROB is reclaimed, and the register or memory destination is updated, eliminating the need for the ROB entry.
- Mispredicted branch reaches the head of the ROB: flushes reorder buffer (sometimes called “graduation” 校准) and execution is restarted at the correct successor of the branch

Speculation Example

在这里插入图片描述

Summary: In-order Issue/Commit, Out-of-Order Execution/Writeback

Memory Disambiguation

Question: 给定一个指令序列，store, load 是否有关？
- 即下列代码是否有相关问题?
  无法知道 R3+0 是否与 R2+32 一样

Avoiding Memory Hazards

WAW and WAR hazards through memory are eliminated with speculation because actual updating of memory occurs in order, when a store is at head of the ROB, and hence, no earlier loads or stores can still be pending 所有真正与 memory 打交道的都是顺序提交
RAW hazards through memory (load after store) are maintained by two restrictions (No Speculation: 不进行 load，直到我们确信地址 $0(R_3)\neq32(R_2)$ ):
- (1) not allowing a load to initiate the second step (第一阶段计算有效地址，第二阶段存数) of its execution if any active ROB entry occupied by a store has a Destination field that matches the value of the $A$ field (address) of the load
- (2) maintaining the program order for the computation of an effective address of a load with respect to all earlier stores
- these restrictions ensure that any load that accesses a memory location written to by an earlier store cannot perform the memory access until the store has written the data

Load Speculation

我们可以假设 load 和 store 相关还是不相关 (called
“dependence speculation”) ，如果推测错误通过 ROB 来修正
- Cache 命中的 load 操作必须等它前面的所有 store 的地址都确定后才能把值写回寄存器并传递给后面的操作；Load 被前面的 store 延迟的现象非常普遍：30%~40%，一个被延迟的 load（发射后）至少需要 5 拍才能写回， 30% 将使得平均 load 指令的延迟拉长到： $3\times0.7 + 5 \times0.3 = 3.6$ 。这恶化了原本就已经比较长的 load 延迟
- 应对措施: 让准备好的 load 直接写回，不考虑前面是否还有未解决的 store, 并在发现访存相关时取消该 load 及其后面的操作（<<1%的概率需要取消）
例如，对于数组拷贝代码，load 猜测后硬件看起来就像是能够自动把循环展开成如下形式:

Exceptions and Interrupts

Technique for both precise interrupts/exceptions and speculation: in-order completion and in-order commit
- If we speculate and are wrong, need to back up and restart execution to point at which we predicted incorrectly. This is exactly same as need to do with precise exceptions
Exceptions are handled by not recognizing the exception until instruction that caused it is ready to commit in ROB
- If a speculated instruction raises an exception, the exception is recorded in the ROB
- This is why reorder buffers in all new processors

Multiple-issue

指令多发射

Goal: CPI < 1 $\Rightarrow$ Instructions Per Clock cycle (IPC) >1
- Vector Processing (SIMD): Explicit coding of independent loops as operations on large vectors of numbers
- Superscalar: varying number instructions/cycle (1 to 8), scheduled by compiler or by HW
- (Very) Long Instruction Words (V)LIW: fixed number of instructions (4-16) scheduled by the compiler; put ops into wide templates (模板)

后两种就是下面要讲的指令多发射技术

The major challenge for all multiple-issue processors is to try to exploit large amounts of ILP. When the parallelism comes from unrolling simple loops in FP programs, the original loop probably could have been run efficiently on a vector processor. The potential advantages of a multiple-issue processor versus a vector processor are the former’s ability to extract some parallelism from less structured code and to easily cache all forms of data. For these reasons, multiple-issue approaches have become the primary method for taking advantage of instruction-level parallelism, and vectors have become primarily an extension to these processors.

Issuing Multiple Instructions/Cycle

Multiple-issue processors come in 3 flavors:
- (1) statically-scheduled superscalar processors
- (2) dynamically-scheduled superscalar processors
- (3) VLIW (very long instruction word) processors
2 types of superscalar processors issue varying numbers of instructions per clock
- use in-order execution if they are statically scheduled, or
- out-of-order execution if they are dynamically scheduled

Because of the diminishing advantages of a statically scheduled superscalar as the issue width grows, statically scheduled superscalars are used primarily for narrow issue widths, normally just two instructions. Beyond that width, most designers choose to implement either a VLIW or a dynamically scheduled superscalar.

Superscalar

Superscalar MIPS: 2 instructions: 1 FP & 1 anything (load, store, branch or ALU)
- Fetch 64-bits/clock cycle; Int on left, FP on right
- Can only issue 2nd instruction if 1st instruction issues (浮点指令不能单独发射)
- More ports for FP registers to do FP load & FP op in a pair (既要 load、store，又要做浮点运算，因此浮点寄存器需要多通道)
- 1 cycle load delay expands to 3 instructions in Superscalar (load 指令的结果不能在本周期和下个周期（3条指令）使用)

Recall: Unrolled Loop that Minimizes Stalls for Scalar

在这里插入图片描述

Loop Unrolling in Superscalar

在这里插入图片描述

Multiple Issue Issues

issue packet: group of instructions from fetch unit that could potentially issue in 1 clock
- 0 to $N$ instruction issues per clock cycle, for $N$ -issue
- If instruction causes structural hazard or a data hazard either due to earlier instruction in execution or to earlier instruction in issue packet, then instruction does not issue
Performing issue checks in 1 cycle could limit clock cycle time: $O (n (n - 1))$ comparisons. 时间复杂度过高，因此将 issue stage 分为两个阶段: issue stage usually split and pipelined
- 1st stage decides how many instructions from within this packet can issue (检查结构相关)
- 2nd stage examines hazards among selected instructions and those already been issued (检查数据相关)

Multiple Issue Challenges

While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with:
- Exactly 50% FP operations AND No hazards
If more instructions issue at same time, greater difficulty of decode and issue:
- Even 2-scalar $\Rightarrow$ examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue; ( $N$ -issue ~ $O(N^2-N)$ comparisons)
- Register file: 每个周期的读写次数都要翻倍
Rename logic: must be able to rename same register multiple times in one cycle! For instance, consider 4-way issue:
Imagine doing this transformation in a single cycle!
Result buses: Need to complete multiple instructions/cycle
- So, need multiple buses with associated matching logic at every reservation station.
- Or, need multiple forwarding paths

Dynamic Scheduling in Superscalar: The easy way

How to issue two instructions and keep in-order instruction issue for Tomasulo?
- Assume 1 integer + 1 floating point
- 1 Tomasulo control for integer, 1 for floating point (Tomasulo 控制器一个控制整型操作，一个控制浮点型操作的发射)
Issue 2X Clock Rate, so that issue remains in order

VLIW

VLIW processors issue a fixed number of instructions formatted either as one large instruction or as a fixed instruction packet with the parallelism among instructions explicitly indicated by the instruction. Each “instruction” has explicit coding for multiple operations
- By definition, all the operations the compiler puts in the long instruction word are independent $\Rightarrow$ execute in parallel
  - VLIW processors are statically scheduled across several branches by the compiler
- 采用多个独立的功能部件，将多条指令的操作组装成固定格式的指令包，形成一条非常长的指令。超长指令字的格式固定，处理过程简单，处理器所需硬件量比超标量要少

The advantage of a VLIW increases as the maximum issue rate grows. Indeed, for simple two-issue processors, the overhead of a superscalar is probably minimal.

Tradeoff

instruction space for simple decoding
- The long instruction word has room for many operations

Loop Unrolling in VLIW

To keep the functional units busy, there must be enough parallelism in a code sequence to fill the available operation slots. This parallelism is uncovered by unrolling loops and scheduling the code within the single larger loop body.

Example

Suppose we have a VLIW that could issue two memory references, two FP operations, and one integer operation or branch in every clock cycle. Show an unrolled version of the loop x[i] = x[i] + s for such a processor. Unroll as many times as necessary to eliminate any stalls. (i.e., completely empty issue cycles)

Answer

在这里插入图片描述

Unrolled 7 times to avoid delays
- 7 results in 9 clocks, or 1.3 clocks per iteration (1.8X).
- Average: 2.5 ops per clock, 50% efficiency
Note: Need more registers in VLIW (15 vs. 6 in SuperScalar)

*technical problems

increase in code size
- (1) generating enough operations in a straight-line code fragment requires ambitiously unrolling loops (as in earlier examples), thereby increasing code size.
- (2) whenever instructions are not full, the unused functional units translate to wasted bits in the instruction encoding.

*logistical problem

Binary code compatibility: In a strict VLIW approach, the code sequence makes use of both the instruction set definition and the detailed pipeline structure, including both functional units and their latencies. Thus different numbers of functional units and unit latencies require different versions of the code. This requirement makes migrating between successive implementations, or between implementations with different issue widths, more difficult than it is for a superscalar design.

Superscalar v. VLIW

在这里插入图片描述

software pipelining & trace scheduling

There are two other important techniques that have been developed specifically for VLIWs (other than loop unrolling):
- software pipelining (软件流水) and trace scheduling (踪迹调度)

software pipelining

Loop-carried Dependence (循环携带相关)

是限制循环结构并行性开发的一个重要因素，它是指一个循环的某个迭代中的指令与其他迭代中的指令之间的数据相关
Observation: if iterations from loops are independent, then can get more ILP by taking instructions from different iterations

Software Pipelining: Symbolic Loop Unrolling

Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop
- 核心思想：从循环的不同迭代中抽取一部分不相关指令（循环控制指令除外）拼成一个新的循环迭代 (A software-pipelined loop interleaves instructions from different iterations without unrolling the loop. )
- This technique is the software counterpart to what Tomasulo’s algorithm does in hardware
Maximize result-use distance (最大化两条相关指令之间的距离); Less code space than unrolling; Fill & drain pipe only once per loop vs. once per each unrolled iteration in loop unrolling

Example

Software pipelining symbolically unrolls the loop and then selects instructions from each iteration.

Software Pipelining with Loop Unrolling in VLIW

在这里插入图片描述

Software pipelined across 9 iterations of original loop. In each iteration of above loop, we:
- Store to $m$ , $m - 8$ , $m - 16$ (iterations $I - 3$ , $I - 2$ , $I - 1$ )
- Compute for $m - 24$ , $m - 32$ , $m - 40$ (iterations $I$ , $I + 1$ , $I + 2$ )
- Load from $m - 48$ , $m - 56$ , $m - 64$ (iterations $I + 3$ , $I + 4$ , $I + 5$ )
9 results in 9 cycles, or 1 clock per iteration. Average: 3.3 ops per clock, 66% efficiency
Note: Need fewer registers for software pipelining (only using 7 registers here, was using 15)

用编译器完成这个调度…想想都很难啊

trace scheduling

trace scheduling: 优化执行效率高的 trace，减少其执行开销
Two steps:
- Trace Selection 选择执行频率较高的路径: Find likely sequence of basic blocks (trace) of (statically predicted or profile predicted) long sequence of straight-line code
- Trace Compaction 调度和优化，缩短执行时间: Squeeze trace into few VLIW instructions; Need bookkeeping code in case prediction is wrong

This is a form of compiler-generated speculation
- Compiler must generate “fixup” (修正，补偿) code to handle cases in which trace is not the taken branch
- Needs extra registers: undoes bad guess by discarding