Out of Order (OoO) and Speculative Execution

Speculative Execution

What Is Speculative Execution? - ExtremeTechhttps://www.extremetech.com/computing/261792-what-is-speculative-executionWith an AMD-centric potential security flaw in the news, it’s a good time to revisit the question of what speculative execution is and how it works. This topic received a great deal of discussion a few years ago when Spectre and Meltdown were frequently in the news and new side-channel attacks were popping up every few months.

Speculative execution is a technique used to increase the performance of all modern microprocessors to one degree or another, including chips built or designed by AMD, ARM, IBM, and Intel. The modern CPU cores that don’t use speculative execution are all intended for ultra-low power environments or minimal processing tasks. Various security flaws like Spectre, Meltdown, Foreshadow, and MDS all targeted speculative execution a few years ago, typically on Intel CPUs.

What Is Speculative Execution?

Speculative execution is one of three components of out-of-order execution, also known as dynamic execution ==>not exactly true, see part 3, in short, this statement is true for most PC and server or other high performance chips . Along with multiple branch prediction (used to predict the instructions most likely to be needed in the near future) and dataflow analysis (used to align instructions for optimal execution, as opposed to executing them in the order they came in), speculative execution delivered a dramatic performance improvement over previous Intel processors when first introduced in the mid-1990s. Because these techniques worked so well, they were quickly adopted by AMD, which used out-of-order processing beginning with the K5.

ARM’s focus on low-power mobile processors initially kept it out of the OOoE playing field, but the company adopted out-of-order execution when it built the Cortex A9 and has continued to expand its use of the technique with later, more powerful Cortex-branded CPUs.

Here’s how it works. Modern CPUs are all pipelined, which means they’re capable of executing multiple instructions in parallel, as shown in the diagram below.

Image by Wikipedia. This is a general diagram of a pipelined CPU, showing how instructions move through the processor from clock cycle to clock cycle.

Imagine that the green block represents an if-then-else branch. The branch predictor calculates which branch is more likely to be taken, fetches the next set of instructions associated with that branch, and begins speculatively executing them before it knows which of the two code branches it’ll be using. In the diagram above, these speculative instructions are represented as the purple box. If the branch predictor guessed correctly, then the next set of instructions the CPU needed are lined up and ready to go, with no pipeline stall or execution delay.

Without branch prediction and speculative execution, the CPU doesn’t know which branch it will take until the first instruction in the pipeline (the green box) finishes executing and moves to Stage 4. Instead of having moving straight from one set of instructions to the next, the CPU has to wait for the appropriate instructions to arrive. This hurts system performance since it’s time the CPU could be performing useful work.

The reason it’s “speculative” execution is that the CPU might be wrong. If it is, the system loads the appropriate data and executes those instructions instead. But branch predictors aren’t wrong very often; accuracy rates are typically above 95 percent.

Why Use Speculative Execution?

Decades ago, before out-of-order execution was invented, CPUs were what we today call “in order” designs. Instructions executed in the order they were received, with no attempt to reorder them or execute them more efficiently. One of the major problems with in-order execution is that a pipeline stall stops the entire CPU until the issue is resolved.

The other problem that drove the development of speculative execution was the gap between CPU and main memory speeds. The graph below shows the gap between CPU and memory clocks. As the gap grew, the amount of time the CPU spent waiting on main memory to deliver information grew as well. Features like L1, L2, and L3 caches and speculative execution were designed to keep the CPU busy and minimize the time it spent idling.

If memory could match the performance of the CPU there would be no need for caches.

It worked. The combination of large off-die caches and out-of-order execution gave Intel’s Pentium Pro and Pentium II opportunities to stretch their legs in ways previous chips couldn’t match. This graph from a 1997 Anandtech article shows the advantage clearly.

...

Intel has been vulnerable to more of the side-channel attacks that came to market over the past three years than AMD or ARM because it opted to speculate more aggressively and wound up exposing certain types of data in the process. Several rounds of patches have reduced those vulnerabilities in previous chips and newer CPUs are designed with security fixes for some of these problems in hardware. It must also be noted that the risk of these kinds of side-channel attacks remains theoretical. In the years since they surfaced, no attack using these methods has been reported.

...

The State of Side-Channel Vulnerabilities in 2021

From 2018 – 2020, we saw a number of side-channel vulnerabilities discussed, including Spectre, Meltdown, Foreshadow, RIDL, MDS, ZombieLoad, and others. It became a bit trendy for security researchers to issue a serious report, a market-friendly name, and occasional hair-raising PR blasts that raised the specter (no pun intended) of devastating security issues that, to date, have not emerged.

Side-channel research continues — a new potential vulnerability was found in Intel CPUs in March — but part of the reason side-channel attacks work is because physics allows us to snoop on information using channels not intended to convey it. (Side-channel attacks are attacks that focus on weaknesses of implementation to leak data, rather than focusing on a specific algorithm to crack it).

We learn things about outer space on a regular basis by observing it in spectrums of energy that humans cannot naturally perceive. We watch for neutrinos using detectors drowned deep in places like Lake Baikal, precisely because the characteristics of these locations help us discern the faint signal we’re looking for from the noise of the universe going about its business. A lot of what we know about geology, astronomy, seismology, and any field where direct observation of the data is either impossible or impractical conceptually relates to the idea of “leaky” side channels. Humans are very good at teasing out data by measuring indirectly. There are ongoing efforts to design chips that make side-channel exploits more difficult, but it’s going to be very difficult to lock them out entirely.

This is not meant to imply that these security problems are not serious or that CPU firms should throw up their hands and refuse to fix them because the universe is inconvenient, but it’s a giant game of whack-a-mole for now, and it may not be possible to secure a chip against all such attacks. As new security methods are invented, new snooping methods that rely on other side channels may appear as well. Some fixes, like disabling Hyper-Threading, can improve security but come with substantial performance hits in certain applications.

Luckily, for now, all of this back-and-forth is theoretical. Intel has been the company affected the most by these disclosures, but none of the side-channel disclosures that have dropped since Spectre and Meltdown have been used in a public attack. AMD, similarly, is aware of no group or organization targeting Zen 3 its recent disclosure. Issues like ransomware have become far worse in the past two years, with no need for help from side-channel vulnerabilities.

In the long run, we expect AMD, Intel, and other vendors to continue patching these issues as they arise, with a combination of hardware, software, and firmware updates. Conceptually, side-channel attacks like these are extremely difficult, if not impossible, to prevent. Specific issues can be mitigated or worked around, but the nature of speculative execution means that a certain amount of data is going to leak under specific circumstances. It may not be possible to prevent it without giving up far more performance than most users would ever want to trade.

OoO

from cse.tamu Lecture 5: Out-of-order Execution

Review of the Model

The following diagram roughly represents the general model you should have in your mind of the main components of the computer:

Instruction Scheduling

Before we talk about out-of-order execution, let's remember how execution proceeds in our standard pipeline. Instructions are fetched, decoded, executed, etc. The decode stage is where we find out about structural, data, and control hazards. The hardware does what it can to minimize the impact of these hazards, but it is really up to the compiler to schedule dependent instructions far away from each other to avoid hazards. For instance, obviously, the compiler should "hoist" loads as far back into the past as possible to avoid RAW hazards from dependent instructions. Also, the compiler should avoid scheduling instructions close together that will complete for some limited resource such as a multiplier unit to avoid a structural hazard. Unfortunately, the compiler is limited by several factors:

  • It can't affect the microarchitecture; it must deal only with the ISA. It can use what it knows about the microarchitecture to schedule intelligently, but can only indirectly affect what happens at run-time.
  • When the microarchitecture changes, the scheduler must be rewritten to deal with the new details. Old programs should theoretically be recompiled, but in practice an OS and other software is often distributed with binaries optimized for the older version of the architecture to avoid incompatibilities.
  • The compiler has to deal with non-uniform latencies and aliasing ambiguities brought on by having a memory system. The compiler can only guess at what will be happening at run-time, but can't actually observe run-time behavior at a granularity that allows for better scheduling.

So, it might be better to have the job of instruction scheduling done in the microarchitecture, or shared between the microarchitecture and the compiler.

Out-of-order Execution

The pipelines we have studied so far have been statically scheduled and inorder pipelines. That is, instructions are executed in program order. If a hazard causes stall cycles, then all instructions up to the offending instruction are stalled until the hazard is gone. As we have seen, forwarding, branch prediction, and other techniques can reduce the number of stall cycles we need, but sometimes a stall is unavoidable. For instance, consider the following code:

	ld	r1, 0(r2)	// load r1 from memory at r2
	add	r2, r1, r3	// r2 := r1 + r3
	add	r4, r3, r5	// r4 := r3 + r5

Suppose that r3 and r5 are ready in the register file. Suppose also that the load instruction misses in the L1 data cache, so the load unit takes about 20 cycles to bring the data from the L2 cache. During the time the load unit is working, the pipeline is stalled. Notice, however, that the second add instruction doesn't depend on the value of r1; it could issue and execute, but the details of our inorder pipeline prevent that because of the stall. The function unit that should be adding r3 and r5 together is instead sitting idle, waiting for the load to complete so the add instruction can be decoded and issued.

Out-of-order execution, or dynamic scheduling, is a technique used to get back some of that wasted execution bandwidth. With out-of-order execution (OoO for short), the processor would issue each of the instructions in program order, and then enter a new pipeline stage called "read operands" during which instructions whose operands are available would move to the execution stage, regardless of their order in the program. The term issue could be redefined at this point to mean "issue and read operands."

Implementation of Out-of-order Execution

To implement an OoO processor, the pipeline has to be enhanced to keep track of the extra complexity. For instance, now that we can reorder instructions, we can have WAR and WAW hazards that we didn't have to worry about with an inorder pipeline. This diagram illustrates the basic idea:

The following tricks are used:

  • Register renaming. Registers that are the destinations of instruction results are renamed, i.e., more than one version of that register name may be used in the hardware. (This can be done in the compiler, but only with architecturally visible registers; it is a much more powerful technique when implemented in hardware.) This can be done in a new "rename registers" pipeline stage that allocates physical (i.e. real) registers to instances of logical (i.e. ISA) registers using a Register Alias Table, that also keeps track of a "free list" of available physical registers. Or, the renamed registers can be provided implicitly by using reservation stations (or both). We'll talk about reservation stations for now.
  • Instruction window. This buffer holds instructions that have been fetched and decoded and are waiting to be executed. Note: Often, the instruction window doesn't actually exist as a single buffer, but is distributed among reservation stations (see below).
  • Enhanced issue logic. The issue logic must be enhanced to issue instructions out of order depending on their readiness to execute.
  • Reservation stations. Each functional unit has a set of reservation stations associated with it. Each station contains information about instructions waiting or ready to issue. The reservation stations can also be used as the physical mechanism behind register renaming.
  • Load/store queue. This is like having reservation stations for the memory unit, but with special properties to avoid data hazards through memory.
  • Scoreboarding or Tomasulo's Algorithm. These are algorithms that keeps track of the details of the pipeline, deciding when and what to execute. The scoreboard knows (or predicts) when results will be available from instructions, so it knows when dependent instructions are able to be executed, and when they can write their results into destination registers. In Tomasulo's algorithm, reservation stations are used to implicitly implement register renaming. (Other schemes add an actual physical register file and Register Alias Table for doing renaming, allowing the new scheme to eliminate more data hazards.)
  • Common data bus (CDB). The common data bus is a network among the functional units used to communicate things like operands and reservation station tags.

The new pipeline is divided into three phases, each of which could take a number of clock cycles:

  1. (This stuff is all from Chapter 2)
  2. Issue:
    • Fetch: The fetch unit keeps instructions in an instruction queue, in program order (i.e., first-in-first-out). These instructions are fetched with the assistance of branch prediction. The issue phase dequeues an instruction from this queue.
    • Decode. The instruction is decoded to determine what functional units it will need.
    • Allocate reservation station. If there is a reservation station available at the function unit this instruction needs, send it there; otherwise, stall this instruction only because of the structural hazard.
    • Read operands. If the operands for the instruction are available, send them to the reservation station for that instruction. Otherwise, send information about the source for those operands to the reservation station, which will wait for the operands. This information takes the form of tags that name functional units and other reservation stations.
    • Rename registers. Implicitly, by sending tags instead of register names to the reservation stations, the issue phase renames registers in a virtual set of registers. For example, WAW hazards are no longer possible, since the same register in two different instructions corresponds to two different reservation stations.
  3. Execute. At the reservation station for this instruction, the following actions may be taken:
    • Wait for operands. If there are operands that haven't been computed yet, wait for them to arrive before using the functional unit. At this point, the instruction has been "issued" with references to where the operands will come from, but without the values.
    • Receive operands. When a value becomes available from a dependent instruction, place it in the reservation station.
    • Compute. When all operands are present in the reservation station, use the functional unit to compute the result of this instruction. If more than one reservation station suddenly has all of its operands available, the functional unit uses some algorithm to choose which reservation station to compute first ==> obviously optimizing waiting time for operands required by later intructions. Note that we are exploiting ILP here; in the same clock cycle, each functional unit can be independently executing an instruction from its own set of reservation stations.
    • Load/store. It doesn't really matter which reservation station "fires" first unless the functional unit is the memory unit, in which case loads and stores are executed in program order. Loads and stores execute in two steps: compute the effective address and use the memory unit. Loads can go as soon as the memory unit becomes available. Stores, like other instructions with operand values, wait for the value to become available before trying to aquire the memory unit.
  4. Write result. Once the result of an executed instruction becomes available, broadcast it over the CDB. Reservation stations that are waiting for the result of this instruction may make forward progress. During this phase, stores to memory are also executed.

Note that this scheme is non-speculative. Branch prediction is used to fetch instructions, but instructions are not executed until all of their dependences are satisfied, including control dependences. So, there is no problem with instructions fetched down the wrong path; they are never executed because they are discarded once their dependent branches are executed.

Reservation Stations

You can think of the reservation stations as structures or records in some program. Each functional unit might have several reservation stations forming a sort of queue where instructions sit and wait to for their operands to become available and for the functional unit to become available. The components of a reservation station for an instruction whose source inputs are Sj and Sk are:

  • Op. The operation to perform. This code is specific to the functional unit with which this reservation station is associated. For example, the set of values of Op for an arithmetic/logic functional unit might be { Add, Subtract, Negate, And, Or, Not }. For a memory unit, the set of values might be { Load, Store }.
  • Qj, Qk. These are the tags for the reservation stations that will produce Sj and Sk, respectively. A value of zero indicates that the corresponding source has already received its value from a reservation station.
  • Vj, Vk. These are the actual values of the source operands. A value here is only valid if the corresponding Q entry is zero, indicating that the source value has arrived.
  • A. Holds the effective address for a load or store. Initially, it might hold only the immediate field of the instruction, until the effective address computation has occured (recall that loads and stores execute in two steps: EA computation and using the memory unit).
  • Busy. This boolean condition is True if the reservation station is occupied, False if it is free.

In addition, each register in the physical register file has an entry, Qi, that gives the tag of the reservation station holding the instruction whose result should be stored into that register. If Qi is zero, then the value in the register file is the actual value of that register, i.e., the register is not renamed at that point.

Example

Let's look at an example from the book, on page 99. Consider the following code:

	ld	f6,34(r2)	// f6 := memory at r2 + 34
	ld	f2,45(r3)	// f2 := memory at r3 + 45
	mul	f0,f2,f4	// f0 := f2 * f4
	sub	f8,f2,f6	// f8 := f2 - f6
	div	f10,f0,f6	// f10 := f0 / f6
	add	f6,f8,f2	// f6 := f8 + f2

Let's look at what the reservation stations will look like once the first load has completed. The second load has done its effective address computation, but is still waiting to use the memory unit. Rather than using numbers for the reservation station tags, we'll use a combination of names and numbers, e.g., Add3.

Here is how the reservation stations would look:

Name	Busy	Op	Vj	Vk		 Qj	Qk	A
Load1	no	
Load2	yes	Load						45 + Regs[r3]
Add1	yes	Sub		Mem[34+Regs[r3]] Load2	0
Add2	yes	Add				 Add1   Load2
Add3	no
Mult1	yes	Mul		Regs[f4]	 Load2
Mult2	yes	Div		Mem[34+Regs[r2]] Mult1

And here is how the Qi field for the floating point register file would look:

Field	F0	F2	F4	F6	F8	F10	...
Qi	Mult1	Load2		Add2	Add1	Mult2	

OoO and Speculative Execution

cpu architecture - Out-of-order execution vs. speculative execution - Stack Overflowhttps://stackoverflow.com/questions/49601910/out-of-order-execution-vs-speculative-executionSpeculative execution and out-of-order execution are orthogonal. One could design a processor that is OoO but not speculative or speculative but in-order. OoO execution is an execution model in which instructions can be dispatched to execution units in an order that is potentially different from the program order. However, the instructions are still retired in program order so that the program's observed behavior is the same as the one intuitively expected by the programmer. (Although it's possible to design an OoO processor that retires instructions in some unnatural order with certain constraints. See the simulation-based study on this idea: Maximizing Limited Resources: a Limit-Based Study and Taxonomy of Out-of-Order Commit).

Speculative execution is an execution model in which instructions can be fetched and enter the pipeline and begin execution without knowing for sure that they will indeed be required to execute (according to the control flow of the program). The term is often used to specifically refer to speculative execution in the execution stage of the pipeline. The Meltdown paper does define these terms on page 3:

In this paper, we refer to speculative execution in a more restricted meaning, where it refers to an instruction sequence following a branch, and use the term out-of-order execution to refer to any way of getting an operation executed before the processor has committed the results of all prior instructions.

The authors here specifically refer to having branch prediction with executing instructions past predicted branches in the execution units. This is commonly the intended meaning of the term. Although it's possible to design a processor that executes instructions speculatively without any branch prediction by using other techniques such as value prediction and speculative memory disambiguation. This would be speculation on data or memory dependencies rather than on control. An instruction could be dispatched to an execution unit with an incorrect operand or that loads the wrong value. Speculation can also occur on the availability of execution resources, on the latency of an earlier instruction, or on the presence of a needed value in a particular unit in the memory hierarchy.

Note that instructions can be executed speculatively, yet in-order ==> and as shown in previous section, out of order yet non-speculative. When the decoding stage of the pipeline identifies a conditional branch instruction, it can speculate on the branch and its target and fetch instructions from the predicted target location. But still, instructions can also be executed in-order. However, note that once the speculated conditional branch instruction and the instructions fetched from the predicted path (or both paths) reach the issue stage, none of them will be issued until all earlier instructions are issued. The Intel Bonnell microarchitecture is an example of a real processor that is in-order and supports branch prediction.

Processors designed to carry out simple tasks and used in embedded systems or IoT devices are typically neither speculative nor OoO. Desktop and server processors are both speculative and OoO. Speculative execution is particularly beneficial when used with OoO.

The confusion came when I read the papers of Meltdown and Spectre and did additional research. It is stated in the Meltdown paper that Meltdown is based on out-of-order execution, while some other resources including the wiki page about sepeculative execution state that Meltdown is based on speculative execution.

The Meltdown vulnerability as described in the paper requires both speculative and out-of-order execution. However, this is somewhat a vague statement since there are many different speculative and out-of-order execution implementations. Meltdown doesn't work with just any type of OoO or speculative execution. For example, ARM11 (used in Raspberry Pis) supports some limited OoO and speculative execution, but it's not vulnerable.

See Peter's answer for more details on Meltdown and his other answer.

Related: What is the difference between Superscalar and OoO execution?.

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值