Chapter 5-02

Please indicate the source if you want to reprint: http://blog.csdn.net/gaoxiangnumber1.
5.7 Understanding Modern Processors
Modern microprocessors’ actual operation is far different from the view that is perceived by looking at machine-level programs. At the code level, it appears as if instructions are executed one at a time, where each instruction involves fetching values from registers or memory, performing an operation, and storing results back to a register or memory location. In the actual processor, a number of instructions are evaluated simultaneously, referred to as instruction-level parallelism. Complex mechanisms are employed to make sure the behavior of this parallel execution exactly captures the sequential semantic model required by the machine-level program. This is one of the remarkable feats of modern microprocessors: they employ complex and exotic microarchitectures, in which multiple instructions can be executed in parallel, while presenting an operational view of simple sequential instruction execution.
There are two different lower bounds characterize the maximum performance of a program. The latency bound is encountered when a series of operations must be performed in strict sequence, because the result of one operation is required before the next one can begin. This bound can limit program performance when the data dependencies in the code limit the ability of the processor to exploit instruction-level parallelism. The throughput bound characterizes the raw computing capacity of the processor’s functional units. This bound becomes the ultimate limit on program performance.
5.7.1 Overall Operation
Our processor design is based on the structure of the Intel Core i7 processor design. It is described as being superscalar, which means it can perform multiple operations on every clock cycle, and out-of-order, meaning that the order in which instructions execute need not correspond to their ordering in the machine-level program.
The overall design has two main parts: the instruction control unit (ICU), which is responsible for reading a sequence of instructions from memory and generating from these a set of primitive operations to perform on program data, and the execution unit (EU), which then executes these operations.
The ICU reads the instructions from an instruction cache—a special high-speed memory containing the most recently accessed instructions. In general, the ICU fetches well ahead of the currently executing instructions, so that it has enough time to decode these and send operations down to the EU.
One problem is that when a program hits a branch, there are two possible directions the program might go. Modern processors employ branch prediction technique, in which they guess whether or not a branch will be taken and also predict the target address for the branch. So the processor begins fetching and decoding instructions at where it predicts the branch will go, and even begins executing these operations before it has been determined whether or not the branch prediction was correct.
If it later determines that the branch was predicted incorrectly, it resets the state to that at the branch point and begins fetching and executing instructions in the other direction. The block labeled “Fetch control” incorporates branch prediction to perform the task of determining which instructions to fetch.
The instruction decoding logic takes the actual program instructions and converts them into a set of primitive operations which performs some simple computational task such as adding two numbers, reading data from memory, or writing data to memory. For machines with complex instructions(x86 and so on), an instruction can be decoded into a variable number of operations. This decoding splits instructions to allow a division of labor among a set of dedicated hardware units. These units can then execute the different parts of multiple instructions in parallel.
The EU receives operations from the instruction fetch unit. Typically, it can receive a number of them on each clock cycle. These operations are dispatched to a set of functional units that perform the actual operations. These functional units are specialized to handle specific types of operations. Our figure illustrates a typical set of functional units, based on those of the Intel Core i7. Three functional units are dedicated to computation, while the remaining two are for reading (load) and writing (store) memory. Each computational unit can perform multiple different operations: all can perform at least basic integer operations, such as addition and bit-wise logical operations. Floating-point operations and integer multiplication require more complex hardware, and so these can only be handled by specific functional units.
Reading and writing memory is implemented by the load and store units. The load unit handles operations that read data from the memory into the processor and the store unit handles operations that write data from the processor to the memory. Both have an adder to perform address computations and both units access memory via a data cache, a high-speed memory containing the most recently accessed data values.
With speculative execution, the operations are evaluated, but the final results are not stored in the program registers or data memory until the processor can be certain that these instructions should actually have been executed. Branch operations are sent to the EU to determine whether or not they were predicted correctly. If the prediction was incorrect, the EU will discard the results that have been computed beyond the branch point. It will also signal the branch unit that the prediction was incorrect and indicate the correct branch destination. Then the branch unit begins fetching at the new location. Such a misprediction incurs a significant cost in performance. It takes a while before the new instructions can be fetched, decoded, and sent to the execution units.
Within the ICU, the retirement unit keeps track of the ongoing processing and makes sure that it obeys the sequential semantics of the machine-level program. Our figure shows a register file containing the integer, floating-point, and more recently SSE registers as part of the retirement unit, because this unit controls the updating of these registers. As an instruction is decoded, information about it is placed into a FIFO queue. This information remains in the queue until one of two outcomes occurs:
First, once the operations for the instruction have completed and any branch points leading to this instruction are confirmed as having been correctly predicted, the instruction can be retired, with any updates to the program registers being made.
Second, if some branch point leading to this instruction was mispredicted, the instruction will be flushed, discarding any results that may have been computed. So mispredictions will not alter the program state.
Any updates to the program registers occur only as instructions are being retired, and this takes place only after the processor can be certain that any branches leading to this instruction have been correctly predicted.
To expedite the communication of results from one instruction to another, much of this information is exchanged among the execution units, shown in the figure as “Operation results.” The execution units can send results directly to each other.
5.7.2 Functional Unit Performance
Throughput = 1 / Issue
The latencies increase as the word sizes increase, for more complex data types, and for more complex operations.
Most forms of addition and multiplication operations have issue times of 1, meaning that on each clock cycle, the processor can start a new one of these operations. This short issue time is achieved through the use of pipelining. Functional units with issue times of 1 cycle are said to be fully pipelined: they can start a new operation every clock cycle. The issue time of 0.33 given for integer addition is due to the fact that the hardware has three fully pipelined functional units capable of performing integer addition. The processor has the potential to perform three additions every clock cycle.
A pipelined function unit is implemented as a series of stages, each of which performs part of the operation. For example, a typical floating-point adder contains three stages (and hence the three-cycle latency): one to process the exponent values, one to add the fractions, and one to round the result. The arithmetic operations can proceed through the stages in close succession rather than waiting for one operation to complete before the next begins. This capability can be exploited only if there are successive, logically independent operations to be performed.
The divider (used for integer and floating-point division) is not fully pipelined—its issue time is just a few cycles less than its latency. This means that the divider must complete all but the last few steps of a division before it can begin a new one.
5.7.3 An Abstract Model of Processor Operation
The CPE(cycles per element) measurements obtained for function combine4, our fastest code up to this point:
These measurements match the latency bound for the processor, except for the case of integer addition. This indicates that the performance of these functions is dictated by the latency of the sum or product computation being performed. Computing the product or sum of n elements requires around L * n + K clock cycles, where L is the latency of the combining operation and K represents the overhead of calling the function and initiating and terminating the loop. The CPE is therefore equal to the latency bound L.
We present the data-flow notation by working with combine4 (Figure 5.10, page 493) as an example. We focus just on the computation performed by the loop. We consider the case of floating-point data with multiplication as the combining operation. The compiled code for this loop consists of four instructions, with registers %rdx holding loop index i, %rax holding array address data, %rbp holding loop bound limit, and %xmm0 holding accumulator value acc.
As Figure 5.13 indicates, the four instructions are expanded by the instruction decoder into a series of five operations, with the initial multiplication instruction being expanded into a load operation to read the source operand from memory, and a mul operation to perform the multiplication.
The top boxes represent the register values at the beginning of the loop, and bottom boxes represent the values at the end. Some of the operations produce values that do not correspond to registers. We show these as arcs between operations on the right-hand side. E.g.: The load operation reads a value from memory and passes it directly to the mul operation.
There are two data dependencies from one iteration to the next: program value acc and loop index i.
The program has two chains of data dependencies, corresponding to the updating of program values acc and i with operations mul and add, respectively. Given that single-precision multiplication has a latency of 4 cycles, while integer addition has latency 1, we can see that the chain on the left will form a critical path, requiring 4n cycles to execute. The chain on the left would require only n cycles to execute, and so it does not limit the program performance.
Figure 5.15 demonstrates why we achieved a CPE equal to the latency bound of 4 cycles for combine4, when performing single-precision floating-point multiplication. When executing the function, the floating-point multiplier becomes the limiting resource. The other operations required during the loop—manipulating and testing loop index i, computing the address of the next data elements, and reading data from memory—proceed in parallel with the multiplier.
For all of the cases where the operation has a latency L greater than 1, the measured CPE is simply L, indicating that this chain forms the performance-limiting critical path.
For the case of integer addition, our measurements of combine4 show a CPE of 2.00, slower than the CPE of 1.00 we would predict based on the chains of dependencies formed along either the left- or the right-hand side of the graph of Figure 5.15. This illustrates the principle that the critical paths in a data-flow representation provide only a lower bound on how many cycles a program will require. Other factors can also limit performance, including the total number of functional units available and the number of data values that can be passed among the functional units on any given step.
Please indicate the source if you want to reprint: http://blog.csdn.net/gaoxiangnumber1.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值