Available performance improvement from superscalar techniques is limited by two key areas:
- The degree of intrinsic parallelism in the instruction stream, i.e. limited amount of instruction-level parallelism, and
- The complexity and time cost of the dispatcher and associated dependency checking logic.
(基本类似于CoIssue技术 同样的需要检查指令之间的dependency 以及受限于 指令流的状况)
Existing binary executable programs have varying degrees of intrinsic parallelism. In some cases instructions are not dependent on each other and can be executed simultaneously. In other cases they are inter-dependent: one instruction impacts either resources or results of the other. The instructions a = b + c; d = e + f
can be run in parallel because none of the results depend on other calculations. However, the instructions a = b + c; b = e + f
might not be runnable in parallel, depending on the order in which the instructions complete while they move through the units.
When the number of simultaneously issued instructions increases, the cost of dependency checking increases extremely rapidly. This is exacerbated by the need to check dependencies at run time and at the CPU's clock rate. This cost includes additional logic gates required to implement the checks, and time delays through those gates. Research shows the gate cost in some cases may be nk gates, and the delay cost k2logn, where n is the number of instructions in the processor's instruction set, and k is the number of simultaneously dispatched instructions. In mathematics, this is called a combinatoric problem involving permutations.
See :
Tomasulo_algorithm
Out of order
Scoreboarding