Predication in Computer Architecture

最新推荐文章于 2024-10-06 06:52:13 发布

EverNoob

最新推荐文章于 2024-10-06 06:52:13 发布

阅读量142

点赞数

分类专栏： Computer_Architecture 文章标签： c语言性能优化硬件架构

原文链接：https://www.cs.nmsu.edu/~rvinyard/itanium/predication.htm; https://en.wikipedia.org/wiki/Predication_(computer_architecture)

版权

Computer_Architecture 专栏收录该内容

19 篇文章 0 订阅

订阅专栏

Overview

from: Predication

Predication is the conditional execution of instructions. Conditional execution is implemented through branches in traditional architectures. Predication removes branches used for conditional execution. Predicated execution avoids branches, and simplifies compiler optimization by converting a control dependency to a data dependency.

The following is an example presented in [1]:

Given the following source code:

    if (emp_status == ACTIVE) {
        n_active_emps++;
        total_payroll += emp_pay;
   }
    else {
        n_inactive_emps++;
   }

This code is typically compiled into a sequence such as:

    {
        cmp.ne p1 = rs, ACTIVE   // compare emp_status
        (p1) br else             // jump to else code if condition fails
    }
    .label then
    {
        add  rt = rt, rp         // sum total_payroll + emp_pay
        add ra = ra, 1          // increment n_active_emps
        br join
    }
    .label else
    {
        add ri = ri, 1          // increment n_inactive_emps
    }
    .label join

The generated predicated code would look like:

    {
        cmp.eq p1, p2 = rs, ACTIVE // compare emp_status
    } {
        (p1) add rt = rt, rp        // sum total_payroll + emp_pay
        (p1) add ra = ra, 1         // increment n_active_emps
        (p2) add ri = ri, 1         // increment n_inactive_emps
    }

Predication provides many advantages over prediction. In the example above the three predicated instructions can be executed in parallel. Each of the instructions above are predicated. If the instruction's predicate evaluates to a 1, the instruction is executed. Otherwise, the instruction equates to a NOP. On a machine with three or more add units, the above example utilizes the same cycles as a non-predicated machine except there is no possibility of a branch penalty. ===> !!! the price of execution is still paid, we traded the runtime penalty from wrong branch prediction with a relative constant waste of computation/power

Important concept:

"Predicated execution avoids branches, and simplifies compiler optimization by converting a control dependence to a data dependence." [2]

The following diagram from Byte.com illustrates predication:

Advantage and Disadvantage

https://en.wikipedia.org/wiki/Predication_(computer_architecture)

In computer science, predication is an architectural feature that provides an alternative to conditional transfer of control, as implemented by conditional branch machine instructions. Predication works by having conditional (predicated) non-branch instructions associated with a predicate, a Boolean value used by the instruction to control whether the instruction is allowed to modify the architectural state or not. If the predicate specified in the instruction is true, the instruction modifies the architectural state; otherwise, the architectural state is unchanged. For example, a predicated move instruction (a conditional move) will only modify the destination if the predicate is true. Thus, instead of using a conditional branch to select an instruction or a sequence of instructions to execute based on the predicate that controls whether the branch occurs, the instructions to be executed are associated with that predicate, so that they will be executed, or not executed, based on whether that predicate is true or false.[1]

Vector processors, some SIMD ISAs (such as AVX2 and AVX-512) and GPUs in general make heavy use of predication, applying one bit of a conditional mask Vector to the corresponding elements in the Vector registers being processed, whereas scalar predication in scalar instruction sets only need the one predicate bit. Where Predicate Masks become particularly powerful in Vector processing is if an array of Condition Codes, one per Vector element, may feed back into Predicate Masks that are then applied to subsequent Vector instructions.

Advantages

The main purpose of predication is to avoid jumps over very small sections of program code, increasing the effectiveness of pipelined execution and avoiding problems with the cache. It also has a number of more subtle benefits:

Functions that are traditionally computed using simple arithmetic and bitwise operations may be quicker to compute using predicated instructions.
Predicated instructions with different predicates can be mixed with each other and with unconditional code, allowing better instruction scheduling and so even better performance.
Elimination of unnecessary branch instructions can make the execution of necessary branches, such as those that make up loops, faster by lessening the load on branch prediction mechanisms.
Elimination of the cost of a branch misprediction which can be high on deeply pipelined architectures.
Instruction sets that have comprehensive Condition Codes generated by instructions may reduce code size further by directly using the Condition Registers in or as predication.

Disadvantages

Predication's primary drawback is in increased encoding space. In typical implementations, every instruction reserves a bitfield for the predicate specifying under what conditions that instruction should have an effect. When available memory is limited, as on embedded devices, this space cost can be prohibitive. However, some architectures such as Thumb-2 are able to avoid this issue (see below). Other detriments are the following:[3]

Predication complicates the hardware by adding levels of logic to critical paths and potentially degrades clock speed.
A predicated block includes cycles for all operations, so shorter paths may take longer ==> since now unconditional and conditional codes are mixed, and be penalized.
Predication is not usually speculated and causes a longer dependency chain. For ordered data this translates to a performance loss compared to a predictable branch.[4]

Predication is most effective when paths are balanced or when the longest path is the most frequently executed,[3] but determining such a path is very difficult at compile time, even in the presence of profiling information.

SIMD, SIMT and Vector Predication

Some SIMD instruction sets, like AVX2, have the ability to use a logical mask to conditionally load/store values to memory, a parallel form of the conditional move, and may also apply individual mask bits to individual arithmetic units executing a parallel operation. The technique is known as "Associative Processing" in Flynn's Taxonomy.

This form of predication is also used in Vector processors and single instruction, multiple threads GPU computing. All the techniques, advantages and disadvantages of single scalar predication apply just as well to the parallel processing case.