Further Optimizations Exploiting the Microarchitecture of the Processor

In the prior essay Transforming an Abstract Program into More Efficient Code Systematically, we  applied optimizations that did not rely on any features of target machine. They simply reduced the overhead of procedure calls and eliminated some of the critical "Optimization blockers" that cause difficulties for optimizing compilers.

      As we seek to push the performance further, we must consider optimizations that exploit the microarchitecture of the processor, that is, the underlying system design by which a processor executes instructions[1].

      Firstly, loop unrolling is a program transformation that reduces the number of iterations for a loop by increasing the number of elements computed on each iteration[1]. 

      Secondly, for a combining operation that associative and commutative, such as integer addition or multiplication, we can improve performance by splitting the set of combining operations into two or more parts and combining the results at the end.

      Thirdly, reassociation transformation is a way to break the sequential dependencies and thereby improve performance beyond the latency bound.

      Three versions of combining code combine5, combine6, combine7 using two-way loop unrolling, two-way parallelism and reassociation transformation separately are presented below:

// Unroll loop by 2                          // Unroll loop by 2, 2-way parallelism       // Change associativity of combining opration
void combine5(vec_ptr v, data_t *dest)       void combine6(vec_ptr v, data_t *dest)       void combine7(vec_ptr v, data_t *dest)
{										     { 										      {
    long int length = vec_length(v);             long int length = vec_length(v);             long int length = vec_length(v);
    long int limit = length-1;                   long int limit = length-1;                   long int limit = length-1;
    data_t *data = get_vec_start(v);             data_t *data = get_vec_start(v);             data_t *data = get_vec_start(v);
    data_t acc = IDENT;                          data_t acc0 = IDENT;                         data_t acc = IDENT;
                                                 data_t acc1 = IDENT;
    // Combine 2 elements at a time              // Combine 2 elements at a time              // Combine 2 elements at a time
    long int i = 0;                              long int i = 0;                              long int i = 0;
    for (i=0; i<limit; i+=2) {                   for (i=0; i<limit; i+=2) {                   for (i=0; i<limit; i+=2) {
        acc = (acc OP data[i]) OP data[i+1];         acc0 = acc0 OP data[i];                      acc = acc OP (data[i] OP data[i+1]);
                                                     acc1 = acc1 OP data[i+1];
    }                                            }                                            }

    // Finish any remaining elements             // Finish any remaining elements             // Finish any remaining elements
    for (; i<length; ++i) {                      for (; i<length; ++i) {                      for (; i<length; ++i) {
        acc = acc OP data[i];                        acc0 = acc0 OP data[i];                      acc = acc OP data[i];
    }                                            }                                            }
    *dest = acc;                                 *dest = acc0 OP acc1;                        *dest = acc;
}                                            }                                            }

     Loop unrolling can improve performance in two ways. First, it reduces the number of operations that do not contribute directly to the program results, such as loop indexing and conditional branching. Second, it exposes ways in which we can further transform the code to reduce the number of operations in the critical paths of the overall computation.

      In the loop of combine6, we have two critical paths, one corresponding to computing the product of  even-numbered elements and one for the odd-numbered elements. Each of these critical paths contain only n/2 operations, thus leading to a CPE of L/2.

      As with the template for combine7, we have two load and two mul operations, but only one of the mul operations forms a data-dependency chain between loop registers. And we only have n/2 operations along with the critical path. As we increase the unrolling factor k, we continue to have only one operation per iteration along the critical path.

References

[1] Randal E. Bryant, David R. O'Hallaron(2011). COMPUTER SYSTEMS A Programmer's Perspective (Second Edition).Beijing: China Machine Press.


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值