Further Optimizations Exploiting the Microarchitecture of the Processor

In the prior essay Transforming an Abstract Program into More Efficient Code Systematically, we  applied optimizations that did not rely on any features of target machine. They simply reduced the overhead of procedure calls and eliminated some of the critical "Optimization blockers" that cause difficulties for optimizing compilers.

      As we seek to push the performance further, we must consider optimizations that exploit the microarchitecture of the processor, that is, the underlying system design by which a processor executes instructions[1].

      Firstly, loop unrolling is a program transformation that reduces the number of iterations for a loop by increasing the number of elements computed on each iteration[1]. 

      Secondly, for a combining operation that associative and commutative, such as integer addition or multiplication, we can improve performance by splitting the set of combining operations into two or more parts and combining the results at the end.

      Thirdly, reassociation transformation is a way to break the sequential dependencies and thereby improve performance beyond the latency bound.

      Three versions of combining code combine5, combine6, combine7 using two-way loop unrolling, two-way parallelism and reassociation transformation separately are presented below:

// Unroll loop by 2                          // Unroll loop by 2, 2-way parallelism       // Change associativity of combining opration
void combine5(vec_ptr v, data_t *dest)       void combine6(vec_ptr v, data_t *dest)       void combine7(vec_ptr v, data_t *dest)
{										     { 										      {
    long int length = vec_length(v);             long int length = vec_length(v);             long int length = vec_length(v);
    long int limit = length-1;                   long int limit = length-1;                   long int limit = length-1;
    data_t *data = get_vec_start(v);             data_t *data = get_vec_start(v);             data_t *data = get_vec_start(v);
    data_t acc = IDENT;                          data_t acc0 = IDENT;                         data_t acc = IDENT;
                                                 data_t acc1 = IDENT;
    // Combine 2 elements at a time              // Combine 2 elements at a time              // Combine 2 elements at a time
    long int i = 0;                              long int i = 0;                              long int i = 0;
    for (i=0; i<limit; i+=2) {                   for (i=0; i<limit; i+=2) {                   for (i=0; i<limit; i+=2) {
        acc = (acc OP data[i]) OP data[i+1];         acc0 = acc0 OP data[i];                      acc = acc OP (data[i] OP data[i+1]);
                                                     acc1 = acc1 OP data[i+1];
    }                                            }                                            }

    // Finish any remaining elements             // Finish any remaining elements             // Finish any remaining elements
    for (; i<length; ++i) {                      for (; i<length; ++i) {                      for (; i<length; ++i) {
        acc = acc OP data[i];                        acc0 = acc0 OP data[i];                      acc = acc OP data[i];
    }                                            }                                            }
    *dest = acc;                                 *dest = acc0 OP acc1;                        *dest = acc;
}                                            }                                            }

     Loop unrolling can improve performance in two ways. First, it reduces the number of operations that do not contribute directly to the program results, such as loop indexing and conditional branching. Second, it exposes ways in which we can further transform the code to reduce the number of operations in the critical paths of the overall computation.

      In the loop of combine6, we have two critical paths, one corresponding to computing the product of  even-numbered elements and one for the odd-numbered elements. Each of these critical paths contain only n/2 operations, thus leading to a CPE of L/2.

      As with the template for combine7, we have two load and two mul operations, but only one of the mul operations forms a data-dependency chain between loop registers. And we only have n/2 operations along with the critical path. As we increase the unrolling factor k, we continue to have only one operation per iteration along the critical path.

References

[1] Randal E. Bryant, David R. O'Hallaron(2011). COMPUTER SYSTEMS A Programmer's Perspective (Second Edition).Beijing: China Machine Press.


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
This lecture presents a study of the microarchitecture of contemporary microprocessors. The fo- cus is on implementation aspects, with discussions on their implications in terms of performance, power, and cost of state-of-the-art designs. The lecture starts with an overview of the different types of microprocessors and a review of the microarchitecture of cache memories. Then, it describes the implementation of the fetch unit, where special emphasis is made on the required support for branch prediction. The next section is devoted to instruction decode with special focus on the par- ticular support to decoding x86 instructions. The next chapter presents the allocation stage and pays special attention to the implementation of register renaming. Afterward, the issue stage is studied. Here, the logic to implement out-of-order issue for both memory and non-memory instructions is thoroughly described. The following chapter focuses on the instruction execution and describes the different functional units that can be found in contemporary microprocessors, as well as the imple- mentation of the bypass network, which has an important impact on the performance. Finally, the lecture concludes with the commit stage, where it describes how the architectural state is updated and recovered in case of exceptions or misspeculations. This lecture is intended for an advanced course on computer architecture, suitable for gradu- ate students or senior undergrads who want to specialize in the area of computer architecture. It is also intended for practitioners in the industry in the area of microprocessor design. The book assumes that the reader is familiar with the main concepts regarding pipelining, out-of-order execu- tion, cache memories, and virtual memory.

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值