Chapter5: Program Optimization2 (cmu15-213)

Exploiting Instruction-level Parallelism

  1. Need general understanding of modern processor design
    1. Hardware can execute multiple instructions in parallel
  2. Performance limited by data dependencies
  3. Simple transformations can yield dramatic performance improvement
    1. Compilers often cannot make these transformations
    2. Lack of associativity and distributivity in floating-point arithmetic

Cycles Per Element (CPE)

  • Convenient way to express performance of program that operates on vectors or lists
  • Length = n
  • In our case: CPE = cycles per OP
  • T = CPE*n + Overhead
  • CPE is slope of line

Basic Optimizations

void combine4(vec_ptr v, data_t *dest)
{
  long i;
  long length = vec_length(v);
  data_t *d = get_vec_start(v);
  data_t t = IDENT;
  for (i = 0; i < length; i++)
    t = t OP d[i];
  *dest = t;
}
  1. Move vec_length out of loop
  2. Avoid bounds check on each cycle
  3. Accumulate in temporary
    就是要把vec_length(v)移动到循环外并用临时的变量来存储其值,这样每次在每次循环检查边间条件时不需要对vec_length(v)进行计算。
image-20211224095842575

虽然程序是按顺序写的,是一个linear sequence of instruction,但是可以拆分成不同的部分,某些部分相互依赖,有些部分相互独立,所以可以跳过前面的部分执行后面的→指令级并行instructuon level parallelism.

image-20211224101628882
  1. Break multiplication up into smaller steps that they can be done one after another,
  2. and we have a sperate dedicated hardware for each of those stages.
  3. do pipelining 当一个操作从一个stage1移动到下一个stage2, 后面的操作就可以填入 stage2

如上图,把一个乘法运算分为三个部分, a*b a*c是 相互独立的,time1a*b 完成计算的第一部分,Time2进入stage2,并且在Time2中stage1 available, 所以在Time2可以进行 a*c的运算

但是p1*p2是需要在a*b a*c完成后进行,所以对于原来的乘法需要 3*3 现在只需要7个time就可以完成

image-20211224103054675
  • Latency(延迟)只从开始执行一个instruction到结束所花费的cycle, cycles/issue是指由于流水线的操作,两个operation之间的距离

  • division操作没有pipeline.

Loop Unrolling

Rather than executing one value within a loop, we execute a multiple ones

The original one:

image-20211224104639295

1.

image-20211224103941468
  • 同时计算两个 i+=2

在这里插入图片描述

  • i+=2,只有加法的时间得到了缩短

2.

image-20211224104345525

结果:
image-20211224104527975
原因:
image-20211224104732447
  • 前面那种需要计算出前半部分的结果才能接着计算,后面这种是并行的,读取数据可以并行
  • 注意:这里是整数运算,如果是floating point的运算可能会出现:
    1. rounding
    2. potential overflow

3. Loop Unrolling with Separate Accumulators (2x2)

image-20211224112407344

Get more parallelism going Multiple Accumulator

We have odd-numbered elements and even-numbered elements in tha array,→ we can compute sperate sum or products of these two sets of elements → the very end combine them together

image-20211224113051616 image-20211224113155365

Unrolling & Accumulating

image-20211224114106901 image-20211224114503508 image-20211224114726423

By sorting of picking the best parameters, we can get very close to the throughput bound of the processor.

the original CPE was 20 clock cycles and 10 -> now: 1 and 0.5

image-20211224114956310

  • Limited only by throughput of functional units
  • Up to 42X improvement over original, unoptimized code
image-20211224115511699

本质是利用流水线,流水线本质是cpu太块,你不管从哪读取都跟不上cpu计算速度,当然在寄存器内存扩大,相当于缓存,减少时间,使流水线尽可能多的被利用

image-20211224120432143 image-20211224120812021

These instructions only modify registers. And also it has multiple copies of registers, and these are sort of speculative values appending updates to them.(to registers, so the correct speculative values position ahead of the wrong ones)

→ so when it comes time to cancel it, it just cancels out all those pending updates.

寄存器重命名块: In cpu, there is a big block called the register renaming unit ,which is, multiple copies of all the registers as they get accumulated. And there are several hundred reg copies( virtual registers ) to keep pending copies to actual registers.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值