CMU Computer Systems: Program Optimization

Optimization

  • Overview
  • Generally Useful Optimizations
    • Code motion/precomputation
    • Strength reduction
    • Sharing of common subexpressions
    • Removing unnecessary procedure calls
  • Optimization Blockers
    • Procedure calls
    • Memory aliasing
  • Exploiting Instruction-Level Parallism
  • Dealing with Conditionals
Performance Realties
  • There’s more to performance than asymptotic complexity
  • Constant factors matter tool
    • Easily see 10:1 performance range depending on how code is written
    • Must optimize at multiple levels:
      • algorithm, data representations, procedures, and loops
  • Must understand system to optimize performance
    • How programs are compiled and executed
    • How modern processors + memory systems operate
    • How to measure program performance and identify bottlenecks
    • How to improve performance without destroying code modular generality
Optimizing Compilers
  • Provide efficient mapping of program to machine
  • Don’t (usually) improve asymptotic efficiency
  • Have difficulty overcoming “optimization blockers”
Limitations of Optimizing Compilers
  • Operate under fundamental constraint
    • Must not cause any change in program behavior
    • Often prevents it from making optimizations that would only affect behavior under pathological conditions
  • Behavior that may be obvious to the programmer can be obfuscated by languages and coding styles
  • Most analysis is performed only within procedures
    • Whole-program analysis is too expensive in most cases
    • Newer versions of GCC do interprocedural analysis within individual files
  • Most analysis is based on static information
  • When in doubt, the compiler must be conservative
Generally Useful Optimizations
  • Optimizations that you or the compiler should do regardless of processor / compiler
  • Code Motion
    • Reduce frequency with which computation performed
      • If it will always procedure same result
      • Especially moving code out of loop
    • Reduction in Strength
      • Replace costly operation with simpler one
      • Shift, and instead of multiply or divide
      • Recognize sequence of products
    • Share Common Subexpressions
      • Reuse portions of expressions
      • GCC will do this with -O1
Optimization Blocker #1: Procedure Calls
  • Why couldn’t compiler move strlen out of inner loop
    • Procedure may have side effects
      • Alters global state each time called
    • Function may not return same value for given arguments
      • Depends on other parts of global state
      • Procedure lower could interact with strlen
  • Warning
    • Compiler treats procedure call as a black box
    • Weak optimizations near them
  • Remedies
    • Use of inline functions
    • Do your own code motion
Optimization Blocker #2: Memory Aliasing
  • Aliasing
    • Two different memory references specify single location
    • Easy to have happen in C
      • Since allowed to do address arithmetic
      • Direct access to storage structures
    • Get in habit of introducing local variables
      • Accumulating within loops
      • Your way of telling compiler not to check for aliasing
Exploiting Instruction-Level Parallelism
  • Need general understanding of modern
    • Hardware can execute multiple instructions in parallel
  • Performance limited by data dependencies
  • Simple transformations can yield dramatic performance improvement
    • Compilers often cannot make these transformations
    • Lack of associativity and distributivity in floating-point arithmetic
Cycles Per Element (CPE)
  • Convenient way to express performance of program that operates on vectors in lists
  • Length = n
  • In our case: CPE = cycles per OP
  • T = CPE*n + Overhead
    • CPE is slope of line
Superscalar Processor
  • Definition
    • A superscalar processor can issue and execute multiple instructions in one cycle. The instructions are retrieved from a sequential instruction stream and are usually scheduled dynamically.
  • Benefit
    • without programming effort, superscalar processor can take advantage of instruction level parallelism that most programs have
Pipelined Functional Units
  • Divide computation into stages
  • Pass partial computations from stage to stage
  • Stage i can start on new computation once values passed to i+1
Unrolling & Accumulating
  • Idea
    • Can unroll to any degree L
    • Can accumulate K results in parallel
    • L must be multiple of K
  • Limitations
    • Diminishing returns
      • Cannot go beyond throughput limitations of execution units
    • Large overhead for short lengths
      • Finish off iterations sequentially
Using Vector Instructions
  • Make use of AVX Instructions
    • Parallel operations on multiple data elements
    • See Web Aside OPT: SIMD on CS: APP web page
Branch Prediction
  • Idea
    • Guess which way branch will go
    • Begin executing instructions at predicted position
      -But don’t actually modify register or memory data
Branch Misprediction Recovery
  • Performance Cost
    • Multiple clock cycles on modern processor
    • Can be a major performance limiter
Getting High Performance
  • Good compiler and flags
  • Don’t do anything stupid
    • Watch out for hidden algorithmic inefficiencies
    • Write compiler-friendly code
      • Watch out for optimization blockers
    • Look carefully at innermost loops
  • Turn code for machine
    • Exploit instruction-level parallelism
    • Avoid unpredictable branches
    • Make code cache friendly (Covered later in course)
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值