计算机结构优化,计算机结构与程序优化.ppt

《计算机结构与程序优化.ppt》由会员分享,可在线阅读,更多相关《计算机结构与程序优化.ppt(116页珍藏版)》请在人人文库网上搜索。

1、计算机结构与程序优化,Introduction to Intel 64 Architectures Optimization,Main Purpose,处理器架构简介 SIMD指令介绍 (SSE /max(A,B),cmp A, B ; Condition jbe L30 ; Conditional branch mov ebx A ; ebx holds X jmp L31 ; Unconditional branch L30: mov ebx, B L31:,xor ebx, ebx ; Clear ebx cmp A, B setle bl ; When ebx = 0 or 1 ; O。

2、R the complement condition sub ebx, 1 ; ebx=11.11 or 00.00 and ebx, A ; ebx=A-B or 0 add ebx, B ; ebx=A or B,Branch Prediction,Spin-Wait and Idle Loops All branch targets should be 16-byte aligned Unroll small loops until the overhead of the branch and induction variable accounts (generally) for les。

3、s than 10%.,Fetch iBUFF_SIZE;i+) sum+=buffi;,Sandy Bridge only,Traversing through pointers,L1D Cache Bank Conflict,L1D Cache Bank Conflict (continue),Minimize Register Spills,Data Layout Optimizations,Pad data structures defined in the source code so that every data element is aligned to a natural o。

4、perand size address boundary,Decomposing an Array,Locality Enhancement,Optimization techniques such as blocking, loop interchange, loop skewing, and packing are best done by the compiler. Optimize data structures either to fit in one-half of the first-level cache or in the second-level cache; turn o。

5、n loop optimizations in the compiler to enhance locality for nested loops,Minimizing Bus Latency,If there is a blend of reads and writes on the bus, changing the code to separate these bus transactions into read phases and write phases can help performance software should favor data access patterns 。

6、that result in higher concentrations of cache miss patterns,Non Temporal Store Bus-traffic,The data transfer rate for bus write transactions is higher if 64 bytes are written out to the bus at a time,Prefetching,First-Level Data Cache Prefetching Avoid Fetch Un-needed Lines Prefetching for 2-Level C。

7、ache,1st-Level DCache Prefetching,Avoid Fetch Un-needed Lines,For L1 Hardware Prefetch,Method 1: Organize the data so consecutive accesses can usually be found in the same 4-KByte page. Access the data in constant strides forward or backward IP Prefetcher. Method 2: Organize the data in consecutive 。

8、lines. Access the data in increasing addresses, in sequential cache lines.,Prefetching for 2-Level Cache,Streamer Loads data or instructions from memory to the second-level cache. To use the streamer, organize the data in blocks of 128 bytes, aligned on 128 bytes,Example of Latency Hiding,Memory Acc。

9、ess Latency and Execution Without Prefetch,Example of Latency Hiding,Memory Access Latency and Execution With Prefetch,Spread Prefetch Instructions,Rearranging PREFETCH instructions may yield a noticeable speedup for the code which stresses the cache resource,Multi-core 2950 Tick 48 bit; max Latency。

10、 15000 tick,Using bit wizardry,Matters Computational-Ideas, Algorithms, Source Code, Jorg Arndt Hackers Delight, Henry S. Warren, Jr. HAKMEM - AIM-239, MIT,QuadCore Intel Core 2 Quad Q9550, 2833 MHz Throughput 3.12 Gbit/s Break out throughput 1090 Tick 288 bit; 212 Tick 48 bit; max Latency 1200 tick。

11、,Look up table,QuadCore Intel Core 2 Quad Q9550, 2833 MHz Throughput 19.1 Gbit/s Break out throughput 280 Tick 288 bit; 68 Tick 48 bit; max Latency 500 tick,A Painless Guide to CRC Error Detection Algorithms Index V3.00, Ross N. williams,Decoder,Viterbi Algorithm Original Program C Optimization SIMD Optimization,Viterbi Algorithm,Viterbi Algorithm,Original Program,QuadCore Intel Core 2 Quad Q9550, 2833 MHz Throughput 11.1 Mbit/s Break out throughput 280K Tick 288 bit; 68K Tick 48 bit; max Latency 300K tick,SIMD Optimization,SIMD Optimization (continue),The End,Thank you。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值