AI算力基础 -- Roofline模型

Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures

在这里插入图片描述

背景1:Amdahl‘s Law: Gene Amdahl进行了一个富有洞察力的观察: 提升一个系统的一个部分的性能对整个系统有多大影响。这一观察被称为Amdahl’s Law(阿姆达尔定律)
背景2:David Patterson,2017年图灵奖得主、加州伯克利大学计算机科学教授、谷歌杰出工程师David Patterson. 作为计算机体系结构宗师,David Patterson曾带领伯克利团队起草了精简数据集RISC-1,奠定RISC架构基础,该架构后来被当时的巨头「太阳微电子」(Sun Microsystem,后来被甲骨文收购)选中用来制作Sparc处理器。他与斯坦福大学前校长、Google母公司Alphabet现董事长John Hennessey合作的《计算机体系结构:量化研究方法》开创性地提供了体系结构的分析和科学框架,至今都是该领域的经典教材。2016年从伯克利退休后,David Patterson以杰出工程师身份加入Google Brain团队,为两代TPU研发做出了卓越贡献。

2018年3月,David Patterson与John Hennessey共同获得2017年度ACM图灵奖,以表彰他们在计算机体系结构的设计和评估方面开创了一套系统的、量化的方法,并对微处理器行业产生了深远的影响。

Amdahl‘s Law

在这里插入图片描述

1. Abstract

We propose an easy-to-understand, visual performance model that offers insights to programmers and architects on improving parallel software and hardware for floating point computations.

2. The roofline model

We believe that for the recent past and foreseeable future, off-chip memory bandwidth will often be the constraining resource[23]. Hence, we want a model that relates processor performance to offchip memory traffic.

**operational intensity: operations per byte of DRAM traffic, operational intensity suggests the
DRAM bandwidth needed by a kernel on a particular computer. **
The Y-axis is attainable floating-point performance. The Xaxis is operational intensity

在这里插入图片描述
Attainable GFlops/sec = Min(Peak Floating Point Performance, Peak Memory Bandwidth x Operational Intensity).

起初阶段,Attainable performance 随着 operation intensity增加而增加,到 ridge point 后保持不变;
同时如果不理想会出现 memory-bound 和 computer-bound 两种情况。

右图说明,给定一个rooline,您可以在不同的 kernels 上重复使用它,因为rooline不会变化。
图1b比较了两个系统的Roofline模型。不出所料,Opteron X2的脊点从1.0右移到 Opteron X4的4.4版本。因此,要在 X4,kernel 需要大于1的操作强度。

脊点的横坐标是实现最大性能所需的最小操作强度,如果脊点偏右,则只有操作强度非常高的核才能达到该计算机的最大性能。如果它在最左边,那么几乎所有内核都可能达到最大性能。

4. Adding ceilings to the roofline model

Roofline模型为性能提供了一个上界, 假设你的程序在远低于它的屋顶线的地方执行。应该执行哪些优化,以什么顺序执行?

To reduce computational bottlenecks

  1. Improve instruction level parallelism (ILP) and apply SIMD?
  2. Balance floating-point operation mix.

To reduce memory bottlenecks:

  1. Restructure loops for unit stride accesses.Optimizing for unit stride memory accesses engages hardware prefetching, which significantly increases memory bandwidth.
  2. Ensure memory affinity. Most microprocessors today include a memory controller on the same chip with the processors. If the system has two multicore chips, then some addresses go to the DRAM local to one multicore chip and the rest must go over a chip interconnect to access the DRAM that is local
    to another chip
  3. Use software prefetching
    在这里插入图片描述
    第一个图是改善了compute,第二个图是改善了memory;第三个图是绘制在一起;

在这里插入图片描述
上述是四种 FP kernel 的实现方法,每个方法有各自的 Oper Inten.

  1. Intel , Intel includes a snoop filter to prevent unnecessary coherency traffic on the bus. If the working set is small enough for the hardware to filter, the snoop filter nearly doubles the delivered memory bandwidth.
    在这里插入图片描述
  2. AMD
    在这里插入图片描述
  3. IBM
    在这里插入图片描述

Fallacy: The model does not take into account all features of modern processors, such as caches or prefetching.
Fallacy: Doubling cache size will increase operational intensity
Fallacy: The model doesn’t account for the long memory latency
Fallacy: The model ignores integer units in floating-point programs, which can limit performance
Fallacy: The model is limited to easily optimized kernels that never hit in the cache
Fallacy: The model is limited to floating-point programs
在这里插入图片描述

Conclusions

This paper describes a simple and visual model to help see which systems would be a good match to important kernels, or conversely, to see how to change kernel code or hardware to run desired kernels well.

  • 1
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值