PAPER NOTES: Roofline: an insightful visual performance model for multicore architectures

• GATHER: (30MIN)20180327
• PAPER INFO:
Roofline: an insightful visual performance model for multicore architectures -Y2009 -C1059
S Williams
A Waterman
D Patterson
Communications of the ACM, 2009

• CITATION
Considering the facts that the intensive memory access is one of the prominent features of hash tables and the memory bandwidth is easy to be a bottleneck on such systems [18], in this part, we investigate the impact of the memory hierarchy on performance. The histograms in Fig. 8 illustrate how the throughput varies with the different number of elements initialized.
---- Concurrent hash tables on multicore machines: Comparison, evaluation and implications Y2018


5.1 Revised Roofline Model for Caffeine
The roofline model [31] is initially proposed in multicore systems to provide insight analysis of attainable performance by relating processors’ peak computation performance and the off-chip memory traffic. 
---- Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks -Y2017


We do quantitative comparison between previous parallel local search operators, by using the Roofline performance model [41].
Roofline model is a performance analysis model for modern processors, especially focusing on the floating point performance of parallel processors [41]. This model is based on the relation of processor performance to off-chip memory traffic. The terminology operational intensity is used to mean operations per byte of DRAM traffic, for example, the floating point operations per byte (FLOP/byte). The theoretical performance of a kernel on a device could be calculated by multiplying operational intensity by the main memory bandwidth (byte/s). The performance is bounded by two types of ceilings: computational ceilings and bandwidth ceilings. We could assume that the operational intensity is a column. If it hits the computational ceiling, which represents the theoretical peak performance of the device, the kernel is compute-bound; otherwise, it is memory-bound. This model offers us with insights for algorithm optimization on parallel processors.
To achieve more accurate performance prediction, we could add memory bandwidth and computational ceilings to the Roofline model [41]. In this paper, we focus on the performance optimization of the GPU-based ILS algorithm. So, the performance prediction is left for the future work.
---- Optimization of parallel iterated local search algorithms on graphics processing unit -Y201605 -C29


4 PERFORMANCE: ROOFLINES, RESPONSETIME,AND THROUGHPUT
To illustrate the performance of the six apps on the three processors, we adapt the Roofline Performance model from highperformance computing (HPC) [58]. This simple visual model is not perfect, yet it offers insights into the causes of performance bottlenecks. The assumption behind the model is that applications don’t fit in on-chip caches, so they are either computation-limited or memory bandwidth-limited. For HPC, the Y-axis is performance in floating-point operations per second, thus the peak computation rate forms the “flat” part of the roofline. The X-axis is operational intensity, measured as floating-point operations per DRAM byte accessed. Memory bandwidth is bytes per second, which turns into the “slanted” part of the roofline since (FLOPS/sec)/ (FLOPS/Byte) = Bytes/sec. Without sufficient operational intensity, a program is memory bandwidth-bound and lives under the slanted part of the roofline. 
---- In-Datacenter Performance Analysis of a Tensor Processing Unit -Y201706 -C177 -ISCA '17
• ABSTRACT INTRO AND CONCLUSION NOTES
大概就是个多核浮点计算性能的测试模型。
一个提供性能评估的模型

intro
有很多异构设计,虽然指令集不同,但是基本思路都是一样的。
多核意味着设计的多样化,比如多个简单核,多线程,明确的本地内存。总之很复杂
多样性意味着不确定性,所以搞个易懂的模型,提供性能指导。
模型不需要太完美,只需要有见解。哪怕有点怪癖呢
conclusion
从时序计算到并行计算,多样性增强。
本文就介绍了简单且可视化的模型,怎么搞好一个kernel
对于浮点kernel无法在cache里适配
展示了 operational intensity,从DRAM里传出的浮点数操作,是kernel和多和计算机的一个重要参数。
在7个老cpu和四个新架构中用了这个模型。
ridge point: 用最小的操作强度实现最大性能,比clock rate和peak performance还好的预测器。
总之这个Roofline能跟performance counter一起用,还能给GPU和Vector Processor用。还能给L3内存,IO啥的用。


• QUICK: (30MIN)20180327
• CONTENT,  ARCHITECTURE AND INTERNAL-RELATIONSHI
2. 有一些performance model
stochastic analytical model
statistical performance model 但是这俩只能测量程序性能,没法给出如何提高性能的指导意见。
bound and bottleneck analysisi:
Amdahl's Law: 但是被先知道并行程序的串行部分
3. Roofline 
off-chip bandwidth 比较重要
用operational intensity代替 arithmetic intensity 和 machine balance
原因: 
1. 从测量processor-cache traffic 到 cache-DRAM traffic
2. 以后会用到kernel,要比arithmetic更general才行

• HIGHLIGHT
• NOTES
• READ: 
• NOTES
• QUESTIONS
• POTENTIAL ISSUES
• NOTES
• READ FURTHER: REFERENCE 
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值