PAPER NOTES: Roofline: an insightful visual performance model for multicore architectures

最新推荐文章于 2024-01-07 20:13:14 发布

tyskink

最新推荐文章于 2024-01-07 20:13:14 发布

阅读量692

点赞数 1

分类专栏： PERFORMANCE EVALUATION 文章标签： CPU PERFORMANCE EVALUATION PARALLEL COMPUTING MULTI-CORE MICRO-ARCHITECTURE

本文链接：https://blog.csdn.net/u014022256/article/details/79719234

版权

PERFORMANCE EVALUATION 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

• GATHER: (30MIN)20180327
• PAPER INFO:
Roofline: an insightful visual performance model for multicore architectures -Y2009 -C1059
S Williams
A Waterman
D Patterson
Communications of the ACM, 2009

• CITATION
Considering the facts that the intensive memory access is one of the prominent features of hash tables and the memory bandwidth is easy to be a bottleneck on such systems [18], in this part, we investigate the impact of the memory hierarchy on performance. The histograms in Fig. 8 illustrate how the throughput varies with the different number of elements initialized.
---- Concurrent hash tables on multicore machines: Comparison, evaluation and implications Y2018

5.1 Revised Roofline Model for Caffeine
The roofline model [31] is initially proposed in multicore systems to provide insight analysis of attainable performance by relating processors’ peak computation performance and the off-chip memory traffic.
---- Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks -Y2017

We do quantitative comparison between previous parallel local search operators, by using the Roofline performance model [41].
Roofline model is a performance analysis model for modern processors, especially focusing on the floating point performance of parallel processors [41]. This model is based on the relation of processor performance to off-chip memory traffic. The terminology operational intensity is used to mean operations per byte of DRAM traffic, for example, the floating point operations per byte (FLOP/byte). The theoretical performance of a kernel on a device could be calculated by multiplying operational intensity by the main memory bandwidth (byte/s). The performance is bounded by two types of ceilings: computational ceilings and bandwidth ceilings. We could assume that the operational intensity is a column. If it hits the computational ceiling, which represents the theoretical peak performance of the device, the kernel is compute-bound; otherwise, it is memory-bound. This model offers us with insights for algorithm optimization on parallel processors.
To achieve more accurate performance prediction, we could add memory bandwidth and computational ceilings to the Roofline model [41]. In this paper, we focus on the performance optimization of the GPU-based ILS algorithm. So, the performance prediction is left for the future work.
---- Optimization of parallel iterated local search algorithms on graphics processing unit -Y201605 -C29

4 PERFORMANCE: ROOFLINES, RESPONSETIME,AND THROUGHPUT
To illustrate the performance of the six apps on the three processors, we adapt the Roofline Performance model from highperformance computing (HPC) [58]. This simple visual model is not perfect, yet it offers insights into the causes of performance bottlenecks. The assumption behind the model is that applications don’t fit in on-chip caches, so they are either computation-limited or memory bandwidth-limited. For HPC, the Y-axis is performance in floating-point operations per second, thus the peak computation rate forms the “flat” part of the roofline. The X-axis is operational intensity, measured as floating-point operations per DRAM byte accessed. Memory bandwidth is bytes per second, which turns into the “slanted” part of the roofline since (FLOPS/sec)/ (FLOPS/Byte) = Bytes/sec. Without sufficient operational intensity, a program is memory bandwidth-bound and lives under the slanted part of the roofline.
---- In-Datacenter Performance Analysis of a Tensor Processing Unit -Y201706 -C177 -ISCA '17
• ABSTRACT INTRO AND CONCLUSION NOTES
大概就是个多核浮点计算性能的测试模型。
一个提供性能评估的模型

intro
有很多异构设计，虽然指令集不同，但是基本思路都是一样的。
多核意味着设计的多样化，比如多个简单核，多线程，明确的本地内存。总之很复杂
多样性意味着不确定性，所以搞个易懂的模型，提供性能指导。
模型不需要太完美，只需要有见解。哪怕有点怪癖呢
conclusion
从时序计算到并行计算，多样性增强。
本文就介绍了简单且可视化的模型，怎么搞好一个kernel
对于浮点kernel无法在cache里适配
展示了 operational intensity，从DRAM里传出的浮点数操作，是kernel和多和计算机的一个重要参数。
在7个老cpu和四个新架构中用了这个模型。
ridge point：用最小的操作强度实现最大性能，比clock rate和peak performance还好的预测器。
总之这个Roofline能跟performance counter一起用，还能给GPU和Vector Processor用。还能给L3内存，IO啥的用。

• QUICK: (30MIN)20180327
• CONTENT, ARCHITECTURE AND INTERNAL-RELATIONSHI
2. 有一些performance model
stochastic analytical model
statistical performance model 但是这俩只能测量程序性能，没法给出如何提高性能的指导意见。
bound and bottleneck analysisi：
Amdahl's Law：但是被先知道并行程序的串行部分
3. Roofline
off-chip bandwidth 比较重要
用operational intensity代替 arithmetic intensity 和 machine balance
原因：
1. 从测量processor-cache traffic 到 cache-DRAM traffic
2. 以后会用到kernel，要比arithmetic更general才行

• HIGHLIGHT
• NOTES
• READ:
• NOTES
• QUESTIONS
• POTENTIAL ISSUES
• NOTES
• READ FURTHER: REFERENCE

tyskink

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
PAPER NOTES: Roofline: an insightful visual performance model for multicore architectures

• GATHER: (30MIN)20180327 • PAPER INFO: Roofline: an insightful visual performance model for multicore architectures -Y2009 -C1059 S Williams A Waterman D Patterson Communications of the ACM, ...
复制链接

扫一扫