Get Started with Intel VTune Profiler（summary）

chen_ ：)

已于 2022-10-17 21:34:51 修改

阅读量1.2k

点赞数 3

分类专栏： IPCC ASC学习高性能计算文章标签： linux intel vtune hpc

于 2022-10-16 22:22:36 首次发布

本文链接：https://blog.csdn.net/weixin_51942493/article/details/127353350

版权

ASC学习同时被 3 个专栏收录

28 篇文章 1 订阅

订阅专栏

高性能计算

15 篇文章 0 订阅

订阅专栏

IPCC

5 篇文章 0 订阅

订阅专栏

https://www.intel.cn/content/www/cn/zh/develop/documentation/get-started-with-vtune/top.html

（1）Before Begin

Linux* OS

matrix sample

<install_directory>\sample\matrix

the environment variables:

source <install-dir>/setvars.sh

（2）Start VTune Profiler

Run

vtune-gui

（3）View and Analyze Performance Data

在这里插入图片描述

分析树：性能快照提供了其他分析类型，这些分析类型可能有助于更深入地调查应用程序中发现的性能问题。与在应用程序中检测到的性能问题相关的分析类型以红色突出显示。

突出显示性能瓶颈的主要指标：

此应用程序的“已用时间” Elapsed Time 非常高。
“内存绑定” Memory Bound 指标很高，表示内存存取问题。因此，性能快照将内存访问Memory Access 分析突出显示为潜在的起点，并指出此性能瓶颈最严重，并且对总运行时间 Elapsed Time 的贡献最大。
对于现代超标量处理器，IPC（每个周期的指令数）指标值非常低，表明处理器在大多数时间都处于停滞状态。
性能快照分析突出显示了热点分析 Hotspots 作为一个很好的起点。通常，热点分析是第一次深入分析的良好候选者。它突出显示了热点或对经过的时间贡献最大的代码区域。

https://www.intel.com/content/www/us/en/develop/documentation/get-started-with-vtune/top/linux-os.html

（4）Run Hotspots Analysis

运行热点分析以查找热点或对应用程序的总运行时间贡献最大的代码段

在这里插入图片描述

opens the Summary viewpoint

在这里插入图片描述
应用程序的总 CPU 时间 大约等于 644 秒。它是应用程序中所有线程的 CPU 时间总和。“总线程计数”为 9，因此应用程序是多线程的。

Top Hotspots section：
提供了有关最耗时的函数（热点函数）的数据，这些数据按执行时所花费的 CPU 时间排序。

Effective CPU Utilization Histogram：
表示可用逻辑处理器的“已用时间”和使用率级别，并提供在应用程序执行期间使用了多少个逻辑处理器的图形视图。理想情况下，图表的最高条形应与“目标利用率”级别匹配。

（5）Identify Most Time-Consuming Code Areas

识别最耗时的代码区域

switch to the Bottom-up tab.
change the grouping level using the Grouping menu

在这里插入图片描述

获取每个功能的详细CPU利用率信息

use the Expand ( >> ) button in the Bottom-up pane to expand the Effective Time by Utilization column.

在这里插入图片描述

双击Bottom-up网格上的multiply1函数以打开Source窗口。

在这里插入图片描述
得出最耗时的行归因于在 multiply1 函数中执行矩阵乘法的循环

https://www.intel.com/content/www/us/en/develop/documentation/vtune-tutorial-common-bottlenecks-linux/top/run-and-interpret-hotspots-analysis.html

（6）Analyze Memory Access

了解 multiply1 循环中内存访问问题背后的确切机制，运行内存访问分析 Memory Access

在这里插入图片描述

若使用了OpenMP, 可以加上 Analyze OpenMP regions

opens the Summary viewpoint.

在这里插入图片描述

应用程序受到内存访问的严重限制。系统不受 DRAM 带宽 DRAM Bandwidth 单独约束的事实表明，应用程序受频繁但很小的内存请求的约束，而不是受饱和的物理 DRAM 带宽的约束。

Switch to the Bottom-up tab

在这里插入图片描述

multiply1函数位于网格的顶部，具有最高的CPU时间 CPU Time 和较高的内存绑定度 Memory Bound 量值。
注意LLC Miss Count指标非常高。这表明应用程序使用了缓存不友好的内存访问模式，这导致处理器经常错过LLC并从DRAM请求数据，这在延迟方面是昂贵的。
解决这个问题的一个好方法是应用循环交换技术，在本例中，该技术改变了矩阵的行和列在主循环中的寻址方式。通过这种方式，消除了低效的内存访问模式，使处理器能够更好地利用LLC。

https://www.intel.com/content/www/us/en/develop/documentation/vtune-tutorial-common-bottlenecks-linux/top/analyze-memory-access.html

（7）Analyze Performance After Optimization

run the Performance Snapshot analysis again

在这里插入图片描述

应用程序的“已用时间” Elapsed Time 显著减少
( 矢量化 ) Vectorization 指标等于 0.0%，这意味着代码未矢量化。因此，性能快照将 HPC 性能表征分析突出显示为潜在的下一步。

（8）Analyze Vectorization Efficiency

with the -O2 level enabled:

在这里插入图片描述
Observe these main indicators:

The overall Vectorization metric is equal to 99.9%, which indicates that the code was vectorized.
However, there are red flags next to the 128-bit Packed FLOPs metrics. Hover over the red flag icon or the metric value to get a description of the issue.
整体矢量化指标等于 99.9%，表示代码已矢量化。
但是，128 位打包 FLOP 指标旁边有危险信号。将鼠标悬停在红色标志图标或指标值上可获取问题描述。

在这里插入图片描述

在这种情况下，Intel®VTune™Profiler表明，大量浮点指令是用部分向量负载执行的。
由于分析是在基于能够使用AVX2指令集的英特尔处理器的机器上执行的，所有指令都只使用128位寄存器执行的事实意味着256位宽的AVX2寄存器根本没有被使用。因此，VTune Profiler将128位向量寄存器的100.0%利用率标记为问题。
（即可以利用256位的向量指令实现向量化）

要了解实际使用的向量指令集，请运行HPC性能表征分析。

choose HPC Performance Characterization

在这里插入图片描述 To run the analysis:

Click the HPC Performance Characterization analysis icon from the analysis tree.
Disable the Collect stacks, Analyze Memory bandwidth and Analyze OpenMP regions options as they are not required for vectorization analysis.
Click the Start button to run the analysis.
Once the data collection is complete, VTune Profiler opens the default Summary window of the HPC Performance Characterization Analysis.

注：不需要打开 分析 OpenMP 区域

在这里插入图片描述

Focus on the Vectorization section

注意，multiply2函数的主循环是使用旧的 SSE2 指令集向量化的，而编译和分析是在支持avx2的处理器上执行的。因此，部分硬件资源仍然未得到充分利用。

OPTFLAGS = -xHost

此选项指示编译器使用执行编译的处理器支持的最佳指令集扩展。

https://www.intel.com/content/www/us/en/develop/documentation/vtune-tutorial-common-bottlenecks-linux/top/analyze-vectorization-efficiency.html

https://www.intel.com/content/www/us/en/develop/documentation/vtune-tutorial-common-bottlenecks-linux/top/enable-platform-appropriate-vectorization.html

（9）Analyze Microarchitecture Usage

性能快照突出显示了微体系结构利用率的问题。将运行微体系结构探索分析以查找优化机会。
虽然之前的优化对应用程序的总运行时间有很大的好处，但仍有一些需要改进的地方。性能快照分析强调，微体系结构没有得到很好的利用。
运行微体系结构探索分析以确定改进机会。

Run Microarchitecture Exploration Analysis

在这里插入图片描述

In the HOW pane, enable all extra options.（选择所用选项）

Summary window：

在这里插入图片描述
分为三个部分：

Elapsed Time section —“已用时间”部分。此部分显示与硬件的硬件利用率级别相关的指标。
µPipe Diagram — μPipe 图。μPipe 或微体系结构管道提供了 CPU 微体系结构指标的图形表示，显示硬件使用效率低下。μPipe 基于 CPU 流水线插槽，表示处理一个微操作所需的硬件资源。
Effective CPU Utilization Histogram — 有效 CPU 使用率直方图。此直方图表示可用逻辑处理器的已用时间和使用级别，并提供在应用程序执行期间使用了多少个逻辑处理器的图形视图。理想情况下，图表的最高条形应与“目标利用率”级别匹配。

本例存在的问题：

“内存绑定” Memory Bound 指标很高，因此应用程序受内存访问的约束。
内存带宽 Memory Bandwidth 和内存延迟 Memory Latency 指标很高。
综合考虑这些因素，得出的结论是应用程序存在内存访问问题

Microarchitecture Pipe（*）

Actual Instructions Retired / Possible Maximum Instruction Retired (pipe efficiency)

红色表示存在潜在的性能问题。图中绿色的一小部分有助于估计执行效率的好坏。因此，管道形式清楚地表示了现有的 CPU 微体系结构问题，并使您能够识别以下常见模式：

在这里插入图片描述

A no significant issues
B Memory bound execution
C Core bound execution
D Front End bound execution
E Bad Speculation issues (for example, branch misprediction)
F a combination of Memory and Bad Speculation issues

示例 1
下面是一个管道示例，表示显著的前端绑定和核心绑定问题，将整个效率限制为 24.4%：

在这里插入图片描述