linux性能分析工具：perf入门一页纸

磨刀砍柴Debug

已于 2023-04-11 10:35:48 修改

阅读量979

点赞数 3

分类专栏：调试和性能工具文章标签： linux 运维服务器 c++

于 2022-07-13 21:30:00 首次发布

本文链接：https://blog.csdn.net/weixin_44531336/article/details/125767551

版权

调试和性能工具专栏收录该内容

4 篇文章 5 订阅

订阅专栏

软件开发中程序运行一段时间出现2类问题最头疼：

1.突然崩溃（调试问题）

Windows：可以用WinDbg分析core dump文件，使用应用程序验证器（appverif.exe）对程序进行全面检查
Linux：可以用gdb分析core dump文件，使用valgrind的相关工具（里面最常用的是memleak检查内存泄露）对程序进行全面检查

2.运行变慢或CPU等使用率居高不下（性能问题）

静态分析代码：效率太低，除非是某一个git分支的合入导致出现问题，这种小范围的改动还可以用静态分析

采样分析：windows下可以使用Xperf进行分析，linux下可以使用perf+火焰图进行分析

说明：下面的笔记是对perf的总结，方便自己后续复习，因此里面列举了很多常用的命令，显得篇幅有点长

重要的事情说3遍：

笔记更新换位置了！
笔记更新换位置了！
笔记更新换位置了！

新位置：我写的新系列，刚开始写没几天，后续文章主要在新地址更新，欢迎支持；写作不易，且看且珍惜（点击跳转，欢迎收藏）

1.perf-Overview

简介

perf是linux（2.6+）官方的分析器（profiler），是一个轻量化的采用和分析的内核级工具，位于tools/perf下的linux内核源码中，并且是基于内核perf_events的；是一个具有分析（profiling）、跟踪（ tracing）和脚本（scripting）功能的多工具集合

安装

使用lsb_release -a列举出版本信息，然后使用下面对应的命令进行安装

Cent OS/RHEL：yum install perf
Fedora：dnf install perf
SUSE：zypper install perf
Ubuntu：apt install linux-tools-common

常见使用场景

perf特别适合CPU分析（perf可以被用来剖析CPU的调用路径）：分析/采样CPU的堆栈跟踪、跟踪CPU调度器的行为、磁盘I/O等；通过对程序进行几次采样，通常就可以找到影响性能的线索

提示：perf的子命令跟git很像，学习perf就是学习子命令的使用方式，最常用的是stat、record、report和script；先对perf支持的常用子命令有一个概括的了解吧…

2.子命令-Overview

part 1.子命令框架

下面展示了最常用的perf子命令，包含子命令的输入来源和输出格式，其中还展示了配合stackcollapse-perf.pl和flamegraph.pl生成火焰图的流程

提示：学会这个图的每个细节，基本上就学会了perf了

在这里插入图片描述

part 2.支持的命令列表

下面直接列出了perf支持的常用子命令和基本说明，注意：perf list列举出的是支持的events列表

# perf
 usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS]

 The most commonly used perf commands are:
   annotate        Read perf.data (created by perf record) and display annotated code	
   archive         Create archive with object files with build-ids found in perf.data file
   bench           General framework for benchmark suites
   buildid-cache   Manage build-id cache.
   buildid-list    List the buildids in a perf.data file
   c2c             Shared Data C2C/HITM Analyzer.
   config          Get and set variables in a configuration file.
   data            Data file related processing
   diff            Read perf.data files and display the differential profile
   evlist          List the event names in a perf.data file
   ftrace          simple wrapper for kernel's ftrace functionality
   inject          Filter to augment the events stream with additional information
   kallsyms        Searches running kernel for symbols
   kmem            Tool to trace/measure kernel memory properties
   kvm             Tool to trace/measure kvm guest os
   list            List all symbolic event types
   lock            Analyze lock events
   mem             Profile memory accesses
   record          Run a command and record its profile into perf.data
   report          Read perf.data (created by perf record) and display the profile
   sched           Tool to trace/measure scheduler properties (latencies)
   script          Read perf.data (created by perf record) and display trace output
   stat            Run a command and gather performance counter statistics
   test            Runs sanity tests.
   timechart       Tool to visualize total system behavior during a workload
   top             System profiling tool.
   version         display the version of perf binary
   probe           Define new dynamic tracepoints
   trace           strace inspired tool

 See 'perf help COMMAND' for more information on a specific command.

part 3.Option参数

下面使用-h列举出了perf stat子命令的Option参数，其他命令也类似

提示：不用全部记住，也不现实，记住常用的，不明白查询一下就可以

# perf stat -h

 Usage: perf stat [<options>] [<command>]

    -a, --all-cpus        system-wide collection from all CPUs
    -A, --no-aggr         disable CPU count aggregation
    -B, --big-num         print large numbers with thousands' separators
    -C, --cpu <cpu>       list of cpus to monitor in system-wide
    -c, --scale           scale/normalize counters
    -D, --delay <n>       ms to wait before starting measurement after program start
    -d, --detailed        detailed run - start a lot of events
    -e, --event <event>   event selector. use 'perf list' to list available events
    -G, --cgroup <name>   monitor event in cgroup name only
    -g, --group           put the counters into a counter group
    -I, --interval-print <n>
                          print counts at regular interval in ms (overhead is possible for values <= 100ms)
    -i, --no-inherit      child tasks do not inherit counters
    -M, --metrics <metric/metric group list>
                          monitor specified metrics or metric groups (separated by ,)
    -n, --null            null run - dont start any counters
    -o, --output <file>   output file name
    -p, --pid <pid>       stat events on existing process id
    -r, --repeat <n>      repeat command and print average + stddev (max: 100, forever: 0)
    -S, --sync            call sync() before starting a run
    -t, --tid <tid>       stat events on existing thread id
    -T, --transaction     hardware transaction statistics
    -v, --verbose         be more verbose (show counter open errors, etc)
    -x, --field-separator <separator>

part 4.支持版本

下面列出了常用命令引入的版本信息，这也说明perf能支持的子命令是与linux内核版本有关系的；可以使用cat /proc/version查看linux的内核版本

在这里插入图片描述

part 5.子命令快速预览

下面是最常用命令的快速预览，基本上能涵盖80%的perf使用场景

#1.Listing Events
#列举名字中包含字符串 "block"的事件（events）
perf list block

#2.Counting Events
#按类型 计数 指定PID的系统调用
perf stat -e 'syscalls:sys_enter_*' -p PID
#计数整个系统的阻塞设备I/O事件，持续10秒
perf stat -e 'block:*' -a sleep 10

#3.剖析（Profiling）
#Sample on-CPU functions for the specified command, at 99 Hertz:
perf record -F 99 command
#Sample CPU stack traces (via frame pointers) system-wide ，at 99 Hertz，for 10 seconds:
perf record -F 99 -a -g sleep 10
#Sample CPU stack traces for the PID, using dwarf (debuginfo) to unwind stacks:
perf record -F 99 -p PID --call-graph dwarf sleep 10
#Record new process events via exec:
perf record -e sched:sched_process_exec -a

#4.Static Tracing
#Trace all context switches with stack traces for 1 second:
perf record -e sched:sched_switch -a -g sleep 1
#Trace all block requests, of size at least 64 Kbytes, until Ctrl-C:
perf record -e block:block_rq_issue --filter 'bytes >= 65536'

#5.Dynamic Tracing
#Add a probe for the kernel tcp_sendmsg() function entry (--add optional):
perf probe --add tcp_sendmsg
#Remove the tcp_sendmsg() tracepoint (or -d):
perf probe --del tcp_sendmsg

#6.Reporting
#Show perf.data as a text report, with data coalesced（合并） and counts and percentages:
perf report -n --stdio
#List all perf.data events, with data header (recommended):
perf script --header
#List all perf.data events, with my recommended fields (needs record -a; Linux < 4.1 used -f
#instead of -F):
perf script --header -F comm,pid,tid,cpu,time,event,ip,sym,dso 
#Generate a flame graph visualization (Linux 5.8+):
perf script report flamegraph

part 6.示例

perf的功能可以通过子命令进行调用实现，下面展示了一个常用示例

# 1.记录事件
# perf record -F 99 -a -- sleep 10
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.491 MB perf.data (7065 samples) ]

# 2.总结输出内容
# perf report --stdio
# To display the perf.data header info, please use --header/--header-only options.
# Overhead  Command        Shared Object          Symbol
# ........  .............  ..........................................  ...............................
#
    83.03%  python         libgomp.so.1.0.0       [.] 0x0000000000018f22
     3.67%  python         libxilinxpcie.so       [.] 0x0000000000001a4f
     1.07%  python         libgomp.so.1.0.0       [.] 0x0000000000018f2d
     0.70%  sleep          ld-2.17.so             [.] do_lookup_x
...

perf record：记录事件，并保存在一个文件中（默认是perf.data）

参数：-F 99表示的采样频率是每秒99赫兹，-a表示对任何CPU的所有进程（可以使用-C指定特定的CPU），sleep 10表示采样时间是10s

注意：对于所有的perf子命令，-a都表示所有CPU，在linux 4.11以及之后这个变成默认，且默认省略-a

perf report --stdio：对perf record的输出文件内容进行总结，展示最频繁采样到的函数

3.事件（events）

perf工具支持一系列测量事件，工具和底层的内核接口能测量不同来的源事件，可以使用perf list列举出事件；支持的事件列表如下

在这里插入图片描述

其中：PMU是Performance Monitoring Unit，只要记住-e可以指定事件就可以了，这部分除非特别需要，否则不需要学习这部分

方法：额外消耗符合从低到高，简单理解就是剖析（profile）就是定期采样，追踪（trace）就是持续不简单的采样

statistics/count: increment an integer counter on events
sample: collect details (eg, instruction pointer or stack) from a subset of events (once every …)
trace: collect details from every event

下面这张图是一张经典图片（总结了支持事件的来源）

在这里插入图片描述

示例：计数硬件选择事件

# perf stat -e instructions,cycles,L1-dcache-load-misses,LLC-load-misses gzip example-perf-stacks.txt.gz
gzip: example-perf-stacks.txt.gz already has .gz suffix -- unchanged

 Performance counter stats for 'gzip example-perf-stacks.txt.gz':

           120,088      instructions:u            #    0.38  insn per cycle
           319,508      cycles:u
   <not supported>      L1-dcache-load-misses:u
   <not supported>      LLC-load-misses:u

       0.007603637 seconds time elapsed

       0.000274000 seconds user
       0.000274000 seconds sys

L1-dcache-load-misses：一级数据缓存负载未命中，大概可以看出应用程序的内存负载，其中有一部分负载从一级缓存返回
LLC-load-misses：末级缓存未命中；在末级缓存之后，就会直接读取主存，因此这实际是测量主存负载

下面对常用的子命令重点介绍一下，只要记住stat主要用于计数events，reord主要用于采样，report和script主要用于分析采样，annotate主要用于源码层面分析这些主要部分就可以，有什么需求，重点学习一下相关命令即可

4.perf stat计数（count）

perf stat子命令会对事件计数（只会在程序执行期间，即on-CPU）将这些采样信息汇总，且不将事件摘要记录到perf.data中

做性能分析时，可以先看一下程序是否有性能问题；这可以用来度量事件的速率（rate）或者检查一个事件是否正在发生

它主要使用内核上下文（kernel context）统计软件事件，使用PMC寄存器统计硬件事件，在使用消耗更大的perf record前，特别适合使用perf stat来检查事件的速率

part 1.基本介绍

查看帮助信息

在这里插入图片描述

常用Options参数

在这里插入图片描述

学习命令最有效的方式就是查看使用示例，下面用2个示例介绍perf stat的使用

part 2.示例

示例1：计数公共事件（common events）

#perf stat -B dd if=/dev/zero of=/dev/null count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB) copied, 0.298623 s, 1.7 GB/s

 Performance counter stats for 'dd if=/dev/zero of=/dev/null count=1000000':

            297.31 msec task-clock:u              #    0.147 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
               211      page-faults:u             #    0.710 K/sec
       232,984,591      cycles:u                  #    0.784 GHz
       331,300,766      instructions:u            #    1.42  insn per cycle
        70,063,843      branches:u                #  235.657 M/sec
             6,299      branch-misses:u           #    0.01% of all branches

       2.026438773 seconds time elapsed

       0.077115000 seconds user
       0.221331000 seconds sys

参数说明：

-B：千位数之间用逗号分隔符
dd：可从标准输入或文件中读取数据，根据指定的格式来转换数据，再输出到文件、设备或标准输出；if=输入文件名；of=输出文件名；count=blocks，仅拷贝blocks个块，块大小等于ibs指定的字节数；ibs=bytes：一次读入bytes个字节，即指定一个块大小为bytes个字节

说明：没有events被指定的话，perf stat收集perf list列举的common event，即默认会对PMC架构进行检测

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-NwgAavy0-1657702055364)(…/…/图片/image-20220711144945809.png)]

输出说明：

简单判断

输出最下面会显示消耗系统和内核的消耗时间（0.3s=0.077115000 + 0.221331000），task-clock表示在处理器上消耗的时间（297.31），通过两者的比较，可以简单判断一下处理器占用率（297.31ms/0.3s，基本接近100%），根据自己的业务判断是否真的这里有问题

IPC（instructions per cycle）

这个指标表明CPU执行指令的效率，越高越好；也可用倒数CPI表示，其中perf通常用IPC，intel通常用CPI

IPC较低：代表CPU经常处于停滞状态，通常是在访问内存；这种场景适合更块的DRAM，或者提高内存本地性，或者较少内存I/O的数量，而不是更换更高频率的CPU

IPC较高：则代表CPU基本没有停滞，指令的吞吐量较高

注意：IPC代表了指令的处理效率，并不代表指令本身的效率

示例2：计数特定事件

当prev_pid是4794时(--filter 'prev_pid == 4794')，事件sched:sched_switch开始计数（-e），展示每个CPU计数（-a），打印输出的时间间隔是1s（-I 1000）

#sudo perf stat -e sched:sched_switch --filter 'prev_pid == 4794' -A -I 1000
#           time             counts unit events
#           time CPU                    counts unit events
     1.001560136 CPU0                       0      sched:sched_switch
     1.001560136 CPU1                       0      sched:sched_switch
     1.001560136 CPU2                       0      sched:sched_switch
     1.001560136 CPU3                       0      sched:sched_switch
     2.006458677 CPU0                       0      sched:sched_switch
     2.006458677 CPU1                       0      sched:sched_switch
     2.006458677 CPU2                       0      sched:sched_switch
     2.006458677 CPU3                     121      sched:sched_switch
^C   2.363133921 CPU0                       0      sched:sched_switch

说明：

1.测量多个events：使用没有空格的逗号分隔的表达式
# perf stat -e cycles,instructions,cache-misses [...]
2.事件数量：理论上，events的个数没有限制（the number of hardware and software events）；但是测量一个文件描述符的话，数量可能超过了内核提供给每个进程的最大数量，此时perf会报错too many open files，可以使用下面的解决方案
1.用ulimit -n增加每个进程打开的文件描述符；注意：需要有root权限
2.减少测量的事件，减少测量的CPU

part 3.修饰器（Modifiers）

默认用户和内核层面的events都会被测量，可以在events后面加冒号和修饰器来决定要计数哪个层面；下面是一个Intel和AMD的x86处理器修饰器简单总结

在这里插入图片描述

提示：修饰器可以指定一个或多个，默认用户层面和内核层面的events都会测量，因此-e cycles和-e cycles:uk效果是一样的

示例：只测量用户层面的计数

#perf stat -e cycles:u dd if=/dev/zero of=/dev/null count=100000
100000+0 records in
100000+0 records out
51200000 bytes (51 MB) copied, 0.0265741 s, 1.9 GB/s

 Performance counter stats for 'dd if=/dev/zero of=/dev/null count=100000':

        22,033,684      cycles:u

       0.027025273 seconds time elapsed

       0.006997000 seconds user
       0.019992000 seconds sys

part 4.范围和继承

perf工具是可以细分到对线程，进程，CPU或者所有CPU进行测量的

线程模式（per-thread mode）：指定的线程被执行时才会计数，当一个线程在进程间迁移时，计数器在当前进程存储，并在新进程恢复
进程模式（per-process mode）：计数（count）和采样（sample）被汇总到进程级别；perf_events接口允许自动继承fork()和pthread_create()，因此默认perf是激活继承的
CPU模式（per-cpu mode）：计数（count）和采样（sample）被汇总到CPU级别；一个event一次只能在一个CPU上测量，要监控多个CPU，就要创建多个事件

默认1：perf stat计数一个进程后续创建的所有进程和线程

默认2：默认是线程模式（per-thread mode），可以使用-a改成计数汇总到CPU模式（per-cpu mode）
# perf stat -B -e cycles:u,instructions:u -a dd if=/dev/zero of=/dev/null count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB) copied, 0.259315 s, 2.0 GB/s

Performance counter stats for 'system wide':

    218,451,420      cycles:u
    331,671,063      instructions:u            #    1.52  insn per cycle

    0.260349779 seconds time elapsed
提示：可以使用-C选项限制CPU的子集，比如为了监控CPU0，CPU3和CPU4，可以使用下面命令
# perf stat -B -e cycles:u,instructions:u -a -C 0,3-4 sleep 5

5.perf record采样（sample）

以一定的时间进行间隔采样，记录事件（-e：指定事件，不执行会执行默认）到一个文件（默认的输出文件是perf.data），方便后续分析；这部分对应的最多的一个英文单词就是profiling（剖析）

在这里插入图片描述

part 1.采样原理

简单理解成每个固定时间采样一次，即在每个CPU上产生一个中断，看一下此时是哪个PID，哪个函数正在运行，然后记录下来；正常情况下，运行时间越多的函数，采样的概率就越大，通过采样个数就可以推断出哪个函数CPU占用率更高

提示：一定要记住，计数是基于采样的，不是时间，这在用火焰图分析时一定要牢记

默认：perf record使用cycles事件作为采样事件，这个事件是一个通用的硬件事件

parr 2.周期和速率

perf_events接口允许使用2种方式设置采样周期和速率

周期：事件发生多少次采样一下（-I）
频率：每秒采样的采样频率是多少赫兹（-F），尽量不要设置成1000等整数形式，因为遇到一些周期性行为的话，会破坏掉采样的随机性，导致采样结果失真；采样频率越高，带来的额外负荷越高，要根据问题设置恰当的采样频率

part 3.Options参数

在这里插入图片描述

提示1：不使用-g，可以使用max-stack指定堆栈跟踪的深度，且针对不同事件可以设置不同的深度
# perf record -e sched:sched_switch/max-stack=5/,sched:sched_wakeup/max-stack=1/ -a -- sleep 1
提示2：shell分隔符--在perf命令选项和虚拟命令选项之间不是必须的，主要是为了停止perf对命令行选项的处理

提示3：如果没有堆栈信息，使用 -fno-omit-framepointer重新编译软件，然后重新使用perf record记录堆栈；或者perf record直接使用--call-graph选项，主要有下面3个选项：

part 4.采样完成？

#采样周期结束，或者ctrl + C，出现类似下面的打印，就说明采样结束了
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.491 MB perf.data (7065 samples) ]

part 5.示例

记录缺页采样，由于缺页是随着进程常驻集合大小（resident set size, RSS）的增长而发生的，因此分析他们可以解释为什么进程的主存在增长

# perf record -e page-faults -a -g -- sleep 5
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.899 MB perf.data (82 samples) ]
[syshaps1@localhost FlameGraph]$ perf script
perf 20797 [007] 2225731.904001:          1 page-faults:
                  57b0b9 perf_hooks__invoke+0x29 (/usr/bin/perf)
                  440013 cmd_record+0xc53 (/usr/bin/perf)
                  4a26af [unknown] (/usr/bin/perf)
                  42af75 main+0x605 (/usr/bin/perf)
            7f2a205d2555 __libc_start_main+0xf5 (/usr/lib64/libc-2.17.so)

perf 20798 [008] 2225731.904004:          1 page-faults:
            7f2a22693d4d _dl_fixup+0x2d (/usr/lib64/ld-2.17.so)
                  44032b cmd_record+0xf6b (/usr/bin/perf)
                  4a26af [unknown] (/usr/bin/perf)
                  42af75 main+0x605 (/usr/bin/perf)
            7f2a205d2555 __libc_start_main+0xf5 (/usr/lib64/libc-2.17.so)

part 6.符号问题

应用程序性能分析中常见2个问题：缺少符号表（missing symbol）和栈踪迹（stack trace），这两个问题会导致perf record抓取不到想要信息

缺少符号表：表现是不能将指令地址解析为函数名称

下面展示了同一个进程的采样结果是否有符号表的对比

在这里插入图片描述

对于ELF二进制，符号有可能会被strip从二进制中剥离出去；解决办法一种是调整构建过程，防止剥离符号表；另一种是使用其他符号信息来源，如debuginfo或BPF类型格式

# file test_flamegrah
test_flamegrah: ELF 64-bit LSB executable, x86-64, ..., not stripped

# strip test_flamegrah
# file test_flamegrah
test_flamegrah: ELF 64-bit LSB executable, x86-64, ..., stripped

缺失栈踪迹：全部缺失或者缺失一部分，或者是有一些异常的[unkown]

原因：栈踪迹一般是基于帧指针（frame pointer），这要求有一个用来记录帧基址的rbp寄存器，但有时编译器开启优化，会用通用寄存器优化掉rbp寄存器，此时观测工具基于帧指针解析就会异常

解决办法：

1.最常用的就是修复帧指针寄存器，对于C/C++软件，使用-fno-omit-frame-pointer重新编译软件，这种情况性能损失通常低于1%
2.改用不基于帧指针的栈遍历技术，如perf支持基于DWARF的栈遍历
 # perf record -F 99 -a --call-graph dwarf
参考：《性能之巅》5.6章节

记录完事件后，可以使用perf report和perf script进一步进行分析

6.perf report（sample analysis）

总结perf.data文件的内容，查看帮助信息，这个子命令也可以处理perf script的输出；如果不使用火焰图的话，可以直接用这个子命令分析系统性能瓶颈

在这里插入图片描述

Options参数如下，其中的-g表示收集堆栈信息，对火焰图的绘制很关键

在这里插入图片描述

使用的源文件test_perf.cpp如下，软件运行后可以看到函数A、B、C的预估占比：A（20%）、B（50%）、C（30%）；性能问题主要在函数B，下面介绍report中的2个常用模式，看看perf分析结果是否符合预期

#include <stdio.h>
#include <unistd.h>

void C() {
    int i = 30000000;
    while(i--);
    return;
}
void B() {
    int i = 50000000;
    while(i--);
    return;
}
void A() {
    int i = 20000000;
    while(i--);
    B();
    return;
}

int main() {
    printf("[main]\n");
    while(1) {
        A();
        C();
        sleep(5);
    }
    return 0;
}

编译：源码带-g进行编译gcc -g test_perf.cpp -o test_perf
采样：假设使用perf record -F 999 -g ./test_perf进行采样（注意：没有使用-g采样堆栈信息）
注意：如果是为了生成火焰图，一定要使用-g；如果是只为perf简单分析，可以不使用-g，这样不会影响运行效果；下图可以先提前看一下perf report的分析区别：不带-g直接定位到函数B是性能瓶颈，如果带-g就要自己找self列的哪个函数占比大

在这里插入图片描述

TUI模式

直接perf report，交互式模式下，你可以导航到关注的函数，查看更详细的信息(下图是perf record带-g采样的截图，因为含有的信息多，所以暂时使用)

在这里插入图片描述

列说明：

Children：将所有函数进行统计，本函数的采样占比（里面子函数的消耗/采样计数也算自己的）：取消可以在参数加--no-children

Self：本函数的采样占比（不包含子函数），所有self的消耗总和应该是100%

Command：采样来自哪一个进程

Shared Object：采样来自的ELF映像的名字，[kernel.kallsyms]表示采样来自内核

Symbol：抽取样本的特权级别

[.]：user level

[k]：kernel level

[g]： guest kernel level (virtualization)

[u]：guest kernel level (virtualization)

[H]：hypervisor（管理程序）

最后一列：符号名字，也可以简单理解生是函数名

问题1：既然是交互模式，交互体现在哪里？

1.通过键盘上下键将光标移动到感兴趣的函数（此时是函数B），直接回车
2.鼠标此时停在“Annotate B”，再一次回车可以看到函数B的反汇编代码

提示1：annotate是perf的另一个子命令，可用于分析执行函数内代码上的CPU消耗；除了"Annotate 函数名"，第一次接触这里，都可以试试，看看功能是做什么的

提示2：下面是因为有符号信息（编译时使用了-g）的截图，如果没有符号信息的话，不会显示源码

在这里插入图片描述

问题2：怎么分析性能瓶颈？

除了系统提供的函数，可以看到函数A，B，C是我自己写的函数，函数B的被采样到的次数最多，因此可能是引起性能瓶颈的地方，要重点分析；函数B的反汇编代码中，可以看到test %eax,%eax占用百分比最大，因此是性能瓶颈要重点关注的地方，详细分析看perf annotate部分

3.直接按h，调出帮助信息；按相应的快捷键，做你需要的事…

在这里插入图片描述

stdio模式

使用参数--stdio非交互模式（perf report --stdio），特别适合将输入文件重定位到text文件

在这里插入图片描述

说明：

排序：按照采样计数逆序排列，采样计数以百分比的形式输出，说明了CPU时间花费在哪里

父子关系：左上是父函数，右下是子函数，最右下角的函数就是on-CPU函数，即on-CPU火焰图的堆栈中最上面的函数

符号：只有调试信息完整，才会显示出函数名，否则只显示16进制地址

顺序：从左到右的函数调用顺序在perf中被称为caller；可以使用-g callee更改顺序：事件函数在左面，祖先函数在右下面
# perf report --stdio -g callee
# Children      Self  Command    Shared Object     Symbol
# ........  ........  .........  ................  .........................
...
    69.95%    20.03%  test_perf  test_perf         [.] A
            |
            ---A
               |
                --69.90%--main
                          __libc_start_main
...                          

7.perf script

默认打印perf.data文件的采样数据（例如栈样本），它的输出对于火焰图非常有用

在这里插入图片描述

perf script的输出格式如下，简单理解就是gdb中使用bt命令查看的栈帧信息

# perf script 
test_flamegrah 20525 1728126.051969:          1 cycles:
        ffffffff8a66d5fa native_write_msr_safe+0xa ([kernel.kallsyms])
        ffffffff8a609924 __intel_pmu_enable_all.isra.23+0x54 ([kernel.kallsyms])
        ffffffff8a609990 intel_pmu_enable_all+0x10 ([kernel.kallsyms])
        ffffffff8a6058ba x86_pmu_enable+0x28a ([kernel.kallsyms])
        ffffffff8a7aca4d perf_pmu_enable+0x2d ([kernel.kallsyms])
        ffffffff8a7ae595 __perf_event_task_sched_in+0x195 ([kernel.kallsyms])
        ffffffff8a6d4c80 finish_task_switch+0xe0 ([kernel.kallsyms])
        ffffffff8ad8c050 __schedule+0x320 ([kernel.kallsyms])
        ffffffff8ad8c3d9 schedule+0x29 ([kernel.kallsyms])
        ffffffff8aa6cbc0 n_tty_write+0x260 ([kernel.kallsyms])
        ffffffff8aa6a0db tty_write+0x15b ([kernel.kallsyms])
        ffffffff8a84e4a0 vfs_write+0xc0 ([kernel.kallsyms])
        ffffffff8a84f225 sys_write+0x55 ([kernel.kallsyms])
        ffffffff8ad99f92 system_call_fastpath+0x25 ([kernel.kallsyms])
            7f3b5000dba0 __GI___libc_write+0x10 (/usr/lib64/libc-2.17.so)
                       0 [unknown] ([unknown])

提示：perf script的输出有非常多的行，不适合手动分析；可以使用火焰图将perf script的堆栈信息进行可视化处理

8.perf annotate（source level analysis）

上面的perf report的tui模式中已经看到了annotate使用的身影

正常分析的程序是release版本，不带调试信息，比如perf report的tui模式中示例函数B在没有调试信息的情况下，显示如下

Percent│                                                                        
       │     Disassembly of section .text:                                       
       │     0000000000400559 <B()>:          
       │     _Z1Bv():                                             
       │400559:   push   %rbp                 
       │40055a:   mov    %rsp,%rbp                     
       │40055d:   movl   $0x2faf080,-0x4(%rbp)              
  9.81 │400564:   mov    -0x4(%rbp),%eax      
  7.14 │400567:   lea    -0x1(%rax),%edx      
 28.87 │40056a:   mov    %edx,-0x4(%rbp)      
 43.10 │40056d:   test   %eax,%eax            
       │40056f:   setne  %al                  
       │400572:   test   %al,%al              
       │400574: ↓ je     400578 <B()+0x1f>    
 11.08 │400576: ↑ jmp    400564 <B()+0xb>     
       │         return;                      
       │400578:   nop                                                     
       │400579:   pop    %rbp                 
       │40057a: ← retq

可以定位到11行的test %eax,%eax采样占比最大，但是我们最想关心的是哪一行源码出问题，可以直接使用-g再次编译程序；重新执行perf相关操作，步骤如下

#1.通过 -g 选项重新编译test_perf.c, 输出的二进制程序为test_perf_with_symbol
gcc -g test_perf.c -o test_perf_with_symbol

#2.通过perf record命令对 test_perf_with_symbol 程序采样
perf record -F 999 ./test_perf_with_symbol 3000

#3.使用perf annotate单独对 函数B 进行分析
perf annotate --stdio --symbol=B

下面是perf annotate分析的截图，可以找到test %eax,%eax实际是在一个while循环中，这个while循环要重点优化的代码，剩下的任务就是撸相关代码了…

在这里插入图片描述

9.perf trace

默认跟踪系统调用，相当于strace的perf版本

# sudo perf trace -p $(pgrep test_flamegrah)
         ? (         ):  ... [continued]: write()) = 4
     0.022 ( 0.026 ms): write(fd: 1</dev/pts/3>, buf: 0x7f598538f000, count: 4   ) = 4
     0.053 ( 0.019 ms): write(fd: 1</dev/pts/3>, buf: 0x7f598538f000, count: 4   ) = 4
     0.077 ( 0.023 ms): write(fd: 1</dev/pts/3>, buf: 0x7f598538f000, count: 4   ) = 4
     0.105 ( 0.020 ms): write(fd: 1</dev/pts/3>, buf: 0x7f598538f000, count: 4   ) = 4
     
# 参数-s输出每个线程的系统调用汇总
# sudo perf trace -s -p $(pgrep test_flamegrah)
 Summary of events:
 test_flamegrah (18403), 1447677 events, 100.0%

   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   write             723828 12468.323     0.000     0.017  4546.880     44.49%

与starce对比：

开销：perf使用每个CPU的缓冲区来减小开销，比strace更安全
范围：perf可以跟踪整个系统，strace只能跟踪一组进程（一般是一个单进程）；perf还可以跟踪系统调用意外的事件
参数：strace支持更多的系统调用参数翻译

扩展：pgrep是一个根据名称查找进程ID的命令，返回的是进程ID

默认只显示PID

# pgrep sshd
4238
4241

pgrep相当于 ps –eo pid,cmd | awk ‘{print $1,$2}’ | grep KeyWord

# ps -eo pid,cmd | awk '{print $1,$2}'| grep sshd
4238 sshd:
4241 sshd:

10.剖析的原理

perf的子命令简单理解就是采样，然后对采样数据进行分析；不知道你是否对perf是怎么实现采样数据分析感兴趣呢，即perf是怎么通过采样数据找到是哪个进程，哪个函数，哪行代码的呢？

剖析：简答理解就是通过固定周期进行采样，对采样结果进行分析的过程
例子：使用上面perf report部分的例子，性能主要出现在汇编代码test %eax,%eax对应的行，地址是0x40056d，属于函数B（起始地址是0x400559）
假设：每次采样都会将当前正在执行的指令、进程编号、事件计数器信息进行记录，假设某次采样到的地址地址是0x40056d，进程编号是11493
期望效果：剖析的作用是将0x40056d转化成源码中函数的哪一行，进程编号11493对应哪一个进程
剖析需要文件：1.剖析的采样，2.进程列表，3.进程地址空间映射，4.二进制文件

下面根据前面的源码和假设，展示一下perf剖析的完整过程

1.进程编号找到相应进程

linux中，每个运行的进程在/proc下都对应一个数字目录，这个数字就是perf采样对应的进程编号，直接用下面的命令就可以查看到是哪个进程在运行

# cat /proc/11493/cmdline
./test_perf

2.找到0x40056d对应的模块

在11493进程的内存映射文件中，找到0x40056d属于test_perf模块（起始地址：00400000，结束地址：00401000）

[syshaps1@localhost 11493]$ cat /proc/11493/maps
00400000-00401000 r-xp 00000000 fd:02 1148346285        /home/test_perf
00600000-00601000 r--p 00000000 fd:02 1148346285        /home/test_perf
00601000-00602000 rw-p 00001000 fd:02 1148346285        /home/test_perf
7ff0ce014000-7ff0ce1d8000 r-xp 00000000 fd:00 33555970  /usr/lib64/libc-2.17.so
...

其中：p=私有，s=共享

3.找到0x40056d对应的函数

查看test_perf模块中函数的起始地址，可以看到0x40056d对应的是函数B

#查看函数的起始地址
# nm -C test_perf
0000000000601040 B __bss_start
....
00000000004005a6 T main
0000000000400460 T _start
0000000000601040 D __TMC_END__
000000000040057b T A()
0000000000400559 T B()
0000000000400537 T C()

4.通过debug信息，找到0x40056d对应的源码

0x40056d对应的是函数B()，起始地址400559，直接使用objdump -S -l --start-address=0x400559

# objdump -S -l --start-address=0x400559 test_perf
test_perf:     file format elf64-x86-64
Disassembly of section .text:
0000000000400559 <_Z1Bv>:
_Z1Bv():
void B() {
  400559:       55                      push   %rbp
  40055a:       48 89 e5                mov    %rsp,%rbp
/home/test_perf.cpp:10
    int i = 50000000;
  40055d:       c7 45 fc 80 f0 fa 02    movl   $0x2faf080,-0x4(%rbp)
/home/test_perf.cpp:11
    while(i--);
  400564:       8b 45 fc                mov    -0x4(%rbp),%eax
  400567:       8d 50 ff                lea    -0x1(%rax),%edx
  40056a:       89 55 fc                mov    %edx,-0x4(%rbp)
  40056d:       85 c0                   test   %eax,%eax				#要找的位置
  40056f:       0f 95 c0                setne  %al
  400572:       84 c0                   test   %al,%al
  400574:       74 02                   je     400578 <_Z1Bv+0x1f>
  400576:       eb ec                   jmp    400564 <_Z1Bv+0xb>
/home/test_perf.cpp:12
    return;
  400578:       90                      nop
/home/test_perf.cpp:13
}
  400579:       5d                      pop    %rbp
  40057a:       c3                      retq

扩展：也可是使用gdb查看源码

#(gdb) disassemble /m 0x400559
Dump of assembler code for function B():
9       void B() {
   0x0000000000400559 <+0>:     push   %rbp
   0x000000000040055a <+1>:     mov    %rsp,%rbp

10          int i = 50000000;
   0x000000000040055d <+4>:     movl   $0x2faf080,-0x4(%rbp)

11          while(i--);
   0x0000000000400564 <+11>:    mov    -0x4(%rbp),%eax
   0x0000000000400567 <+14>:    lea    -0x1(%rax),%edx
   0x000000000040056a <+17>:    mov    %edx,-0x4(%rbp)
   0x000000000040056d <+20>:    test   %eax,%eax				#要找的位置

至此，perf每次处理一条采样记录，最后统计到进程 -> 线程 -> 模块 -> 函数 -> 代码行 -> 汇编指令行

11.应用示例：perf与火焰图

perf采样数据有时太多，不好剖析，可以使用火焰图对perf采样数据进一步分析

使用的示例是perf report中的示例（即 10.剖析的原理 中的示例）

1.火焰图的生成流程

在这里插入图片描述

2.命令步骤如下

#1.以99Hz（-F 99）对 test_perf 进程 栈跟踪（-g）采样，也可使用 -p PID 指定进程，ctrl+C结束采样
perf record -F 99 -g ./test_perf

#2.生成折叠后的调用栈和火焰图
perf script --header > out.stacks
git clone https://github.com/brendangregg/FlameGraph; cd FlameGraph
./stackcollapse-perf.pl < out.stacks | ./flamegraph.pl --hash > out.svg

3.火焰图效果

下面是一张on-CPU火焰图（浏览器打开可以交互操作），主要分析最上面的函数（B和C），很明显函数B被采样到的次数更多，是要重点优化的函数

在这里插入图片描述

总结：本篇笔记主要是围绕着perf的子命令使用方法进行介绍的，其中的probe部分没有介绍，想学的话可以看参考资料；perf就先介绍这么多了…

参考

1.官网文章，讲解的很详细，Tutorial - Perf Wiki (kernel.org)
2.Linux perf Examples (brendangregg.com)，《性能之巅》作者的blog
3.《Systems Performance》 Second Edition, Chapter 13 perf 和相关perf的介绍；这本书中文版叫《性能之巅》，很厚也很贵，强烈推荐
4.用Perf寻找程序中的性能热点 - 知乎 (zhihu.com)，用一个例子贯穿讲解perf，讲解的很好

磨刀砍柴Debug

关注

3
点赞
踩
6

收藏

觉得还不错? 一键收藏
1
评论
linux性能分析工具：perf入门一页纸

是linux（2.6+）官方的分析器（），是一个轻量化的采用和分析的内核级工具，位于下的linux内核源码中，并且是基于内核的；是一个具有分析（profiling）、跟踪（ tracing）和脚本（scripting）功能的多工具集合使用列举出版本信息，然后使用下面对应的命令进行安装常见使用场景特别适合CPU分析（perf可以被用来剖析CPU的调用路径）：分析/采样CPU的堆栈跟踪、跟踪CPU调度器的行为、磁盘I/O等；通过对程序进行几次采样，通常就可以找到影响性能的线索下面展示了最常用的子命令，包
复制链接

扫一扫