linux性能分析工具专题-perf（事件采样，全面性能分析）

最新推荐文章于 2025-04-30 10:49:45 发布

runafterhit

最新推荐文章于 2025-04-30 10:49:45 发布

阅读量1.1w

点赞数 2

分类专栏：架构/重构类

本文链接：https://blog.csdn.net/runafterhit/article/details/107801860

版权

架构/重构类专栏收录该内容

16 篇文章

订阅专栏

本文深入讲解了perf工具，一款强大的Linux性能分析工具，用于统计硬件事件如指令执行、cache-miss等，帮助查找应用热点。文章涵盖了perf的概念、工具集合、事件介绍及常见工具的使用方法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文章目录

备注：本文调试版本有两个环境，主要展示以虚拟机ubuntu信息为主，配合树莓派。主要是发现虚拟机很多事件不支持，个人推测是由于虚拟的系统，一些硬件事件等模拟受限。因此后面补充追加了树莓派测试信息。

perf version 4.4.177（执行环境，虚拟机 64ubuntu环境：Linux ubuntu 4.4.0-148-generic #174~14.04.1-Ubuntu SMP Thu May 9 08:17:37 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux）
perf version 4.9.82 （执行环境，树莓派4B：Linux raspberrypi 4.19.127-v7l #1 SMP Sun Aug 9 00:56:42 PDT 2020 armv7l GNU/Linux）

概述

perf概念

perf是一款强大的linux性能分析工具（基于linux内核提供的性能事件perf_event口）。能够统计各类硬件事件如指令执行、cache-miss、分支错误预测，用来寻找应用的热点。并且可以针对单个task、单个cpu、或者单个workload进行统计前面这些硬件事件和其他的软件事件；

perf依赖事件进行统计，这里的事件是通过采样机制，并不是clock级别的统计；根据使用perf工具的不同按测量事件的类型进行统计。

perf的工具集合介绍

perf自身提供了大量的命令用来收集和分析性能，追踪信息；（每个命令都可以进一步查看帮助，如perf stat -h）
其中比较常用的：
perf stat:统计一个执行命令的各种事件计数。
perf top:进行指定事件排序（可到函数或者指令级别）。
perf record/report/annotate:一组详细分析组合命令，record记录性能文件，report通过文件输出基础报告，annotate配合代码进行定位输出；
还有一些针对性性能检查工具：如针对锁的 lock;针对调度的sched;针对slab分配器性能kmem;自定义检查点 probe。

各类命令如下：

goodboy@ubuntu:~$ perf -h
 usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS]
 The most commonly used perf commands are:
   annotate        Read perf.data (created by perf record) and display annotated code
   archive         Create archive with object files with build-ids found in perf.data file
   bench           General framework for benchmark suites
   buildid-cache   Manage build-id cache.
   buildid-list    List the buildids in a perf.data file
   data            Data file related processing
   diff            Read perf.data files and display the differential profile
   evlist          List the event names in a perf.data file
   inject          Filter to augment the events stream with additional information
   kmem            Tool to trace/measure kernel memory properties
   kvm             Tool to trace/measure kvm guest os
   list            List all symbolic event types
   lock            Analyze lock events
   mem             Profile memory accesses
   record          Run a command and record its profile into perf.data
   report          Read perf.data (created by perf record) and display the profile
   sched           Tool to trace/measure scheduler properties (latencies)
   script          Read perf.data (created by perf record) and display trace output
   stat            Run a command and gather performance counter statistics
   test            Runs sanity tests.
   timechart       Tool to visualize total system behavior during a workload
   top             System profiling tool.
   trace           strace inspired tool
   probe           Define new dynamic tracepoints
 See 'perf help COMMAND' for more information on a specific command.

perf的事件介绍—perf list参看

前面提到perf自身是基于内核提供的事件统计机制的，用perf list命令查看，这些事件主要有由以下三种构成：
1、软件事件（内核统计和操作系统相关性能事件。如系统调用次数、上下文切换次数、任务迁移次数、缺页例外次数等）
2、性能检测单元Performance Monitoring Unit（PMU）硬件事件（如指定cycle数）
3、硬件事件（如各类中断事件）（如 L1 cache miss）

List of pre-defined events (to be used in -e):

  alignment-faults                                   [Software event]
  context-switches OR cs                             [Software event]
  cpu-clock                                          [Software event]
  cpu-migrations OR migrations                       [Software event]
  page-faults OR faults                              [Software event]
  L1-dcache-load-misses                              [Hardware cache event]
  L1-dcache-loads                                    [Hardware cache event]
  L1-dcache-stores                                   [Hardware cache event]
  L1-icache-load-misses                              [Hardware cache event]
  branch-load-misses                                 [Hardware cache event]
  branch-loads                                       [Hardware cache event]
  dTLB-load-misses                                   [Hardware cache event]
  dTLB-loads                                         [Hardware cache event]
  cycles-ct OR cpu/cycles-ct/                        [Kernel PMU event]
  ...

常用perf性能查看工具使用

perf stat—运行一个命令并且统计过程事件

perf stat主要在程序执行的过程中统计支持的事件计数，简单的在屏幕输出。可以使用perf stat cmd方式执行cmd命令，在执行结束后会输出各类事件的统计。
例如，测试从zero文件读取输入写到空设备中，连续写1000000个block：

root@ubuntu:/sys/kernel/debug/tracing# perf stat -B dd if=/dev/zero of=/dev/null count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB) copied, 1.03206 s, 496 MB/s

 Performance counter stats for 'dd if=/dev/zero of=/dev/null count=1000000':

       1020.507771      task-clock (msec)         #    0.987 CPUs utilized          
                48      context-switches          #    0.047 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                67      page-faults               #    0.066 K/sec                  
   <not supported>      cycles                   
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
   <not supported>      instructions             
   <not supported>      branches                 
   <not supported>      branch-misses            

       1.033620037 seconds time elapsed

树莓派上查看：
在这里插入图片描述
默认典型事件说明：
（1）task-clock：任务真正占用的处理器时间，单位ms。(CPU占用率 = task-clock / time elapsed)
（2）context-switches：上下文的切换次数。
（3）CPU-migrations：处理器迁移次数，为了维持多处理器负载均衡，特定条件下会将某个任务迁移到另一个CPU。
（4）page-faults：缺页异常的次数。当应用程序请求的页面尚未建立、请求的页面不在内存中，或者请求的页面虽然在内存中，但物理地址和虚拟地址的映射关系尚未建立时，都会触发一次缺页异常。另外TLB不命中，页面访问权限不匹配
等情况也会触发缺页异常。
（5）cycles：消耗的处理器周期数。
（6）instructions：执行了多少条指令。IPC为平均每个cpu cycle执行了多少条指令。
（7）branches：遇到的分支指令数。branch-misses是预测错误的分支指令数。

常用选项介绍（用perf stat -h查看全部）

    -a, --all-cpus        system-wide collection from all CPUs //全部cpu统计
    -C, --cpu <cpu>       list of cpus to monitor in system-wide // 指定某个CPU事件
    -d, --detailed        detailed run - start a lot of events //打印更详细信息
    -e, --event <event>   event selector. use 'perf list' to list available events //指定性能事件 多个用，分
    -I, --interval-print <n> print counts at regular interval in ms (>= 10) // 每隔n毫秒打印一次
    -p, --pid <pid>       stat events on existing process id // 指定某个pid的进程
    -r, --repeat <n>      repeat command and print average + stddev (max: 100, forever: 0) //重复运行
    -t, --tid <tid>       stat events on existing thread id // 指定某个tid的线程

perf top—输出系统某个事件热度函数或者指令排序

perf top工具的使用类似linux的top命令，实时的输出函数采样按某一统计事件的排序结果，默认事件为是cycles（消耗的处理器周期数），默认按降序排序；perf top会统计全部用户态和内核态的函数，默认是全部CPU，也可以指定某个CPU监控器。

//ubuntu执行情况
mples: 1K of event 'cpu-clock', Event count (approx.): 301398414              
Overhead  Shared Object                  Symbol                                 
   8.80%  [kernel]                       [k] _raw_spin_unlock_irqrestore
   6.32%  [kernel]                       [k] clear_page_orig
   5.91%  [kernel]                       [k] kallsyms_expand_symbol.constprop.1
   4.58%  [kernel]                       [k] mpt_put_msg_frame
   4.00%  [kernel]                       [k] update_iter
   3.75%  [kernel]                       [k] trace_graph_entry
   2.83%  [kernel]                       [k] ftrace_graph_caller
   1.77%  [kernel]                       [k] prepare_ftrace_return
   1.75%  [kernel]                       [k] finish_task_switch
   1.75%  libelf-0.158.so                [.] gelf_getsym
   1.67%  perf                           [.] d_demangle_callback
   1.30%  libc-2.19.so                   [.] _int_malloc
   1.02%  perf                           [.] rb_next

树莓派执行如下：
在这里插入图片描述
常用选项介绍（用perf top -h查看全部）

    -a, --all-cpus        system-wide collection from all CPUs //全部cpu统计
    -c, --count <n>       event period to sample // 指定采样周期
    -C, --cpu <cpu>       list of cpus to monitor // 指定某个CPU事件
    -e, --event <event>   event selector. use 'perf list' to list available events // 指定事件
    -K, --hide_kernel_symbols hide kernel symbols // 隐藏内核函数
    -U, --hide_user_symbols hide user symbols // 隐藏用户态函数
    -p, --pid <pid>       profile events on existing process id // 仅分析目标进程及其创建的线程
    -t, --tid <tid>       profile events on existing thread id // 仅分析目标线程
    -g              enables call-graph recording and display // 展示调用关系（通过光标上下移动，enter展开）

perf record/report—record收集性能数据记录到文件，report查看

可以通过perf record cmd来针对cmd命令进行统计。收集一段时间内的性能事件到文件 perf.data(默认)，随后需要用perf report命令分析。可以统计单个线程、进程、或者CPU事件。默认统计事件也是按照cycles（消耗的处理器周期数），默认的平均统计频率为1秒1000次，也就是1000Hz；

举例，用1000统计频率，统计一个sleep 5秒过程中，全部CPU上的事件：

root@ubuntu:/home/wy/study/str# perf record -a -F 1000 sleep 5
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.898 MB perf.data (4997 samples) ]
root@ubuntu:/home/uy/study/str# ls
perf.data
root@ubuntu:/home/wuya/study/str# perf report // 输入后显示内容部分如下：
Samples: 4K of event 'cpu-clock', Event count (approx.): 4997000000                                           
Overhead  Command         Shared Object               Symbol                                                  
  98.96%  swapper         [kernel.kallsyms]           [k] native_safe_halt
   0.14%  swapper         [kernel.kallsyms]           [k] _raw_spin_unlock_irqrestore
   0.08%  tpvmlp          [kernel.kallsyms]           [k] _raw_spin_unlock_irqrestore
   0.06%  Xorg            [kernel.kallsyms]           [k] prepare_ftrace_return
   0.04%  Xorg            [kernel.kallsyms]           [k] _raw_spin_unlock_irqrestore
   0.04%  Xorg            [kernel.kallsyms]           [k] trace_graph_entry
   0.04%  compiz          [kernel.kallsyms]           [k] trace_graph_entry
   0.04%  compiz          [vdso]                      [.] __vdso_clock_gettime
   0.04%  compiz          libcompiz_core.so.0.9.11.3  [.] CompScreen::handleEvent
   0.04%  kworker/0:2     [kernel.kallsyms]           [k] _raw_spin_unlock_irqrestore

常用选项介绍（用perf report -h查看全部）

    -a, --all-cpus        system-wide collection from all CPUs  //全部cpu统计
    -c, --count <n>       event period to sample // 指定采样周期
    -C, --cpu <cpu>       list of cpus to monitor // 指定某个CPU事件
    -e, --event <event>   event selector. use 'perf list' to list available even // 指定事件
    -F, --freq <n>        profile at this frequency // 指定统计频率，每秒n次
    -g                    enables call-graph recording // 开启 图形调用栈 记录（可以看到子函数统计情况）
    -o, --output <file>   output file name // 指定输出文件名称
    -P, --period          Record the sample period //指定记录频率
    -p, --pid <pid>       record events on existing process id // 指定记录进程pid

perf annotate—代码指令级解析record文件精确定位

perf annotate提供指令级别的record文件定位。使用调试信息-g编译的文件能够显示汇编和本身源码信息。
但要注意， annotate命令并不能够解析内核image中的符号，必须要传递未压缩的内核image给annotate才能正常的解析内核符号，比如：perf annotate -k /tmp/vmlinux -d symbol
举例：main.c内容如下：

#include <stdio.h>
#include <time.h>
void func_a() {
   unsigned int num = 1;
   int i;
   for (i = 0;i < 10000000; i++) {
      num *= 2;
      num = 1;
   }
}
void func_b() {
   unsigned int num = 1;
   int i;
   for (i = 0;i < 10000000; i++) {
      num <<= 1;
      num = 1;
   }
}
int main() {
   func_a();
   func_b();
   return 0;
}

编译命令：gcc -g -O0 main.c （-g是debug信息，保留符号表等；-O0表示不进行优化处理）
统计命令：perf record -a -g ./a.out
perf report查看结果：

Samples: 73  of event 'cpu-clock', Event count (approx.): 18250000       
  Children      Self  Command  Shared Object      Symbol    
+   97.26%     0.00%  a.out    a.out              [.] main 
+   97.26%     0.00%  a.out    libc-2.19.so       [.] __libc_start_main 
+   49.32%    49.32%  a.out    a.out              [.] func_a 
+   47.95%    47.95%  a.out    a.out              [.] func_b 
+    1.37%     1.37%  perf     [kernel.kallsyms]  [k] finish_task_switch  
+    1.37%     0.00%  a.out    ld-2.19.so         [.] dl_main

perf annotate查看结果：

func_a  /home/goodboy/tmp/a.out           
       │    void func_a() {
       │      push   %rbp
       │      mov    %rsp,%rbp
       │       unsigned int num = 1;
       │      movl   $0x1,-0x8(%rbp)
       │       int i;
       │       for (i = 0;i < 10000000; i++) {
       │      movl   $0x0,-0x4(%rbp)
       │    ↓ jmp    22
       │          num *= 2;
 11.11 │14:┌─→shll   -0x8(%rbp)
       │   │      num = 1;
       │   │  movl   $0x1,-0x8(%rbp)
       │   │#include <stdio.h>
       │   │#include <time.h>
       │   │void func_a() {
       │   │   unsigned int num = 1;
       │   │   int i;
       │   │   for (i = 0;i < 10000000; i++) {
  5.56 │   │  addl   $0x1,-0x4(%rbp)
 33.33 │22:│  cmpl   $0x98967f,-0x4(%rbp)
 50.00 │   └──jle    14
       │          num *= 2;
       │          num = 1;
       │       }
       │    }
       │      pop    %rbp
       │    ← retq

常用选项介绍（用perf annotate -h查看全部）

    -C, --cpu <cpu>       list of cpus to profile  // 指定某个CPU事件
    -d, --dsos <dso[,dso...]> only consider symbols in these dsos // 只解析指定文件中符号
    -k, --vmlinux <file>  vmlinux pathname // 指定内核文件
    -m, --modules         load module symbols - WARNING: use only with -k and LIVE kernel
    -P, --full-paths      Don't shorten the displayed pathnames
    -s, --symbol <symbol> symbol to annotate // 指定符号定位

perf bench—添加负载的工具（待更新）

调试技巧

参考资料

wiki介绍: https://perf.wiki.kernel.org/index.php/Main_Page
Brendan Gregg’s Blog ：http://www.brendangregg.com/blog/index.html
优秀blog：https://www.cnblogs.com/arnoldlu/p/6241297.html