perf与火焰图使用介绍

cocoti

已于 2022-07-22 18:04:12 修改

阅读量5.8k

点赞数 6

分类专栏： linux调试文章标签： linux

于 2022-04-11 13:45:02 首次发布

本文链接：https://blog.csdn.net/xuhaitao23/article/details/124016932

版权

linux调试专栏收录该内容

5 篇文章 4 订阅

订阅专栏

一、介绍

perf是linux上的性能分析工具，perf可以对event进行统计得到event的发生次数，或者对event进行采样，得到每次event发生时的相关数据(cpu、进程id、运行栈等)，利用这些数据来对程序性能进行分析。

perf可以统计或采样的event有很多，如果我们要分析cpu，那么我们可以使用cpu-cycles、cpu-clock来衡量占用cpu的程序的分布情况，还可以通过cache-misses、page-faults、branch-misses等event来分析造成cpu占用高的底层原因，确定原因后方便优化。如果我们要分析内存、io、网络等，也可以通过其他event来进行分析，perf可以使用的event非常多，如果要使用perf来分析问题，就需要了解问题相关的event有哪些，作用是什么，这是使用perf的一个门槛。

perf工作大致可以分成三种模式：

counter 计数模式，记录perf执行过程中，统计的目标程序或者整个系统范围内，event的出现次数。
sampling 采样模式，按照指定频率去采样event，记录每次采样时，采样事件输出的信息(cpu、进程id、运行栈等)。这种方式由于每次都记录信息，所以额外的资源消耗是比较大的，需要权衡一下采样频率。同时产生的数据量也容易很大，可能需要大量的硬盘空间。
bfp 可以对对应的event执行用户自己设计的代码，也就是说记录的信息、执行的操作可以由用户定制

perf可以使用的event非常多，上图是Brendan Gregg的文章中找到的一张图，画出了perf可以使用的event的结构图，大致可以分为以下几类：
Hardware Events: CPU的PMU(performance monitoring unit)触发的事件，也叫performance monitoring counters (PMCs)，例如cpu-cycles、cache miss
Software Events: 一些比较底层的软件event，例如缺页、timer(定时)
Kernel Tracepoint Events: 内核中的tracepoint
User Statically-Defined Tracing (USDT): 用户态的tracepoint
Dynamic Tracing: 动态设置的event，例如使用内核的kprobe，可以在大部分函数动态增加event
Timed Profiling: 定时event

二、perf使用示例

在这里插入图片描述

上图整理了perf的子命令之间的关系，常用的有：

perf record —— 采样，生成perf.data二进制文件
perf annotate/perf report/perf script —— 分析perf.data文件，annotate可以查看代码，report可以统计分析，script是直接转化成文本格式
perf stat —— counter，统计event的出现次数
perf top —— 整个系统的分析，类似于top命令，但可以具体到函数，可以指定event

下面我们介绍一些常用的使用方法。

1. perf list

perf list可以查看当前系统支持哪些event：

root:/# perf list

List of pre-defined events (to be used in -e):

  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  cache-misses                                       [Hardware event]
  cache-references                                   [Hardware event]
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  ref-cycles                                         [Hardware event]

  alignment-faults                                   [Software event]
  bpf-output                                         [Software event]
  context-switches OR cs                             [Software event]
  cpu-clock                                          [Software event]
  cpu-migrations OR migrations                       [Software event]
  dummy                                              [Software event]
  emulation-faults                                   [Software event]
  major-faults                                       [Software event]
  minor-faults                                       [Software event]
  page-faults OR faults                              [Software event]
  task-clock                                         [Software event]

  L1-dcache-load-misses                              [Hardware cache event]
  L1-dcache-loads                                    [Hardware cache event]
  L1-dcache-stores                                   [Hardware cache event]
...
  branch-instructions OR cpu/branch-instructions/    [Kernel PMU event]
  branch-misses OR cpu/branch-misses/                [Kernel PMU event]
  bus-cycles OR cpu/bus-cycles/                      [Kernel PMU event]
...
cache:
  l1d.replacement
       [L1D data line replacements]
  ...
floating point:
  avx_insts.all
  ...
frontend:
  dsb2mite_switches.penalty_cycles
  ...
memory:
  hle_retired.aborted
  ...
...
  mem:<addr>[/len][:access]                          [Hardware breakpoint]

  alarmtimer:alarmtimer_cancel                       [Tracepoint event]
  alarmtimer:alarmtimer_fired                        [Tracepoint event]
  alarmtimer:alarmtimer_start                        [Tracepoint event]
  alarmtimer:alarmtimer_suspend                      [Tracepoint event]
  block:block_bio_backmerge                          [Tracepoint event]
  block:block_bio_bounce                             [Tracepoint event]
  block:block_bio_complete                           [Tracepoint event]
  block:block_bio_frontmerge                         [Tracepoint event]
...
  sdt_libc:lll_lock_wait_private                     [SDT event]
  sdt_libc:longjmp                                   [SDT event]
  sdt_libc:longjmp_target                            [SDT event]
  sdt_libc:memory_arena_new                          [SDT event]
...
Metric Groups:

Branch_Mispredicts:
  IpMispredict
       [Number of Instructions per non-speculative Branch Misprediction (JEClear)]
Branches:
  BpTB
       [Branch instructions per taken branch]
  IpB
       [Instructions per Branch]
  IpCall
       [Instruction per (near) call]
  IpTB
       [Instruction per taken branch]
Cache_Misses:
...

可以看到event非常多，要了解所有的event很困难，一般使用时就使用几个常用的event。

2. perf top

我写了一个4线程的简单例子，func1比较特殊是为了后面做branch-misses测试：

#include <thread>
#include <stdlib.h>
using namespace std;

#define ARRAY_SIZE(a) (sizeof(a)/sizeof(a[0]))
void func1()
{
	int i = 0;
	int j = 0;
	int a[8096];
	for (i = 0; i < ARRAY_SIZE(a); ++i)
		a[i] = rand()%256;
	while (1) {
		for (i = 0; i < ARRAY_SIZE(a); ++i)
		{
			if (a[i] > 128)
				++j;
		}
	}
}
void func2()
{
	int i = 0;
	while (1) {};
}
void func3()
{
	int i = 0;
	while (1) {};
}
void func4()
{
	int i = 0;
	while (1) {};
}

int main()
{
	thread t1(func1);
	thread t2(func2);
	thread t3(func3);
	thread t4(func4);
	t1.join();
	t2.join();
	t3.join();
	t4.join();
	return 0;
}

在一个4core的cpu上运行后，运行perf top查看：

root:/# perf top
   PerfTop:   15736 irqs/sec  kernel: 0.3%  exact: 100.0% lost: 0/0 drop: 0/0 [4000Hz cycles],  (all, 4 CPUs)
-------------------------------------------------------------------------------

    24.85%  test              [.] func4
    24.75%  test              [.] func2
    24.74%  test              [.] func3
    24.50%  test              [.] func1
     0.07%  [kernel]          [k] native_read_msr
     0.05%  [kernel]          [k] kallsyms_expand_symbol.constprop.6
     0.04%  [kernel]          [k] number
     0.04%  libc-2.27.so      [.] _int_malloc
     0.03%  [kernel]          [k] vsnprintf
     0.03%  perf              [.] 0x00000000000ca993
     0.03%  [kernel]          [k] native_write_msr_safe
     0.02%  [kernel]          [k] format_decode
     0.02%  libc-2.27.so      [.] _int_free
     0.02%  [kernel]          [k] native_irq_return_iret
     0.02%  chm               [.] get_listen_fd_set
     0.02%  perf              [.] 0x00000000000b8f9f
     0.02%  [kernel]          [k] memcpy_erms
     0.01%  libc-2.27.so      [.] __strchr_avx2
     0.01%  [kernel]          [k] string_nocheck
     0.01%  libc-2.27.so      [.] __strcmp_sse2_unaligned

可以看到占用最高的就是测试代码中的4个程序。perf top默认使用的event应该是cpu-cycles，也就是统计cpu的占用，也可以通过-e参数来指定其他event，我们测试一下branch-misses：

root:/# perf top -e branch-misses

   PerfTop:    2494 irqs/sec  kernel: 2.1%  exact:  0.0% lost: 0/0 drop: 0/0 [4000Hz branch-misses],  (all, 4 CPUs)
-------------------------------------------------------------------------------

    88.36%  test              [.] func1
     3.38%  [kernel]          [k] kallsyms_expand_symbol.constprop.6
     0.63%  [kernel]          [k] widen_string
     0.31%  [kernel]          [k] update_iter
     0.26%  perf              [.] 0x00000000000d65f4
     0.18%  libc-2.27.so      [.] __strchr_avx2
     0.17%  libc-2.27.so      [.] __strcmp_sse2_unaligned
     0.16%  perf              [.] 0x00000000000d65f0
     0.15%  libc-2.27.so      [.] __strlen_avx2
     0.15%  libc-2.27.so      [.] __strncmp_sse42
     0.14%  perf              [.] 0x00000000000bcc82
     0.13%  perf              [.] 0x00000000000ca997
     0.13%  perf              [.] 0x00000000000ca9ab
     0.11%  [kernel]          [k] strlen
     0.11%  libc-2.27.so      [.] __memmove_avx_unaligned_erms
     0.11%  perf              [.] 0x00000000000d5bd1
     0.10%  perf              [.] 0x00000000000bcc53
     0.10%  perf              [.] 0x00000000000bce2f
     0.10%  perf              [.] 0x00000000000a9de1
     0.09%  perf              [.] 0x00000000000cd380

可以看到都集中到func1了。

3. perf stat

perf stat用于统计event出现的次数，我们还是先运行前面的测试代码，再简单运行一下perf stat看看：

root@NVR:/# perf stat
^C
 Performance counter stats for 'system wide':

          26554.66 msec cpu-clock                 #    3.989 CPUs utilized
              1123      context-switches          #    0.042 K/sec
                63      cpu-migrations            #    0.002 K/sec
              1792      page-faults               #    0.067 K/sec
       84466426364      cycles                    #    3.181 GHz
       87751300851      instructions              #    1.04  insn per cycle
       67109814507      branches                  # 2527.233 M/sec
          14199772      branch-misses             #    0.02% of all branches
      211149758565      slots                     # 7951.514 M/sec
       86127311435      topdown-retiring          #     40.8% retiring
        2783995588      topdown-bad-spec          #      1.3% bad speculation
      113766529032      topdown-fe-bound          #     53.9% frontend bound
        8471922503      topdown-be-bound          #      4.0% backend bound

       6.656672156 seconds time elapsed

可以看到，打印了一些默认event的出现次数，主要都是用于分析cpu的。加上-d参数，也就是detail，可以打印更多的event：

root@NVR:/# perf stat -d
^C
 Performance counter stats for 'system wide':

          12109.05 msec cpu-clock                 #    3.982 CPUs utilized
               909      context-switches          #    0.075 K/sec
                24      cpu-migrations            #    0.002 K/sec
               605      page-faults               #    0.050 K/sec
       38096645022      cycles                    #    3.146 GHz
       39579623649      instructions              #    1.04  insn per cycle
       30206324853      branches                  # 2494.525 M/sec
           6122035      branch-misses             #    0.02% of all branches
       95258121530      slots                     # 7866.689 M/sec
       38858931263      topdown-retiring          #     40.8% retiring
        1373476991      topdown-bad-spec          #      1.4% bad speculation
       51203585282      topdown-fe-bound          #     53.8% frontend bound
        3822129305      topdown-be-bound          #      4.0% backend bound
        4950231287      L1-dcache-loads           #  408.804 M/sec
           4890442      L1-dcache-load-misses     #    0.10% of all L1-dcache accesses
            180721      LLC-loads                 #    0.015 M/sec
             21947      LLC-load-misses           #   12.14% of all LL-cache accesses

       3.041273402 seconds time elapsed

perf stat可以对代码各个事件的发生情况有个初步的判断，例如3.982 CPUs utilized就表示3.982个cpu在使用中，总共就4个cpu，那么就表示cpu的负载是很高的，这个是通过12109.05 msec cpu-clock，除3.041273402 seconds time elapsed得到的。

4. perf record

perf record是采样模式，采样hardware event时，可以设置采样频率(-F或者-c)，但是tracepoint貌似使用-F或者-c设置频率并没有作用，但在help中也没有看到相关说明，不清楚具体机制是怎么样的。

perf record可以对perf list中列出的event进行采样，我们这里对cycles和cpu-clock着两个event进行采样分析。

cycles

perf record可以加-e参数指定event，如果没有加-e，那么默认使用cycles这个event。cycles越多，就以为着使用的cpu越多，我的测试平台的cpu是4核约3.219 GHz。

先运行之前的测试程序，再简单试一下perf record：

root:/# perf record
^C[ perf record: Woken up 0 times to write data ]
[ perf record: Captured and wrote 0.344 MB perf.data (4801 samples) ]

root:/# ls perf.data -al
-rw-------    1 root     root        163390 Apr  8 01:02 perf.data

没有指定命令，也没有指定运行时间，那么就需要ctrl+c手动退出，退出后会生成perf.data文件，内容是二进制数据，无法直接查看。

我们这里使用最简单的perf script来查看，作用很简单就是将其转换成文本形式：

root:/# perf script | head -n 30
            perf 24227 [000] 195303.278925:          1   cycles:  ffffffff97a77d68 native_write_msr+0x8 ([kernel.kallsyms])
            perf 24227 [000] 195303.278930:          1   cycles:  ffffffff97a77d68 native_write_msr+0x8 ([kernel.kallsyms])
            perf 24227 [000] 195303.278931:          7   cycles:  ffffffff97a77d68 native_write_msr+0x8 ([kernel.kallsyms])
            perf 24227 [000] 195303.278933:        142   cycles:  ffffffff97a77d68 native_write_msr+0x8 ([kernel.kallsyms])
            perf 24227 [000] 195303.278934:       3290   cycles:  ffffffff97a77d68 native_write_msr+0x8 ([kernel.kallsyms])
            perf 24227 [000] 195303.278936:      75079   cycles:  ffffffff97b52282 generic_exec_single+0x62 ([kernel.kallsyms])
            perf 24227 [001] 195303.278959:          1   cycles:  ffffffff97a77d68 native_write_msr+0x8 ([kernel.kallsyms])
            test 23860 [000] 195303.278960:    1553468   cycles:      562cf2400ce4 func1+0xaa (/root/test)
            perf 24227 [001] 195303.278962:          1   cycles:  ffffffff97a77d68 native_write_msr+0x8 ([kernel.kallsyms])
            perf 24227 [001] 195303.278964:         10   cycles:  ffffffff97a77d68 native_write_msr+0x8 ([kernel.kallsyms])
            perf 24227 [001] 195303.278965:        187   cycles:  ffffffff97a77d68 native_write_msr+0x8 ([kernel.kallsyms])
            perf 24227 [001] 195303.278967:       4338   cycles:  ffffffff97a77d68 native_write_msr+0x8 ([kernel.kallsyms])
            perf 24227 [001] 195303.278969:      93276   cycles:  ffffffff97a41d70 fpregs_assert_state_consistent+0x0 ([kernel.kallsyms])
            perf 24227 [002] 195303.278986:          1   cycles:  ffffffff97a77d68 native_write_msr+0x8 ([kernel.kallsyms])
            perf 24227 [002] 195303.278989:          1   cycles:  ffffffff97a77d68 native_write_msr+0x8 ([kernel.kallsyms])
            perf 24227 [002] 195303.278990:         13   cycles:  ffffffff97a77d68 native_write_msr+0x8 ([kernel.kallsyms])
            perf 24227 [002] 195303.278992:        252   cycles:  ffffffff97a77d68 native_write_msr+0x8 ([kernel.kallsyms])
            perf 24227 [002] 195303.278994:       4934   cycles:  ffffffff97a77d68 native_write_msr+0x8 ([kernel.kallsyms])
            perf 24227 [002] 195303.278996:      97258   cycles:            4b7d14 [unknown] (/bin/perf)
            test 23863 [001] 195303.278999:    1556756   cycles:      562cf2400d31 func4+0x19 (/root/test)
            perf 24227 [003] 195303.279010:          1   cycles:  ffffffff97a77d68 native_write_msr+0x8 ([kernel.kallsyms])
            perf 24227 [003] 195303.279012:          1   cycles:  ffffffff97a77d68 native_write_msr+0x8 ([kernel.kallsyms])
            perf 24227 [003] 195303.279014:         14   cycles:  ffffffff97a77d68 native_write_msr+0x8 ([kernel.kallsyms])
            perf 24227 [003] 195303.279015:        312   cycles:  ffffffff97a77d68 native_write_msr+0x8 ([kernel.kallsyms])
            perf 24227 [003] 195303.279016:       7432   cycles:  ffffffff97a3eeae native_sched_clock+0x3e ([kernel.kallsyms])
            perf 24227 [003] 195303.279019:     169030   cycles:  ffffffff97fe4c84 alloc_cpumask_var_node+0x4 ([kernel.kallsyms])
            test 23861 [002] 195303.279028:    1409420   cycles:      562cf2400cff func2+0xb (/root/test)
            test 23862 [003] 195303.279073:    2007827   cycles:      562cf2400d0c func3+0xb (/root/test)
            test 23860 [000] 195303.279448:    3321369   cycles:      562cf2400ceb func1+0xb1 (/root/test)
            test 23861 [002] 195303.279471:    2604359   cycles:      562cf2400cff func2+0xb (/root/test)

以最后一行为例，打印出来的结果，分别是：

test									程序名
23861									线程tid
[002]									cpu号
195303.279471							时间
2604359   cycles						事件的发生次数
562cf2400cff func2+0xb (/root/test)		代码地址，以及所在的函数、可执行文件，如果没有相应的符号信息则无法解析出来

我们再加一些参数：

-g				使用frame-pointer打印栈
-p 23859		只对进程23859进行统计，这个是运行的测试代码的进程id
-- sleep 5		运行5秒的perf
-c 10000		每10000个event进行一次采样

运行一下：

root@NVR:/# perf record -g -p 23859 -c 10000 -- sleep 5
[ perf record: Woken up 0 times to write data ]
[ perf record: Captured and wrote 149.799 MB perf.data (1360311 samples) ]
root@NVR:/# perf script | head -n 30
test 23863 196133.609777:      10000 cycles:
            562cf2400d31 func4+0x19 (/root/test)
            562cf2401155 std::__invoke_impl<void, void (*)()>+0x1d (/root/test)
            562cf2400f62 std::__invoke<void (*)()>+0x35 (/root/test)
            562cf2401630 std::thread::_Invoker<std::tuple<void (*)()> >::_M_invoke<0ul>+0x28 (/root/test)
            562cf24015ec std::thread::_Invoker<std::tuple<void (*)()> >::operator()+0x2c (/root/test)
            562cf24015bc std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)()> > >::_M_run+0x1c (/root/test)
            7fdc031ae37d [unknown] (/lib/libstdc++.so)
            7fdc02e21d3f __clone+0x3f (/lib/libc.so)

test 23863 196133.609784:      10000 cycles:
        ffffffff98800dd3 asm_sysvec_call_function_single+0x3 ([kernel.kallsyms])
            562cf2400d31 func4+0x19 (/root/test)
            562cf2401155 std::__invoke_impl<void, void (*)()>+0x1d (/root/test)
            562cf2400f62 std::__invoke<void (*)()>+0x35 (/root/test)
            562cf2401630 std::thread::_Invoker<std::tuple<void (*)()> >::_M_invoke<0ul>+0x28 (/root/test)
            562cf24015ec std::thread::_Invoker<std::tuple<void (*)()> >::operator()+0x2c (/root/test)
            562cf24015bc std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)()> > >::_M_run+0x1c (/root/test)
            7fdc031ae37d [unknown] (/lib/libstdc++.so.6.0.24)
            7fdc02e21d3f __clone+0x3f (/lib/libc-2.27.so)

test 23863 196133.609789:      10000 cycles:
            562cf2400d31 func4+0x19 (/root/test)
            562cf2401155 std::__invoke_impl<void, void (*)()>+0x1d (/root/test)
            562cf2400f62 std::__invoke<void (*)()>+0x35 (/root/test)
            562cf2401630 std::thread::_Invoker<std::tuple<void (*)()> >::_M_invoke<0ul>+0x28 (/root/test)
            562cf24015ec std::thread::_Invoker<std::tuple<void (*)()> >::operator()+0x2c (/root/test)
            562cf24015bc std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)()> > >::_M_run+0x1c (/root/test)
            7fdc031ae37d [unknown] (/lib/libstdc++.so)
            7fdc02e21d3f __clone+0x3f (/lib/libc.so)

有几个变化：

只有23863进程的事件了
不需要ctrl+c退出，到时间后自动结束
打印出了栈信息，这个需要在编译时带上-fno-omit-frame-pointer才能打印出来。如果使用dwarf或者lbr则需要其他选项
每次的cycles数据都是精确的10000。不加-c参数时，默认好像是使用-F 4000，也就是4000Hz的采样频率，对于cpu-clock这个是精准的，但如果是其他事件，事件本身与时间并没有关系，perf用了一些手段尝试接近这个频率，但不一定精准。

cpu-clock

cpu-clock是时间事件，也就是说定时去进行采样。使用cpu-clock的好处是不同的cpu负载下，得到的样本数是差不多的，频率固定时间固定，得到的样本数就固定。

cpu-clock的采样频率也可以用-c或者-F指定，都是比较精准的，-c指定的应该就是微秒数，-F就是频率，不指定-c或者-F那么就是4000Hz的频率。

我们就简单看一下使用：

root:/# perf record -e cpu-clock -c 1000000
^C[ perf record: Woken up 0 times to write data ]
[ perf record: Captured and wrote 0.710 MB perf.data (13300 samples) ]

root:/# perf script | head -n 10
            test 23861 [000] 196936.405540:    1000000 cpu-clock:      562cf2400cff func2+0xb (/root/test)
            test 23860 [001] 196936.405559:    1000000 cpu-clock:      562cf2400ce4 func1+0xaa (/root/test)
            test 23862 [002] 196936.405570:    1000000 cpu-clock:      562cf2400d0c func3+0xb (/root/test)
            test 23863 [003] 196936.405582:    1000000 cpu-clock:      562cf2400d31 func4+0x19 (/root/test)
            test 23861 [000] 196936.406538:    1000000 cpu-clock:      562cf2400cff func2+0xb (/root/test)
            test 23860 [001] 196936.406558:    1000000 cpu-clock:      562cf2400cdd func1+0xa3 (/root/test)
            test 23862 [002] 196936.406570:    1000000 cpu-clock:      562cf2400d0c func3+0xb (/root/test)
            test 23863 [003] 196936.406582:    1000000 cpu-clock:      562cf2400d31 func4+0x19 (/root/test)
            test 23861 [000] 196936.407538:    1000000 cpu-clock:      562cf2400cff func2+0xb (/root/test)
            test 23860 [001] 196936.407558:    1000000 cpu-clock:      562cf2400cc6 func1+0x8c (/root/test)

我们使用cpu-clock时，最好是不要使用100、1000这样比较整的频率，参考"Linux perf: 为什么采样频率设置为99Hz而不是100Hz？"

5. perf probe

perf probe是对kprobe和uprobe的封装，可以动态添加event。

一般人应该不太需要对kernel增加event，我就不以kprobe距离了，我们看uprobe。

由于我测试机子上的perf版本可能功能不够完整，测试uprobe时有一点问题，重新编译比较麻烦，我们这里就到ubuntu上去测试了。原本是想以malloc函数为例进行测试的，但是由于系统的libc貌似没有带符号表，导致无法直接打印参数，搭建环境真的是困难重重。

我们先通过使用寄存器打印参数的方式测试malloc，再另外写一个测试代码直接使用参数名打印参数。

malloc

首先写一个调用malloc的测试代码：

#include <stdlib.h>

int main()
{
        int i = 0;
        void *p;
        for (i = 0; i < 10000; ++i)
        {
                p = malloc((i%10+1)*1024);
                free(p);
        }
}

malloc函数一般是在libc当中，查一下libc的路径：

root:/# gcc -print-file-name=libc.so
/usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu/libc.so

root:/# file /usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu/libc.so
/usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu/libc.so: ASCII text

#不是可执行文件，cat看看
root:/# cat /usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu/libc.so
/* GNU ld script
   Use the shared library, but some functions are only in
   the static library, so try that secondarily.  */
OUTPUT_FORMAT(elf64-x86-64)
GROUP ( /lib/x86_64-linux-gnu/libc.so.6 /usr/lib/x86_64-linux-gnu/libc_nonshared.a  AS_NEEDED ( /lib/x86_64-linux-gnu/ld-linux-x86-64.so.2 ) )

#指向/lib/x86_64-linux-gnu/libc.so.6
root:/# file /lib/x86_64-linux-gnu/libc.so.6
/lib/x86_64-linux-gnu/libc.so.6: symbolic link to libc-2.27.so

#最终的动态库文件是/lib/x86_64-linux-gnu/libc-2.27.so，已经是stripped，没有符号表了无法直接打印参数
root:/# file /lib/x86_64-linux-gnu/libc-2.27.so
/lib/x86_64-linux-gnu/libc-2.27.so: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=71f0f3074a929e519e85f6a5c03a7d1fd976bfe4, for GNU/Linux 3.2.0, stripped

知道了/lib/x86_64-linux-gnu/libc-2.27.so这个文件之后，我们使用perf probe来创建event：

root:/# perf probe -x /lib/x86_64-linux-gnu/libc-2.27.so malloc %di
Added new events:
  probe_libc:malloc    (on malloc in /lib/x86_64-linux-gnu/libc-2.27.so with %di)
  probe_libc:malloc_1  (on malloc in /lib/x86_64-linux-gnu/libc-2.27.so with %di)

You can now use it in all perf tools, such as:

        perf record -e probe_libc:malloc_1 -aR sleep 1
# 创建了两个event，malloc和malloc_1，可能是有两个同名函数。
# x86系统上，参数少的时候是通过寄存器来传递参数的，%di就是di寄存器，是第一个参数，也就是malloc的大小

# 查看event，已经生成了，但没有enable
root:/# ls /sys/kernel/debug/tracing/events/probe_libc/
enable  filter  malloc  malloc_1

# perf list也可以看到
root:/# perf list | grep malloc
  kmem:kmalloc                                       [Tracepoint event]
  kmem:kmalloc_node                                  [Tracepoint event]
  probe_libc:malloc                                  [Tracepoint event]
  probe_libc:malloc_1                                [Tracepoint event]
  sdt_libc:memory_malloc_retry                       [SDT event]

# 如果要删除event，执行：
root:/# perf probe -d malloc*
Removed event: probe_libc:malloc
Removed event: probe_libc:malloc_1

然后我们通过perf record和测试程序来测试一下这个event：

root:/# perf record -e probe_libc:malloc* ./test
[ perf record: Woken up 4 times to write data ]
[ perf record: Captured and wrote 0.856 MB perf.data (10000 samples) ]
root:/# perf script | head -n 20
            test 11783 [000] 4406175.631329: probe_libc:malloc_1: (7f1806d55020) arg1=0x400
            test 11783 [000] 4406175.631372: probe_libc:malloc_1: (7f1806d55020) arg1=0x800
            test 11783 [000] 4406175.631374: probe_libc:malloc_1: (7f1806d55020) arg1=0xc00
            test 11783 [000] 4406175.631377: probe_libc:malloc_1: (7f1806d55020) arg1=0x1000
            test 11783 [000] 4406175.631378: probe_libc:malloc_1: (7f1806d55020) arg1=0x1400
            test 11783 [000] 4406175.631379: probe_libc:malloc_1: (7f1806d55020) arg1=0x1800
            test 11783 [000] 4406175.631381: probe_libc:malloc_1: (7f1806d55020) arg1=0x1c00
            test 11783 [000] 4406175.631384: probe_libc:malloc_1: (7f1806d55020) arg1=0x2000
            test 11783 [000] 4406175.631385: probe_libc:malloc_1: (7f1806d55020) arg1=0x2400
            test 11783 [000] 4406175.631386: probe_libc:malloc_1: (7f1806d55020) arg1=0x2800
            test 11783 [000] 4406175.631388: probe_libc:malloc_1: (7f1806d55020) arg1=0x400
            test 11783 [000] 4406175.631389: probe_libc:malloc_1: (7f1806d55020) arg1=0x800
            test 11783 [000] 4406175.631390: probe_libc:malloc_1: (7f1806d55020) arg1=0xc00
            test 11783 [000] 4406175.631392: probe_libc:malloc_1: (7f1806d55020) arg1=0x1000
            test 11783 [000] 4406175.631393: probe_libc:malloc_1: (7f1806d55020) arg1=0x1400
            test 11783 [000] 4406175.631394: probe_libc:malloc_1: (7f1806d55020) arg1=0x1800
            test 11783 [000] 4406175.631396: probe_libc:malloc_1: (7f1806d55020) arg1=0x1c00
            test 11783 [000] 4406175.631397: probe_libc:malloc_1: (7f1806d55020) arg1=0x2000
            test 11783 [000] 4406175.631398: probe_libc:malloc_1: (7f1806d55020) arg1=0x2400
            test 11783 [000] 4406175.631400: probe_libc:malloc_1: (7f1806d55020) arg1=0x2800

可以看到，得到了malloc的调用情况，arg1就是malloc的大小

测试代码——通过名字打印参数

先写一个测试程序：

#include <stdlib.h>

int func1(int a)
{
        return 0;
}

int main()
{
        int i = 0;
        for (i = 0; i < 10000; ++i)
        {
                func1(i);
        }
}

后面对func1函数进行采样。

先编译测试程序，需要加上-g参数。

然后perf probe添加event，直接使用a来输出参数：

root:/# perf probe -x ./test func1 a
Target program is compiled without optimization. Skipping prologue.
Probe on address 0x5fa to force probing at the function entry.

Added new event:
  probe_test:func1     (on func1 in /test with a)

You can now use it in all perf tools, such as:

        perf record -e probe_test:func1 -aR sleep 1

再通过perf record进行测试：

root:/# perf record -e probe_test:func1 ./test
[ perf record: Woken up 3 times to write data ]
[ perf record: Captured and wrote 0.703 MB perf.data (10000 samples) ]
root:/# perf script | head -n 10
            test 12086 [000] 4406546.095893: probe_test:func1: (55db1c1f3601) a=0
            test 12086 [000] 4406546.095905: probe_test:func1: (55db1c1f3601) a=1
            test 12086 [000] 4406546.095908: probe_test:func1: (55db1c1f3601) a=2
            test 12086 [000] 4406546.095910: probe_test:func1: (55db1c1f3601) a=3
            test 12086 [000] 4406546.095913: probe_test:func1: (55db1c1f3601) a=4
            test 12086 [000] 4406546.095915: probe_test:func1: (55db1c1f3601) a=5
            test 12086 [000] 4406546.095918: probe_test:func1: (55db1c1f3601) a=6
            test 12086 [000] 4406546.095920: probe_test:func1: (55db1c1f3601) a=7
            test 12086 [000] 4406546.095923: probe_test:func1: (55db1c1f3601) a=8
            test 12086 [000] 4406546.095925: probe_test:func1: (55db1c1f3601) a=9

perf probe使用起来还是比较方便的，我们也可以通过echo、cat来操作uprobe，但是需要通过objdump等方式获取到代码地址，才能够添加event，也不能直接通过参数名称来打印参数，很不方便，perf probe就是加了一层封装便于使用。

filter

sudo perf probe -x /lib/x86_64-linux-gnu/libpthread-2.33.so "open64 file oflag"
sudo perf record -e probe_libpthread:open64 -aR --filter 'oflag = 524865' -p 922374

6. perf annotate

perf annotate用于直接查看代码，应该有一些其他依赖，在测试机子上跑有一些问题，我们还是在ubuntu上测试。

随便写个测试代码：

#include <thread>
#include <stdlib.h>
#include <unistd.h>
using namespace std;

void func1(int usec)
{
        int k = 0;
        while (1) {
                ++k;
                if (k % 1000000 == 0)
                        usleep(usec);
        }
}

int main()
{
        thread t1(func1, 1000);
        thread t2(func1, 2000);
        thread t3(func1, 3000);
        thread t4(func1, 4000);
        t1.join();
        t2.join();
        t3.join();
        t4.join();
        return 0;
}

perf record得到采样数据：

root:/# perf record ./test
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.283 MB perf.data (7000 samples) ]

perf annotate查看：

root:/# perf annotate
Samples: 7K of event 'cycles', 4000 Hz, Event count (approx.): 5390771294
func1  /home/xuhaitao/tftp/test [Percent: local period]
Percent│
       │    /home/xuhaitao/.debug/.build-id/0f/e8c3ceff6e01e84da3d086ef10f8150a8fc29f/elf：     文件格式 elf64-x86-64
       │
       │
       │    Disassembly of section .text:
       │
       │    0000000000000c3a <func1(int)>:
       │    _Z5func1i():
       │    #include <stdlib.h>
       │    #include <unistd.h>
       │    using namespace std;
       │
       │    void func1(int usec)
       │    {
       │      push  %rbp
       │      mov   %rsp,%rbp
       │      sub   $0x20,%rsp
       │      mov   %edi,-0x14(%rbp)
       │    int k = 0;
       │      movl  $0x0,-0x4(%rbp)
       │    while (1) {
       │    ++k;
  0.33 │12:   addl  $0x1,-0x4(%rbp)
       │    if (k % 1000000 == 0)
  3.74 │      mov   -0x4(%rbp),%ecx
 11.86 │      mov   $0x431bde83,%edx
  0.20 │      mov   %ecx,%eax
 27.34 │      imul  %edx
  1.00 │      sar   $0x12,%edx
  0.06 │      mov   %ecx,%eax
  1.83 │      sar   $0x1f,%eax
 11.09 │      sub   %eax,%edx
  0.23 │      mov   %edx,%eax
 23.63 │      imul  $0xf4240,%eax,%eax
  8.30 │      sub   %eax,%ecx
  3.52 │      mov   %ecx,%eax
       │      test  %eax,%eax
  6.88 │    ↑ jne   12
       │    usleep(usec);
       │      mov   -0x14(%rbp),%eax
       │      mov   %eax,%edi
       │    → callq usleep@plt
       │    ++k;
       │    ↑ jmp   12

可以看到占用比例高的在代码的哪个位置。

7. perf report

我们就直接使用上面perf annotate测试过程中生成的数据来测试perf report，可以统计占用比例。

Samples: 7K of event 'cycles', Event count (approx.): 5390771294
Overhead  Command  Shared Object      Symbol
  99.55%  test     test               [.] func1
   0.06%  test     [kernel.kallsyms]  [k] prepare_exit_to_usermode
   0.04%  test     [kernel.kallsyms]  [k] kmem_cache_alloc
   0.04%  test     ld-2.27.so         [.] do_lookup_x
   0.03%  test     [kernel.kallsyms]  [k] update_curr
   0.02%  test     [kernel.kallsyms]  [k] do_nanosleep
   0.01%  test     [kernel.kallsyms]  [k] try_to_wake_up
   0.01%  test     [kernel.kallsyms]  [k] rtl8169_interrupt
   0.01%  test     [kernel.kallsyms]  [k] task_tick_fair
   0.01%  test     [kernel.kallsyms]  [k] finish_task_switch
   0.01%  test     [kernel.kallsyms]  [k] native_write_msr
   0.01%  test     [kernel.kallsyms]  [k] enqueue_entity
   0.01%  test     [kernel.kallsyms]  [k] psi_task_change
   0.01%  test     [kernel.kallsyms]  [k] ip_sublist_rcv
   0.01%  test     [kernel.kallsyms]  [k] get_task_policy.part.33
   0.01%  test     [kernel.kallsyms]  [k] __calc_delta
   0.01%  test     [kernel.kallsyms]  [k] __update_load_avg_se
   0.01%  test     [kernel.kallsyms]  [k] rb_erase
   0.01%  test     [kernel.kallsyms]  [k] update_rq_clock.part.82
   0.01%  test     [kernel.kallsyms]  [k] __task_rq_unlock
   0.01%  test     [kernel.kallsyms]  [k] rcu_segcblist_ready_cbs
   0.01%  test     [kernel.kallsyms]  [k] hugepage_vma_check
   0.01%  test     [kernel.kallsyms]  [k] newidle_balance
   0.01%  test     [kernel.kallsyms]  [k] __hrtimer_run_queues
   0.00%  test     [kernel.kallsyms]  [k] do_syscall_64
   0.00%  test     [kernel.kallsyms]  [k] nmi_restore
   0.00%  perf     [kernel.kallsyms]  [k] native_sched_clock
   0.00%  test     [kernel.kallsyms]  [k] native_sched_clock
   0.00%  test     [kernel.kallsyms]  [k] sched_clock
   0.00%  perf     [kernel.kallsyms]  [k] native_apic_mem_write
   0.00%  perf     [kernel.kallsyms]  [k] native_write_msr

可以看到各个函数的比例，如果record时加上了-g统计调用栈，还可以看到调用关系，例如：
在这里插入图片描述

这个界面是可以交互的，可以进行选择event、展开函数等操作。

8. perf timechart

perf timechart用于画时序图。

应该是使用sched:xxx这些事件，来判断哪些线程在哪个cpu上运行，然后画出时序图。

可以通过perf record得到perf.data数据，也可以直接使用perf timechart record来获取数据，这样就不需要手动指定event。

我们还是用上面的测试程序进行测试。

# 运行测试程序
root:/# ./test &
# 采样
root:/# perf timechart record
^C[ perf record: Woken up 0 times to write data ]
[ perf record: Captured and wrote 3.140 MB perf.data (30458 samples) ]
# 生成svg文件
root:/# perf timechart
Written 0.6 seconds of trace to output.svg.

用浏览器打开output.svg：
在这里插入图片描述

可以看到cpu在不同时间的工作情况，以及不同线程在不同时间的工作情况。
perf timechart比较粗糙，也可以试试使用kernelshark进行时序分析。

使用perf evlist可以查看perf.data中有哪些事件：

root:/# perf evlist
sched:sched_wakeup
sched:sched_switch
power:cpu_frequency
power:cpu_idle
dummy:HG
# Tip: use 'perf evlist --trace-fields' to show fields for tracepoint events

三、火焰图使用示例

1. on-cpu火焰图

on-cpu火焰图可以用于分析cpu是被哪些线程、哪些函数占用的，可以方便的找到热点代码便于后续分析优化。

先随便写个测试代码：

#include <thread>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <sys/prctl.h>
using namespace std;

void func2(int n) {
        char *p = (char*)malloc((n+1)*1024);
        for (int j = 0; j < (n+1)*1024; ++j) p[j] = 0;
        free(p);
}

void func1(int n, char *name)
{
        prctl(PR_SET_NAME, name);
        int k = 0;
        for (int i = 0; i < n; ++i) {
                char *p = (char*)malloc((i+1)*1024);
                for (int j = 0; j < (i+1)*1024; ++j) p[j] = 0;
                free(p);
                func2(i);
        }
}

int main()
{
        thread t1(func1, 100, (char*)"t1");
        thread t2(func1, 200, (char*)"t2");
        thread t3(func1, 300, (char*)"t3");
        thread t4(func1, 400, (char*)"t4");
        t1.join();
        t2.join();
        t3.join();
        t4.join();
        return 0;
}

4个线程，t1的任务最轻，t4的任务最重。
运行test，perf record采样，再生成火焰图：

roo:/# perf record -g ./test
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.316 MB perf.data (2591 samples) ]
root:/# Failed to connect to ubus
root:/# perf script > perf.unfold
root:/# FlameGraph/stackcollapse-perf.pl perf.unfold &> perf.folded
root:/# FlameGraph/flamegraph.pl perf.folded > perf.svg

打开perf.svg，得到：
在这里插入图片描述
就是将perf.data中数据可视化了，可以清楚的看到cpu的分布情况。

record时，也可以指定其他event进行采样，可以修改采样频率，可以指定进程等等，根据实际情况灵活调整。

2. 内存火焰图

内存火焰图可以用于分析内存的变化情况，可以进一步帮助分析内存被哪些模块占用、内存泄露问题的原因等问题。

参考：
https://www.brendangregg.com/FlameGraphs/memoryflamegraphs.html

内存的数据统计由几种方法：

通过uprobe统计malloc、calloc、free等函数，优点是可以分析出malloc的内存是否被释放掉了，便于分析内存泄露问题，缺点是开销可能会比较大。
统计内核的brk()和mmap()，开销很小，但是只能统计分配，看不到是否被释放了。
统计page fault，可以得到物理内存的情况，上面的都是虚拟内存

我本地测试memleak.py(uprobe方法)的时候有有一些报错，环境还是有点问题，就不详细看了，参考链接中有使用方法与结果。

3. bio火焰图

bio可以用于统计io的耗时由哪些函数占用，便于分析优化io性能。

参考链接：https://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html#IO

如果通过tracepoint去统计，只能统计到调用到的次数。
使用bcc+ebpf可以统计到每次io的耗时。bcc和ebpf我就不介绍了，展开讲内容会很多，以后有机会再研究介绍。

brendang regg写的python脚本是使用blk_account_io_start和blk_account_io_completion这两个函数上添加kprobe来进行数据获取的，分别是block io的开始与结束，这个函数名可能在不同版本的内核中会发生变化，因此可能需要对脚本做一些修改，例如我使用的版本上结束的函数名是blk_account_io_done。

使用方法参考链接，结果是这样的：
在这里插入图片描述

4. 其他

还有几个火焰图，时间原因就不介绍了，可以去看brendang regg的网站，简单说一下：

off-cpu相关：

off-cpu flame graphs —— 与on-cpu互补，on-cpu只能看到运行情况，但如果某个请求运行慢，可能是被阻塞导致，那么就需要分析阻塞点在代码的哪个位置，off-cpu就是画出每个阻塞点的阻塞时间，用于分析这个问题。
Wakeup flame graphs —— off-cpu的进一步，off-cpu画出了阻塞点，但不知道阻塞是被谁唤醒的，wakeup通过分析唤醒阻塞点的线程栈，就可以知道是在哪里进行的唤醒，从而分析唤醒慢的原因。
Chain graphs —— off-cpu和wakeup火焰图画出了阻塞点、唤醒点，但两者之间的关系并没有，也就是不知道唤醒点是唤醒哪个阻塞点，chain graph就是解决这个问题

其他

Hot/Cold Flame Graphs —— 就是讲on-cpu与off-cpu结合，在一张图上显示，这样可以清晰的看到on和off各自的比例
Differential Flame Graphs —— 对比两个数据，画出来的图上显示变化情况，也就是相对之前的数据，每个部分占用是变高还是变低

参考资料

PERF tutorial —— by P.J. Drongowski
Brendan Gregg’s Homepage
“What exactly does -F do?” —— https://www.spinics.net/lists/linux-perf-users/msg17665.html
“Linux perf: 为什么采样频率设置为99Hz而不是100Hz？”
https://blog.openresty.com.cn/cn/dynamic-tracing/#dtrace-%E4%B8%8E-systemtap

cocoti

关注

6
点赞
踩
45

收藏

觉得还不错? 一键收藏
1
评论
perf与火焰图使用介绍

一、介绍perf是linux上的性能分析工具，perf可以对event进行统计得到event的发生次数，或者对event进行采样，得到每次event发生时的相关数据(cpu、进程id、运行栈等)，利用这些数据来对程序性能进行分析。perf可以统计或采样的event有很多，如果我们要分析cpu，那么我们可以使用cpu-cycles、cpu-clock来衡量占用cpu的程序的分布情况，还可以通过cache-misses、page-faults、branch-misses等event来分析造成cpu占用高的底
复制链接

扫一扫

专栏目录