学会使用perf性能分析工具--这一篇就够了

漫游嵌入式

已于 2022-09-04 15:02:30 修改

阅读量1.7w

点赞数 36

文章标签： linux 运维服务器

于 2022-02-28 14:26:41 首次发布

本文链接：https://blog.csdn.net/qq_38131812/article/details/123048333

版权

Linux性能分析与优化实战教程专栏收录该内容

4 篇文章

订阅专栏

在功能上，perf很强大，可以对众多的软硬件事件采样，还能采集出跟踪点（trace points）的信息（比如系统调用、TCP/IP事件和文件系统操作。perf的代码和Linux内核代码放在一起，是内核级的工具。perf是在Linux上做剖析分析的首选工具。

perf命令介绍

perf 工具提供了一组丰富的命令来收集和分析性能和跟踪数据。perf支持的命令如下：

 usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS]

 The most commonly used perf commands are:
   annotate        Read perf.data (created by perf record) and display annotated code
   archive         Create archive with object files with build-ids found in perf.data file
   bench           General framework for benchmark suites
   buildid-cache   Manage build-id cache.
   buildid-list    List the buildids in a perf.data file
   c2c             Shared Data C2C/HITM Analyzer.
   config          Get and set variables in a configuration file.
   data            Data file related processing
   diff            Read perf.data files and display the differential profile
   evlist          List the event names in a perf.data file
   ftrace          simple wrapper for kernel's ftrace functionality
   inject          Filter to augment the events stream with additional information
   kallsyms        Searches running kernel for symbols
   kmem            Tool to trace/measure kernel memory properties
   kvm             Tool to trace/measure kvm guest os
   list            List all symbolic event types
   lock            Analyze lock events
   mem             Profile memory accesses
   record          Run a command and record its profile into perf.data
   report          Read perf.data (created by perf record) and display the profile
   sched           Tool to trace/measure scheduler properties (latencies)
   script          Read perf.data (created by perf record) and display trace output
   stat            Run a command and gather performance counter statistics
   test            Runs sanity tests.
   timechart       Tool to visualize total system behavior during a workload
   top             System profiling tool.
   version         display the version of perf binary
   probe           Define new dynamic tracepoints
   trace           strace inspired tool

annotate：读取 perf.data（由 perf record记录）并显示带注释的代码，需要在编译应用程序时加入-g选项

archive：用perf.data文件中找到的build-ids的对象文件创建档案。

bench：对系统调度、内存访问、epoll、Futex等进行压力测试。

buildid-cache：管理build-id缓存

buildid-list：列出perf.data文件中的buildids。

c2c：共享数据C2C/HITM分析仪。

config：读取或设置配置文件中的变量

data：数据文件相关处理

diff: 读取perf.data文件并显示差分曲线

ftrace：内核的ftrace功能的简单封装器

inject：用额外的信息来增加事件流的过滤器

kallsyms：搜索运行中的内核中的符号

kmem：追踪/测量内核内存属性的工具

kvm：追踪/测量kvm客户操作系统的工具

list：列出所有象征性的事件类型

lock：分析锁事件

mem：分析内存访问

record：将所有的分析记录进perf.data

report：读取perf.data（由perf记录创建）并显示概况

sched：跟踪/测量调度器属性（延迟）的工具

script：读取perf.data（由perf记录创建）并显示跟踪输出

stat：运行一个命令并收集性能计数器的统计数据

test：测试系统内核支持的功能

timechart：在工作负载期间可视化整个系统行为的工具

top：系统分析工具

probe：定义新的动态跟踪点

trace：strace启发的工具

测试程序：

测试程序会一直循环打印a的值，打印一次睡眠一次。我们使用gcc test.c -g -o test将其编译成可执行文件。下面我们将结合此测试程序来使用perf工具进行分析。

#include <stdio.h>

void print(void)
{
        int i = 0;
        while(1){
        i++;
        }

}
int main ()
{

   print();

   return 0;
}

list

list命令会列举出perf支持监测的所有事件。

List of pre-defined events (to be used in -e):

  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  cache-misses                                       [Hardware event]
  cache-references                                   [Hardware event]
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]

  alignment-faults                                   [Software event]
  bpf-output                                         [Software event]
  context-switches OR cs                             [Software event]
  cpu-clock                                          [Software event]
  cpu-migrations OR migrations                       [Software event]
  dummy                                              [Software event]
  emulation-faults                                   [Software event]
  major-faults                                       [Software event]
  minor-faults                                       [Software event]
  page-faults OR faults                              [Software event]
  task-clock                                         [Software event]

  duration_time                                      [Tool event]

  L1-dcache-load-misses                              [Hardware cache event]
  L1-dcache-loads                                    [Hardware cache event]
  L1-icache-load-misses                              [Hardware cache event]
  L1-icache-loads                                    [Hardware cache event]
  branch-load-misses                                 [Hardware cache event]
  branch-loads                                       [Hardware cache event]
  dTLB-load-misses                                   [Hardware cache event]
  iTLB-load-misses                                   [Hardware cache event]

  br_immed_retired OR armv8_pmuv3/br_immed_retired/  [Kernel PMU event]
  br_mis_pred OR armv8_pmuv3/br_mis_pred/            [Kernel PMU event]
  br_pred OR armv8_pmuv3/br_pred/                    [Kernel PMU event]
  bus_access OR armv8_pmuv3/bus_access/              [Kernel PMU event]
  bus_cycles OR armv8_pmuv3/bus_cycles/              [Kernel PMU event]
  cid_write_retired OR armv8_pmuv3/cid_write_retired/ [Kernel PMU event]
  cpu_cycles OR armv8_pmuv3/cpu_cycles/              [Kernel PMU event]
  exc_return OR armv8_pmuv3/exc_return/              [Kernel PMU event]
  exc_taken OR armv8_pmuv3/exc_taken/                [Kernel PMU event]
  inst_retired OR armv8_pmuv3/inst_retired/          [Kernel PMU event]
  l1d_cache OR armv8_pmuv3/l1d_cache/                [Kernel PMU event]
  l1d_cache_refill OR armv8_pmuv3/l1d_cache_refill/  [Kernel PMU event]
  l1d_cache_wb OR armv8_pmuv3/l1d_cache_wb/          [Kernel PMU event]
  l1d_tlb_refill OR armv8_pmuv3/l1d_tlb_refill/      [Kernel PMU event]
  l1i_cache OR armv8_pmuv3/l1i_cache/                [Kernel PMU event]
  l1i_cache_refill OR armv8_pmuv3/l1i_cache_refill/  [Kernel PMU event]
  l1i_tlb_refill OR armv8_pmuv3/l1i_tlb_refill/      [Kernel PMU event]
  l2d_cache OR armv8_pmuv3/l2d_cache/                [Kernel PMU event]
  l2d_cache_refill OR armv8_pmuv3/l2d_cache_refill/  [Kernel PMU event]
  l2d_cache_wb OR armv8_pmuv3/l2d_cache_wb/          [Kernel PMU event]
  ld_retired OR armv8_pmuv3/ld_retired/              [Kernel PMU event]
  mem_access OR armv8_pmuv3/mem_access/              [Kernel PMU event]
  memory_error OR armv8_pmuv3/memory_error/          [Kernel PMU event]
  pc_write_retired OR armv8_pmuv3/pc_write_retired/  [Kernel PMU event]
  st_retired OR armv8_pmuv3/st_retired/              [Kernel PMU event]
  sw_incr OR armv8_pmuv3/sw_incr/                    [Kernel PMU event]
  unaligned_ldst_retired OR armv8_pmuv3/unaligned_ldst_retired/ [Kernel PMU event]
  cs_etm//                                           [Kernel PMU event]
  imx8_ddr0/activate/                                [Kernel PMU event]
  imx8_ddr0/axid-read/                               [Kernel PMU event]
  imx8_ddr0/axid-write/                              [Kernel PMU event]
  imx8_ddr0/cycles/                                  [Kernel PMU event]
  imx8_ddr0/hp-read-credit-cnt/                      [Kernel PMU event]
  imx8_ddr0/hp-read/                                 [Kernel PMU event]
  imx8_ddr0/hp-req-nocredit/                         [Kernel PMU event]
  imx8_ddr0/hp-xact-credit/                          [Kernel PMU event]
  imx8_ddr0/load-mode/                               [Kernel PMU event]
  imx8_ddr0/lp-read-credit-cnt/                      [Kernel PMU event]
  imx8_ddr0/lp-req-nocredit/                         [Kernel PMU event]
  imx8_ddr0/lp-xact-credit/                          [Kernel PMU event]
  imx8_ddr0/perf-mwr/                                [Kernel PMU event]
  imx8_ddr0/precharge/                               [Kernel PMU event]
  imx8_ddr0/raw-hazard/                              [Kernel PMU event]
  imx8_ddr0/read-accesses/                           [Kernel PMU event]
  imx8_ddr0/read-activate/                           [Kernel PMU event]
  imx8_ddr0/read-command/                            [Kernel PMU event]
  imx8_ddr0/read-cycles/                             [Kernel PMU event]
  imx8_ddr0/read-modify-write-command/               [Kernel PMU event]
  imx8_ddr0/read-queue-depth/                        [Kernel PMU event]
  imx8_ddr0/read-write-transition/                   [Kernel PMU event]
  imx8_ddr0/read/                                    [Kernel PMU event]
  imx8_ddr0/refresh/                                 [Kernel PMU event]
  imx8_ddr0/selfresh/                                [Kernel PMU event]
  imx8_ddr0/wr-xact-credit/                          [Kernel PMU event]
  imx8_ddr0/write-accesses/                          [Kernel PMU event]
  imx8_ddr0/write-command/                           [Kernel PMU event]
  imx8_ddr0/write-credit-cnt/                        [Kernel PMU event]
  imx8_ddr0/write-cycles/                            [Kernel PMU event]
  imx8_ddr0/write-queue-depth/                       [Kernel PMU event]
  imx8_ddr0/write/                                   [Kernel PMU event]

branch:
  br_cond
       [Conditional branch executed]
  br_cond_mispred
       [Conditional branch mispredicted]
  br_indirect_mispred
       [Indirect branch mispredicted]
  br_indirect_mispred_addr
       [Indirect branch mispredicted because of address miscompare]
  br_indirect_spec
       [Branch speculatively executed, indirect branch]

bus:
  bus_access_rd
       [Bus access read]
  bus_access_wr
       [Bus access write]

cache:
  ext_snoop
       [SCU Snooped data from another CPU for this CPU]
  prefetch_linefill
       [Linefill because of prefetch]
  prefetch_linefill_drop
       [Instruction Cache Throttle occurred]
  read_alloc
       [Read allocate mode]
  read_alloc_enter
       [Entering read allocate mode]

memory:
  ext_mem_req
       [External memory request]
  ext_mem_req_nc
       [Non-cacheable external memory request]

other:
  exc_fiq
       [Exception taken, FIQ]
  exc_irq
       [Exception taken, IRQ]
  l1d_cache_err
       [L1 Data Cache (data, tag or dirty) memory error, correctable or non-correctable]
  l1i_cache_err
       [L1 Instruction Cache (data or tag) memory error]
  pre_decode_err
       [Pre-decode error]
  tlb_err
       [TLB memory error]

pipeline:
  agu_dep_stall
       [Cycles there is an interlock for a load/store instruction waiting for data to calculate the address in the
        AGU]
  decode_dep_stall
       [Cycles the DPU IQ is empty and there is a pre-decode error being processed]
  ic_dep_stall
       [Cycles the DPU IQ is empty and there is an instruction cache miss being processed]
  iutlb_dep_stall
       [Cycles the DPU IQ is empty and there is an instruction micro-TLB miss being processed]
  ld_dep_stall
       [Cycles there is a stall in the Wr stage because of a load miss]
  other_interlock_stall
       [Cycles there is an interlock other than Advanced SIMD/Floating-point instructions or load/store instruction]
  other_iq_dep_stall
       [Cycles that the DPU IQ is empty and that is not because of a recent micro-TLB miss, instruction cache miss or
        pre-decode error]
  simd_dep_stall
       [Cycles there is an interlock for an Advanced SIMD/Floating-point operation]
  st_dep_stall
       [Cycles there is a stall in the Wr stage because of a store]
  stall_sb_full
       [Data Write operation that stalls the pipeline because the store buffer is full]

  rNNN                                               [Raw hardware event descriptor]
  cpu/t1=v1[,t2=v2,t3 ...]/modifier                  [Raw hardware event descriptor]
   (see 'man perf-list' on how to encode it)

  mem:<addr>[/len][:access]                          [Hardware breakpoint]


Metric Groups:

No_group:
  imx8mp_bandwidth_usage.lpddr4
       [bandwidth usage for lpddr4 evk board. Unit: imx8_ddr ]
  imx8mp_ddr_read.2d
       [bytes of gpu 2d read from ddr. Unit: imx8_ddr ]
  imx8mp_ddr_read.3d
       [bytes of gpu 3d read from ddr. Unit: imx8_ddr ]
  imx8mp_ddr_read.a53
       [bytes of a53 core read from ddr. Unit: imx8_ddr ]
  imx8mp_ddr_read.all
       [bytes of all masters read from ddr. Unit: imx8_ddr ]
  imx8mp_ddr_read.audio_dsp
       [bytes of audio dsp read from ddr. Unit: imx8_ddr ]
  imx8mp_ddr_read.audio_sdma2_burst
       [bytes of audio sdma2_burst read from ddr. Unit: imx8_ddr ]
  imx8mp_ddr_read.audio_sdma2_per
       [bytes of audio sdma2_per read from ddr. Unit: imx8_ddr ]
  imx8mp_ddr_read.audio_sdma3_burst
       [bytes of audio sdma3_burst read from ddr. Unit: imx8_ddr ]
  imx8mp_ddr_read.audio_sdma3_per
       [bytes of audio sdma3_per read from ddr. Unit: imx8_ddr ]
  imx8mp_ddr_read.audio_sdma_pif
       [bytes of audio sdma_pif read from ddr. Unit: imx8_ddr ]
  imx8mp_ddr_read.dewarp
       [bytes of display dewarp read from ddr. Unit: imx8_ddr ]
  imx8mp_ddr_read.hdmi_hdcp
       [bytes of hdmi_tx tx_hdcp read from ddr. Unit: imx8_ddr ]
  imx8mp_ddr_read.hdmi_hrv_mwr
       [bytes of hdmi_tx hrv_mwr read from ddr. Unit: imx8_ddr ]
  imx8mp_ddr_read.hdmi_lcdif
       [bytes of hdmi_tx lcdif read from ddr. Unit: imx8_ddr ]
  imx8mp_ddr_read.isi1
       [bytes of display isi1 read from ddr. Unit: imx8_ddr ]
  imx8mp_ddr_read.isi2
       [bytes of display isi2 read from ddr. Unit: imx8_ddr ]
  imx8mp_ddr_read.isi3
       [bytes of display isi3 read from ddr. Unit: imx8_ddr ]
  imx8mp_ddr_read.isp1
       [bytes of display isp1 read from ddr. Unit: imx8_ddr ]
  imx8mp_ddr_read.isp2
       [bytes of display isp2 read from ddr. Unit: imx8_ddr ]
  imx8mp_ddr_read.lcdif1
       [bytes of display lcdif1 read from ddr. Unit: imx8_ddr ]
  imx8mp_ddr_read.lcdif2
       [bytes of display lcdif2 read from ddr. Unit: imx8_ddr ]
  imx8mp_ddr_read.npu
       [bytes of npu read from ddr. Unit: imx8_ddr ]
  imx8mp_ddr_read.pci
       [bytes of hsio pci read from ddr. Unit: imx8_ddr ]
  imx8mp_ddr_read.supermix
       [bytes of supermix(m7) core read from ddr. Unit: imx8_ddr ]
  imx8mp_ddr_read.usb1
       [bytes of hsio usb1 read from ddr. Unit: imx8_ddr ]
  imx8mp_ddr_read.usb2
       [bytes of hsio usb2 read from ddr. Unit: imx8_ddr ]
  imx8mp_ddr_read.vpu1
       [bytes of vpu1 read from ddr. Unit: imx8_ddr ]
  imx8mp_ddr_read.vpu2
       [bytes of vpu2 read from ddr. Unit: imx8_ddr ]
  imx8mp_ddr_read.vpu3
       [bytes of vpu3 read from ddr. Unit: imx8_ddr ]
  imx8mp_ddr_write.2d
       [bytes of gpu 2d write to ddr. Unit: imx8_ddr ]
  imx8mp_ddr_write.3d
       [bytes of gpu 3d write to ddr. Unit: imx8_ddr ]
  imx8mp_ddr_write.a53
       [bytes of a53 core write to ddr. Unit: imx8_ddr ]
  imx8mp_ddr_write.all
       [bytes of all masters write to ddr. Unit: imx8_ddr ]
  imx8mp_ddr_write.audio_dsp
       [bytes of audio dsp write to ddr. Unit: imx8_ddr ]
  imx8mp_ddr_write.audio_sdma2_burst
       [bytes of audio sdma2_burst write to ddr. Unit: imx8_ddr ]
  imx8mp_ddr_write.audio_sdma2_per
       [bytes of audio sdma2_per write to ddr. Unit: imx8_ddr ]
  imx8mp_ddr_write.audio_sdma3_burst
       [bytes of audio sdma3_burst write to ddr. Unit: imx8_ddr ]
  imx8mp_ddr_write.audio_sdma3_per
       [bytes of audio sdma3_per write to ddr. Unit: imx8_ddr ]
  imx8mp_ddr_write.audio_sdma_pif
       [bytes of audio sdma_pif write to ddr. Unit: imx8_ddr ]
  imx8mp_ddr_write.dewarp
       [bytes of display dewarp write to ddr. Unit: imx8_ddr ]
  imx8mp_ddr_write.hdmi_hdcp
       [bytes of hdmi_tx tx_hdcp write to ddr. Unit: imx8_ddr ]
  imx8mp_ddr_write.hdmi_hrv_mwr
       [bytes of hdmi_tx hrv_mwr write to ddr. Unit: imx8_ddr ]
  imx8mp_ddr_write.hdmi_lcdif
       [bytes of hdmi_tx lcdif write to ddr. Unit: imx8_ddr ]
  imx8mp_ddr_write.isi1
       [bytes of display isi1 write to ddr. Unit: imx8_ddr ]
  imx8mp_ddr_write.isi2
       [bytes of display isi2 write to ddr. Unit: imx8_ddr ]
  imx8mp_ddr_write.isi3
       [bytes of display isi3 write to ddr. Unit: imx8_ddr ]
  imx8mp_ddr_write.isp2
       [bytes of display isp2 write to ddr. Unit: imx8_ddr ]
  imx8mp_ddr_write.lcdif1
       [bytes of display lcdif1 write to ddr. Unit: imx8_ddr ]
  imx8mp_ddr_write.lcdif2
       [bytes of display lcdif2 write to ddr. Unit: imx8_ddr ]
  imx8mp_ddr_write.npu
       [bytes of npu write to ddr. Unit: imx8_ddr ]
  imx8mp_ddr_write.pci
       [bytes of hsio pci write to ddr. Unit: imx8_ddr ]
  imx8mp_ddr_write.supermix
       [bytes of supermix(m7) write to ddr. Unit: imx8_ddr ]
  imx8mp_ddr_write.usb1
       [bytes of hsio usb1 write to ddr. Unit: imx8_ddr ]
  imx8mp_ddr_write.usb2
       [bytes of hsio usb2 write to ddr. Unit: imx8_ddr ]
  imx8mp_ddr_write.vpu1
       [bytes of vpu1 write to ddr. Unit: imx8_ddr ]
  imx8mp_ddr_write.vpu2
       [bytes of vpu2 write to ddr. Unit: imx8_ddr ]
  imx8mp_ddr_write.vpu3
       [bytes of vpu3 write to ddr. Unit: imx8_ddr ]
imx8_ddr_DDR_MON:
  imx8mp_ddr_write.isp1
       [bytes of display isp1 write to ddr. Unit: imx8_ddr ]

stat

我们可以使用stat来采集程序的运行时间和CPU开销，perf stat所支持的主要参数如下：

-a, --all-cpus system-wide collection from all CPUs
-A, --no-aggr disable CPU count aggregation
-B, --big-num print large numbers with thousands' separators
-C, --cpu <cpu> list of cpus to monitor in system-wide
-D, --delay <n> ms to wait before starting measurement after program start (-1: start with events disabled)
-d, --detailed detailed run - start a lot of events
-e, --event <event> event selector. use 'perf list' to list available events
-G, --cgroup <name> monitor event in cgroup name only
-g, --group put the counters into a counter group
-I, --interval-print <n>
print counts at regular interval in ms (overhead is possible for values <= 100ms)
-i, --no-inherit child tasks do not inherit counters
-M, --metrics <metric/metric group list>
monitor specified metrics or metric groups (separated by ,)
-n, --null null run - dont start any counters
-o, --output <file> output file name
-p, --pid <pid> stat events on existing process id
-r, --repeat <n> repeat command and print average + stddev (max: 100, forever: 0)
-S, --sync call sync() before starting a run
-t, --tid <tid> stat events on existing thread id
-T, --transaction hardware transaction statistics
-v, --verbose be more verbose (show counter open errors, etc)

先运行测试程序，然后使用top命令查看应用程序对应的pid，例如这里的pid是997。

那么我们就来采集这个应用程序的运行信息：

perf stat -p 997

由于测试程序是死循环，所以最后打印的时间是执行stat命令的总时间。输出中还显示了task-clock (msec)是22624.22毫秒，也就是22s。

 Performance counter stats for process id '997':

          22624.22 msec task-clock                #    0.188 CPUs utilized
              1225      context-switches          #    0.054 K/sec
                 1      cpu-migrations            #    0.000 K/sec
                 0      page-faults               #    0.000 K/sec
       39516466339      cycles                    #    1.747 GHz
       23012315521      instructions              #    0.58  insn per cycle
        3381064757      branches                  #  149.444 M/sec
         256850857      branch-misses             #    7.60% of all branches

     120.484878500 seconds time elapsed

record

剖析采样可以帮助我们采集到程序运行的特征，而且剖析精度非常高，可以定位到具体的代码行和指令块。

-a, --all-cpus system-wide collection from all CPUs
-b, --branch-any sample any taken branches
-B, --no-buildid do not collect buildids in perf.data
-c, --count <n> event period to sample
-C, --cpu <cpu> list of cpus to monitor
-d, --data Record the sample addresses
-D, --delay <n> ms to wait before starting measurement after program start (-1: start with events disabled)
-e, --event <event> event selector. use 'perf list' to list available events
-F, --freq <freq or 'max'>
profile at this frequency
-g enables call-graph recording
-G, --cgroup <name> monitor event in cgroup name only
-I, --intr-regs[=<any register>]
sample selected machine registers on interrupt, use '-I?' to list register names
-i, --no-inherit child tasks do not inherit counters
-j, --branch-filter <branch filter mask>
branch stack filter modes
-k, --clockid <clockid>
clockid to use for events, see clock_gettime()
-m, --mmap-pages <pages[,pages]>
number of mmap data pages and AUX area tracing mmap pages
-N, --no-buildid-cache
do not update the buildid cache
-n, --no-samples don't sample
-o, --output <file> output file name
-P, --period Record the sample period
-p, --pid <pid> record events on existing process id
-q, --quiet don't print any message
-R, --raw-samples collect raw sample records from all opened counters
-r, --realtime <n> collect data with this RT SCHED_FIFO priority
-S, --snapshot[=<opts>]
AUX area tracing Snapshot Mode
-s, --stat per thread counts
-t, --tid <tid> record events on existing thread id
-T, --timestamp Record the sample timestamps
-u, --uid <user> user to profile
-v, --verbose be more verbose (show counter open errors, etc)

我通过“-F 999”选项，我把采样频率设置为999Hz，每秒采样999次。

测试命令：

perf record -F 999 -p 997

然后perf会将记录的数据存储在perf.data中。

report


 Usage: perf report [<options>]

    -b, --branch-stack    use branch records for per branch histogram filling
    -c, --comms <comm[,comm...]>
                          only consider symbols in these comms
    -C, --cpu <cpu>       list of cpus to profile
    -d, --dsos <dso[,dso...]>
                          only consider symbols in these dsos
    -D, --dump-raw-trace  dump raw trace in ASCII
    -F, --fields <key[,keys...]>
                          output field(s): overhead period sample  overhead overhead_sys
                          overhead_us overhead_guest_sys overhead_guest_us overhead_children
                          sample period pid comm dso symbol parent cpu socket
                          srcline srcfile local_weight weight transaction trace
                          symbol_size dso_size cgroup cgroup_id ipc_null time
                          dso_from dso_to symbol_from symbol_to mispredict abort
                          in_tx cycles srcline_from srcline_to ipc_lbr symbol_daddr
                          dso_daddr locked tlb mem snoop dcacheline symbol_iaddr
                          phys_daddr
    -f, --force           don't complain, do it
    -g, --call-graph <print_type,threshold[,print_limit],order,sort_key[,branch],value>
                          Display call graph (stack chain/backtrace):

                                print_type:     call graph printing style (graph|flat|fractal|folded|none)
                                threshold:      minimum call graph inclusion threshold (<percent>)
                                print_limit:    maximum number of call graph entry (<number>)
                                order:          call graph order (caller|callee)
                                sort_key:       call graph sort key (function|address)
                                branch:         include last branch info to call graph (branch)
                                value:          call graph value (percent|period|count)

                                Default: graph,0.5,caller,function,percent
    -G, --inverted        alias for inverted call graph
    -i, --input <file>    input file name
    -I, --show-info       Display extended information about perf.data file
    -k, --vmlinux <file>  vmlinux pathname
    -M, --disassembler-style <disassembler style>
                          Specify disassembler style (e.g. -M intel for intel syntax)
    -m, --modules         load module symbols - WARNING: use only with -k and LIVE kernel
    -n, --show-nr-samples
                          Show a column with the number of samples
    -p, --parent <regex>  regex filter to identify parent, see: '--sort parent'
    -q, --quiet           Do not show any message
    -s, --sort <key[,key2...]>
                          sort by key(s): overhead overhead_sys overhead_us overhead_guest_sys
                          overhead_guest_us overhead_children sample period
                          pid comm dso symbol parent cpu socket srcline srcfile
                          local_weight weight transaction trace symbol_size
                          dso_size cgroup cgroup_id ipc_null time dso_from dso_to
                          symbol_from symbol_to mispredict abort in_tx cycles
                          srcline_from srcline_to ipc_lbr symbol_daddr dso_daddr
                          locked tlb mem snoop dcacheline symbol_iaddr phys_daddr
    -S, --symbols <symbol[,symbol...]>
                          only consider these symbols
    -t, --field-separator <separator>
                          separator for columns, no spaces will be added between columns '.' is reserved.
    -T, --threads         Show per-thread event counters
    -U, --hide-unresolved
                          Only display entries resolved to a symbol
    -v, --verbose         be more verbose (show symbol address, etc)
    -w, --column-widths <width[,width...]>
                          don't try to adjust column width, use these fixed values
    -x, --exclude-other   Only display entries with parent-match
        --asm-raw         Display raw encoding of assembly instructions (default)
        --branch-history  add last branch records to call history
        --children        Accumulate callchains of children and show total overhead as well. Enabled by default, use --no-children to disable.
        --demangle        Disable symbol demangling
        --demangle-kernel
                          Enable kernel symbol demangling
        --full-source-path
                          Show full source file name path for source lines
        --group           Show event group information together
        --group-sort-idx <n>
                          Sort the output by the event at the index n in group. If n is invalid, sort by the first event. WARNING: should be used on grouped events.
        --gtk             Use the GTK2 interface
        --header          Show data header.
        --header-only     Show only data header.
        --hierarchy       Show entries in a hierarchy
        --ignore-callees <regex>
                          ignore callees of these functions in call graphs
        --ignore-vmlinux  don't load vmlinux even if found
        --inline          Show inline function
        --itrace[=<opts>]
                          Instruction Tracing options
                                i[period]:              synthesize instructions events
                                b:                      synthesize branches events (branch misses for Arm SPE)
                                c:                      synthesize branches events (calls only)
                                r:                      synthesize branches events (returns only)
                                x:                      synthesize transactions events
                                w:                      synthesize ptwrite events
                                p:                      synthesize power events
                                o:                      synthesize other events recorded due to the use
                                                        of aux-output (refer to perf record)
                                e[flags]:               synthesize error events
                                                        each flag must be preceded by + or -
                                                        error flags are: o (overflow)
                                                                         l (data lost)
                                d[flags]:               create a debug log
                                                        each flag must be preceded by + or -
                                                        log flags are: a (all perf events)
                                f:                      synthesize first level cache events
                                m:                      synthesize last level cache events
                                t:                      synthesize TLB events
                                a:                      synthesize remote access events
                                g[len]:                 synthesize a call chain (use with i or x)
                                G[len]:                 synthesize a call chain on existing event records
                                l[len]:                 synthesize last branch entries (use with i or x)
                                L[len]:                 synthesize last branch entries on existing event records
                                sNUMBER:                skip initial number of events
                                q:                      quicker (less detailed) decoding
                                PERIOD[ns|us|ms|i|t]:   specify period to sample stream
                                concatenate multiple options. Default is ibxwpe or cewp

        --kallsyms <file>
                          kallsyms pathname
        --max-stack <n>   Set the maximum stack depth when parsing the callchain, anything beyond the specified depth will be ignored. Default: kernel.perf_event_max_stack or 127
        --mem-mode        mem access profile
        --mmaps           Display recorded tasks memory maps
        --ns              Show times in nanosecs
        --objdump <path>  objdump binary to use for disassembly and annotations
        --percent-limit <percent>
                          Don't show entries under that percent
        --percent-type <local-period>
                          Set percent type local/global-period/hits
        --percentage <relative|absolute>
                          how to display percentage of filtered entries
        --pid <pid[,pid...]>
                          only consider symbols in these pids
        --prefix <prefix>
                          Add prefix to source file path names in programs (with --prefix-strip)
        --prefix-strip <N>
                          Strip first N entries of source file path name in programs (with --prefix)
        --pretty <key>    pretty printing style key: normal raw
        --raw-trace       Show raw trace event output (do not use print fmt or plugins)
        --samples <n>     Number of samples to save per histogram entry for individual browsing
        --show-cpu-utilization
                          Show sample percentage for different cpu modes
        --show-on-off-events
                          Show the on/off switch events, used with --switch-on and --switch-off
        --show-ref-call-graph
                          Show callgraph from reference event
        --show-total-period
                          Show a column with the sum of periods
        --socket-filter <n>
                          only show processor socket that match with this filter
        --source          Interleave source code with assembly code (default)
        --stats           Display event stats
        --stdio           Use the stdio interface
        --stdio-color <mode>
                          'always' (default), 'never' or 'auto' only applicable to --stdio mode
        --stitch-lbr      Enable LBR callgraph stitching approach
        --switch-off <event>
                          Stop considering events after the ocurrence of this event
        --switch-on <event>
                          Consider events after the ocurrence of this event
        --symbol-filter <filter>
                          only show symbols that (partially) match with this filter
        --symfs <directory>
                          Look for files with symbols relative to this directory
        --tasks           Display recorded tasks
        --tid <tid[,tid...]>
                          only consider symbols in these tids
        --time <str>      Time span of interest (start,stop)
        --time-quantum <time (ms|us|ns|s)>
                          Set time quantum for time sort key (default 100ms)
        --total-cycles    Sort all blocks by 'Sampled Cycles%'
        --tui             Use the TUI interface

采集完数据，我们就可以通过perf report命令寻找采样中的性能瓶颈了。

perf report

Samples: 21K of event 'cycles', Event count (approx.): 38100133435
Overhead  Command  Shared Object      Symbol
  99.99%  test     test               [.] print                                                                                                                                                     •
   0.00%  test     [kernel.kallsyms]  [k] update_sd_lb_stats.constprop.0                                                                                                                            ▒
   0.00%  test     [kernel.kallsyms]  [k] _raw_spin_unlock_irq                                                                                                                                      ▒
   0.00%  test     [kernel.kallsyms]  [k] shift_arg_pages                                                                                                                                           ▒
   0.00%  perf     [kernel.kallsyms]  [k] perf_event_exec

Overhead：指出了该Symbol采样在总采样中所占的百分比。在当前场景下，表示了该Symbol消耗的CPU时间占总CPU时间的百分比
Command：进程名
Shared Object：模块名，比如具体哪个共享库，哪个可执行程序。
Symbol：二进制模块中的符号名，如果是高级语言，比如C语言编写的程序，等价于函数名。

只定位到函数还不够好，perf工具还能帮我们定位到更细的粒度，这样我们就不用去猜函数中哪一段代码出了问题。如果我们通过键盘上下键把光标移动到print函数上，然后敲击Enter键，perf给出了一些选项。通过这些选项，我们可以进一步分析这个函数。

我们选中第一个选项“Annotate wasteTime”，我们敲击Enter键就可以对函数做进一步分析了。

Annotate print                   --- 分析print函数中指令或者代码的性能
Zoom into test thread             --- 聚焦到线程 test
Zoom into test DSO              --- 聚焦到动态共享对象test
Browse map details                --- 查看map
Run scripts for samples of thread [test]--- 针对test线程的采样运行脚本
Run scripts for samples of symbol [test] --- 针对函数的采样运行脚本   
Run scripts for all samples       --- 针对所有采样运行脚步
Switch to another data file in PWD --- 切换到当前目录中另一个数据文件
Exit

annotate


 Usage: perf annotate [<options>]

    -C, --cpu <cpu>       list of cpus to profile
    -d, --dsos <dso[,dso...]>
                          only consider symbols in these dsos
    -D, --dump-raw-trace  dump raw trace in ASCII
    -f, --force           don't complain, do it
    -i, --input <file>    input file name
    -k, --vmlinux <file>  vmlinux pathname
    -l, --print-line      print matching source lines (may be slow)
    -M, --disassembler-style <disassembler style>
                          Specify disassembler style (e.g. -M intel for intel syntax)
    -m, --modules         load module symbols - WARNING: use only with -k and LIVE kernel
    -n, --show-nr-samples
                          Show a column with the number of samples
    -P, --full-paths      Don't shorten the displayed pathnames
    -q, --quiet           do now show any message
    -s, --symbol <symbol>
                          symbol to annotate

我们可以使用annotate来单独分析print函数的信息，效果和report中进入annotate一样。

perf annotate -l -s print

top

 Usage: perf top [<options>]

    -a, --all-cpus        system-wide collection from all CPUs
    -b, --branch-any      sample any taken branches
    -c, --count <n>       event period to sample
    -C, --cpu <cpu>       list of cpus to monitor
    -d, --delay <n>       number of seconds to delay between refreshes
    -D, --dump-symtab     dump the symbol table used for profiling
    -E, --entries <n>     display this many functions
    -e, --event <event>   event selector. use 'perf list' to list available events
    -f, --count-filter <n>
                          only display functions with more events than this
    -F, --freq <freq or 'max'>
                          profile at this frequency
    -g                    enables call-graph recording and display
    -i, --no-inherit      child tasks do not inherit counters
    -j, --branch-filter <branch filter mask>
                          branch stack filter modes
    -K, --hide_kernel_symbols
                          hide kernel symbols
    -k, --vmlinux <file>  vmlinux pathname
    -M, --disassembler-style <disassembler style>
                          Specify disassembler style (e.g. -M intel for intel syntax)
    -m, --mmap-pages <pages>
                          number of mmap data pages
    -n, --show-nr-samples
                          Show a column with the number of samples
    -p, --pid <pid>       profile events on existing process id
    -r, --realtime <n>    collect data with this RT SCHED_FIFO priority
    -s, --sort <key[,key2...]>
                          sort by key(s): pid, comm, dso, symbol, parent, cpu, srcline, ... Please refer the man page for the complete list.
    -t, --tid <tid>       profile events on existing thread id
    -U, --hide_user_symbols
                          hide user symbols
    -u, --uid <user>      user to profile
    -v, --verbose         be more verbose (show counter open errors, etc)
    -w, --column-widths <width[,width...]>

perf top命令和linux下的top命令有点相似，实时打印出系统中被采样事件的状态和统计数据。perf top主要用于实时剖析各个函数在某个性能事件(event)上的热度，默认的event是cycles(cpu周期数)，这样可以检测系统中所有应用层和内核层函数的热度。

perf top支持两种输出界面，tui和tty，默认是tui，因为tui需要更多的环境和库支持，所以经常出现乱码问题，所以本文都是基于tty界面分析(–stdio)。

直接执行perf top监控的是整个系统中所有进程的状态，多数情况我们只关心某个进程，或者想定位某个线程的性能问题，perf top都是支持的(-p / -t)。

需要进入函数内部一探究竟，有时对于像上面的DH_SSM_BLKBUF_ALLOC这样的函数的调用堆栈，以定位到是哪里在频繁调用。这时候可以执行：

 perf top -t 4010 --stdio -g -K

上面的-g参数就是现实函数的调用堆栈，-k是为了只输出应用层函数

bench

bench可以来对系统性能进行评测，支持调度、系统调用、内存、epoll等各项功能测试。

Usage:
        perf bench [<common options>] <collection> <benchmark> [<options>]

        # List of all available benchmark collections:

         sched: Scheduler and IPC benchmarks
       syscall: System call benchmarks
           mem: Memory access benchmarks
         futex: Futex stressing benchmarks
         epoll: Epoll stressing benchmarks
     internals: Perf-internals benchmarks
           all: All benchmarks

如果我们使用perf bench all，会测试所有支持的测试项目。

# Running sched/messaging benchmark...
# 20 sender and receiver processes per group
# 10 groups == 400 processes run

     Total time: 0.900 [sec]

# Running sched/pipe benchmark...
# Executed 1000000 pipe operations between two processes

     Total time: 15.180 [sec]

      15.180503 usecs/op
          65873 ops/sec

# Running syscall/basic benchmark...
# Executed 10000000 getppid() calls
     Total time: 3.972 [sec]

       0.397209 usecs/op
        2517568 ops/sec

# Running mem/memcpy benchmark...
# function 'default' (Default memcpy() provided by glibc)
# Copying 1MB bytes ...

       1.698370 GB/sec

# Running mem/memset benchmark...
# function 'default' (Default memset() provided by glibc)
# Copying 1MB bytes ...

      12.207031 GB/sec

# Running mem/find_bit benchmark...
100000 operations 1 bits set of 1 bits
  Average for_each_set_bit took: 4638.600 usec (+- 13.761 usec)
  Average test_bit loop took:    1894.200 usec (+- 2.672 usec)