perf工具使用

最新推荐文章于 2024-10-05 13:25:24 发布

jerry_ms

最新推荐文章于 2024-10-05 13:25:24 发布

阅读量2.6k

点赞数

本文链接：https://blog.csdn.net/u014089131/article/details/76410327

版权

perf用法：

1.编译perf：
直接在android下执行make perf

2.perf支持的命令如下：
root:# perf

usage: perf [--version] [--help] COMMAND [ARGS]

The most commonly used perf commands are:
annotate Read perf.data (created by perf record) and display annotated code
archive Create archive with object files with build-ids found in perf.data file
bench General framework for benchmark suites
buildid-cache Manage build-id cache.
buildid-list List the buildids in a perf.data file
diff Read perf.data files and display the differential profile
evlist List the event names in a perf.data file
inject Filter to augment the events stream with additional information
kmem Tool to trace/measure kernel memory(slab) properties
kvm Tool to trace/measure kvm guest os
list List all symbolic event types
lock Analyze lock events
mem Profile memory accesses
record Run a command and record its profile into perf.data
report Read perf.data (created by perf record) and display the profile
sched Tool to trace/measure scheduler properties (latencies)
script Read perf.data (created by perf record) and display trace output
stat Run a command and gather performance counter statistics
test Runs sanity tests.
timechart Tool to visualize total system behavior during a workload
top System profiling tool.
trace strace inspired tool
probe Define new dynamic tracepoints

See 'perf help COMMAND' for more information on a specific command.

3.perf list用来显示perf所支持的采样的事件，有HW，SW也有tracepoint事件。

-e 可以用来执行perf list中的某个事件
-e <event> : u // userspace
-e <event> : k // kernel
-e <event> : h // hypervisor
-e <event> : G // guest counting (in KVM guests)
-e <event> : H // host counting (not in KVM guests)
例如：
显示内核模块中，消耗最多CPU周期的函数：
perf top -e cycles:k
显示分配高速缓存最多的函数：
perf top -e kmem:kmem_cache_alloc

4. perf top对于某个指定的事件(默认是CPU周期)，显示消耗最多的函数或者指令
perf top [-e <EVENT> | --event=EVENT] [<options>]
perf top -G [fractal]，路径概率为相对值，加起来为100%，调用顺序为从下往上。
perf top -G graph，路径概率为绝对值，加起来为该函数的热度。
perf top -k /data/perf/vmlinux //可以完整分析kernel的symbols
perf top --help 可以列出所有的options

5.perf常用的命令行参数：
-e <event>：指明要分析的性能事件。
-p <pid>：Profile events on existing Process ID (comma sperated list). 仅分析目标进程及其创建的线程。
-k <path>：Path to vmlinux. Required for annotation functionality. 带符号表的内核映像所在的路径。
-K：不显示属于内核或模块的符号。
-U：不显示属于用户态程序的符号。
-d <n>：界面的刷新周期，默认为2s，因为perf top默认每2s从mmap的内存区域读取一次性能数据。
-G：得到函数的调用关系图。

6. perf stat用于分析指定程序的性能

perf stat [-e <EVENT> | --event=EVENT] [-a] - <command> [<options>]
perf stat ls 用于统计ls执行数据，输出内容含义：

task-clock：任务真正占用的处理器时间，单位为ms。CPUs utilized = task-clock / time elapsed，CPU的占用率。
context-switches：上下文的切换次数。
CPU-migrations：处理器迁移次数。Linux为了维持多个处理器的负载均衡，在特定条件下会将某个任务从一个CPU
迁移到另一个CPU。
page-faults：缺页异常的次数。当应用程序请求的页面尚未建立、请求的页面不在内存中，或者请求的页面虽然在内
存中，但物理地址和虚拟地址的映射关系尚未建立时，都会触发一次缺页异常。另外TLB不命中，页面访问权限不匹配
等情况也会触发缺页异常。
cycles：消耗的处理器周期数。如果把被ls使用的cpu cycles看成是一个处理器的，那么它的主频为2.486GHz。
可以用cycles / task-clock算出。
stalled-cycles-frontend：略过。
stalled-cycles-backend：略过。
instructions：执行了多少条指令。IPC为平均每个cpu cycle执行了多少条指令。
branches：遇到的分支指令数。branch-misses是预测错误的分支指令数。

应用举例：
执行10次程序，给出标准偏差与期望的比值：
# perf stat -r 10 ls > /dev/null
显示更详细的信息：
# perf stat -v ls > /dev/null
只显示任务执行时间，不显示性能计数器：
# perf stat -n ls > /dev/null
单独给出每个CPU上的信息：
# perf stat -a -A ls > /dev/null
ls命令执行了多少次系统调用：
# perf stat -e syscalls:sys_enter ls

7. perf record 收集采用数据并记录到文件中
常用的options：
-e：Select the PMU event.
-a：System-wide collection from all CPUs.
-p：Record events on existing process ID (comma separated list).
-A：Append to the output file to do incremental profiling.
-f：Overwrite existing data file.
-o：Output file name.
-g：Do call-graph (stack chain/backtrace) recording.
-C：Collect samples only on the list of CPUs provided.

8.perf report 读取perf record生成的文件，并给出分析结果

9.perf lock 内核锁分析，需要打开CONFIG_LOCKDEP、CONFIG_LOCK_STAT这两个kernel配置选项
例如：
# perf lock record ls // 记录
# perf lock report // 报告
[plain] view plain copy
Name acquired contended total wait (ns) max wait (ns) min wait (ns)

&mm->page_table_... 382 0 0 0 0
&mm->page_table_... 72 0 0 0 0
&fs->lock 64 0 0 0 0
dcache_lock 62 0 0 0 0
vfsmount_lock 43 0 0 0 0
&newf->file_lock... 41 0 0 0 0

Name：内核锁的名字。
aquired：该锁被直接获得的次数，因为没有其它内核路径占用该锁，此时不用等待。
contended：该锁等待后获得的次数，此时被其它内核路径占用，需要等待。
total wait：为了获得该锁，总共的等待时间。
max wait：为了获得该锁，最大的等待时间。
min wait：为了获得该锁，最小的等待时间。

10. perf kmem 用于slab分配器分析
perf kmem {record | stat} [<options>]

# perf kmem record ls // 记录
# perf kmem stat --caller --alloc -l 20 // 报告

Callsite | Total_alloc/Per | Total_req/Per | Hit | Ping-pong | Frag
------------------------------------------------------------------------------------------------------
perf_event_mmap+ec | 311296/8192 | 155952/4104 | 38 | 0 | 49.902%
proc_reg_open+41 | 64/64 | 40/40 | 1 | 0 | 37.500%
__kmalloc_node+4d | 1024/1024 | 664/664 | 1 | 0 | 35.156%
ext3_readdir+5bd | 64/64 | 48/48 | 1 | 0 | 25.000%
load_elf_binary+8ec | 512/512 | 392/392 | 1 | 0 | 23.438%

Callsite：内核代码中调用kmalloc和kfree的地方。
Total_alloc/Per：总共分配的内存大小，平均每次分配的内存大小。
Total_req/Per：总共请求的内存大小，平均每次请求的内存大小。
Hit：调用的次数。
Ping-pong：kmalloc和kfree不被同一个CPU执行时的次数，这会导致cache效率降低。
Frag：碎片所占的百分比，碎片 = 分配的内存 - 请求的内存，这部分是浪费的。
有使用--alloc选项，还会看到Alloc Ptr，即所分配内存的地址。

11.perf sched分析调度器性能
perf sched [<options>] {record|latency|map|replay|script}

例子：
# perf sched record sleep 10 // perf sched record <command>
# perf report latency --sort max

12. perf probe可以自定义tracepoint

13. 使用perf工具分析cpu利用率：
可以使用perf stat采样cycle数据分析cpu占用情况。
CPU周期(cpu-cycles)是默认的性能事件，所谓的CPU周期是指CPU所能识别的最小时间单元，通常为亿分之几秒，
是CPU执行最简单的指令时所需要的时间，例如读取寄存器中的内容，也叫做clock tick
perf record -a -e cycles -o cycle.perf -g sleep 10 //采用10s数据，以cpu cycle为基准
perf report -i cycle.perf //查看生成的报告