前言
本篇文章是我日常cpu性能优化的实践笔记,结构有些混乱,未来会重构。
我认为,学习性能优化最有效的途径,一是看祖师爷Brendan Gregg
的文章,二就是实践了。
负载
- 平均负载多少才合适?
- 当平均负载高于 CPU 数量 70% 的时候,你就应该分析排查负载高的问题了
- 三种情况会导致CPU负载升高:
- CPU 密集型进程,使用大量 CPU 会导致平均负载升高
- I/O 密集型进程,等待 I/O 也会导致平均负载升高,但不一定会很高
- 大量等待 CPU 的进程调度也会导致平均负载升高,此时的 CPU 使用率也会比较高。
- 上下文切换:
- cswch: 每秒自愿上下文切换(voluntary context switches)的次数,另一个则是 nvcswch
- 所谓自愿上下文切换,是指进程无法获取所需资源,导致的上下文切换。比如说, I/O、内存等系统资源不足时,就会发生自愿上下文切换。
- nvcswch: 表示每秒非自愿上下文切换(non voluntary context switches)的次数。
- 而非自愿上下文切换,则是指进程由于时间片已到等原因,被系统强制调度,进而发生的上下文切换。比如说,大量进程都在争抢 CPU 时,就容易发生非自愿上下文切换。
- 触发上下文切换。在非常流畅的系统中,这个开销大约是3-5微秒,这可比抢锁和同步cache还慢。
- Steps in Context Switching 解释上下文切换的时候做了点啥
- How many Context Switches is “normal” (as a function of CPU cores (or other))? System calls Multi-Tasking
- cswch: 每秒自愿上下文切换(voluntary context switches)的次数,另一个则是 nvcswch
CPU使用率
- Linux 通过 /proc 虚拟文件系统,向用户空间提供了系统内部状态的信息,,而 /proc/stat 提供的就是系统的 CPU 和任务统计信息。
- user(通常缩写为 us),代表用户态 CPU 时间。注意,它不包括下面的 nice 时间,但包括了 guest 时间。
- nice(通常缩写为 ni),代表低优先级用户态 CPU 时间,也就是进程的 nice 值被调整为 1-19 之间时的。这里注意,nice 可取值范围是 -20 到 19,数值越大,优先级反而越低。
- system(通常缩写为 sys),代表内核态 CPU 时间
- idle(通常缩写为 id),代表空闲时间。注意,它不包括等待 I/O 的时间(iowait)。
- irq(通常缩写为 hi),代表处理硬中断的 CPU 时间。
- softirq(通常缩写为 si),代表处理软中断的 CPU时间。
- steal(通常缩写为 st),代表当系统运行在虚拟机中的时候,被其他虚拟机占用的 CPU 时间。
- guest(通常缩写为 guest),代表通过虚拟化运行其他操作系统的时间,也就是运行虚拟机的 CPU 时间。
- guest_nice(通常缩写为 gnice),代表以低优先级运行虚拟机的时间。
- 性能分析工具给出的都是间隔一段时间的平均 CPU 使用率,所以要注意间隔时间的设置
proc
- stack选项表示当前的进程(主线程)内核模式的堆栈
/proc/[pid]/stack
/proc/[PID]/task/
下表示当前进程所有的线程/proc/[PID]/task/[TID]/stack
表示当前进程的内核模式堆栈
# -d 参数表示高亮显示变化的区域
$ watch -d cat /proc/interrupts
perf
一些链接
- linux 性能分析工具——perf(包括处理器特性介绍)
- 这里面的参考链接很好
- https://www.ibm.com/developerworks/cn/linux/l-cn-perf1/
- 如何用Perf解开服务器消耗的困境(Perf:跟踪问题三板斧)
- man page
面对一个问题程序,最好采用自顶向下的策略。先整体看看该程序运行时各种统计事件的大概,再针对某些方向深入细节。而不要一下子扎进琐碎细节,会一叶障目的。
有些程序慢是因为计算量太大,其多数时间都应该在使用 CPU 进行计算,这叫做 CPU bound 型;有些程序慢是因为过多的 IO,这种时候其 CPU 利用率应该不高,这叫做 IO bound 型;对于 CPU bound 程序的调优和 IO bound 的调优是不同的。
perf list
使用 perf list
命令可以列出所有能够触发 perf 采样点的事件 perf list 'sched:*'
表示列出sched相关的tracepoints
不同的系统会列出不同的结果,在 2.6.35 版本的内核中,该列表已经相当的长,但无论有多少,我们可以将它们划分为三类:
- Hardware Event 是由 PMU 硬件产生的事件,比如 cache 命中,当您需要了解程序对硬件特性的使用情况时,便需要对这些事件进行采样;
- Software Event 是内核软件产生的事件,比如进程切换,tick 数等 ;
man perf_event_open
可以查到每一个event的解释 - Tracepoint event 是内核中的静态 tracepoint 所触发的事件,这些 tracepoint 用来判断程序运行期间内核的行为细节,比如 slab 分配器的分配次数等。
- kernel tracepoint分为以下几类
- block: block device I/O
- ext4: file system operations
- kmem: kernel memory allocation events
- random: kernel random number generator events
- sched: CPU scheduler events
- syscalls: system call enter and exits
- task: task events
比较有用的event
cpu-clock
一般就用这个,表示监控的指标时cpu的周期block:block_rq_issue
系统下发IO
perf stat
# CPU counter statistics for the specified command: 对某个command的执行过程执行perf,显示一些cpu相关的计数
perf stat command
# Detailed CPU counter statistics (includes extras) for the specified command: 基本同上,但更为详细
perf stat -d command
# CPU counter statistics for the specified PID, until Ctrl-C: 对某个pid
perf stat -p PID
# CPU counter statistics for the entire system, for 5 seconds: 对整个系统,执行5秒
perf stat -a sleep 5
# Various basic CPU statistics, system wide, for 10 seconds: 显示基本cpu相关信息
perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles -a sleep 10
# Various CPU level 1 data cache statistics for the specified command: cache相关
perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores command
# Various CPU data TLB statistics for the specified command:
perf stat -e dTLB-loads,dTLB-load-misses,dTLB-prefetch-misses command
# Various CPU last level cache statistics for the specified command:
perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches command
# Using raw PMC counters, eg, counting unhalted core cycles:
perf stat -e r003c -a sleep 5
# PMCs: counting cycles and frontend stalls via raw specification:
perf stat -e cycles -e cpu/event=0x0e,umask=0x01,inv,cmask=0x01/ -a sleep 5
# Count syscalls per-second system-wide:
perf stat -e raw_syscalls:sys_enter -I 1000 -a
# Count system calls by type for the specified PID, until Ctrl-C:
perf stat -e 'syscalls:sys_enter_*' -p PID
# Count system calls by type for the entire system, for 5 seconds:
perf stat -e 'syscalls:sys_enter_*' -a sleep 5
# Count scheduler events for the specified PID, until Ctrl-C:
perf stat -e 'sched:*' -p PID
# Count scheduler events for the specified PID, for 10 seconds:
perf stat -e 'sched:*' -p PID sleep 10
# Count ext4 events for the entire system, for 10 seconds:
perf stat -e 'ext4:*' -a sleep 10
# Count block device I/O events for the entire system, for 10 seconds:
perf stat -e 'block:*' -a sleep 10
# Count all vmscan events, printing a report every second:
perf stat -e 'vmscan:*' -a -I 1000
perf record
perf record
则提供了保存数据的功能,保存后的数据,需要你用 perf report 解析展示。- perf 后面经常加
-g
选项,开启调试关系的采样,查看调用链 -p <pid>
指定进程号
Profiling
# Sample on-CPU functions for the specified command, at 99 Hertz:
perf record -F 99 command
# Sample on-CPU functions for the specified PID, at 99 Hertz, until Ctrl-C:
perf record -F 99 -p PID
# Sample on-CPU functions for the specified PID, at 99 Hertz, for 10 seconds:
perf record -F 99 -p PID sleep 10
# Sample CPU stack traces (via frame pointers) for the specified PID, at 99 Hertz, for 10 seconds:
perf record -F 99 -p PID -g -- sleep 10
# Sample CPU stack traces for the PID, using dwarf (dbg info) to unwind stacks, at 99 Hertz, for 10 seconds:
perf record -F 99 -p PID --call-graph dwarf sleep 10
# Sample CPU stack traces for the entire system, at 99 Hertz, for 10 seconds (< Linux 4.11):
perf record -F 99 -ag -- sleep 10
# Sample CPU stack traces for the entire system, at 99 Hertz, for 10 seconds (>= Linux 4.11):
perf record -F 99 -g -- sleep 10
# If the previous command didn't work, try forcing perf to use the cpu-clock event:
perf record -F 99 -e cpu-clock -ag -- sleep 10
# Sample CPU stack traces for a container identified by its /sys/fs/cgroup/perf_event cgroup:
perf record -F 99 -e cpu-clock --cgroup=docker/1d567f4393190204...etc... -a -- sleep 10
# Sample CPU stack traces for the entire system, with dwarf stacks, at 99 Hertz, for 10 seconds:
perf record -F 99 -a --call-graph dwarf sleep 10
# Sample CPU stack traces for the entire system, using last branch record for stacks, ... (>= Linux 4.?):
perf record -F 99 -a --call-graph lbr sleep 10
# Sample CPU stack traces, once every 10,000 Level 1 data cache misses, for 5 seconds:
perf record -e L1-dcache-load-misses -c 10000 -ag -- sleep 5
# Sample CPU stack traces, once every 100 last level cache misses, for 5 seconds:
perf record -e LLC-load-misses -c 100 -ag -- sleep 5
# Sample on-CPU kernel instructions, for 5 seconds:
perf record -e cycles:k -a -- sleep 5
# Sample on-CPU user instructions, for 5 seconds:
perf record -e cycles:u -a -- sleep 5
# Sample on-CPU user instructions precisely (using PEBS), for 5 seconds:
perf record -e cycles:up -a -- sleep 5
# Perform branch tracing (needs HW support), for 1 second:
perf record -b -a sleep 1
Static Tracing
# Trace new processes, until Ctrl-C:
perf record -e sched:sched_process_exec -a
# Sample (take a subset of) context-switches, until Ctrl-C:
perf record -e context-switches -a
# Trace all context-switches, until Ctrl-C:
# -c 1 意味着-vv选项中的sample_period是1,也就是抓所有的时间
perf record -e context-switches -c 1 -a
# Include raw settings used (see: man perf_event_open):
# -vv显示详细信息
perf record -vv -e context-switches -a
# Trace all context-switches via sched tracepoint, until Ctrl-C:
perf record -e sched:sched_switch -a
# Sample context-switches with stack traces, until Ctrl-C:
perf record -e context-switches -ag
# Sample context-switches with stack traces, for 10 seconds:
perf record -e context-switches -ag -- sleep 10
# Sample CS, stack traces, and with timestamps (< Linux 3.17, -T now default):
perf record -e context-switches -ag -T
# Sample CPU migrations, for 10 seconds:
perf record -e migrations -a -- sleep 10
# Trace all connect()s with stack traces (outbound connections), until Ctrl-C:
perf record -e syscalls:sys_enter_connect -ag
# Trace all accepts()s with stack traces (inbound connections), until Ctrl-C:
perf record -e syscalls:sys_enter_accept* -ag
# Trace all block device (disk I/O) requests with stack traces, until Ctrl-C:
perf record -e block:block_rq_insert -ag
# Sample at most 100 block device requests per second, until Ctrl-C:
perf record -F 100 -e block:block_rq_insert -a
# Trace all block device issues and completions (has timestamps), until Ctrl-C:
perf record -e block:block_rq_issue -e block:block_rq_complete -a
# Trace all block completions, of size at least 100 Kbytes, until Ctrl-C:
perf record -e block:block_rq_complete --filter 'nr_sector > 200'
# Trace all block completions, synchronous writes only, until Ctrl-C:
perf record -e block:block_rq_complete --filter 'rwbs == "WS"'
# Trace all block completions, all types of writes, until Ctrl-C:
perf record -e block:block_rq_complete --filter 'rwbs ~ "*W*"'
# Sample minor faults (RSS growth) with stack traces, until Ctrl-C:
perf record -e minor-faults -ag
# Trace all minor faults with stack traces, until Ctrl-C:
perf record -e minor-faults -c 1 -ag
# Sample page faults with stack traces, until Ctrl-C:
perf record -e page-faults -ag
# Trace all ext4 calls, and write to a non-ext4 location, until Ctrl-C:
perf record -e 'ext4:*' -o /tmp/perf.data -a
# Trace kswapd wakeup events, until Ctrl-C:
perf record -e vmscan:mm_vmscan_wakeup_kswapd -ag
# Add Node.js USDT probes (Linux 4.10+):
perf buildid-cache --add `which node`
# Trace the node http__server__request USDT event (Linux 4.10+):
perf record -e sdt_node:http__server__request -a
perf report
# Show perf.data in an ncurses browser (TUI) if possible:
perf report
# Show perf.data with a column for sample count:
perf report -n
# Show perf.data as a text report, with data coalesced and percentages:
perf report --stdio
# Report, with stacks in folded format: one line per stack (needs 4.4):
perf report --stdio -n -g folded
# List all events from perf.data:
perf script
# List all perf.data events, with data header (newer kernels; was previously default):
perf script --header
# List all perf.data events, with customized fields (< Linux 4.1):
perf script -f time,event,trace
# List all perf.data events, with customized fields (>= Linux 4.1):
perf script -F time,event,trace
# List all perf.data events, with my recommended fields (needs record -a; newer kernels):
perf script --header -F comm,pid,tid,cpu,time,event,ip,sym,dso
# List all perf.data events, with my recommended fields (needs record -a; older kernels):
perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso
# Dump raw contents from perf.data as hex (for debugging):
perf script -D
# Disassemble and annotate instructions with percentages (needs some debuginfo):
perf annotate --stdio
perf probe
Dynamic Tracing
# Add a tracepoint for the kernel tcp_sendmsg() function entry ("--add" is optional):
perf probe --add tcp_sendmsg
# Remove the tcp_sendmsg() tracepoint (or use "--del"):
perf probe -d tcp_sendmsg
# Add a tracepoint for the kernel tcp_sendmsg() function return:
perf probe 'tcp_sendmsg%return'
# Show available variables for the kernel tcp_sendmsg() function (needs debuginfo):
perf probe -V tcp_sendmsg
# Show available variables for the kernel tcp_sendmsg() function, plus external vars (needs debuginfo):
perf probe -V tcp_sendmsg --externs
# Show available line probes for tcp_sendmsg() (needs debuginfo):
perf probe -L tcp_sendmsg
# Show available variables for tcp_sendmsg() at line number 81 (needs debuginfo):
perf probe -V tcp_sendmsg:81
# Add a tracepoint for tcp_sendmsg(), with three entry argument registers (platform specific):
perf probe 'tcp_sendmsg %ax %dx %cx'
# Add a tracepoint for tcp_sendmsg(), with an alias ("bytes") for the %cx register (platform specific):
perf probe 'tcp_sendmsg bytes=%cx'
# Trace previously created probe when the bytes (alias) variable is greater than 100:
perf record -e probe:tcp_sendmsg --filter 'bytes > 100'
# Add a tracepoint for tcp_sendmsg() return, and capture the return value:
perf probe 'tcp_sendmsg%return $retval'
# Add a tracepoint for tcp_sendmsg(), and "size" entry argument (reliable, but needs debuginfo):
perf probe 'tcp_sendmsg size'
# Add a tracepoint for tcp_sendmsg(), with size and socket state (needs debuginfo):
perf probe 'tcp_sendmsg size sk->__sk_common.skc_state'
# Tell me how on Earth you would do this, but don't actually do it (needs debuginfo):
perf probe -nv 'tcp_sendmsg size sk->__sk_common.skc_state'
# Trace previous probe when size is non-zero, and state is not TCP_ESTABLISHED(1) (needs debuginfo):
perf record -e probe:tcp_sendmsg --filter 'size > 0 && skc_state != 1' -a
# Add a tracepoint for tcp_sendmsg() line 81 with local variable seglen (needs debuginfo):
perf probe 'tcp_sendmsg:81 seglen'
# Add a tracepoint for do_sys_open() with the filename as a string (needs debuginfo):
perf probe 'do_sys_open filename:string'
# Add a tracepoint for myfunc() return, and include the retval as a string:
perf probe 'myfunc%return +0($retval):string'
# Add a tracepoint for the user-level malloc() function from libc:
perf probe -x /lib64/libc.so.6 malloc
# Add a tracepoint for this user-level static probe (USDT, aka SDT event):
perf probe -x /usr/lib64/libpthread-2.24.so %sdt_libpthread:mutex_entry
# List currently available dynamic probes:
perf probe -l
MISC
Mixed
# Trace system calls by process, showing a summary refreshing every 2 seconds:
perf top -e raw_syscalls:sys_enter -ns comm
# Trace sent network packets by on-CPU process, rolling output (no clear):
stdbuf -oL perf top -e net:net_dev_xmit -ns comm | strings
# Sample stacks at 99 Hertz, and, context switches:
perf record -F99 -e cpu-clock -e cs -a -g
# Sample stacks to 2 levels deep, and, context switch stacks to 5 levels (needs 4.8):
perf record -F99 -e cpu-clock/max-stack=2/ -e cs/max-stack=5/ -a -g
Special
# Record cacheline events (Linux 4.10+):
perf c2c record -a -- sleep 10
# Report cacheline events from previous recording (Linux 4.10+):
perf c2c report
其他用法
perf c2c
(Linux 4.10+): cache-2-cache and cacheline false sharing analysis.perf kmem
: kernel memory allocation analysis.perf kvm
: KVM virtual guest analysis.perf lock
: lock analysis.perf mem
: memory access analysis.perf sched
: kernel scheduler statistics. Examples.
perf调用查看docker内部的程序
- 最简单的方法是在
docker
外执行perf record
。 然后把文件拷贝到容器内来分析。- 先运行
perf record -g -p < pid>
,执行一会儿(比如 15 秒)后,按 Ctrl+C 停止。 - 然后,把生成的 perf.data 文件,拷贝到容器里面来分…
- 先运行
docker cp perf.data phpfpm:/tmp
docker exec -i -t phpfpm bash
perf top
- 它能够实时显示占用 CPU 时钟最多的函数或者指令,因此可以用来查找热点函数
- perf 后面经常加
-g
选项,开启调试关系的采样,查看调用链 -p <pid>
指定进程号- 每一行包含四列:
- 第一列 Overhead ,是该符号的性能事件在所有采样中的比例,用百分比来表示。
- 第二列 Shared ,是该函数或指令所在的动态共享对象(Dynamic Shared Object),如内核、进程名、动态链接库名、内核模块名等。
- 第三列 Object ,是动态共享对象的类型。比如
[.]
表示用户空间的可执行程序、或者动态链接库,而[k]
则表示内核空间。 - 最后一列 Symbol 是符号名,也就是函数名。当函数名未知时,用十六进制的地址来表示。
# Sample CPUs at 49 Hertz, and show top addresses and symbols, live (no perf.data file):
perf top -F 49
# Sample CPUs at 49 Hertz, and show top process names and segments, live:
perf top -F 49 -ns comm,dso
perf 优化
- 优化系统调用
- 优化cache利用率
perf生成火焰图
# 1. 采样
# -g 选项是告诉perf record额外记录函数的调用关系
# -e cpu-clock 指perf record监控的指标为cpu周期
# -p 指定需要record的进程pid
# -a across all cpus
sudo perf record -a -e cpu-clock -p <pid> -F 1000 -g sleep 60
sudo perf record -a -e cpu-clock -F 1000 -g -p <pid>
sudo perf record --call-graph dwarf -p pid
# 2. 用perf script工具对perf.data进行解析
perf script -i perf.data &> perf.unfold
# 后面用到火焰图repo https://github.com/brendangregg/FlameGraph.git
# 3. 将perf.unfold中的符号进行折叠
xxx/stackcollapse-perf.pl perf.unfold &> perf.folded
# 4. 最后生成svg图
xxx/flamegraph.pl perf.folded > perf.svg
Brendan Gregg博客上提及的生成火焰图的方法
当我们使用 stackcollapse-perf.pl
生成了对应的folded文件后,可以使用grep生成自己希望的火焰图
perf script | ./stackcollapse-perf.pl > out.perf-folded
# 可以过滤掉 cpu_idle线程的影响,将关注重点放到真正消耗cpu的地方
grep -v cpu_idle out.perf-folded | ./flamegraph.pl > nonidle.svg
grep ext4 out.perf-folded | ./flamegraph.pl > ext4internals.svg
egrep 'system_call.*sys_(read|write)' out.perf-folded | ./flamegraph.pl > rw.svg
I frequently elide the cpu_idle threads in this way, to focus on the real threads that are consuming CPU resources. If I miss this step, the cpu_idle threads can often dominate the flame graph, squeezing the interesting code paths.
Note that it would be a little more efficient to process the output of perf report instead of perf script; better still, perf report could have a report style (eg, “-g folded”) that output folded stacks directly, obviating the need for stackcollapse-perf.pl. There could even be a perf mode that output the SVG directly (which wouldn’t be the first one; see perf-timechart), although, that would miss the value of being able to grep the folded stacks (which I use frequently).
There are more examples of perf_events CPU flame graphs on the CPU flame graph page, including a summary of these instructions. I have also shared an example of using perf for a Block Device I/O Flame Graph.
使用systemtap生成火焰图
SystemTap can do the aggregation in-kernel and pass a (much smaller) report to user-land. The data collected and output generated can be customized much further via its scripting language.
The timer.profile
probe was used, which samples all CPUs at 100 Hertz
# -D MAXBACKTRACE=100 -D MAXSTRINGLEN=4096 保证stack trace不会truncate
stap -s 32 -D MAXBACKTRACE=100 -D MAXSTRINGLEN=4096 -D MAXMAPENTRIES=10240 \
-D MAXACTION=10000 -D STP_OVERLOAD_THRESHOLD=5000000000 --all-modules \
-ve 'global s; probe timer.profile { s[backtrace()] <<< 1; }
probe end { foreach (i in s+) { print_stack(i);
printf("\t%d\n", @count(s[i])); } } probe timer.s(60) { exit(); }' \
> out.stap-stacks
./stackcollapse-stap.pl out.stap-stacks > out.stap-folded
cat out.stap-folded | ./flamegraph.pl > stap-kernel.svg
perf script
可以把record记录下的文件转成readable的一条条记录
perf使用笔记
Omitting frame pointers
会导致perf看到的栈帧不全,有几种办法规避- 使用编译选项
-fno-omit-frame-pointer
。如果优化过度,使用了-fomit-frame-pointer
编译选项,会导致栈帧不全。 我猜这种优化是引入-O2
只有引入的,所以最好加-O0 -fno-omit-frame-pointer
- 如果加了上面的选项还是栈帧不全,我猜测应该是libc库编译的时候优化掉frame-pointer了。
perf record
的时候增加选项--call-graph lbr
,这个必须要cpu支持last branch record特性- 使用dwarf(也就是gdb info)
perf record
增加选项--call-graph dwarf
或-g dwarf
- 使用编译选项
OFF-CPU分析
off-cpu说明进程使用的系统调用,进入了内核态,分析offcpu的主要是为了测量cpu的阻塞在什么样的系统调用已经系统调用花费的时间。
如下可见,off-cpu主要分析的是进入某种阻塞的系统调用(sys cpu并不高),进程消耗在等待某种资源(如IO)
Off-CPU Tracing -------------------------------------------->
| |
B B
A A
A(---------. .----------)
| |
B(--------. .--)
| | user-land
- - - - - - - - - - syscall - - - - - - - - - - - - - - - - -
| | kernel
X Off-CPU |
block . . . . . interrupt
测量off-cpu time
最简单使用time命令,可以观察到进程花费在sys上时间。如下,我们总共耗费了50s,但是user和sys加起来只有11.6s。有38.2s消失了,这就是耗费在off-cpu上的时间(读写IO)
$ time tar cf archive.tar linux-4.15-rc2
real 0m50.798s
user 0m1.048s
sys 0m11.627s
bcc命令
# 统计offcpu时间,输出histo
/usr/share/bcc/tools/cpudist -O -p <pid>
# 统计offcpu的时间,可以分别统计用户态和内核态的,可以看到用户态和内核态的栈帧
/usr/share/bcc/tools/offcputime -df -p `pgrep -nx mysqld` 30 > out.stacks
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
./flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us < out.stacks > out.svg
# 对某个pid的调用malloc计数,因为bpf是内核态工具,stackcount只有在ctrl+c之后才会将统计结果输出,这样就避免了每次进程调用malloc之后的内核态用户态转换。
# 函数可以用*通配符,例如c:*map可以匹配到18个函数
/usr/share/bcc/tools/stackcount -p 19183 -U c:malloc > out.stacks
# 将上述输出转化为火焰图
./stackcollapse.pl < out.stacks | ./flamegraph.pl --color=mem \
--title="malloc() Flame Graph" --countname="calls" > out.svg
注意
Request-Synchronous Context
很多我们部署的服务,其进程大多处于等待状态(网络线程等待网络包,工作线程等待任务)。这些干扰会影响我们直接分析offcpu。我们其实想找的是如某次request的过程中,该线程出现的offcpu时间。所以我们可以在上面的offcputime
的输出中只选出有用栈帧进行分析。
Scheduler Latency
Something that’s missing from these stacks is if the off-CPU time includes time spent waiting on a CPU run queue. This time is known as scheduler latency, run queue latency, or dispatcher queue latency. If the CPUs are running at saturation, then any time a thread blocks, it may endure additional time waiting its turn on a CPU after being woken up. That time will be included in the off-CPU time.
You can use extra trace events to tease apart off-CPU time into time blocked vs scheduler latency, but in practice, CPU saturation is pretty easy to spot, so you are unlikely to be spending much time studying off-CPU time when you have a known CPU saturation issue to deal with.
感觉意思是说,如果cpu负载太高,那么offcpu有一部分的时间是花费在调度上了
非自愿上下文切换 Involuntary Context Switching
Involuntary Context Switching也会造成上下文切换,但这一般情况下不是我们想要定位的问题,可以用如下的参数只关注D状态下的offcpu TASK_UNINTERRUPTIBLE
/usr/share/bcc/tools/offcputime -p `pgrep -nx mysqld` --state 2
On Linux, involuntary context switches occur for state TASK_RUNNING (0), whereas the blocking events we’re usually interested in are in TASK_INTERRUPTIBLE (1) or TASK_UNINTERRUPTIBLE (2), which offcputime can match on using --state. I used this feature in my Linux Load Averages: Solving the Mystery post.
IO
sycn IO的时延增大直接会表现为业务的时延增大。我们需要找到一种方式,可以记录每一次调用sync io系统调用的时间,从而画出火焰图。 Brendan写了一个工具 fileiostacks.py。当我们确定系统是因为sync IO导致了性能问题,可以使用这种方式进行分析
如下这种方式,只会打印出调用sycn IO访问文件系统的栈帧的火焰图
/usr/share/bcc/tools/fileiostacks.py -f 30 > out.stacks
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
./flamegraph.pl --color=io --title="File I/O Time Flame Graph" --countname=us < out.stacks > out.svg
但是IO不止有一种,还有block device,还有网络,我们下面看一个block_device的
使用perf
perf record -e block:block_rq_insert -a -g -- sleep 30
perf script --header > out.stacks
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
./stackcollapse-perf.html < out.stacks | ./flamegraph.pl --color=io \
--title="Block I/O Flame Graph" --countname="I/O" > out.svg
使用bcc
/usr/share/bcc/tools/biostacks.py -f 30 > out.stacks
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
./flamegraph.pl --color=io --title="Block I/O Time Flame Graph" --countname=us < out.stacks > out.svg
Off-CPU
和上面不同,这里我们想分析所有的off-cpu事件。
/usr/share/bcc/tools/offcputime -df -p `pgrep -nx mysqld` 30 > out.stacks
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
./flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us < out.stacks > out.svg
wake-up
这个火焰图主要将关注点放在wakeup的线程上,我们就可以看到到底是什么线程唤醒了某个进程,感觉还是挺有偶用的,比如说锁开销很高的时候,我们就能分析到底是什么进程在唤醒锁。
比如之前内核出现的wmb惊群的问题,应该可以用这种方法分析出
/usr/share/bcc/tools/wakeuptime -f 30 > out.stacks
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
./flamegraph.pl --color=wakeup --title="Wakeup Time Flame Graph" --countname=us < out.stacks > out.svg
Chain Graphs
后之览者,亦将有感于斯文。
bcc系列工具概览
cd /usr/share/bcc/tools/
Single Purpose Tools:
# Trace new processes:
execsnoop
# Trace file opens with process and filename:
opensnoop
# Summarize block I/O (disk) latency as a power-of-2 distribution by disk:
biolatency -D
# Summarize block I/O size as a power-of-2 distribution by program name:
bitesize
# Trace common ext4 file system operations slower than 1 millisecond:
ext4slower 1
# Trace TCP active connections (connect()) with IP address and ports:
tcpconnect
# Trace TCP passive connections (accept()) with IP address and ports:
tcpaccept
# Trace TCP connections to local port 80, with session duration:
tcplife -L 80
# Trace TCP retransmissions with IP addresses and TCP state:
# 只打印重传包
tcpretrans
# Sample stack traces at 49 Hertz for 10 seconds, emit folded format (for flame graphs):
profile -fd -F 49 10
# Trace details and latency of resolver DNS lookups:
gethostlatency
# Trace commands issued in all running bash shells:
bashreadline
Multi Tools: Kernel Dynamic Tracing
# Count "tcp_send*" kernel function, print output every second:
funccount -i 1 'tcp_send*'
# Count "vfs_*" calls for PID 185:
funccount -p 185 'vfs_*'
# Trace file names opened, using dynamic tracing of the kernel do_sys_open() function:
trace 'p::do_sys_open "%s", arg2'
# Same as before ("p:: is assumed if not specified):
trace 'do_sys_open "%s", arg2'
# Trace the return of the kernel do_sys_open() funciton, and print the retval:
trace 'r::do_sys_open "ret: %d", retval'
# Trace do_nanosleep() kernel function and the second argument (mode), with kernel stack traces:
trace -K 'do_nanosleep "mode: %d", arg2'
# Trace do_nanosleep() mode by providing the prototype (no debuginfo required):
trace 'do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode) "mode: %d", mode'
# Trace do_nanosleep() with the task address (may be NULL), noting the dereference:
trace 'do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode) "task: %x", t->task'
# Frequency count tcp_sendmsg() size:
argdist -C 'p::tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size):u32:size'
# Summarize tcp_sendmsg() size as a power-of-2 histogram:
argdist -H 'p::tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size):u32:size'
# Frequency count stack traces that lead to the submit_bio() function (disk I/O issue):
stackcount submit_bio
# Summarize the latency (time taken) by the vfs_read() function for PID 181:
# 统计函数延时,我们也可以用来做自己进程的性能优化
funclatency -p 181 -u vfs_read
Multi Tools: User Level Dynamic Tracing
# Trace the libc library function nanosleep() and print the requested sleep details:
trace 'p:c:nanosleep(struct timespec *req) "%d sec %d nsec", req->tv_sec, req->tv_nsec'
# Count the libc write() call for PID 181 by file descriptor:
argdist -p 181 -C 'p:c:write(int fd):int:fd'
# Summarize the latency (time taken) by libc getaddrinfo(), as a power-of-2 histogram in microseconds:
funclatency.py -u 'c:getaddrinfo'
Multi Tools: Kernel Static Tracing
# Count stack traces that led to issuing block I/O, tracing its kernel tracepoint:
stackcount t:block:block_rq_insert
Multi Tools: User Statically Defined Tracing (USDT)
# Trace the pthread_create USDT probe, and print arg1 as hex:
trace 'u:pthread:pthread_create "%x", arg1'
perf-tools
里面有些工具bcc已经收纳了,但是有的比如像iosnoop,还是perf-tools才有
参考链接
- linux 性能调优工具perf + 火焰图 常用命令
- 系统级性能分析工具perf的介绍与使用写的很全的perf介绍
- Flame Graphs看跪了,还有系统优化的最新消息
- CPU Flame Graphsbrendan gregg NB
- Off-CPU Flame Graphsbcc一系列的perf工具,包括offcpu的分析,IO的分析等等
- 关于-fno-omit-frame-pointer与-fomit-frame-pointer
- perf Examples NB
- SystemTap新手指南中文翻译
- perf-tools 很多很有用的tool,例如
tcpretrans
可以只抓重传的tcp包 - Linux Perf Tools Tips很牛,很多问题,例如perf找不到symbol怎么办 还有systemtap相关
- eBPF
- Linux Extended BPF (eBPF) Tracing Tools
- Userspace stack is not unwinded in most samples with offcputime.py
- Speed up SystemTap script monitoring of system calls
- What do ‘real’, ‘user’ and ‘sys’ mean in the output of time(1)?
- Linux Load Averages: Solving the Mystery
- 漫话性能:CPU上下文切换深度好文