修订记录
- 2023/03/23 增加并细化pidstat指令
- 2023/06/13 细化blktrace(实在是分析io的利器)
- 2023/06/14 新增"分析问题前60秒你使用的分析手段"小节
前言
工欲善其事必先利其器,要想分析清楚linux服务器中的各类问题,比如性能问题,服务程序的bug,那么必须对该系统下的分析工具有一定的了解,本文对当前的主流cpu,内存,网络,io以及各种debug分析工具(blktrace,perf,systemtap)做了个简单总结。
注:学无止境,故需持续更新
性能分析工具

linux性能分析工具www.brendangregg.com/linuxperf.html
总观
分析问题前60秒做的事情
分析问题,首先需要对全局有个大致了解,文章分析问题的前60秒应该做的事情 就展示了我们最开始该使用的一些分析手段,总结一下,就是以下几步:
uptime
dmesg | tail
vmstat 1
mpstat -P ALL 1
pidstat 1
iostat -xz 1
free -m
sar -n DEV 1
sar -n TCP,ETCP 1
top
top
最常规的大盘工具了,展示进程以及系统全局的cpu,内存等信息
atop
atop除了有非常全的大盘信息(cpu,内存,磁盘,网络),还能实时展示进程cpu,磁盘io信息,同时会实时监视频率期间,退出进程的退出原因(是正常退出,还是收到信号,如果是前者,其返回值是多少,如果是后者,又收到了什么信号呢?)
它以一定的频率记录系统的运行状态,所采集的数据包含系统资源(CPU、内存、磁盘和网络)使用情况和进程运行情况,并能以日志文件的方式保存在磁盘中,服务器出现问题后,我们可获取相应的atop日志文件进行分析。
http://bean-li.github.io/atop-exit-code/
https://www.cnblogs.com/xybaby/p/8098229.html
dstat
hzwuhongsong@pubt1-ceph2:~$ sudo dstat
You did not select any stats, using -cdngy by default.
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read writ| recv send| in out | int csw
0 0 99 0 0 0| 17M 4367k| 0 0 | 0 0 |3548 31k
0 0 99 0 0 0| 0 1733k|1643k 1608k| 0 0 |9497 16k
0 0 99 0 0 0|8192B 2309k|1636k 1672k| 0 0 |9916 18k
dstat默认输出的是相关资源的总和,通过设置相关参数,可以只看某个cpu,某个网卡或者某个磁盘的数据
sar
sar(System Activity Reporter 系统活动情况报告)是目前 Linux 上最为全面的系统性能分析工具之一,可以从多方面对系统的活动进行报告,包括:文件的读写情况、系统调用的使用情况、磁盘 I/O、CPU 效率、内存使用状况、进程活动及 IPC 有关的活动等。
要判断系统瓶颈问题,有时需几个 sar 命令选项结合起来;
- 怀疑CPU存在瓶颈
可用 sar -u 和 sar -q 等来查看
- 怀疑内存存在瓶颈
可用sar -B、sar -r 和 sar -W 等来查看
- 怀疑I/O存在瓶颈
可用 sar -b、sar -u 和 sar -d 等来查看
-A 汇总所有的报告
-a 报告文件读写使用情况
-B 报告附加的缓存的使用情况(包括页换入换出以及缺页中断情况)
-b 报告缓存的使用情况
-c 报告系统调用的使用情况
-d 报告磁盘的使用情况
-g 报告串口的使用情况
-h 报告关于buffer使用的统计数据
-m 报告IPC消息队列和信号量的使用情况
-n 报告命名cache的使用情况
-p 报告调页活动的使用情况
-q 报告运行队列和交换队列的平均长度
-R 报告进程的活动情况
-r 报告没有使用的内存页面和硬盘块
-u 报告CPU的利用率
-v 报告进程、i节点、文件和锁表状态
-w 报告系统交换活动状况
-y 报告TTY设备活动状况
- 缺点
看到的是整个系统的情况,比如像进程的缺页中断等情况可以通过pidstat查看
vmstat
vmstat是Virtual Meomory Statistics(虚拟内存统计)的缩写,可实时动态监视操作系统的虚拟内存、进程、CPU活动
vmstat输出字段说明:
- Procs(进程):
r: 运行队列中进程数量
b: 等待IO的进程数量
- Memory(内存):
swpd: 使用虚拟内存大小
free: 可用内存大小
buff: 用作缓冲的内存大小
cache: 用作缓存的内存大小
- Swap:
si: 每秒从交换区写到内存的大小
so: 每秒写入交换区的内存大小
- IO:(现在的Linux版本块的大小为1024bytes)
bi: 每秒读取的块数
bo: 每秒写入的块
- system:
in: 每秒中断数,包括时钟中断
cs: 每秒上下文切换数
- CPU(以百分比表示)
us: 用户进程执行时间(user time)
sy: 系统进程执行时间(system time)
id: 空闲时间(包括IO等待时间)
wa: 等待IO时间
pidstat
pidstat - 进程(或子进程) 堆栈,磁盘,内存, cpu,上下文切换等
pidstat主要用于监控全部或指定进程占用系统资源的情况(甚至可以查看子进程),可监控如下状态:
- CPU(-u,通过-t还可以看到子进程的情况),
- 内存(-r)(包括内存使用以及缺页情况)(重点是其能看到单个进程的情况)、
- 磁盘IO(-d)、
- 上下文切换切换
- 进程堆栈使用
pidstat首次运行时显示自系统启动开始的各项统计信息,之后运行pidstat将显示自上次运行该命令以后的统计信息。用户可以通过指定统计的次数和时间来获得所需的统计信息
pidstat 2 10 (采样周期和次数)
进程(或子进程)内存情况
除了可以展示常规的内存,还可以展示内存缺页中断情况
-r Report page faults and memory utilization.
minflt/s
Total number of minor faults the task has made per
second, those which have not required loading a
memory page from disk.
majflt/s
Total number of major faults the task has made per
second, those which have required loading a memory
page from disk.
VSZ Virtual Size: The virtual memory usage of entire
task in kilobytes.
RSS Resident Set Size: The non-swapped physical memory
used by the task in kilobytes.
%MEM The tasks's currently used share of available
physical memory.
进程(或子进程)磁盘情况
-d Report I/O statistics (kernels 2.6.20 and later only).
The following values may be displayed:
/
/
PID The identification number of the task being
monitored.
iodelay
Block I/O delay of the task being monitored,
measured in clock ticks. This metric includes the
delays spent waiting for sync block I/O completion
and for swapin block I/O completion.
Command
The command name of the task.
进程(或子进程)堆栈使用情况
-s Report stack utilization. The following values may be
StkSize
The amount of memory in kilobytes reserved for the
task as stack, but not necessarily used.
StkRef The amount of memory in kilobytes used as stack,
referenced by the task.
进程(或子进程)cpu使用情况
-u Report CPU utilization.
进程(或子进程)上下文切换使用情况
cswch/s
Total number of voluntary context switches the task
made per second. A voluntary context switch occurs
when a task blocks because it requires a resource
that is unavailable.
nvcswch/s
Total number of non voluntary context switches the
task made per second. An involuntary context
switch takes place when a task executes for the
duration of its time slice and then is forced to
relinquish the processor.
调用层
ltrace
用来跟踪进程调用库函数的情况
-f 跟踪子进程。
-l 只打印某个库中的调用。
-o, --output=file 把输出定向到文件。
-p PID 附着在值为PID的进程号上进行ltrace。
-r 打印相对时间戳。
-S 显示系统调用。
-t, -tt, -ttt 打印绝对时间戳。
-T 输出每个调用过程的时间开销。
-u USERNAME 使用某个用户id或组ID来运行命令。
strace
strace常用来跟踪进程执行时的系统调用和所接收的信号。strace可以跟踪到一个进程产生的系统调用,包括参数,返回值,执行消耗的时间。
// 上面的含义是 跟踪28979进程的所有系统调用(-e trace=all),并统计系统调用的花费时间,以及开始时间(并以可视化的时分秒格式显示),最后将记录结果存在output.txt文件里面
sudo strace -f -T -tt -e trace=all -p 164240
像ceph这种多进程程序,一般要加-f,表示除了监视主进程,也会监视其子进程
cpu相关
mpstat
mpstat是Multiprocessor Statistics的缩写,是实时系统监控工具。其报告与CPU的一些统计信息,这些信息存放在/proc/stat文件中。在多CPUs系统里,其不但能查看所有CPU的平均状况信息,而且能够查看特定CPU的信息。mpstat最大的特点是:可以查看多核心cpu中每个计算核心的统计数据;而类似工具vmstat只能查看系统整体cpu情况。
perf
perf top
-e <event>:指明要分析的性能事件。
-p <pid>:Profile events on existing Process ID (comma sperated list). 仅分析目标进程及其创建的线程。
-d <n>:界面的刷新周期,默认为2s,因为perf top默认每2s从mmap的内存区域读取一次性能数据。
-g:得到函数的调用关系图。
以下是-e指定的事件吧?
cpu-clock:任务真正占用的处理器时间,单位为ms。CPUs utilized = task-clock / time elapsed,CPU的占用率。
context-switches:程序在运行过程中上下文的切换次数
CPU-migrations:程序在运行过程中发生的处理器迁移次数。Linux为了维持多个处理器的负载均衡,在特定条件下会将某个任务从一个CPU迁移到另一个CPU。
CPU迁移和上下文切换:发生上下文切换不一定会发生CPU迁移,而发生CPU迁移时肯定会发生上下文切换。发生上下文切换有可能只是把上下文从当前CPU中换出,下一次调度器还是将进程安排在这个CPU上执行。
page-faults:缺页异常的次数。当应用程序请求的页面尚未建立、请求的页面不在内存中,或者请求的页面虽然在内存中,但物理地址和虚拟地址的映射关系尚未建立时,都会触发一次缺页异常。另外TLB不命中,页面访问权限不匹配等情况也会触发缺页异常。
cycles:消耗的处理器周期数。如果把被ls使用的cpu cycles看成是一个处理器的,那么它的主频为2.486GHz。可以用cycles / task-clock算出。
stalled-cycles-frontend:指令读取或解码的质量步骤,未能按理想状态发挥并行左右,发生停滞的时钟周期。
stalled-cycles-backend:指令执行步骤,发生停滞的时钟周期。
instructions:执行了多少条指令。IPC为平均每个cpu cycle执行了多少条指令。
branches:遇到的分支指令数。branch-misses是预测错误的分支指令数。
perf record -e cpu-clock -g -p 4966
-g 选项是告诉perf record额外记录函数的调用关系
-e cpu-clock 指perf record监控的指标为cpu周期
-p 指定需要record的进程pid
很重要的一点是还能输出进程占用耗时的调用栈
运行一会ctrl+C中断,会产生 perf.data文件,然后可以按如下查看:
perf report -i perf.data
tiptop
资料比较少,看man文档
The tiptop program provides a dynamic real-time view of the tasks running in the system. tiptop is very similar to top (1), but the information displayed comes from
hardware counters.
内存相关
free
slabtop
usage: slabtop [options]
options:
--delay=n, -d n delay n seconds between updates
--once, -o only display once, then exit
--sort=S, -s S specify sort criteria S (see below)
--version, -V display version information and exit
--help display this help and exit
The following are valid sort criteria:
a: sort by number of active objects
b: sort by objects per slab
c: sort by cache size
l: sort by number of slabs
v: sort by number of active slabs
n: sort by name
o: sort by number of objects
p: sort by pages per slab
s: sort by object size
u: sort by cache utilization
pmap
hzwuhongsong@pubt1-ceph72:~$ sudo pmap -d 5475(进程id)
5475: /usr/bin/ceph-osd -f --cluster ceph --id 34 --setuser ceph --setgroup ceph
Address Kbytes Mode Offset Device Mapping
0000561f25d4a000 19592 r-x-- 0000000000000000 008:00002 ceph-osd
0000561f2726b000 148 r---- 0000000001321000 008:00002 ceph-osd
0000561f27290000 92 rw--- 0000000001346000 008:00002 ceph-osd
0000561f272a7000 133296 rw--- 0000000000000000 000:00000 [ anon ]
0000561f2f903000 2090828 rw--- 0000000000000000 000:00000 [ anon ]
00007f0816807000 2044 ----- 0000000000004000 008:00002 libuuid.so.1.3.0
00007f08172c2000 1480 r-x-- 0000000000000000 008:00002 libstdc++.so.6.0.22
mapped: 2836640K writeable/private: 2684632K shared: 136K
通过该命令,可以看到一个进程的内存分布情况。最后一行的
- mapped 该进程映射到文件的内存量。
- writable/private 该进程使用的私有地址空间。
- shared 该进程和其它进程共享的地址空间量。
io相关
iotop
统计每个进程的io信息
iostat
blktrace
- 分析io耗时
使用步骤一般如下:
- 使用blktrace来收集硬盘io数据;
- 使用blkparse来分析收集到的数据(此时信息太多,不利于查看)
- 使用bbt汇总blkparse的分析结果,结果如下图,形象

blktrace -d /dev/nvme1n1p2
blkparse -i nvme1n1p2 -d nvme1n1p2.blktrace.bin
btt -i nvme1n1p2.blktrace.bin
- 一些有意思的使用
- 获取某个阶段(比如d2c)延时图
btt -i sdb.blktrace.bin -l sdb.d2c_latency
btt -i sdb.blktrace.bin -q sdb.q2c_latency
得到数据后,通过excel工具就可以画图了

- 获取每秒iops和bps变化图
blkparse -i sdb -d sdb.blktrace.bin
btt -i sdb.blktrace.bin -q sdb.q2c_latency
注意,这一步之后,我们会得到如下文件:
sdb.q2c_latency_8,16_q2c,dat
sys_iops_fp.dat
sys_mbps_fp.dat
8,16_iops_fp.dat
8,17_mbps_fp.dat
注意,如果我们blktrace -d sdb,只关注sdb的时候,我们可以通过sys_iops_fp.dat和sys_mbps_fp.dat获取对应的IOPS和MBPS信息:
cat sys_iops_fp.dat
0 3453
1 4859
2 7765
3 6807
.......
通过excel等工具画图

- 获取io size分布
blkparse -i sdb -d sub.blktrace.bin
btt -i sdb.blktrace.bin -B sdb.offset
这个步骤之后会生成三个文件:
sdb.offset_8,16_r.dat
sdb.offset_8,16_w.dat
sdb.offset_8,16_c.dat
其中r表示读操作的offset和size信息,w表示写操作的offset和size信息,c表示读+写。
其输出格式如下:
cat sdb.offset_8,16_w.dat
0.000006500 74196632 74196656
0.000194981 74196656 74196680
0.000423532 21923304 21923336
0.000597505 60868864 60868912
0.001046757 20481496 20481520
第一个字段是时间,第二个字段是开始扇区即offset,第三个字段为结束扇区。根据第二个字段和第三个字段算出来size。单位为扇区。
然后通过excel等工具可以画图

- 绘制io轨迹图
上一小节里,可以拿到不同时间里,访问磁盘的位置以及访问扇区的个数,如果不考虑访问扇区的个数,我们可以得到一张访问轨迹2D图:

- 一个io耗时和大小统计的脚本
统计了IO的读/写数量、最大延迟、延迟的分布情况、块大小及数量,来源: 剖析生产系统的I/O模式
#!/bin/bash
if [ $# -ne 1 ]; then
echo "Usage: $0 <block_device_name>"
exit
fi
if [ ! -b $1 ]; then
echo "could not find block device $1"
exit
fi
duration=10
echo "running blktrace for $duration seconds to collect data..."
timeout $duration blktrace -d $1 >/dev/null 2>&1
DEVNAME=`basename $1`
echo "parsing blktrace data..."
blkparse -i $DEVNAME |sort -g -k8 -k10 -k4 |awk '
BEGIN {
total_read=0;
total_write=0;
maxwait_read=0;
maxwait_write=0;
}
{
if ($6=="Q") {
queue_ts=$4;
block=$8;
nblock=$10;
rw=$7;
};
if ($6=="C" && $8==block && $10==nblock && $7==rw) {
await=$4-queue_ts;
if (rw=="R") {
if (await>maxwait_read) maxwait_read=await;
total_read++;
read_count_block[nblock]++;
if (await>0.001) read_count1++;
if (await>0.01) read_count10++;
if (await>0.02) read_count20++;
if (await>0.03) read_count30++;
}
if (rw=="W") {
if (await>maxwait_write) maxwait_write=await;
total_write++;
write_count_block[nblock]++;
if (await>0.001) write_count1++;
if (await>0.01) write_count10++;
if (await>0.02) write_count20++;
if (await>0.03) write_count30++;
}
}
} END {
printf("========\nsummary:\n========\n");
printf("total number of reads: %d\n", total_read);
printf("total number of writes: %d\n", total_write);
printf("slowest read : %.6f second\n", maxwait_read);
printf("slowest write: %.6f second\n", maxwait_write);
printf("reads\n> 1ms: %d\n>10ms: %d\n>20ms: %d\n>30ms: %d\n", read_count1, read_count10, read_count20, read_count30);
printf("writes\n> 1ms: %d\n>10ms: %d\n>20ms: %d\n>30ms: %d\n", write_count1, write_count10, write_count20, write_count30);
printf("\nblock size:%16s\n","Read Count");
for (i in read_count_block)
printf("%10d:%16d\n", i, read_count_block[i]);
printf("\nblock size:%16s\n","Write Count");
for (i in write_count_block)
printf("%10d:%16d\n", i, write_count_block[i]);
}'
输出示例
========
summary:
========
total number of reads: 1081513
total number of writes: 0
slowest read : 0.032560 second
slowest write: 0.000000 second
reads
> 1ms: 18253
>10ms: 17058
>20ms: 17045
>30ms: 780
writes
> 1ms: 0
>10ms: 0
>20ms: 0
>30ms: 0
block size: Read Count
256: 93756
248: 1538
64: 98084
56: 7475
8: 101218
48: 15889
240: 1637
232: 1651
224: 1942
40: 21693
216: 1811
32: 197893
208: 1907
24: 37787
128: 97382
16: 399850
- 其他
当然,除了上面这些,还可以做很多其他的事情,具体参见bbt的help
zwuhongsong@pubbeta1-nova-vhost-node1:~$ sudo btt
Usage: btt
// 各个阶段的时延
[ -l <output name> | --d2c-latencies=<output name> ]
[ -q <output name> | --q2c-latencies=<output name> ]
[ -z <output name> | --q2d-latencies=<output name> ]
// io大小
[ -d <seconds> | --range-delta=<seconds> ]
// io起始block io
[ -B <output name> | --dump-blocknos=<output name> ]
[ -p <output name> | --per-io-dump=<output name> ]
[ -P <output name> | --per-io-trees=<output name> ]
[ -D <dev;...> | --devices=<dev;...> ]
[ -e <exe,...> | --exes=<exe,...> ]
[ -h | --help ]
[ -i <input name> | --input-file=<input name> ]
[ -I <output name> | --iostat=<output name> ]
[ -L <freq> | --periodic-latencies=<freq> ]
[ -m <output name> | --seeks-per-second=<output name> ]
[ -M <dev map> | --dev-maps=<dev map>
[ -o <output name> | --output-file=<output name> ]
......
上述使用主要参考来自blktrace分析IO 以及 Beyond iostat: Storage performance analysis with blktrace ,感谢
使用blktrace排查iowait cpu高的问题 : 通过blktrace可知道是哪个磁盘哪些扇区在频繁io,通过文件系统block块大小可由扇区推出来其在文件系统中的block号,进而通过debugfs可找到该block对应的inode号,进而可以通过inode号用debugfs获得文件名字,有了文件名字通过lsof就知道是哪个程序在频繁调用它了
perf分析io
Linux 4.6 内核的块设备层的预定义了 19 个通用块层的 tracepoints。这些 tracepoints,可以通过如下 perf 命令来列出来,
hzwuhongsong@pubt1-ceph69:~$ sudo perf list block:*
List of pre-defined events (to be used in -e):
block:block_bio_backmerge [Tracepoint event]
block:block_bio_bounce [Tracepoint event]
block:block_bio_complete [Tracepoint event]
block:block_bio_frontmerge [Tracepoint event]
block:block_bio_queue [Tracepoint event]
block:block_bio_remap [Tracepoint event]
block:block_dirty_buffer [Tracepoint event]
block:block_getrq [Tracepoint event]
block:block_plug [Tracepoint event]
block:block_rq_abort [Tracepoint event]
block:block_rq_complete [Tracepoint event]
block:block_rq_insert [Tracepoint event]
block:block_rq_issue [Tracepoint event]
block:block_rq_remap [Tracepoint event]
block:block_rq_requeue [Tracepoint event]
block:block_sleeprq [Tracepoint event]
block:block_split [Tracepoint event]
block:block_touch_buffer [Tracepoint event]
block:block_unplug [Tracepoint event]
我们可以利用 block:block_rq_insert 来跟踪获取 fio 测试时,该进程写往块设备 /dev/sampleblk1 IO 请求的起始扇区地址和扇区数量,
sudo perf record -a -g --call-graph dwarf -e block:block_rq_insert sleep 10 (可以加-p pid)
因为我们指定了记录调用栈的信息,所以,perf script 可以获取 fio 从用户态到内核 block:block_rq_insert tracepoint 的完整调用栈的信息。 并且,给出了主次设备号,相关操作,及起始扇区和扇区数,
hzwuhongsong@pubt1-ceph69:~$ sudo perf script | head -n 20
tp_osd_tp 609749 [040] 14605298.276441: block:block_rq_insert: 8,192 WS 0 () 242453120 + 512 [tp_osd_tp]
4f7809 __elv_add_request+0xaaa000b9 (/usr/lib/debug/lib/modules/4.9.65-netease/vmlinux)
4f7809 __elv_add_request+0xaaa000b9 (/usr/lib/debug/lib/modules/4.9.65-netease/vmlinux)
4fec49 blk_flush_plug_list+0xaaa00209 (/usr/lib/debug/lib/modules/4.9.65-netease/vmlinux)
4ff067 blk_finish_plug+0xaaa00027 (/usr/lib/debug/lib/modules/4.9.65-netease/vmlinux)
454690 do_io_submit+0xaaa00330 (/usr/lib/debug/lib/modules/4.9.65-netease/vmlinux)
80861e system_call_fast_compare_end+0xaaa0000c (/usr/lib/debug/lib/modules/4.9.65-netease/vmlinux)
717 io_submit+0xffff000011580007 (/lib/x86_64-linux-gnu/libaio.so.1.0.1)
a4ff5a aio_queue_t::submit_batch+0xffff5555555580ba (/usr/bin/ceph-osd)
bstore_kv_sync 603916 [010] 14838588.484227: block:block_rq_insert: 8,80 WS 0 () 1125001616 + 16 [bstore_kv_sync]
4f7809 __elv_add_request+0xaaa000b9 (/usr/lib/debug/lib/modules/4.9.65-netease/vmlinux)
4f7809 __elv_add_request+0xaaa000b9 (/usr/lib/debug/lib/modules/4.9.65-netease/vmlinux)
4fec49 blk_flush_plug_list+0xaaa00209 (/usr/lib/debug/lib/modules/4.9.65-netease/vmlinux)
4ff067 blk_finish_plug+0xaaa00027 (/usr/lib/debug/lib/modules/4.9.65-netease/vmlinux)
454690 do_io_submit+0xaaa00330 (/usr/lib/debug/lib/modules/4.9.65-netease/vmlinux)
80861e system_call_fast_compare_end+0xaaa0000c (/usr/lib/debug/lib/modules/4.9.65-netease/vmlinux)
717 io_submit+0xffff000011580007 (/lib/x86_64-linux-gnu/libaio.so.1.0.1)
a4ff5a aio_queue_t::submit_batch+0xffff5555555580ba (/usr/bin/ceph-osd)
使用简单的处理,我们即可发现这个测试在通用块层的 IO Pattern,
hzwuhongsong@pubt1-ceph69:~$ sudo perf script | grep lock_rq_insert | head -n 20
tp_osd_tp 609749 [040] 14605298.276441: block:block_rq_insert: 8,192 WS 0 () 242453120 + 512 [tp_osd_tp]
tp_osd_tp 609749 [040] 14605298.276454: block:block_rq_insert: 8,192 WS 0 () 242453632 + 512 [tp_osd_tp]
tp_osd_tp 609749 [040] 14605298.276461: block:block_rq_insert: 8,192 WS 0 () 242454144 + 512 [tp_osd_tp]
tp_osd_tp 609749 [040] 14605298.276468: block:block_rq_insert: 8,192 WS 0 () 242454656 + 512 [tp_osd_tp]
tp_osd_tp 609749 [040] 14605298.276474: block:block_rq_insert: 8,192 WS 0 () 242455168 + 512 [tp_osd_tp]
tp_osd_tp 609749 [040] 14605298.276481: block:block_rq_insert: 8,192 WS 0 () 242455680 + 512 [tp_osd_tp]
tp_osd_tp 609749 [040] 14605298.276487: block:block_rq_insert: 8,192 WS 0 () 242456192 + 512 [tp_osd_tp]
tp_osd_tp 609749 [040] 14605298.276495: block:block_rq_insert: 8,192 WS 0 () 242456704 + 512 [tp_osd_tp]
bstore_kv_sync 603866 [042] 14605298.277208: block:block_rq_insert: 8,192 WS 0 () 4543200 + 16 [bstore_kv_sync]
tp_osd_tp 605294 [031] 14605298.289926: block:block_rq_insert: 8,96 WS 0 () 2313522944 + 512 [tp_osd_tp]
线程io size分布以及数量
stap脚本如下:
#!/usr/bin/stap
/*
* bitesize-nd.stp Measure storage (bio) I/O size distribution.
* For Linux, uses SystemTap (non-debuginfo).
*
* USAGE: ./bitesize-nd.stp
*
* This script uses the kernel tracepoint block_rq_insert. The output includes
* the name of the process or thread that was on-CPU when the I/O request was
* inserted on the issue queue.
*
* From systemtap-lwtools: https://github.com/brendangregg/systemtap-lwtools
*
* See the corresponding man page (in systemtap-lwtools) for more info.
*
* Copyright (C) 2015 Brendan Gregg.
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* 01-Feb-2015 Brendan Gregg Created this.
*/
global sz;
probe begin
{
printf("Tracing block I/O... Hit Ctrl-C to end.\n");
}
probe kernel.trace("block_rq_insert") {
/*
* You aren't supposed to access __data_len directly as it is internal,
* but I don't see another way...
*/
sz[execname()] <<< $rq->__data_len;
}
probe end
{
printf("\nI/O size (bytes):\n\n");
foreach (name in sz+) {
printf("process name: %s\n", name);
print(@hist_log(sz[name]));
}
delete sz;
}
对ceph存储机器运行脚本,结果显示如下:
hzwuhongsong@pubt1-ceph69:~$ sudo stap 13.stp
process name: rocksdb:bg0
value |-------------------------------------------------- count
1024 | 0
2048 | 0
4096 | 312
8192 | 22
16384 | 6
32768 | 100
65536 | 71
131072 | 129
262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 28946
524288 | 0
1048576 | 0
process name: bstore_kv_final
value |-------------------------------------------------- count
1024 | 0
2048 | 0
4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 44412
8192 |@ 1294
16384 | 0
32768 | 0
process name: bstore_kv_sync
value |-------------------------------------------------- count
1024 | 0
2048 | 0
4096 |@ 1132
8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 46594
16384 | 277
32768 | 197
65536 | 0
131072 | 0
iosnoop/Heatmap
iosnoop 不但可以了解块设备上的 IO 请求大小,更有从 IO 请求发起到完成的延迟时间的信息。
#!/bin/bash
#
# iosnoop - trace block device I/O.
# Written using Linux ftrace.
#
# This traces disk I/O at the block device interface, using the block:
# tracepoints. This can help characterize the I/O requested for the storage
# devices and their resulting performance. I/O completions can also be studied
# event-by-event for debugging disk and controller I/O scheduling issues.
#
# USAGE: ./iosnoop [-hQst] [-d device] [-i iotype] [-p pid] [-n name] [duration]
#
# Run "iosnoop -h" for full usage.
#
# REQUIREMENTS: FTRACE CONFIG, block:block_rq_* tracepoints (you may
# already have these on recent kernels).
#
# OVERHEAD: By default, iosnoop works without buffering, printing I/O events
# as they happen (uses trace_pipe), context switching and consuming CPU to do
# so. This has a limit of about 10,000 IOPS (depending on your platform), at
# which point iosnoop will be consuming 1 CPU. The duration mode uses buffering,
# and can handle much higher IOPS rates, however, the buffer has a limit of
# about 50,000 I/O, after which events will be dropped. You can tune this with
# bufsize_kb, which is per-CPU. Also note that the "-n" option is currently
# post-filtered, so all events are traced.
#
# This was written as a proof of concept for ftrace. It would be better written
# using perf_events (after some capabilities are added), which has a better
# buffering policy, or a tracer such as SystemTap or ktap.
#
# From perf-tools: https://github.com/brendangregg/perf-tools
#
# See the iosnoop(8) man page (in perf-tools) for more info.
#
# COPYRIGHT: Copyright (c) 2014 Brendan Gregg.
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software Foundation,
# Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
#
# (http://www.gnu.org/copyleft/gpl.html)
#
# 12-Jul-2014 Brendan Gregg Created this.
### default variables
tracing=/sys/kernel/debug/tracing
flock=/var/tmp/.ftrace-lock
bufsize_kb=4096
opt_duration=0; duration=; opt_name=0; name=; opt_pid=0; pid=; ftext=
opt_start=0; opt_end=0; opt_device=0; device=; opt_iotype=0; iotype=
opt_queue=0
trap ':' INT QUIT TERM PIPE HUP # sends execution to end tracing section
function usage {
cat <<-END >&2
USAGE: iosnoop [-hQst] [-d device] [-i iotype] [-p PID] [-n name]
[duration]
-d device # device string (eg, "202,1)
-i iotype # match type (eg, '*R*' for all reads)
-n name # process name to match on I/O issue
-p PID # PID to match on I/O issue
-Q # use queue insert as start time
-s # include start time of I/O (s)
-t # include completion time of I/O (s)
-h # this usage message
duration # duration seconds, and use buffers
eg,
iosnoop # watch block I/O live (unbuffered)
iosnoop 1 # trace 1 sec (buffered)
iosnoop -Q # include queueing time in LATms
iosnoop -ts # include start and end timestamps
iosnoop -i '*R*' # trace reads
iosnoop -p 91 # show I/O issued when PID 91 is on-CPU
iosnoop -Qp 91 # show I/O queued by PID 91, queue time
See the man page and example file for more info.
END
exit
}
function warn {
if ! eval "$@"; then
echo >&2 "WARNING: command failed \"$@\""
fi
}
function end {
# disable tracing
echo 2>/dev/null
echo "Ending tracing..." 2>/dev/null
cd $tracing
warn "echo 0 > events/block/$b_start/enable"
warn "echo 0 > events/block/block_rq_complete/enable"
if (( opt_device || opt_iotype || opt_pid )); then
warn "echo 0 > events/block/$b_start/filter"
warn "echo 0 > events/block/block_rq_complete/filter"
fi
warn "echo > trace"
(( wroteflock )) && warn "rm $flock"
}
function die {
echo >&2 "$@"
exit 1
}
function edie {
# die with a quiet end()
echo >&2 "$@"
exec >/dev/null 2>&1
end
exit 1
}
### process options
while getopts d:hi:n:p:Qst opt
do
case $opt in
d) opt_device=1; device=$OPTARG ;;
i) opt_iotype=1; iotype=$OPTARG ;;
n) opt_name=1; name=$OPTARG ;;
p) opt_pid=1; pid=$OPTARG ;;
Q) opt_queue=1 ;;
s) opt_start=1 ;;
t) opt_end=1 ;;
h|?) usage ;;
esac
done
shift $(( $OPTIND - 1 ))
if (( $# )); then
opt_duration=1
duration=$1
shift
fi
if (( opt_device )); then
major=${device%,*}
minor=${device#*,}
dev=$(( (major << 20) + minor ))
fi
### option logic
(( opt_pid && opt_name )) && die "ERROR: use either -p or -n."
(( opt_pid )) && ftext=" issued by PID $pid"
(( opt_name )) && ftext=" issued by process name \"$name\""
if (( opt_duration )); then
echo "Tracing block I/O$ftext for $duration seconds (buffered)..."
else
echo "Tracing block I/O$ftext. Ctrl-C to end."
fi
if (( opt_queue )); then
b_start=block_rq_insert
else
b_start=block_rq_issue
fi
### select awk
(( opt_duration )) && use=mawk || use=gawk # workaround for mawk fflush()
[[ -x /usr/bin/$use ]] && awk=$use || awk=awk
wroteflock=1
### check permissions
cd $tracing || die "ERROR: accessing tracing. Root user? Kernel has FTRACE?
debugfs mounted? (mount -t debugfs debugfs /sys/kernel/debug)"
### ftrace lock
[[ -e $flock ]] && die "ERROR: ftrace may be in use by PID $(cat $flock) $flock"
echo $$ > $flock || die "ERROR: unable to write $flock."
### setup and begin tracing
echo nop > current_tracer
warn "echo $bufsize_kb > buffer_size_kb"
filter=
if (( opt_iotype )); then
filter="rwbs ~ \"$iotype\""
fi
if (( opt_device )); then
[[ "$filter" != "" ]] && filter="$filter && "
filter="${filter}dev == $dev"
fi
filter_i=$filter
if (( opt_pid )); then
[[ "$filter_i" != "" ]] && filter_i="$filter_i && "
filter_i="${filter_i}common_pid == $pid"
[[ "$filter" == "" ]] && filter=0
fi
if (( opt_iotype || opt_device || opt_pid )); then
if ! echo "$filter_i" > events/block/$b_start/filter || \
! echo "$filter" > events/block/block_rq_complete/filter
then
edie "ERROR: setting -d or -t filter. Exiting."
fi
fi
if ! echo 1 > events/block/$b_start/enable || \
! echo 1 > events/block/block_rq_complete/enable; then
edie "ERROR: enabling block I/O tracepoints. Exiting."
fi
(( opt_start )) && printf "%-15s " "STARTs"
(( opt_end )) && printf "%-15s " "ENDs"
printf "%-12.12s %-6s %-4s %-8s %-12s %-6s %8s\n" \
"COMM" "PID" "TYPE" "DEV" "BLOCK" "BYTES" "LATms"
#
# Determine output format. It may be one of the following (newest first):
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# TASK-PID CPU# TIMESTAMP FUNCTION
# To differentiate between them, the number of header fields is counted,
# and an offset set, to skip the extra column when needed.
#
offset=$($awk 'BEGIN { o = 0; }
$1 == "#" && $2 ~ /TASK/ && NF == 6 { o = 1; }
$2 ~ /TASK/ { print o; exit }' trace)
### print trace buffer
warn "echo > trace"
( if (( opt_duration )); then
# wait then dump buffer
sleep $duration
cat trace
else
# print buffer live
cat trace_pipe
fi ) | $awk -v o=$offset -v opt_name=$opt_name -v name=$name \
-v opt_duration=$opt_duration -v opt_start=$opt_start -v opt_end=$opt_end \
-v b_start=$b_start '
# common fields
$1 != "#" {
# task name can contain dashes
comm = pid = $1
sub(/-[0-9][0-9]*/, "", comm)
sub(/.*-/, "", pid)
time = $(3+o); sub(":", "", time)
dev = $(5+o)
}
# block I/O request
$1 != "#" && $0 ~ b_start {
if (opt_name && match(comm, name) == 0)
next
#
# example: (fields1..4+o) 202,1 W 0 () 12862264 + 8 [tar]
# The cmd field "()" might contain multiple words (hex),
# hence stepping from the right (NF-3).
#
loc = $(NF-3)
starts[dev, loc] = time
comms[dev, loc] = comm
pids[dev, loc] = pid
next
}
# block I/O completion
$1 != "#" && $0 ~ /rq_complete/ {
#
# example: (fields1..4+o) 202,1 W () 12862256 + 8 [0]
#
dir = $(6+o)
loc = $(NF-3)
nsec = $(NF-1)
if (starts[dev, loc] > 0) {
latency = sprintf("%.2f",
1000 * (time - starts[dev, loc]))
comm = comms[dev, loc]
pid = pids[dev, loc]
if (opt_start)
printf "%-15s ", starts[dev, loc]
if (opt_end)
printf "%-15s ", time
printf "%-12.12s %-6s %-4s %-8s %-12s %-6s %8s\n",
comm, pid, dir, dev, loc, nsec * 512, latency
if (!opt_duration)
fflush()
delete starts[dev, loc]
delete comms[dev, loc]
delete pids[dev, loc]
}
next
}
$0 ~ /LOST.*EVENTS/ { print "WARNING: " $0 > "/dev/stderr" }
'
### end tracing
end
http://oliveryang.net/2016/08/linux-block-driver-basic-4/
http://www.brendangregg.com/blog/2014-07-16/iosnoop-for-linux.html
iosnoop 在短时间内会产生大量的输出,每个 IO 请求的 IO 延迟时间都可能有很大差异,如何能对 fio 测试的延迟有没有更好的数据呈现方式呢? Heatmap 就是一个这样的工具(脚本见:https://github.com/brendangregg/HeatMap),结合iosnoop后的具体使用如下:
$ sudo ./iosnoop -d 253,1 -s -t > iosnoop.log
$ grep '^[0-9]' iosnoop.log | awk '{ print $1, $9 }' | sed 's/\.//g' | sed 's/$/0/g' > trace.txt
$ ./trace2heatmap.pl --unitstime=us --unitslatency=us --maxlat=200 --grid trace.txt> heatmap.svg
参考:
网络相关
iftop
sudo iftop -i eth0

类似于iotop查看每个进程的io信息,iftop可以查看ge ge每个网络连通的主机的网络流量信息
第一行:带宽显示
中间部分:外部连接列表,即记录了哪些ip正在和本机的网络连接
中间部分右边:实时参数分别是该访问ip连接到本机2秒,10秒和40秒的平均流量
=>代表发送数据,<= 代表接收数据
- 缺点
与iptraf一样,只能跟踪单个网卡信息,如果不加-i选项,默认用的是第一个网卡
从零开始学习iftop流量监控(找出服务器耗费流量最多的ip和端口)
nethogs
类似于top和iotop,可以按进程实时统计网络带宽利用率
ifstat
hzwuhongsong@pubt1-ceph2:~$ sudo ifstat
eth0 eth0.100 eth0.101 eth0.102 eth0.104
KB/s in KB/s out KB/s in KB/s out KB/s in KB/s out KB/s in KB/s out KB/s in KB/s out
5731.76 10562.41 4540.13 1226.96 0.09 0.00 0.40 0.00 915.29 8978.31
2248.33 3823.92 925.83 2011.85 0.09 0.00 0.63 0.00 1185.01 1707.78
可以观察各个网卡流量的输入以及输出情况
netstat / ss
ss是socket层专用的,而netstat横跨网络各层
ss(Socket Statistics的缩写)命令可以用来获取 socket统计信息,此命令输出的结果类似于 netstat输出的内容,但它能显示更多更详细的 TCP连接状态的信息,且比 netstat 更快速高效。它使用了 TCP协议栈中 tcp_diag(是一个用于分析统计的模块),能直接从获得第一手内核信息,这就使得 ss命令快捷高效。在没有 tcp_diag,ss也可以正常运行。
当服务器的socket连接数量变得非常大时,无论是使用netstat命令还是直接cat /proc/net/tcp,执行速度都会很慢。可能你不会有切身的感受,但请相信我,当服务器维持的连接达到上万个的时候,使用netstat等于浪费 生命,而用ss才是节省时间。
tcpdump + wireshark
ip
ifconfig的替代品,
功能包括
- 显示不同网络接口的统计数据, 比如接收发送包数量以及丢包,或包错误信息
root@pubt1-ceph2:/home/hzwuhongsong# ip -s link
2: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 24:6e:96:13:cb:9c brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped overrun mcast
0 0 0 0 0 0
TX: bytes packets errors dropped carrier collsns
0 0 0 0 0 0
- ip地址的查看,添加,删除;
- 网络接口设置
1)激活和停止网络端口
可以使用ip命令的up和down选项来激某个特定的接口
# 停止网络接口eth0
ip link set eth0 down
# 启动网络接口eth0
ip link set eth0 up
2)修改设置传输队列的长度
ip link set dev eth0 txqueuelen 100
或
ip link set dev eth0 txqlen 100
3)修改网络设置MTU(最大传输单元)的值
ip link set dev eth0 mtu 1500
4)修改网卡的MAC地址
ip link set dev eth0 address 00:01:4f:00:15:f1
- 路由查看,添加,修改,删除,路由策略设置
以下是查看百度网址的路由:
root@pubt1-ceph2:/home/hzwuhongsong# ping www.baidu.com
PING www.a.shifen.com (180.101.49.11) 56(84) bytes of data.
64 bytes from 180.101.49.11: icmp_seq=1 ttl=52 time=11.4 ms
64 bytes from 180.101.49.11: icmp_seq=2 ttl=52 time=11.5 ms
^V64 bytes from 180.101.49.11: icmp_seq=3 ttl=52 time=11.5 ms
64 bytes from 180.101.49.11: icmp_seq=4 ttl=52 time=11.4 ms
v^C
--- www.a.shifen.com ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3004ms
rtt min/avg/max/mdev = 11.425/11.495/11.542/0.138 ms
root@pubt1-ceph2:/home/hzwuhongsong# ip route get 180.101.49.11
180.101.49.11 via 10.185.0.254 dev eth0.100 src 10.185.0.101
cache
root@pubt1-ceph2:/home/hzwuhongsong#
- 查看ARP信息
地址解析协议(ARP)用于将一个IP地址转换成它对应的物理地址,也就是通常所说的MAC地址。使用ip命令的neigh或者neighbour选项,你可以查看接入你所在的局域网的设备的MAC地址。
ping
查看网络时延以及丢包情况
iperf
iperf服务器间的网络带宽能力以及带宽是否稳定
安装工具 apt-get install iperf
16作为服务端,开启服务 iperf -s
其他机器向16打perf iperf -c 10.192.132.19 -t 10
16上就能看到每秒的带宽情况
tc
tc是网络模拟工具,可以模拟网络延迟,网络丢包,网络包重复,包乱序,包损坏等
iptables
防火墙相关命令
处理目标文件的工具
来源于“深入理解计算机系统P473”
ldd
查看程式运行所需的共享库,常用来解决程式因缺少某个库文件而不能运行的一些问题。
/opt/app/todeav1/test$ldd test
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00000039a7e00000)
libm.so.6 => /lib64/libm.so.6 (0x0000003996400000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00000039a5600000)
libc.so.6 => /lib64/libc.so.6 (0x0000003995800000)
/lib64/ld-linux-x86-64.so.2 (0x0000003995400000)
第一列:程序需要依赖什么库
第二列: 系统提供的与程序需要的库所对应的库
第三列:库加载的开始地址
nm
列出一个目标文件的符号表种定义的符号
size
列出目标文件种节的名字和大小
readleaf
显示一个目标文件的完整结构,包括elf头中编码的所有信息。包含上面nm以及size命令的功能。
objdump
所有二进制工具之母,能够显示一个目标文件中的所有信息,它最大的作用试反汇编.text节中的二进制程序
硬件相关
lscpu/lspci/lsscis
内核分析
sysrq
ftrace
https://www.ibm.com/developerworks/cn/linux/l-cn-ftrace1/index.html
https://zhuanlan.zhihu.com/p/33267453
systemtap
perf
debugfs
通过blktrace可知道是哪个磁盘哪些扇区在频繁io(也可以通过perf),通过文件系统block块大小可由扇区推出来其在文件系统中的block号,进而通过debugfs可找到该block对应的inode号,进而可以通过inode号用debugfs获得文件名字,有了文件名字通过lsof就知道是哪个程序在频繁调用它了
- 通过文件系统上的block号查找inode
root@pubt1-ceph72:/# debugfs -R 'icheck 333' /dev/sda2
debugfs 1.43.4 (31-Jan-2017)
Block Inode number
333 7
- 通过inode查找文件
root@pubt1-ceph72:/# ls -alhi
16 -rw-r--r-- 1 root root 736K May 29 10:20 MegaSAS.log
36569089 drwxr-xr-x 12 root root 4.0K May 21 10:44 mnt
root@pubt1-ceph72:/# debugfs -R 'ncheck 36569089' /dev/sda2
debugfs 1.43.4 (31-Jan-2017)
Inode Pathname
36569089 //mnt
root@pubt1-ceph72:/# debugfs -R 'ncheck 16' /dev/sda2
debugfs 1.43.4 (31-Jan-2017)
Inode Pathname
16 //MegaSAS.log
- 查看文件的磁盘layout
sudo debugfs -R "htree 2/cache/cachewrite" /dev/sdb
Root node dump:
Reserved zero: 0
Hash Version: 1
Info length: 8
Indirect levels: 1
Flags: 0
// 表示第一级hash的block块可以装得下507个条目,当前已经装了507个
Number of entries (count): 507
Number of entries (limit): 507
Checksum: 0xd25a9670
Entry #0: Hash 0x00000000, block 508(第508个block块)
Entry #1: Hash 0x00e561cc, block 74993
......
// 表示第二级hash的第一个block块可以装得下510个条目,当前已经装了510个
Entry #0: Hash 0x00000000, block 508
Number of entries (count): 510
Number of entries (limit): 510
Checksum: 0xbd6a7276
Entry #0: Hash 0x00000000, block 1
Entry #1: Hash 0x00009a0e, block 76528 (第76528个block)
Entry #2: Hash 0x00012814, block 41037
Entry #3: Hash 0x0001c75e, block 77337
Entry #4: Hash 0x00021116, block 146121
......
// 表示叶子block中存放着文件名等信息
Entry #1: Hash 0x00009a0e, block 76528
Reading directory block 76528, phys 422172612
105621626 0x0000f33c-1bdf30c6 (128) 2_45021452_45792853_0_0
112792663 0x0000fcac-ced840e8 (60) whs_file01 // 文件名
113065652 0x000122e6-ef3a9030 (60) whs_file02
leaf block checksum: 0xdf72790d // 叶子block块的checksum值
A Minimum Complete Tutorial of Linux ext4 File System
nb工具ebpf
Libbpf-tools —— 让 Tracing 工具身轻如燕
其他
lsof
lsof(list open files)是一个查看当前系统文件的工具。在linux环境下,任何事物都以文件的形式存在,通过文件不仅仅可以访问常规数据,还可以访问网络连接和硬件。如传输控制协议 (TCP) 和用户数据报协议 (UDP) 套接字等,系统在后台都为该应用程序分配了一个文件描述符,该文件描述符提供了大量关于这个应用程序本身的信息。
pstack
展示进程调用栈
swapon/swapoff
swap分区开启于关闭

934

被折叠的 条评论
为什么被折叠?



