BPF之巅——Linux 60秒分析

Linux 60秒分析

工具和指标可以聚焦于唾手可得的性能问题:列出十几个常见的问题,以及对应的分析方法,让每个人都能参照检查。此文章翻译的是Brendan Gregg和Netflix性能工程团队的发布部分内容的翻译摘取。

uptime


$ uptime
20:08:53 up 50 min,  2 users,  load average: 0.00, 0.01, 0.05

快速检查平均负载,即此刻有多少个任务(进程)需要执行。
3个数字是指数衰减的1分钟/5分钟/15分钟滑动窗口累计值,可以大致了解负载随时间变化的情况
负载的平均值在排除故障过程中被首先进行检查,以确认性能问题是否还存在
一个较高的15分钟负载与一个较低的1分钟负载同时出现,可能意味着已经错过了问题发生的现场

dmesg|tail

$ dmesg|tail
[   14.636073] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
[   16.323147] e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
[   18.243527] sda1: WRITE SAME failed. Manually zeroing.
[   20.586194] init: plymouth-upstart-bridge main process ended, respawning
[   21.207826] cgroup: systemd-logind (852) created nested cgroup for controller "memory" which has incomplete hierarchy support. Nested cgroups may change behavior in the future.
[   21.207828] cgroup: "memory" requires setting use_hierarchy to 1 on the root.
[  314.984616] audit_printk_skb: 171 callbacks suppressed
[  314.984619] type=1400 audit(1608549828.128:69): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/lib/cups/backend/cups-pdf" pid=3267 comm="apparmor_parser"
[  314.984623] type=1400 audit(1608549828.128:70): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/sbin/cupsd" pid=3267 comm="apparmor_parser"
[  314.984944] type=1400 audit(1608549828.128:71): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/sbin/cupsd" pid=3267 comm="apparmor_parser"

显示过去10条系统日志,如果有的话。
寻找可能导致性能问题的错误

vmstat 1

# vmstat num num表示打印统计信息打印的间隔时间
$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 1163168  30152 317236    0    0   494    24  244  430  3  3 80 14  0
 0  0      0 1163040  30152 317236    0    0     0     0  321  717  1  0 99  0  0
 0  0      0 1162948  30160 317228    0    0     0   580  377  882  1  0 99  0  0

需要认真检查的列包括:

  • r:CPU正在执行的和等待执行的进程数量:一个比CPU数量多的r值代表CPU资源处于饱和状态
  • free:空间内存,单位是kb
  • si和so:页换入和页换出,如果值不为0表示系统内存紧张
  • us、sy、id、wa和st:这些都是CPU运行时间的进一步细分,是对所有的CPU取平均值之后的结果,分别代表用户态时间、系统态时间、空闲、等待I/O,以及被窃取时间(stolen time,指的是虚拟化环境下,被其他客户机所挤占的时间)

mpstat -P ALL 1


$ mpstat -P ALL 1
Linux 3.13.0-32-generic (ubuntu) 	12/21/2020 	_x86_64_	(8 CPU)

07:33:19 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
07:33:20 PM  all    0.63    0.00    0.13    0.00    0.00    0.00    0.00    0.00    0.00   99.25
07:33:20 PM    0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
07:33:20 PM    1    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
07:33:20 PM    2    0.99    0.00    0.99    0.00    0.00    0.00    0.00    0.00    0.00   98.02
07:33:20 PM    3    0.99    0.00    0.99    0.00    0.00    0.00    0.00    0.00    0.00   98.02
07:33:20 PM    4    0.99    0.00    0.99    0.00    0.00    0.00    0.00    0.00    0.00   98.02
07:33:20 PM    5    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
07:33:20 PM    6    1.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   99.00
07:33:20 PM    7    0.99    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   99.01

每个CPU分解到各个状态下的时间打印出来。
%usr 用户态
%sys 内核态,可用系统调用跟踪和内核跟踪
%iowait 磁盘,iostat可以详细查看存储设备的信息

pidstat 1

$ pidstat 1
Linux 3.13.0-32-generic (ubuntu) 	12/21/2020 	_x86_64_	(8 CPU)

07:39:10 PM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
07:39:11 PM  1000      2669    0.94    0.00    0.00    0.94     6  compiz
07:39:11 PM  1000      4153    0.94    0.94    0.00    1.89     1  pidstat

07:39:11 PM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
07:39:12 PM     0      1473    0.99    0.99    0.00    1.98     3  Xorg
07:39:12 PM  1000      2669    6.93    1.98    0.00    8.91     1  compiz
07:39:12 PM  1000      2816    0.99    0.00    0.00    0.99     0  gnome-terminal
07:39:12 PM  1000      4153    0.99    1.98    0.00    2.97     1  pidstat

为每个进程展示CPU的使用情况

iostat -xz 1


$ iostat -xz 1
Linux 3.13.0-32-generic (ubuntu) 	12/21/2020 	_x86_64_	(8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.45    0.08    0.38    1.03    0.00   98.06

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.76     6.20   25.18    2.26   352.72    80.87    31.60     0.35   12.85   12.25   19.50   1.75   4.81

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.63    0.00    0.13    0.00    0.00   99.25

显示存储设备的I/O指标
需要认真检查的列包括:

  • r/s、w/s、rkB/s和wkB/s:每秒向设备发送的读、写次数,以及读、写字节数,可以用这些指标对业务负载画像(某些性能问题仅仅是因为超过可能够承受的最大负载导致的)
  • await:I/O的平均响应时间,以毫秒为单位(超过预期的平均响应时间,可以看做设备已饱和或者设备层面有问题的表征)
  • avgqu-sz:设备请求队列的平均长度
  • %util:设备使用率

free -m


$ free -m
           total       used       free     shared    buffers     cached
Mem:        1987       1158        829          5        119        443
-/+ buffers/cache:        595       1391
Swap:         1021          0       1021

输出显示了用MB作为单位的可用内存,检查可用内存是否为0

sar -n DEV 1


$ sar -n DEV 1
Linux 3.13.0-32-generic (ubuntu) 	12/21/2020 	_x86_64_	(8 CPU)

08:01:27 PM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
08:01:28 PM      eth0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
08:01:28 PM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
08:01:28 PM   docker0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

查看网络设备指标

sar -n TCP,ETCP 1


$ sar -n TCP,ETCP 1
Linux 3.13.0-32-generic (ubuntu) 	12/21/2020 	_x86_64_	(8 CPU)

08:03:08 PM  active/s passive/s    iseg/s    oseg/s
08:03:09 PM      0.00      0.00      0.00      0.00

08:03:08 PM  atmptf/s  estres/s retrans/s isegerr/s   orsts/s
08:03:09 PM      0.00      0.00      0.00      0.00      0.00

  • active/s:每秒本地发起的TCP连接的数量(connect())
  • passive/s:每秒远端发起的TCP连接的数量(accept())
  • retranss/s:每秒TCP重传的数量

top


$ top
top - 20:06:48 up 48 min,  2 users,  load average: 0.00, 0.01, 0.05
Tasks: 479 total,   1 running, 478 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.5 us,  0.3 sy,  0.0 ni, 98.6 id,  0.6 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:   2035256 total,  1188352 used,   846904 free,   122112 buffers
KiB Swap:  1046524 total,        0 used,  1046524 free.   454056 cached Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                  
  2669 cyrench+  20   0 1475872 163472  39464 S  24.2  8.0   1:39.61 compiz                   
  1473 root      20   0  337108  43888  13436 S   6.1  2.2   0:22.28 Xorg                     
     1 root      20   0   33908   3212   1456 S   0.0  0.2   0:02.13 init                     
     2 root      20   0       0      0      0 S   0.0  0.0   0:00.05 kthreadd                 
     3 root      20   0       0      0      0 S   0.0  0.0   0:00.01 ksoftirqd/0              
     5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H             
     7 root      20   0       0      0      0 S   0.0  0.0   0:01.09 rcu_sched                
     8 root      20   0       0      0      0 S   0.0  0.0   0:00.58 rcuos/0                  
     9 root      20   0       0      0      0 S   0.0  0.0   0:00.62 rcuos/1                  
    10 root      20   0       0      0      0 S   0.0  0.0   0:00.32 rcuos/2                  
    11 root      20   0       0      0      0 S   0.0  0.0   0:00.33 rcuos/3                  
    12 root      20   0       0      0      0 S   0.0  0.0   0:00.45 rcuos/4                  
    13 root      20   0       0      0      0 S   0.0  0.0   0:00.37 rcuos/5                  
    14 root      20   0       0      0      0 S   0.0  0.0   0:00.27 rcuos/6                  
    15 root      20   0       0      0      0 S   0.0  0.0   0:00.22 rcuos/7

  • 3
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
BPF and related observability tools give software professionals unprecedented visibility into software, helping them analyze operating system and application performance, troubleshoot code, and strengthen security. BPF Performance Tools: Linux System and Application Observability is the industry’s most comprehensive guide to using these tools for observability. Brendan Gregg, author of the industry’s definitive guide to system performance, introduces powerful new methods and tools for doing analysis that leads to more robust, reliable, and safer code. This authoritative guide: Explores a wide spectrum of software and hardware targets Thoroughly covers open source BPF tools from the Linux Foundation iovisor project’s bcc and bpftrace repositories Summarizes performance engineering and kernel internals you need to understand Provides and discusses 150+ bpftrace tools, including 80 written specifically for this book: tools you can run as-is, without programming — or customize and develop further, using diverse interfaces and the bpftrace front-end You’ll learn how to use BPF (eBPF) tracing tools to analyze CPUs, memory, disks, file systems, networking, languages, applications, containers, hypervisors, security, and the Linux kernel. You’ll move from basic to advanced tools and techniques, producing new metrics, stack traces, custom latency histograms, and more. It’s like having a superpower: with Gregg’s guidance and tools, you can analyze virtually everything that impacts system performance, so you can improve virtually any Linux operating system or application.

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值