[资源监控] 如何判断Linux操作系统CPU，内存，I/O，网络资源耗尽了？（更新中）

mathemagics

已于 2024-04-14 21:02:26 修改

阅读量942

点赞数 23

文章标签： linux 运维服务器

于 2024-04-14 20:58:48 首次发布

本文链接：https://blog.csdn.net/mathemagics/article/details/137754436

版权

[资源监控] 如何判断Linux操作系统CPU，内存，I/O，网络资源耗尽了？

TODO:

实验跑完
结合面试问题再捋一遍
捋下性能调优notes
跟着OS复习，修订一遍

sar工具可以监控CPU，内存，IO，磁盘，网络资源。

在sysstat包下，需要下载，并开启数据收集：

# vim /etc/default/sysstat，修改：
ENABLED="true"
# sudo service sysstat restart 重启

说明

sar -u 1 5 # 测试CPU usage，每1s输出当前值，共5次计算平均
sar -u # 查看历史数据

root@node01:~# sar --help
Usage: sar [ options ] [ <interval> [ <count> ] ]
Main options and reports (report name between square brackets):
        -B      Paging statistics [A_PAGE]
        -b      I/O and transfer rate statistics [A_IO]
        -d      Block devices statistics [A_DISK]
        -F [ MOUNT ]
                Filesystems statistics [A_FS]
        -H      Hugepages utilization statistics [A_HUGE]
        -I { <int_list> | SUM | ALL }
                Interrupts statistics [A_IRQ]
        -m { <keyword> [,...] | ALL }
                Power management statistics [A_PWR_...]
                Keywords are:
                CPU     CPU instantaneous clock frequency
                FAN     Fans speed
                FREQ    CPU average clock frequency
                IN      Voltage inputs
                TEMP    Devices temperature
                USB     USB devices plugged into the system
        -n { <keyword> [,...] | ALL }
                Network statistics [A_NET_...]
                Keywords are:
                DEV     Network interfaces
                EDEV    Network interfaces (errors)
                NFS     NFS client
                NFSD    NFS server
                SOCK    Sockets (v4)
                IP      IP traffic      (v4)
                EIP     IP traffic      (v4) (errors)
                ICMP    ICMP traffic    (v4)
                EICMP   ICMP traffic    (v4) (errors)
                TCP     TCP traffic     (v4)
                ETCP    TCP traffic     (v4) (errors)
                UDP     UDP traffic     (v4)
                SOCK6   Sockets (v6)
                IP6     IP traffic      (v6)
                EIP6    IP traffic      (v6) (errors)
                ICMP6   ICMP traffic    (v6)
                EICMP6  ICMP traffic    (v6) (errors)
                UDP6    UDP traffic     (v6)
                FC      Fibre channel HBAs
                SOFT    Software-based network processing
        -q      Queue length and load average statistics [A_QUEUE]
        -r [ ALL ]
                Memory utilization statistics [A_MEMORY]
        -S      Swap space utilization statistics [A_MEMORY]
        -u [ ALL ]
                CPU utilization statistics [A_CPU]
        -v      Kernel tables statistics [A_KTABLES]
        -W      Swapping statistics [A_SWAP]
        -w      Task creation and system switching statistics [A_PCSW]
        -y      TTY devices statistics [A_SERIAL]

CPU

观察 %idle 是否持续很低

Q：低于多少？

我目前的答案是1%，因为CPU使用率本身波动快，跑满了就是慢些、不像内存跑满系统会崩，没有把一会儿0%一会儿30%当作耗尽的情况；

讲座中听过，对于Web服务，40%以下时延会急剧上升；但自己模拟还没复现出这个结果。

TODO：之前见过资料还有看system占比的，待探究。

root@node01:~# sar -u 1 5
Linux 5.4.0-171-generic (node01)        04/14/2024      _x86_64_        (4 CPU)

07:39:56 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
07:39:57 AM     all      0.50      0.00      0.25      0.00      0.00     99.25
07:39:58 AM     all      0.76      0.00      1.01      0.00      0.00     98.24
07:39:59 AM     all      0.25      0.00      0.76      0.00      0.00     98.99
07:40:00 AM     all      0.50      0.00      0.25      0.00      0.00     99.25
07:40:01 AM     all      0.50      0.00      1.50      0.00      0.00     98.00
Average:        all      0.50      0.00      0.75      0.00      0.00     98.74

输出项说明：

CPU：all 表示统计信息为所有 CPU核的平均值。
%user：显示在用户级别(application)运行使用 CPU 总时间的百分比。
%nice：显示在用户级别，用于nice(进程优先级调整)操作，所占用 CPU 总时间的百分比。Percentage of CPU utilization that occurred while executing at the user level with nice priority.
- On a CPU utilization graph or report, the “nice” CPU percentage is the % of CPU time occupied by user level processes with a positive nice value (lower scheduling priority – see man nice for details). Basically it’s CPU time that’s currently “in use”, but if a normal (nice value 0) or high-priority (negative nice value) process comes along those programs will be kicked off the CPU.
- nice: It is the CPU scheduling priority, higher vales (+19) mean lower priority, and lower values (-20) mean higher priority (inverse relationship).
%system：在核心级别(kernel)运行所使用 CPU 总时间的百分比。
%iowait：显示用于等待I/O操作占用 CPU 总时间的百分比。
%steal：管理程序(hypervisor)为另一个虚拟进程提供服务而等待虚拟 CPU 的百分比。e.g., 这台物理机上的虚拟机消耗
%idle：显示 CPU 空闲时间占用 CPU 总时间的百分比。

说明

若 %iowait 的值过高，表示硬盘存在I/O瓶颈
若 %idle 的值高但系统响应慢时，有可能是 CPU 等待分配内存，此时应加大内存容量
若 %idle 的值持续很低，则系统的 CPU 处理能力相对较低，表明系统中最需要解决的资源是 CPU 。

实验：CPU

空闲状态下：

root@node01:~# sar -u 1 5
Linux 5.4.0-171-generic (node01)        04/14/2024      _x86_64_        (4 CPU)

07:39:56 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
07:39:57 AM     all      0.50      0.00      0.25      0.00      0.00     99.25
07:39:58 AM     all      0.76      0.00      1.01      0.00      0.00     98.24
07:39:59 AM     all      0.25      0.00      0.76      0.00      0.00     98.99
07:40:00 AM     all      0.50      0.00      0.25      0.00      0.00     99.25
07:40:01 AM     all      0.50      0.00      1.50      0.00      0.00     98.00
Average:        all      0.50      0.00      0.75      0.00      0.00     98.74

跑满1个核：

sysbench --test=cpu run

用sar可看到：

root@node01:~# sar -u 1 5
Linux 5.4.0-171-generic (node01)        04/14/2024      _x86_64_        (4 CPU)

08:21:52 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
08:21:53 AM     all     25.94      0.00      0.76      0.00      0.00     73.30
08:21:54 AM     all     25.37      0.00      2.24      0.00      0.00     72.39
08:21:55 AM     all     25.56      0.00      0.25      0.00      0.00     74.19
08:21:56 AM     all     26.25      0.00      1.25      0.00      0.00     72.50
08:21:57 AM     all     26.12      0.00      1.99      0.00      0.00     71.89
Average:        all     25.85      0.00      1.30      0.00      0.00     72.85

用top可看到：

%Cpu(s): 25.4 us,  0.0 sy,  0.0 ni, 74.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

用htop可以看到每个核的情况：1个跑满，3个空

跑满全部核： TODO

至少起4个线程

sysbench --threads=100 --events=80000  cpu --cpu-max-prime=8000000 run
https://cloud.tencent.com/developer/article/1867674

内存

Q：内存多少算是耗尽？

A：看swap

Q：swap关掉了呢？ // 说起来我也按教程关了swap，据说开启会使cgroups设置的内存上限就会失效。还有其他原因吗？

A：（思考）跑到过99%

Q：怎么看的？

没好意思回答。当时虚拟机卡死，看的是宿主机的Task Manager。很快宿主机也卡死了，虚拟机崩了。

TODO：多少算内存耗尽呢？等到一点不剩问题已经大了。

root@node01:~# sar -r 10 5
Linux 5.4.0-171-generic (node01)        04/14/2024      _x86_64_        (4 CPU)

08:50:27 AM kbmemfree   kbavail kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
08:50:37 AM    988804   1752596   1833456     46.05     72412    857060   2447588     61.47   2176988    425532       192
08:50:47 AM    987204   1751020   1834924     46.08     72428    857060   2450288     61.54   2177068    425532       156
08:50:57 AM    987720   1751552   1834136     46.06     72444    857060   2453056     61.61   2177036    425532       164
08:51:07 AM    987960   1751804   1834324     46.07     72460    857064   2452184     61.59   2176864    425536       184
08:51:17 AM    988064   1751932   1833860     46.06     72476    857064   2445740     61.43   2177112    425536       180
Average:       987950   1751781   1834140     46.06     72444    857062   2449771     61.53   2177014    425534       175

输出说明：

kbmemfree：这个值和free命令中的free值基本一致,所以它不包括buffer和cache的空间.
kbmemused：这个值和free命令中的used值基本一致,所以它包括buffer和cache的空间.
%memused：这个值是kbmemused和内存总量(不包括swap)的一个百分比.
kbbuffers和kbcached：这两个值就是free命令中的buffer和cache.
kbcommit：保证当前系统所需要的内存,即为了确保不溢出而需要的内存(RAM+swap).
%commit：这个值是kbcommit与内存总量(包括swap)的一个百分比.

TODO：应该看memused还是commit？前者应该对于free -m的百分比… 后者是sar高亮的，似乎也更有理

与free -m的对比：

root@node01:~# sar -r 1 1
Linux 5.4.0-171-generic (node01)        04/14/2024      _x86_64_        (4 CPU)

10:07:49 AM kbmemfree   kbavail kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
10:07:50 AM    977376   1751528   1834060     46.06     80028    859308   2444136     61.38   2184788    427772       180
Average:       977376   1751528   1834060     46.06     80028    859308   2444136     61.38   2184788    427772       180

root@node01:~# free -m
              total        used        free      shared  buff/cache   available
Mem:           3888        1920         954           1        1013        1710
Swap:             0           0           0
root@node01:~# echo "scale=2;1920/3888" | bc
.49

root@node01:~# top
...
MiB Mem :   3888.3 total,    938.1 free,   1922.2 used,   1028.0 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   1708.3 avail Mem

TODO：明明内存没什么波动，为何free/top和sar记录的数据会有差异。sar本身吃了不少内存吗？
root@node01:~# ps aux | grep sar
root      493389  0.0  0.0   5604   720 pts/0    S+   12:28   0:00 sar -r 1 100
root      493433  0.0  0.0   6300   720 pts/2    S+   12:28   0:00 grep --color=auto sar
root@node01:~# ps aux | head -1
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
sar是否调了其他命令？

实验：内存

I/O

cpu的%iowait，或disk的%util

TODO：多少？

`sar -d`

root@node01:~# sar -d  -p 10 3 
Linux 5.4.0-171-generic (node01)        04/14/2024      _x86_64_        (4 CPU)

12:36:52 PM       DEV       tps     rkB/s     wkB/s     dkB/s   areq-sz    aqu-sz     await     %util
12:37:02 PM     loop0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:37:02 PM     loop1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:37:02 PM     loop2      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:37:02 PM     loop3      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:37:02 PM     loop4      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:37:02 PM     loop5      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:37:02 PM       fd0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:37:02 PM       sda      0.40      0.00      5.20      0.00     13.00      0.00      0.50      0.08
12:37:02 PM ubuntu--vg-ubuntu--lv      1.30      0.00      5.20      0.00      4.00      0.00      0.00      0.08

12:37:02 PM       DEV       tps     rkB/s     wkB/s     dkB/s   areq-sz    aqu-sz     await     %util
12:37:12 PM     loop0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:37:12 PM     loop1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:37:12 PM     loop2      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:37:12 PM     loop3      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:37:12 PM     loop4      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:37:12 PM     loop5      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:37:12 PM       fd0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:37:12 PM       sda      0.40      0.00      4.00      0.00     10.00      0.00      0.25      0.12
12:37:12 PM ubuntu--vg-ubuntu--lv      1.00      0.00      4.00      0.00      4.00      0.00      0.00      0.12

12:37:12 PM       DEV       tps     rkB/s     wkB/s     dkB/s   areq-sz    aqu-sz     await     %util
12:37:22 PM     loop0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:37:22 PM     loop1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:37:22 PM     loop2      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:37:22 PM     loop3      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:37:22 PM     loop4      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:37:22 PM     loop5      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:37:22 PM       fd0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:37:22 PM       sda      0.60      0.00      4.80      0.00      8.00      0.00      0.50      0.12
12:37:22 PM ubuntu--vg-ubuntu--lv      1.20      0.00      4.80      0.00      4.00      0.00      0.00      0.12

Average:          DEV       tps     rkB/s     wkB/s     dkB/s   areq-sz    aqu-sz     await     %util
Average:        loop0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:        loop1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:        loop2      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:        loop3      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:        loop4      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:        loop5      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:          fd0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:          sda      0.47      0.00      4.67      0.00     10.00      0.00      0.43      0.11
Average:    ubuntu--vg-ubuntu--lv      1.17      0.00      4.67      0.00      4.00      0.00      0.00      0.11

TODO CHECK：

tps：每秒I/O的传输总数
await 表示每次IO请求等待时间，包括等待时间和处理时间
%util 表示磁盘忙碌情况，一般该值超过80%表示该磁盘可能处于繁忙状态。

`sar -u`

%iowait 表示CPU等待IO时间占整个CPU周期的百分比，

TODO：CHECK WHY - 如果iowait值超过50%，或者明显大于%system、%user以及%idle，表示IO可能存在问题。

root@node01:~# sar -u 1 5
Linux 5.4.0-171-generic (node01)        04/14/2024      _x86_64_        (4 CPU)

07:39:56 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
07:39:57 AM     all      0.50      0.00      0.25      0.00      0.00     99.25
07:39:58 AM     all      0.76      0.00      1.01      0.00      0.00     98.24
07:39:59 AM     all      0.25      0.00      0.76      0.00      0.00     98.99
07:40:00 AM     all      0.50      0.00      0.25      0.00      0.00     99.25
07:40:01 AM     all      0.50      0.00      1.50      0.00      0.00     98.00
Average:        all      0.50      0.00      0.75      0.00      0.00     98.74

实验：I/O

网络 TODO

`sar -n DEV` 网络接口信息

root@node01:~# sar -n DEV 1 3
Linux 5.4.0-171-generic (node01)        04/14/2024      _x86_64_        (4 CPU)

12:46:13 PM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
12:46:14 PM     ens33     45.00    102.00      4.71     26.20      0.00      0.00      0.00      0.02
12:46:14 PM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

12:46:14 PM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
12:46:15 PM     ens33     26.00     48.00      2.57      6.02      0.00      0.00      0.00      0.00
12:46:15 PM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

12:46:15 PM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
12:46:16 PM     ens33     50.00    109.00      5.00     24.61      0.00      0.00      0.00      0.02
12:46:16 PM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

Average:        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
Average:        ens33     40.33     86.33      4.09     18.94      0.00      0.00      0.00      0.02
Average:           lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

`sar -n EDEV` 网络出错信息

root@node01:~# sar -n EDEV  1 3
Linux 5.4.0-171-generic (node01)        04/14/2024      _x86_64_        (4 CPU)

12:48:15 PM     IFACE   rxerr/s   txerr/s    coll/s  rxdrop/s  txdrop/s  txcarr/s  rxfram/s  rxfifo/s  txfifo/s
12:48:16 PM     ens33      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:48:16 PM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

12:48:16 PM     IFACE   rxerr/s   txerr/s    coll/s  rxdrop/s  txdrop/s  txcarr/s  rxfram/s  rxfifo/s  txfifo/s
12:48:17 PM     ens33      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:48:17 PM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

12:48:17 PM     IFACE   rxerr/s   txerr/s    coll/s  rxdrop/s  txdrop/s  txcarr/s  rxfram/s  rxfifo/s  txfifo/s
12:48:18 PM     ens33      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:48:18 PM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

Average:        IFACE   rxerr/s   txerr/s    coll/s  rxdrop/s  txdrop/s  txcarr/s  rxfram/s  rxfifo/s  txfifo/s
Average:        ens33      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:           lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

IFACE 网卡名称
rxerr/s 每秒钟接收到的损坏的数据包
txerr/s 每秒钟发送的数据包错误数
coll/s 当发送数据包时候，每秒钟发生的冲撞(collisions)数，这个是在半双工模式下才有
rxdrop/s 当由于缓冲区满的时候，网卡设备接收端每秒钟丢掉的网络包的数目
txdrop/s 当由于缓冲区满的时候，网络设备发送端每秒钟丢掉的网络包的数目
txcarr/s 当发送数据包的时候，每秒钟载波错误发生的次数
rxfram 在接收数据包的时候，每秒钟发生的帧对其错误的次数
rxfifo 在接收数据包的时候，每秒钟缓冲区溢出的错误发生的次数
txfifo 在发生数据包的时候，每秒钟缓冲区溢出的错误发生的次数

实验：网络

测试工具

CPU/memory/IO 用测试工具sysbench https://cn.linux-console.net/?p=16153

网络用测试工具 apache benchmark

tmp

sar未开启的报错：

Cannot open /var/log/sysstat/sa__: No such file or directory
Please check if data collecting is enabled