早上到公司,发现邮箱内有报警信息显示CPU和IO使用都已超标。
报警内容如下:
主机: test-server-192.168.1.18
时间: 2015.11.15 15:25:17
状态: PROBLEM
级别:Warning
报警原因: Processor load is too high on test-server
内容: Processor load (1 min average per core):value=52.53
原始事件ID: 30605
主机: test-server-192.168.1.19
时间: 2015.11.18 15:42:23
状态: PROBLEM
级别: Warning
报警原因: Disk I/O is overloaded on test-server
内容: CPU iowait time:value=68.7 %
原始事件ID: 30812
问题追踪
1)用top查看进程,发现有近2000个进程
[root@test-server ~]# top top - 10:00:32 up 184 days, 19:55, 2 users, load average: 49.39, 52.06, 53.04 Tasks: 1826 total, 1 running, 1825 sleeping, 0 stopped, 0 zombie Cpu(s): 22.5%us, 3.8%sy, 0.0%ni, 31.7%id, 41.3%wa, 0.7%hi, 0.0%si, 0.0%st Mem: 8058056k total, 7631808k used, 426248k free, 718780k buffers Swap: 0k total, 0k used, 0k free, 358720k cached
2)查maillog日志,一直报警:No space left on device
[root@test-server ~]# tail -f /var/log/maillog Nov 19 10:12:15 test-server postfix/postdrop[19470]: warning: mail_queue_enter: create file maildrop/878633.19470: No space left on device Nov 19 10:12:15 test-server postfix/postdrop[27287]: warning: mail_queue_enter: create file maildrop/900082.27287: No space left on device Nov 19 10:12:15 test-server postfix/postdrop[12347]: warning: mail_queue_enter: create file maildrop/919377.12347: No space left on device Nov 19 10:12:15 test-server postfix/postdrop[21222]: warning: mail_queue_enter: create file maildrop/937001.21222: No space left on device Nov 19 10:12:16 test-server postfix/postdrop[25028]: warning: mail_queue_enter: create file maildrop/956095.25028: No space left on device Nov 19 10:12:16 test-server postfix/postdrop[28123]: warning: mail_queue_enter: create file maildrop/980022.28123: No space left on device Nov 19 10:12:16 test-server postfix/postdrop[26680]: warning: mail_queue_enter: create file maildrop/999360.26680: No space left on device
3)用lsof,发现是sendmail、postdrop进程数量超多,进程数达到2000多个!这才是导致服务器报警的祸根。
使用下面的命令查出是sendmail、postdrop进程数量超多。 命令解释: lsof |awk '{print $1} 表示打印lsof命令结果信息中的第一列 sort|uniq -c 表示去重 sort -k1 -rn 按照第一列降序排列 head-5 打印前5行 [root@test-server ~]# lsof [root@test-server ~]# lsof |awk '{print $1}'|sort|uniq -c|sort -k1 -rn|head -5 [root@test-server ~]# lsof |grep sendmail |wc -l 24682 [root@test-server ~]# lsof |grep postdrop |wc -l 24108
4)查看文件索引节点inode,发现空间满了
root@test-server log]# df -i Filesystem Inodes IUsed IFree IUse% Mounted on /dev/xvda1 1310720 1310720 0 100% / tmpfs 1007257 1 1007256 1% /dev/shm /dev/xvdb1 13107200 6142 13101058 1% /u01 用df -Th命令: root@cwebser3 statistics]# df -Th Filesystem Type Size Used Avail Use% Mounted on /dev/xvda1 ext4 20G 4.1G 15G 22% / tmpfs tmpfs 3.9G 0 3.9G 0% /dev/shm /dev/xvdb1 ext3 197G 18G 170G 10% /u01
5)腾出磁盘空间
比如将大文件剪贴到大分区下,然后再软链接回来;或者清空大日志文件
6)杀死所有sendmail和postdrop进程后
[root@test-server ~]# ps -ef|grep sendmail | grep -v grep | awk -F" " '{print $2}' | xargs kill -9 [root@test-server ~]# ps -ef|grep postdrop | grep -v grep | awk -F" " '{print $2}' | xargs kill -9 lsof再次查看,确保sendmail和postdrop进程数为0 [root@test-server ~]# lsof |grep sendmail |wc -l 0 [root@test-server ~]# lsof |grep postdrop |wc -l 0
7)最后启动sendmail,用top命令查看进程只有100多个,监控报警消失,问题搞定!
[root@test-server cron.d]# service sendmail restart sendmail: unrecognized service [root@cwebser3 cron.d]# top top - 10:43:12 up 184 days, 20:37, 2 users, load average: 1.03, 1.54, 14.15 Tasks: 105 total, 1 running, 104 sleeping, 0 stopped, 0 zombie Cpu(s): 43.4%us, 1.3%sy, 0.0%ni, 47.9%id, 7.0%wa, 0.3%hi, 0.0%si, 0.0%st Mem: 8058056k total, 6762996k used, 1295060k free, 1422060k buffers Swap: 0k total, 0k used, 0k free, 381392k cached