一、邮箱收到一堆监控,报警内空大致如下,很明显是CPU不够用了,IO也有点问题:主机: bwebser2__10.253.5.198 时间: 2015.11.15 15:25:17状态: PROBLEM 级别: Warning报警原因: Processor load is too high on bwebser2内容: Processor load (1 min average per core):value=52.53原始事件ID: 30605主机: bwebser2__10.253.5.198时间: 2015.11.18 15:42:23 状态: PROBLEM级别: Warning 报警原因: Disk I/O is overloaded on bwebser2 内容: CPU iowait time:value=68.7 %原始事件ID: 30812
二、用top查看进程,发现有近2000个进程
- [root@bwebser2 ~]# top
- top - 10:00:32 up 184 days, 19:55, 2 users, load average: 49.39, 52.06, 53.04
- Tasks: 1826 total, 1 running, 1825 sleeping, 0 stopped, 0 zombie
- Cpu(s): 22.5%us, 3.8%sy, 0.0%ni, 31.7%id, 41.3%wa, 0.7%hi, 0.0%si, 0.0%st
- Mem: 8058056k total, 7631808k used, 426248k free, 718780k buffers
- Swap: 0k total, 0k used, 0k free, 358720k cached
三、猜测可能和sendmail有关,查maillog日志,一直报警:No space left on device
- [root@bwebser2 ~]# tail -f /var/log/maillog
- Nov 19 10:12:15 bwebser2 postfix/postdrop[19470]: warning: mail_queue_enter: create file maildrop/878633.19470: No space left on device
- Nov 19 10:12:15 bwebser2 postfix/postdrop[27287]: warning: mail_queue_enter: create file maildrop/900082.27287: No space left on device
- Nov 19 10:12:15 bwebser2 postfix/postdrop[12347]: warning: mail_queue_enter: create file maildrop/919377.12347: No space left on device
- Nov 19 10:12:15 bwebser2 postfix/postdrop[21222]: warning: mail_queue_enter: create file maildrop/937001.21222: No space left on device
- Nov 19 10:12:16 bwebser2 postfix/postdrop[25028]: warning: mail_queue_enter: create file maildrop/956095.25028: No space left on device
- Nov 19 10:12:16 bwebser2 postfix/postdrop[28123]: warning: mail_queue_enter: create file maildrop/980022.28123: No space left on device
- Nov 19 10:12:16 bwebser2 postfix/postdrop[26680]: warning: mail_queue_enter: create file maildrop/999360.26680: No space left on device
四、用lsof确定sendmail、postdrop进程数量,进程数达到2000多个,为什么有这么多呢?
- [root@bwebser2 ~]# lsof |grep sendmail |wc -l
- 24682
- [root@bwebser2 ~]# lsof |grep postdrop |wc -l
- 24108
五、查看文件索引节点inode,发现空间满了:
- [root@bwebser2 log]# df -i
- Filesystem Inodes IUsed IFree IUse% Mounted on
- /dev/xvda1 1310720 1310720 0 100% /
- tmpfs 1007257 1 1007256 1% /dev/shm
- /dev/xvdb1 13107200 6142 13101058 1% /u01
- 用df -Th命令:
- root@cwebser3 statistics]# df -Th
- Filesystem Type Size Used Avail Use% Mounted on
- /dev/xvda1 ext4 20G 4.1G 15G 22% /
- tmpfs tmpfs 3.9G 0 3.9G 0% /dev/shm
- /dev/xvdb1 ext3 197G 18G 170G 10% /u01
六、通过清除zookeeper监控日志把腾出根的空间
- cd /home/zookeeper/monitor
- [root@bwebser2 monitor]# ll
- total 8
- drwxrwxr-x 163 zookeeper zookeeper 4096 Nov 12 00:16 charts
- drwxrwxr-x 167 zookeeper zookeeper 4096 Nov 18 17:31 statistics
- [root@bwebser2 monitor]# cd charts
- rm -rf *
- [root@bwebser2 monitor]# cd ../statistics/
- [root@bwebser2 statistics]# rm -rf 201506*
- [root@bwebser2 statistics]# rm -rf 201507*
- [root@bwebser2 statistics]# rm -rf 201508*
- [root@bwebser2 statistics]# rm -rf 201509*
- [root@bwebser2 statistics]# rm -rf 201510*
七、杀死所有sendmail和postdrop进程后
- [root@bwebser2 ~]#ps -ef|grep sendmail | grep -v grep | awk '{print "kill -9 " $2}' |sh
- [root@bwebser2 ~]#ps -ef|grep postdrop | grep -v grep | awk '{print "kill -9 " $2}' |sh
八、lsof查看,进程数为0
- [root@bwebser2 ~]# lsof |grep sendmail |wc -l
- 0
- [root@bwebser2 ~]# lsof |grep postdrop |wc -l
- 0
九、被忽略的/etc/cron.d下的sysstat,修改sysstat,操作如下:
- [root@bwebser2 cron.d]#cd /etc/cron.d/
- [root@bwebser2 cron.d]# ll
- total 12
- -rw-r--r--. 1 root root 113 Nov 23 2013 0hourly
- -rw-r--r--. 1 root root 108 Apr 7 2014 raid-check
- -rw-r--r--. 1 root root 235 Nov 23 2013 sysstat
- vi sysstat添加&>/dev/null
- # run system activity accounting tool every 10 minutes
- */10 * * * * root /usr/lib/sa/sa1 1 1 &>/dev/null
- # generate a daily summary of process accounting at 23:53
- 53 23 * * * root /usr/lib/sa/sa2 -A &>/dev/null
十、再次用top命令查看进程只有100多个,监控报警消失,问题搞定!
- [root@bwebser2 cron.d]# service sendmail restart
- sendmail: unrecognized service
- [root@cwebser3 cron.d]# top
- top - 10:43:12 up 184 days, 20:37, 2 users, load average: 1.03, 1.54, 14.15
- Tasks: 105 total, 1 running, 104 sleeping, 0 stopped, 0 zombie
- Cpu(s): 43.4%us, 1.3%sy, 0.0%ni, 47.9%id, 7.0%wa, 0.3%hi, 0.0%si, 0.0%st
- Mem: 8058056k total, 6762996k used, 1295060k free, 1422060k buffers
- Swap: 0k total, 0k used, 0k free, 381392k cached