对于组织的比较凌乱的系统,在现场环境有时会出现服务器无法访问,服务器挂机的情况。一般来说挂机原因主要是内存泄漏或者硬盘消耗殆尽。通过不断将系统监控信息刷到日志文件,可在宕机重启后快速的定位问题。比如docker不限制日志大小会刷爆磁盘,firefox一直开启会内存泄露。监控脚本如下:
#!/bin/bash
# monitor.sh
top_log_size=209715200
while true
do
now=`date '+%F %T'`
time=`date +%Y-%m-%d-%H:%M:%S`
#read cpu_user cpu_sys <<< `top -b -n 1 | grep Cpu | awk '{print $2" "$4}'`
top_data=`top -b -o %MEM | head -n 25` #保留前20行数据
disk_usage=`df -h | awk '$NF=="/"{print $5}' | sed 's/%//'`
free_top=`free -h`
free_top_size=`ls -l /home/user/log/free_top.log | awk '{print $5}'`
disk_usage_size=`ls -l /home/user/log/disk_usage.log | awk '{print $5}'`
top=`ls -l /home/user/log/monitor_top.log | awk '{print $5}'`
top_scp=`ls -l /home/user/log/log_scp/top/ |grep "^-"|wc -l`
disk_usage_scp=`ls -l /home/user/log/log_scp/disk_usage |grep "^-"|wc -l`
free_top_scp=`ls -l /home/user/log/log_scp/free_top |grep "^-"|wc -l`
if [ $top -gt 209715200 ];then #200M=209715200
mv /home/user/log/monitor_top.log /home/user/log/log_scp/top/$time
echo -e "${green} When the file reaches 200M, the file is recorded again and the backup is completed ${NC}"
touch "/home/user/log/monitor_top.log"
else
echo -e "top: $top"
echo "
"
echo -e "$now
$top_data ">> /home/user/log/monitor_top.log
fi
if [ $disk_usage_size -gt 209715200 ];then #200M=209715200
mv /home/user/log/disk_usage.log /home/user/log/log_scp/disk_usage/$time
echo -e "${green} When the file reaches 200M, the file is recorded again and the backup is completed ${NC}"
touch "/home/user/log/disk_usage.log"
else
echo -e "disk_usage_size: $disk_usage_size"
echo "
"
echo -e "$now
$disk_usage ">> /home/user/log/disk_usage.log
fi
if [ $free_top_size -gt 209715200 ];then #200M=209715200
mv /home/user/log/free_top.log /home/user/log/log_scp/free_top/$time
echo -e "${green} When the file reaches 200M, the file is recorded again and the backup is completed ${NC}"
touch "/home/user/log/free_top.log"
else
echo -e "free_top_size: $free_top_size"
echo "
"
echo -e "$now
$free_top ">> /home/user/log/free_top.log
fi
done
上述脚本会将实时top,df,free命令的输出记录在log中。故障重启后可查看log文件,快速定位问题来源。
下图为系统崩溃前的top日志,可以看到是Isolated W+ 这个进程内存泄露导致的。查询文章发现是firefox一直开启会有内存泄露的隐患。