环境说明:
公司是做在线教育的互联网企业,WEB架构为:前端使用LVS + Heartbeat做负载均衡,后端主要是Apache/Nginx + Tomcat,缓存有redis和Memcached,数据库使用的Oracle和Mysql。
脚本实现目的:
通过查看tomcat日志来检测服务是否正常。
脚本思路:
检测tomcat的当前连接数以及与数据库的连接数,若超过规定值则重启应用。
在脚本中事先定义好错误关键字,每次检测前提取日志的最后一万行到一个临时文件中。若检测到指定的错误,则触发指定的操作,操作完成后备份日志,并清空当前日志(避免下次检测时重复触发操作)。所有的服务都是通过配置文件的方式传递给脚本,以方便批量部署。
脚本内容:
#!/bin/bash #This shell-script is use for check tomcat log and connection status #Created in 2012-02-17 #Last changed in 2013-04-26 source ~/.bash_profile &>/dev/null dir=/tol/app dir2=/tol/script cd $dir2 dt2=`date +"%Y-%m-%d"` ls $dir2/logs &>/dev/null || mkdir -p $dir2/logs &>/dev/null log="$dir2/logs/tomcat-check-$dt2.log" host=`/sbin/ifconfig |grep "inet addr"|cut -d ':' -f2 |awk '{print $1}'|head -1` conf=$dir2/tomcat-check.conf sh_name=$0 function shijian () { dt=`date +"%Y-%m-%d-%H:%M:%S"` } #conf-check if test ! -f $conf ; then echo "The $conf is not exist" exit 0 fi shijian echo "$dt" >> $log echo "$host" >> $log while read line do shijian tom=`echo $line |awk -F\; '{print $1}'` con_num=`echo $line |awk -F\; '{print $4}'` db_port=`echo $line |awk -F\; '{print $5}'` db_num=`echo $line |awk -F\; '{print $6}'` function pid () { pid='' pid=`ps -elf|grep java|grep "$dir"|grep $tom|awk '{print $4}'` } function tom_restart () { #stop `echo $line |awk -F\; '{print $2}'` sleep 3 pid if test -n "$pid" ; then kill -9 $pid fi rm -rf $dir/$tom/work/Catalina/* rm -rf $dir/$tom/temp/* #start `echo $line |awk -F\; '{print $3}'` } tail -10000 $dir/$tom/logs/catalina.out > $dir/$tom/logs/check.log catalina=$dir/$tom/logs/check.log catalina2=$dir/$tom/logs/catalina.out function clean () { \cp $catalina2 $catalina2-$dt && >$catalina2 } function chongqi () { tom_restart shijian } #check-logsize shijian size=`du -m $catalina2 | awk '{print $1}'` if test $size -ge 2000 ; then >$catalina2 echo "The $dt $host $catalina2 is too much big size="$size-M"" |mail -s "$host $tom catalina.out" jiank else shijian echo "$dt check__$host $tom $catalina2 size__is done" >> $log #check-logkeyword check1="OutOfMemoryError" check2="Too many open files" check3="Heap at VM Abort" check4="Cannot get a connection" check5="timeout" for check in "$check1" "$check2" "$check3" "$check4" "$check5" do keyword=`grep -i -c "$check" $catalina` if [ "$keyword" -gt "1" ] ; then if [ "$check" = timeout ] ; then if [ "$keyword" -gt "500" ] ; then clean chongqi echo "$dt $tom "$check" and restart by $sh_name newpid=$pid" >> $log echo "$host $tom "$check" by $sh_name" |mail -s "$host $tom will check!!" jiank fi else clean chongqi pid if test -n "$pid" ; then echo "$dt $tom "$check" and restart by $sh_name newpid=$pid" >> $log echo "$dt $host $tom "$check" and restart by $sh_name" |mail -s "$host $tom will restart" jiank else echo "$dt $tom "$check" and restart fail by $sh_name" >> $log echo "$dt $host $tom "$check" and restart fail by $sh_name" |mail -s "check $host $tom" jiank fi fi else echo "$dt check__$host $tom "$check"__is done" >> $log fi done #check Oracle Io Exception io=`grep -i "Connection timed out" $catalina |grep -i -c "java.sql.SQLException: Io"` if [ "$io" != "0" ] ; then clean chongqi pid if test -n "$pid" ; then echo "$dt $tom Oracle Io Exception and restart by $sh_name newpid=$pid" >> $log echo "$dt $host $tom Oracle Io Exception and restart by $sh_name"|mail -s "$host $tom will restart" jiank else echo "$dt $tom Oracle Io Exception and restart fail by $sh_name" >> $log echo "$dt $host $tom Oracle Io Exception and restart fail by $sh_name" |mail -s "check $host $tom" jiank fi else echo "$dt check__$host $tom Oracle Io Exception__is done" >> $log fi #check process connect numbers connect=`ps -eLf|grep java|grep "$dir"|grep -c $tom` if [ "$connect" -gt "$con_num" ] ; then chongqi pid if test -n "$pid" ; then echo "$dt $tom connect more than $con_num proc and restart by $sh_name newpid=$pid" >> $log echo "$dt $host $tom connect more than $con_num proc and restart by $sh_name"| mail -s "$host $tom will restart" jiank else echo "$dt $tom connect more than $con_num proc and restart fail by $sh_name" >> $log echo "$dt $host $tom connect more than $con_num proc and restart fail by $sh_name" |mail -s "check $host $tom" jiank fi else echo "$dt check__$host $tom connect more than $con_num proc__is done" >> $log fi #check db connect numbers if test -n "$db_port" ; then pid='' pid=`ps -elf|grep java|grep "$dir"|grep $tom|/usr/bin/tail -1|awk '{print $4}'` dbconno=`netstat -tunp|grep "$db_port"|grep -c "$pid"` if [ "$dbconno" -gt "$db_num" ] ; then chongqi pid if test -n "$pid" ; then echo "$dt $tom db connect more than $db_num and restart by $sh_name newpid=$pid" >> $log echo "$dt $host $tom db connect more than $db_num and restart by $sh_name"|mail -s "$host $tom will restart" jiank else echo "$dt $tom db connect more than $db_num and restart fail by $sh_name" >> $log echo "$dt $host $tom db connect more than $db_num and restart fail by $sh_name" |mail -s "check $host $tom" jiank fi else echo "$dt check__$host $tom db connect numbers more than $db_num __is done" >> $log fi fi fi done < $conf
脚本配置文件:
tomcata;/etc/init.d/tomcata stop;/etc/init.d/tomcata start;1000;1521;200; tomcatb;/etc/init.d/tomcatb stop;/etc/init.d/tomcatb start;1000;1521;200;
转载于:https://blog.51cto.com/rmeos/1423420