文章目录
前言
前两天学习了shell的初级编程,初步掌握了awk,sed, grep
三剑客的用法,今天学习几个大佬写出来的脚本,巩固下学到的知识,顺便记录下脚本的内容,方便以后使用查找。
生产环境下的统计类脚本
1. 统计设备资产明细脚本
可以使用Python或者Go语言封装shell脚本,获取脚本执行结果,然后以后端API的形式提供更好地数据展示结构,例如JSON。
脚本内容
#! /bin/bash
####get cpu info ####
cpu_num=`cat /proc/cpuinfo| grep "physical id"| sort| uniq| wc -l`
cpu_sum=`cat /proc/cpuinfo| grep processor| wc -l`
cpu_hz=`cat /proc/cpuinfo| grep "model name"| uniq -c| awk '{print $NF}'`
####get nic info####
mem_m=0
for i in `dmidecode -t memory| grep Size: |grep -v "No Module Installed" |awk '{print $2}'`;
do
mem_m=`expr $mem_m + $i`
done
mem_sum=`echo $mem_m / 1024 | bc`
wan_num=`lspci | grep Ethernet | grep -E "0-Gigabit |10 Gigabi" | wc -l`
####get disk num####
B=`date +%s`
ssd_num=0
sata_num=0
for i in `lsblk | grep "disk" | awk '{print $1}' | grep -Ev "ram" | sort`;
do
code=`cat /sys/block/$i/queue/rotational`
if [ "$code" = "0" ];then
ssd_num=`expr $ssd_num + 1` && echo $i >>/tmp/$B.ssd
else
sata_num=`expr $sata_num + 1` && echo $i >>/tmp/$B.sata
fi
done
####get disk sum####
C=`date +%N`
ssd_sum=0
sata_sum=0
if [ -f /tmp/$B.ssd ];then
for n in `cat /tmp/$B.ssd`;
do
fdisk -l /dev/$n >>/tmp/$C.ssd 2>$1
for x in `grep "Disk /dev" /tmp/$C.ssd | awk '{print $3}'`;
do
u=`echo $x / 1 |bc`
done
ssd_sum=`expr $ssd_sum + $u + 1`
done
fi
for m in `cat /tmp/$B.sata`;
do
fdisk -l /dev/$m >> /tmp/$C.sata 2>&1
for y in `grep "Disk /dev" /tmp/$C.sata | awk '{print $3}'`;
do
v=`echo $y / 1|bc`
done
sata_sum=`expr $sata_sum + $v + 1`
done
####ip info####
ip=`ifconfig eth0 | grep inet | awk '{print $2}'`
####show dev info####
echo -n "$ip `hostname` "
echo -n "CPU(物理核数,逻辑核数,频率): $cpu_num $cpu_sum $cpu_hz "
echo -n "内存(GB): $mem_sum"
echo "SSD数量:${ssd_num} SSD容量:${ssd_sum}GB SATA数量:${sata_num} SATA容量:${sata_sum}GB"
2. 统计重要业务程序是否正常可以运行
统计业务进程数量是否为1, 可以将重要的业务交给Supervisord守护进程托管。
#!/bin/bash
sync_redis_status=`ps aux | grep sync_redis.py | grep -v grep | wc -l`
if [ ${sync_redis_status} ne 1 ];then
echo "Critical! sync_redis is Died"
exit 2
else
echo "OK! sync_redis is Alive"
exit 0
fi
3. 统计机器的IP连接数
#!/bin/bash
# 脚本的$1和$2报警阈值可以根据业务的实际情况进行调整
# $1=5 $2=10
ip_conns=`netstat -an | grep tcp | grep EST | wc -l`
messages=`netstat -ant | awk '/^tcp/ {++S[$NF]} END {for (a in S) print a,S[a]}' | tr -s '\n' ',' | sed -r 's/(.*),/\1\n/g'`
if [[ $ip_conns -lt $1 ]];then
echo "$messages, OK -connect counts is $ip_conns"
exit 0
fi
if [[ $ip_conns -gt $1 && $ip_conns -lt $2 ]];then
echo "$messages, Warning -connect counts is $ip_conns"
exit 1
fi
if [[ $ip_conns -gt $2 ]];then
echo "$messages,Critical -connect counts is $ip_conns"
exit 2
fi
生产环境下的监控类脚本
1. 在Nginx负载均衡器上监控Nginx进程的脚本
系统使用Nginx+keepalived架构,脚本每隔5秒就监控一次Nginx的运行状态,如果发现有问题就关闭本机的keepalived程序,让VIP切换到从Nginx负载均衡器上。
#!/bin/bash
while :
do
nginxpid=`ps -C nginx --no-header | wc -l`
if [[ $nginxpid -eq 0 ]];then
ulimit -SHn 65535
/usr/local/nginx/sbin/nginx
sleep 5
if [[ $nginxpid -eq 0 ]];then
/etc/init.d/keepalived stop
fi
fi
sleep 5
done
2. 系统文件打开数检测脚本
查看Nginx进程下的最大文件打开数:
#!/bin/bash
for pid in `pa aux | grep nginx | grep -v grep | awk '{print $2}'`
do
cat /proc/${pid}/limits | grep 'Max open files'
done
3. 检测机器的CPU利用率
检测信息包括:user, system, iowait, idle 几个参数。
使用方法:
-
脚本使用提示:
-
使用案例:通过-w设置警告级别阈值:
执行脚本后打印出CPU使用信息,通过查看执行返回状态码为1,可以判断为CPU使用达到告警状态
通过-s设置严重级别阈值:
执行脚本后打印出CPU使用信息,通过查看执行返回状态码为2,可以判断为CPU使用达到严重状态
#!/bin/bash # CPU Utilization Statistics plugin for Nagios # # USAGE : . /check cpu utili. sh [-w <user, system, iowait> ][ -c <user,system,iowait>] ([ -i <intervals in second> ] [ -n <report number> ]) # # Exemple: . /check cpu utili. sh # ./check_cpu_utili.sh -w 70,40,30 -c 90,60,40 # . /check_cpu_utili. sh -w 70, 40, 30 -c 90, 60, 40 -i 3 -n 5 # Paths to commands used in this script. These may have to be modified to match your system setup. IOSTAT="/usr/bin/iostat" # Nagios return codes STATE_OK=0 STATE_WARNING=1 STATE_CRITICAL=2 STATE_UNKNOWN=3 # Plugin parameters value if not define LIST_WARNING_THRESHOLD="70,40,30" LIST_CRITICAL_THRESHOLD="90,60,40" INTERVAL_SEC=l NUM_REPORT=l # Plugin variable description PROGNAME=$(basename $0) if [[ ! -x $IOSTAT ]];then echo "UNKNOWN: iostat not found or is not executable by the nagios user." exit $STATE_UNKNOWN fi print_usage() { echo "" echo "$PROGNAME $RELEASE - CPU Utilization check script for Nagios" echo "" echo "Usage: check_cpu_utili.sh -w -c (-i -n)" echo "" echo " -w Warning threshold in % for warn_user,warn_system,warn_iowait CPU (default: 70,40,30)" echo " Exit with WARNING status if cpu exceeds warn_n" echo " -c Critical threshold in % for crit user,crit system,crit iowait CPU (default : 90,60,40)" echo " Exit with CRITICAL status if cpu exceeds crit_n" echo " -i Interval in seconds for iostat (default : 1)" echo " -n Number report for iostat (default: 3)" echo " -h Show this page" echo "" echo "Usage: $PROGNAME" echo "Usage: $PROGNAME --help" echo "" exit 0 } print_help() { print_usage echo "" echo "This plugin will check cpu utilization (user,systerm,CPU Iowait in%)" echo "" exit 0 } # Parse parameters while [[ "$#" -gt "0" ]] do case "$1" in -h | --help) print_help exit $STATE_OK ;; -v | --version) print_release exit $STATE_OK ;; -w | warning) shift LIST_WARNING_THRESHOLD=$1 ;; -c | --critical) shift LIST_CRITICAL_THRESHOLD=$1 ;; -i | --interval) shift INTERVAL_SEC=$1 ;; -n | --number) shift NUM_REPORT=$1 ;; *) echo "Unknown argument: $1" print_usage exit $STATE_UNKNOWN ;; esac shift done # List to Table for warning threshold (compatibility with TAB_WARNING_THRESHOLD=(`echo $LIST_WARNING_THRESHOLD | sed 's/,/ /g'`) if [[ "${#TAB_WARNING_THRESHOLD[@]}" -ne "3" ]];then echo "ERROR : Bad count parameter in Warning threshold" exit $STATE_WARNING else USER_WARNING_THRESHOLD=`echo ${TAB_WARNING_THRESHOLD[0]}` SYSTEM_WARNING_THRESHOLD=`echo ${TAB_WARNING_THRESHOLD[1]}` IOWAIT_WARNING_THRESHOLD=`echo ${TAB_WARNING_THRESHOLD[2]}` fi # List to Table for critical threshold TAB_CRITICAL_THRESHOLD=(`echo $LIST_CRITICAL_THRESHOLD | sed 's/,/ /g'`) if [[ "${#TAB_CRITICAL_THRESHOLD[@]}" -ne "3" ]];then echo "ERROR: Bad count parameter in CRITICAL Threshold" exit $STATE_WARNING else USER_CRITICAL_THRESHOLD=`echo ${TAB_CRITICAL_THRESHOLD[0]}` SYSTEM_CRITICAL_THRESHOLD=`echo ${TAB_CRITICAL_THRESHOLD[1]}` IOWAIT_CRITICAL_THRESHOLD=`echo ${TAB_CRITICAL_THRESHOLD[2]}` fi if [[ "${TAB_WARNING_THRESHOLD[0]}" -ge "${TAB_CRITICAL_THRESHOLD[0]}" && "${TAB_WARNING_THRESHOLD[1]}" -ge "${TAB_CRITICAL_THRESHOLD[1]}" && "${TAB_WARNING_THRESHOLD[2]}" -ge "${TAB_CRITICAL_THRESHOLD[2]}" ]];then echo "ERROR: Critical CPU Threshold lower as Warning CPU Threshol" exit $STATE_WARNING fi # 这里是阿里云上的格式,具体的需要根据情况截取 CPU_REPORT=`iostat -c ${INTERVAL_SEC} ${NUM_REPORT} | sed '/^$/d'| sed -n "3p"` CPU_USER=`echo $CPU_REPORT | awk '{print $1}'` CPU_SYSTEM=`echo $CPU_REPORT | awk '{print $3}'` CPU_IOWAIT=`echo $CPU_REPORT | awk '{print $4}'` CPU_STEAL=`echo $CPU_REPORT | awk '{print $5}'` CPU_IDLE=`echo $CPU_REPORT | awk '{print $6}'` NAGIOS_STATUS="user=${CPU_USER}%, system=${CPU_SYSTEM}%, iowait=${CPU_IOWAIT}%, idle=${CPU_IDLE}%" NAGIOS_DATA="CpuUser=${CPU_USER};${TAB_WARNING_THRESHOLD[0]};${TAB_CRITICAL_THRESHOLD[0]};0" CPU_USER_MAJOR=`echo $CPU_USER | cut -d "." -f 1` CPU_SYSTEM_MAJOR=`echo $CPU_SYSTEM | cut -d "." -f 1` CPU_IOWAIT_MAJOR=`echo $CPU_IOWAIT | cut -d "." -f 1` CPU_IDLE_MAJOR=`echo $CPU_IDLE | cut -d "." -f 1` # return if [[ "${CPU_USER_MAJOR}" -ge "${USER_CRITICAL_THRESHOLD}" ]];then echo "CPU STATISTICS OK:${NAGIOS_STATUS} | CPU_USER=${CPU_USER}%;70;90;0;100" exit $STATE_CRITICAL elif [[ "${CPU_SYSTEM_MAJOR}" -ge "${SYSTEM_CRITICAL_THRESHOLD}" ]];then echo "CPU STATISTICS OK:${NAGIOS_STATUS} | CPU_USER=${CPU_USER}%;70;90;0;100" exit $STATE_CRITICAL elif [[ "${CPU_IOWAIT_MAJOR}" -ge "${IOWAIT_CRITICAL_THRESHOLD}" ]];then echo "CPU STATISTICS OK:${NAGIOS_STATUS} | CPU_USER=${CPU_USER}%;70;90;0;100" exit $STATE_CRITICAL elif [[ "${CPU_USER_MAJOR}" -ge "${USER_WARNING_THRESHOLD}" ]] && [[ "${CPU_USER_MAJOR}" -lt "${USER_CRITICAL_THRESHOLD}" ]];then echo "CPU STATISTICS OK: ${NAGIOS_STATUS} | CPU_USER=${CPU_USER}%;70;90;0;100" exit $STATE_WARNING elif [[ "${CPU_SYSTEM_MAJOR}" -ge "${SYSTEM_WARNING_THRESHOLD}" ]] && [[ "${CPU_SYSTEM_MAJOR}" -lt "${SYSTEM_CRITICAL_THRESHOLD}" ]];then echo "CPU STATISTICS OK: ${NAGIOS_STATUS} | CPU_USER=${CPU_USER}%;70;90;0;100" exit $STATE_WARNING elif [[ "${CPU_IOWAIT_MAJOR}" -ge "${IOWAIT_WARNING_THRESHOLD}" ]] && [[ "${CPU_IOWAIT_MAJOR}" -lt "${IOWAIT_CRITICAL_THRESHOLD}" ]];then echo "CPU STATISTICS OK: ${NAGIOS_STATUS} | CPU_USER=${CPU_USER}%;70;90;0;100" exit $STATE_WARNING else echo "CPU STATISTICS OK:${NAGIOS_STATUS} | CPU_USER= ${CPU_USER}%;70;90;0;100" exit $STATE_OK fi
生产环境下的运维开发类脚本
1. 控制shell多进程数量的脚本
脚本启动 run.py 程序,控制进程数量在8个。
#!/bin/bash
# 每5分钟运行一次脚本
CE_HOME=`/data/ContentEngine`
LOG_PATH=`/data/logs`
# 控制程序数量为8
MAX_SPIDER_COUNT=8
# 当前程序数量
count=`ps -ef | grep -v grep | grep run.py | wc -l`
# 下面的代码逻辑是控制 run.py 进程数量始终为 8,以充分挖掘机器的性能,并且为了防止形成死循环,这
# 里没有采用 while 语句
try_time=0
cd $CE_HOME
while [[ "$count" -lt "$MAX_SPIDER_COUNT" -a "$try_time" -lt "$MAX_SPIDER_COUNT" ]];
do
let try_time+=1
nohup python run.py >> ${LOG_PATH}/spider.log 2>&1 &
count=`ps -ef | grep -v grep | grep run.py | wc -l`
done