问题:监控报警——数据库服务器负载过高
问题排查:
1、top命令查看
top - 09:21:16 up 71 days, 10:25, 7 users, load average: 54.12, 10.79, 20.94
Tasks: 1228 total, 3 running, 1225 sleeping, 0 stopped, 0 zombie
Cpu0 : 60.1%us, 12.5%sy, 0.0%ni, 14.5%id, 3.4%wa, 0.3%hi, 9.1%si, 0.0%st
Cpu1 : 0.7%us, 37.7%sy, 0.0%ni, 5.6%id, 56.1%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu2 : 2.3%us, 0.7%sy, 0.0%ni, 95.3%id, 1.7%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 0.3%us, 0.3%sy, 0.0%ni, 98.7%id, 0.7%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu4 : 0.7%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 1.0%us, 0.3%sy, 0.0%ni, 96.7%id, 2.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu6 : 13.2%us, 7.0%sy, 0.0%ni, 77.8%id, 2.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu7 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu8 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu9 : 0.3%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu10 : 0.7%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu11 : 0.7%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu12 : 0.3%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu13 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu14 : 0.3%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu15 : 0.3%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu16 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu17 : 0.0%us, 0.7%sy, 0.0%ni, 97.0%id, 2.3%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu18 : 0.3%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu19 : 0.0%us, 0.7%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu20 : 53.7%us, 11.8%sy, 0.0%ni, 27.2%id, 7.3%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu21 : 0.7%us, 18.5%sy, 0.0%ni, 72.6%id, 8.3%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu22 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu23 : 0.3%us, 1.3%sy, 0.0%ni, 98.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu24 : 48.9%us, 2.6%sy, 0.0%ni, 42.6%id, 5.9%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu25 : 0.7%us, 0.7%sy, 0.0%ni, 97.4%id, 1.3%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu26 : 0.3%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu27 : 0.3%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu28 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu29 : 0.0%us, 0.7%sy, 0.0%ni, 98.4%id, 1.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu30 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu31 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu32 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu33 : 0.3%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu34 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu35 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu36 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu37 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu38 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu39 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 131905616k total, 128930896k used, 2974720k free, 152424k buffers
Swap: 16383992k total, 4172316k used, 12211676k free, 112751428k cached
分析:cpu很空闲,内存和IO很高,数据库当前没有正在运行的SQL,是什么导致IO使用这么高,根据过去一段时间这台服务器上的使用情况来看,应该是定时任务的问题。
定时任务rman都是凌晨跑,但是由于数据库超过2T,全备备份时间过长,一直到上班时间还在备份,导致磁盘IO过高。
解决办法:
1、关闭定时任务
ps -ef|grep backup
kill -9 process_id
定时任务关闭后,发现负载并没有降。原来定时任务杀死后,rman任务仍在运行
SELECT sid, spid, client_info ,'kill -9 '||spid
FROM v$process p, v$session s
WHERE p.addr = s.paddr
AND client_info LIKE '%rman%';
2、kill掉rman进程
结果:服务器负载降下去了,一切恢复正常。
3、修改备份策略
kill掉了,但是备份仍然少不了。呜呜~
因为这个库大部分数据都是从其他库同步而来,数据冗余严重,经过评估,可只备份必要的表空间。
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/30443223/viewspace-2219889/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/30443223/viewspace-2219889/