本帖最后由 oracle_mao 于 2013-12-19 09:08 编辑
对于管理RAC的dba来说,肯定遇到过某一个节点被另一个节点reboot,其实出现这样的原因很多,但大多数情况,都是由于某个节点的资源(cpu,内存,磁盘,网络等)出现了问题,阻碍了节点间的通信,为保证数据的完整性以及RAC特性,才将出问题的节点剔除集群的,从维护RAC以来,遇到过很多次这样的事情,不是os被重启了,就是db资源被重启了,有时候出问题是晚上,有时候是凌晨,有时候是白天。以下为最近一次RAC出问题的故障分析以及解决方案。
时间发生在:2013-12-10 晚上21:24,RAC的第一个节点被重启(2节点RAC集群)
1. 查看RAC的alert日志
2013-12-10 21:16:05.917
[/u01/11.2.0/grid/bin/oraagent.bin(4278)]CRS-5818:Aborted command 'check' for resource 'ora.LISTENER.lsnr'. Details at (:CRSAGF001
13:) {1:43062:2} in /u01/11.2.0/grid/log/rac1/agent/crsd/oraagent_grid/oraagent_grid.log.
----这里发现出现了crs-5818错误,其实rac会每隔1分钟检查一次各个资源的状态,如果发现检查失败就会报错,这里应该是首先检查监听资源,所以我们每次出问题的时候都是先报监听检查失败。
2013-12-10 21:16:05.491
[/u01/11.2.0/grid/bin/oraagent.bin(4278)]CRS-5818:Aborted command 'check' for resource 'ora.LISTENER_SCAN1.lsnr'. Details at (:CRS
AGF00113:) {1:43062:2} in /u01/11.2.0/grid/log/rac1/agent/crsd/oraagent_grid/oraagent_grid.log.
2013-12-10 21:25:47.436
[ohasd(2989)]CRS-2112:The OLR service started on node rac1.--这是时候已经发生重启,并重启完毕了
2013-12-10 21:25:47.622
[ohasd(2989)]CRS-1301:Oracle High Availability Service started on node rac1.
2013-12-10 21:25:47.864
[ohasd(2989)]CRS-8017:location: /var/opt/oracle/lastgasp has 34 reboot advisory log files, 0 were announced and 0 errors occurred
2013-12-10 21:25:52.660
[/u01/11.2.0/grid/bin/oraagent.bin(3270)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/u01/11.2.0/gr
id/log/rac1/agent/ohasd/oraagent_grid/oraagent_grid.log"
2013-12-10 21:26:06.605
[gpnpd(3622)]CRS-2328:GPNPD started on node rac1.
2013-12-10 21:26:09.620
[cssd(3640)]CRS-1713:CSSD daemon is started in clustered mode
2013-12-10 21:26:11.437
[ohasd(2989)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE
2013-12-10 21:27:28.324
[cssd(3640)]CRS-1707:Lease acquisition for node rac1 number 1 completed
2013-12-10 21:27:29.718
[