一次RAC无法启动的处理

上次去客户现场处理RAC无法启动的问题。大概情况是这样的:安装nbu的工程师在配置服务器连接磁带机的过程中,其中节点1识别不到。为识别磁带机,工程师要求重启系统。于是先关闭Oracle RAC集群。但是系统重启后,集群无法启动。经排查无法发现问题原因所在,于是打算将节点2集群关闭后再一起重新启动。最终两个节点都无法启动。

其实集群无法启动的原因有很多,比如空间不足、磁盘心跳等。查看grid的alert日志如图,发现mdns进程启动不了。

[ohasd(10850)]CRS-1301:Oracle High Availability Service started on node rh1.
2014-11-28 16:09:43.679
[ohasd(10850)]CRS-8017:location: /etc/oracle/lastgasp has 2 reboot advisory log files, 0 were announced and 0 errors occurred
2014-11-28 16:09:44.905
[/u01/app/11.2.0/grid/bin/orarootagent.bin(10894)]CRS-5016:Process "/u01/app/11.2.0/grid/bin/acfsload" spawned by agent "/u01/app/11.2.0/grid/bin/orarootagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/app/11.2.0/grid/log/rh1/agent/ohasd/orarootagent_root/orarootagent_root.log"
2014-11-28 16:09:55.725
[mdnsd(10983)]CRS-5602:mDNS service stopping by request.
2014-11-28 16:11:55.604
[/u01/app/11.2.0/grid/bin/oraagent.bin(10971)]CRS-5818:Aborted command 'start' for resource 'ora.mdnsd'. Details at (:CRSAGF00113:) {0:0:2} in /u01/app/11.2.0/grid/log/rh1/agent/ohasd/oraagent_grid/oraagent_grid.log.
2014-11-28 16:11:59.616
[ohasd(10850)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.mdnsd'. Details at (:CRSPE00111:) {0:0:2} in /u01/app/11.2.0/grid/log/rh1/ohasd/ohasd.log.
2014-11-28 16:12:21.129
[mdnsd(11070)]CRS-5602:mDNS service stopping by request.

进一步查看oraagent_grid.log,还是不能确定什么原因

[  clsdmc][1111406912]Fail to connect (ADDRESS=(PROTOCOL=ipc)(KEY=rh1DBG_MOND)) with status 9
2014-11-28 16:09:44.966: [ora.crf][1111406912] {0:0:2} [check] Error = error 9 encountered when connecting to MOND
2014-11-28 16:09:44.966: [ora.crf][1111406912] {0:0:2} [check] Calling PID check for daemon
2014-11-28 16:09:44.966: [ora.crf][1111406912] {0:0:2} [check] Process id 4862 translated to 
2014-11-28 16:09:45.081: [ COMMCRS][1080408384]clsc_connect: (0x2aaaac004a10) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=rh1DBG_CRSD))

[  clsdmc][1089182016]Fail to connect (ADDRESS=(PROTOCOL=ipc)(KEY=rh1DBG_CRSD)) with status 9
2014-11-28 16:09:45.082: [ora.crsd][1089182016] {0:0:2} [check] Error = error 9 encountered when connecting to CRSD
2014-11-28 16:09:45.084: [ora.crsd][1089182016] {0:0:2} [check] DaemonAgent::check returned 1
2014-11-28 16:09:45.085: [    AGFW][1109305664] {0:0:2} ora.crsd 1 1 state changed from: UNKNOWN to: OFFLINE
2014-11-28 16:09:45.085: [    AGFW][1109305664] {0:0:2} Agent sending last reply for: RESOURCE_PROBE[ora.crsd 1 1] ID 4097:107
2014-11-28 16:09:45.091: [    AGFW][1109305664] {0:0:2} Agent received the message: RESOURCE_DELETE[ora.crsd 1 1] ID 4358:177
2014-11-28 16:09:45.092: [    AGFW][1109305664] {0:0:2} Agent sending last reply for: RESOURCE_DELETE[ora.crsd 1 1] ID 4358:177
2014-11-28 16:09:45.092: [    AGFW][1109305664] {0:0:2} ora.crsd 1 1 marked as deleted.
2014-11-28 16:09:45.154: [ COMMCRS][1091283264]clsc_connect: (0x2aaab00033b0) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=rh1DBG_CTSSD))

[  clsdmc][1113508160]Fail to connect (ADDRESS=(PROTOCOL=ipc)(KEY=rh1DBG_CTSSD)) with status 9
2014-11-28 16:09:45.155: [ COMMCRS][1119811904]clsc_connect: (0x2aaab0014bf0) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=rh1DBG_MOND))

2014-11-28 16:09:45.155: [ora.ctssd][1113508160] {0:0:2} [check] Error = error 9 encountered when connecting to CTSSD
[  clsdmc][1111406912]Fail to connect (ADDRESS=(PROTOCOL=ipc)(KEY=rh1DBG_MOND)) with status 9
2014-11-28 16:09:45.156: [ora.crf][1111406912] {0:0:2} [check] Error = error 9 encountered when connecting to MOND
2014-11-28 16:09:45.158: [ora.ctssd][1113508160] {0:0:2} [check] Check return = 1, state detail = NULL
2014-11-28 16:09:45.160: [ora.crf][1111406912] {0:0:2} [check] Check return = 1, state detail = NULL
2014-11-28 16:09:45.161: [    AGFW][1109305664] {0:0:2} ora.ctssd 1 1 state changed from: UNKNOWN to: OFFLINE
2014-11-28 16:09:45.161: [    AGFW][1109305664] {0:0:2} Agent sending last reply for: RESOURCE_PROBE[ora.ctssd 1 1] ID 4097:109
2014-11-28 16:09:45.162: [    AGFW][1109305664] {0:0:2} ora.crf 1 1 state changed from: UNKNOWN to: OFFLINE
2014-11-28 16:09:45.162: [    AGFW][1109305664] {0:0:2} Agent sending last reply for: RESOURCE_PROBE[ora.crf 1 1] ID 4097:105
2014-11-28 16:09:45.168: [    AGFW][1089182016] {0:0:2} Agent has no resources to be monitored, Shutting down ..
2014-11-28 16:09:45.168: [    AGFW][1089182016] {0:0:2} Agent sending message to PE: AGENT_SHUTDOWN_REQUEST[Proxy] ID 20486:53
2014-11-28 16:09:45.169: [    AGFW][1109305664] {0:0:2} Agent received the message: RESOURCE_DELETE[ora.ctssd 1 1] ID 4358:179
2014-11-28 16:09:45.169: [    AGFW][1109305664] {0:0:2} Agent sending last reply for: RESOURCE_DELETE[ora.ctssd 1 1] ID 4358:179
2014-11-28 16:09:45.171: [    AGFW][1109305664] {0:0:2} Agent has no resources to be monitored, Shutting down ..
2014-11-28 16:09:45.171: [    AGFW][1109305664] {0:0:2} Agent sending message to PE: AGENT_SHUTDOWN_REQUEST[Proxy] ID 20486:56
2014-11-28 16:09:45.172: [    AGFW][1109305664] {0:0:2} ora.ctssd 1 1 marked as deleted.
2014-11-28 16:09:45.172: [    AGFW][1109305664] {0:0:2} Agent received the message: RESOURCE_DELETE[ora.crf 1 1] ID 4358:181
2014-11-28 16:09:45.172: [    AGFW][1109305664] {0:0:2} Agent sending last reply for: RESOURCE_DELETE[ora.crf 1 1] ID 4358:181
2014-11-28 16:09:45.173: [    AGFW][1109305664] {0:0:2} Agent has no resources to be monitored, Shutting down ..
2014-11-28 16:09:45.173: [    AGFW][1109305664] {0:0:2} Agent sending message to PE: AGENT_SHUTDOWN_REQUEST[Proxy] ID 20486:57
2014-11-28 16:09:45.175: [    AGFW][1109305664] {0:0:2} ora.crf 1 1 marked as deleted.
2014-11-28 16:09:45.175: [    AGFW][1109305664] {0:0:2} Agent is shutting down.
2014-11-28 16:09:45.175: [    AGFW][1109305664] {0:0:2} Agent is exiting with exit code: 1

于是进入到/u01/app/11.2.0/grid/log/rh1/mdnsd目录下查看日志,发现好像跟权限有关系:

2014-11-28 16:09:55.686: [ default][4228895824]mdnsd START pid=10983 
2014-11-28 16:09:55.723: [ COMMCRS][1088567616]clsclisten: Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=rh1DBG_MDNSD))

2014-11-28 16:09:55.724: [  clsdmt][1085036864]Fail to listen to (ADDRESS=(PROTOCOL=ipc)(KEY=rh1DBG_MDNSD))
2014-11-28 16:09:55.724: [  clsdmt][1085036864]Terminating process
2014-11-28 16:09:55.724: [    MDNS][1085036864] clsdm requested mdnsd exit
2014-11-28 16:09:55.725: [    MDNS][1085036864] mdnsd exit

最后排查发现,/var/tmp/.oracle目录下的属主权限如图,


查看正常RAC的此目录信息,应该是grid的属主

srwxr-xr-x 1 grid oinstall 0 Nov 28 15:00 mdnsd
-rw-r--r-- 1 grid oinstall 5 Nov 28 15:00 mdnsd.pid
prw-r--r-- 1 root root     0 Nov 22 00:42 npohasd
srwxrwxrwx 1 grid oinstall 0 Nov 28 15:00 ora_gipc_gipcd_rh1
-rw-r--r-- 1 grid oinstall 0 Nov 28 10:41 ora_gipc_gipcd_rh1_lock
srwxrwxrwx 1 grid oinstall 0 Nov 28 15:00 ora_gipc_GPNPD_rh1
-rw-r--r-- 1 grid oinstall 0 Nov 22 00:44 ora_gipc_GPNPD_rh1_lock
srwxrwxrwx 1 root root     0 Nov 28 15:00 ora_gipc_srh1gridrhclusterCRFM_SIPC
-rw-r--r-- 1 root root     0 Nov 22 00:48 ora_gipc_srh1gridrhclusterCRFM_SIPC_lock
srwxrwxrwx 1 grid oinstall 0 Nov 22 02:46 s#5426.1
srwxrwxrwx 1 grid oinstall 0 Nov 22 02:46 s#5426.2
srwxrwxrwx 1 grid oinstall 0 Nov 28 11:34 s#5489.1
srwxrwxrwx 1 grid oinstall 0 Nov 28 11:34 s#5489.2
srwxrwxrwx 1 grid oinstall 0 Nov 28 11:34 s#5509.1
srwxrwxrwx 1 grid oinstall 0 Nov 28 11:34 s#5509.2
srwxrwxrwx 1 grid oinstall 0 Nov 28 12:17 s#5592.1
srwxrwxrwx 1 grid oinstall 0 Nov 28 12:17 s#5592.2
srwxrwxrwx 1 grid oinstall 0 Nov 28 12:17 s#5594.1
srwxrwxrwx 1 grid oinstall 0 Nov 28 12:17 s#5594.2
srwxrwxrwx 1 grid oinstall 0 Nov 26 13:30 s#5804.1
srwxrwxrwx 1 grid oinstall 0 Nov 26 13:30 s#5804.2
srwxrwxrwx 1 grid oinstall 0 Nov 26 13:30 s#5866.1
srwxrwxrwx 1 grid oinstall 0 Nov 26 13:30 s#5866.2
srwxrwxrwx 1 grid oinstall 0 Nov 22 00:51 s#6013.1
srwxrwxrwx 1 grid oinstall 0 Nov 22 00:51 s#6013.2
srwxrwxrwx 1 grid oinstall 0 Nov 22 02:14 s#6048.1
srwxrwxrwx 1 grid oinstall 0 Nov 22 02:14 s#6048.2
srwxrwxrwx 1 grid oinstall 0 Nov 22 02:14 s#6050.1
srwxrwxrwx 1 grid oinstall 0 Nov 22 02:14 s#6050.2
srwxrwxrwx 1 grid oinstall 0 Nov 28 14:15 s#6090.1
srwxrwxrwx 1 grid oinstall 0 Nov 28 14:15 s#6090.2
srwxrwxrwx 1 grid oinstall 0 Nov 28 14:15 s#6097.1
srwxrwxrwx 1 grid oinstall 0 Nov 28 14:15 s#6097.2
srwxrwxrwx 1 grid oinstall 0 Nov 22 00:59 s#7155.1
srwxrwxrwx 1 grid oinstall 0 Nov 22 00:59 s#7155.2
srwxrwxrwx 1 grid oinstall 0 Nov 28 15:05 sAevm
srwxrwxrwx 1 grid oinstall 0 Nov 28 15:05 sCevm
怀疑是此目录权限引起的,此目录属主和权限很乱,所以决定删除此目录下文件(root用户):

说明:此目录在安装完RAC后生成,之后每次启动集群,这些文件如果存在就进行验证;如果不存在,则可以去生成。那么如果存在,但权限属主不正确则无法验证通过。所以删除,让集群重新生成。

cd /var/tmp/.oracle
rm -rf *

确保完全关闭集群(root用户)

 ./crsctl stop crs -f

重新启动集群成功

 ./crsctl start crs

说明:如果重新启动仍然失败,而且/var/tmp/.oracle中无文件生成。可查看ps -ef | grep ha,发现有多个ha进程

参考:http://www.itpub.net/forum.php?mod=viewthread&tid=1781465 :jieyancai的贴子

root      4459     1  0 14:59 ?        00:00:00 /bin/sh /etc/init.d/init.ohasd run
root     10850     1  0 16:09 ?        00:00:01 /u01/app/11.2.0/grid/bin/ohasd.bin reboot
root     11169     1  0 16:18 ?        00:00:00 /u01/app/11.2.0/grid/bin/ohasd.bin reboot
root     11209     1  0 16:20 ?        00:00:00 /u01/app/11.2.0/grid/bin/ohasd.bin reboot
root     11250 11004  0 16:23 pts/5    00:00:00 tail -f /u01/app/11.2.0/grid/log/rh1/agent/ohasd/orarootagent_root/orarootagent_root.log
root     11255 11097  0 16:23 pts/6    00:00:00 tail -f /u01/app/11.2.0/grid/log/rh1/ohasd/ohasd.log
root     11260     1  0 16:24 ?        00:00:00 /u01/app/11.2.0/grid/bin/ohasd.bin reboot
root     11532     1  0 16:27 ?        00:00:00 /u01/app/11.2.0/grid/bin/ohasd.bin reboot
kill -9 10850
kill -9 11260
再次查看,已经在启动,/var/tmp/.oracle已生成相应文件

root      4459     1  0 14:59 ?        00:00:00 /bin/sh /etc/init.d/init.ohasd run
root     11250 11004  0 16:23 pts/5    00:00:00 tail -f /u01/app/11.2.0/grid/log/rh1/agent/ohasd/orarootagent_root/orarootagent_root.log
root     11255 11097  0 16:23 pts/6    00:00:00 tail -f /u01/app/11.2.0/grid/log/rh1/ohasd/ohasd.log
root     11799  4459 34 16:35 ?        00:00:00 /u01/app/11.2.0/grid/bin/ohasd.bin restart

查看/var/tmp/.oracle目录,

[root@rh1 tmp]# ll /var/tmp/.oracle
total 4
srwxr-xr-x 1 grid oinstall 0 Nov 28 16:35 mdnsd
-rw-r--r-- 1 grid oinstall 6 Nov 28 16:35 mdnsd.pid
prw-r--r-- 1 root root     0 Nov 28 16:18 npohasd
srwxrwxrwx 1 grid oinstall 0 Nov 28 16:36 ora_gipc_gipcd_rh1
-rw-r--r-- 1 grid oinstall 0 Nov 28 16:36 ora_gipc_gipcd_rh1_lock
srwxrwxrwx 1 grid oinstall 0 Nov 28 16:35 ora_gipc_GPNPD_rh1
-rw-r--r-- 1 grid oinstall 0 Nov 28 16:35 ora_gipc_GPNPD_rh1_lock
srwxrwxrwx 1 root root     0 Nov 28 16:36 ora_gipc_srh1gridrhclusterCRFM_SIPC
-rw-r--r-- 1 root root     0 Nov 28 16:36 ora_gipc_srh1gridrhclusterCRFM_SIPC_lock
srwxrwxrwx 1 grid oinstall 0 Nov 28 16:40 s#12669.1
srwxrwxrwx 1 grid oinstall 0 Nov 28 16:40 s#12669.2
srwxrwxrwx 1 grid oinstall 0 Nov 28 16:40 s#12686.1
srwxrwxrwx 1 grid oinstall 0 Nov 28 16:40 s#12686.2
srwxrwxrwx 1 grid oinstall 0 Nov 28 16:40 sAevm
srwxrwxrwx 1 grid oinstall 0 Nov 28 16:40 sCevm
......









评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

上海阿丽

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值