一、环境描述
操作系统:Linux RHEL7.5
数据库版本:Oracle 11.2.0.4 RAC
故障问题:节点1的GI无法启动
二、分析过程
2.1首先需要确认集群启动到了那个阶段
crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
从以上报错可以看出ohasd层面没有启动,有可能是因为/etc/inittab中启动集群的init.ohasd脚本没有被调用,
或者是ohasd.bin守护进程没有启动成功。需要进一步去验证:
[grid@rac1 ~]$ ps -ef|grep has
root 906 1 0 08:48 ? 00:00:00 /bin/sh /etc/init.d/init.ohasd run >/dev/null 2>&1 Type=simple
root 941 1 0 08:48 ? 00:00:00 /bin/sh /etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null
grid 3751 3145 0 10:18 pts/0 00:00:00 grep --color=auto has
发现init.ohasd脚本被调用了,ohasd.bin守护进程启动失败,问题就在于ohasd.bin守护进程为什么启动失败。
2.2查看ohasd的日志文件以进行分析
vi /u01/app/11.2.0/grid/log/rac1/ohasd/ohasd.log
2021-02-24 09:57:01.643: [ default][4158826304] OHASD Daemon Starting. Command string :restart
2021-02-24 09:57:01.644: [ default][4158826304] Initializing OLR
2021-02-24 09:57:01.647: [ OCROSD][4158826304] utopen:6m': failed in stat OCR file/disk /u01/app/11.2.0/grid/cdata/rac1.olr, errno=2, os err string=No such file or directory
2021-02-24 09:57:01.647: [ OCROSD][4158826304] utopen:7: failed to open any OCR file/disk, errno=2, os err string=No such file or directory
2021-02-24 09:57:01.648: [ OCRRAW][4158826304] proprinit: Could not open raw device
2021-02-24 09:57:01.648: [ OCRAPI][4158826304] a_init:16!: Backend init unsuccessful : [26]
2021-02-24 09:57:01.648: [ CRSOCR][4158826304] OCR context init failure. Error: PROCL-26: Error while accessing the physical storage Operating System error [No such file or directory] [2]
2021-02-24 09:57:01.649: [ default][4158826304] Created alert : (:OHAS00106:) : OLR initialization failed, error: PROCL-26: Error while accessing the physical storage Operating System error [No such file or directory] [2]
2021-02-24 09:57:01.649: [ default][4158826304] [PANIC] OHASD exiting; Could not init OLR
2021-02-24 09:57:01.649: [ default][4158826304] Done.
根据上面的日志信息,看出问题是由于无法访问rac1.olr文件导致的。
此时去检查一下rac1.olr这个文件是否存在:
[grid@rac1 cdata]$ ll
total 8
drwxr-xr-x 2 grid oinstall 6 Dec 7 16:28 localhost
drwxrwxr-x 2 grid oinstall 4096 Jan 11 07:52 rac-cluster
drwxr-xr-x 2 grid oinstall 108 Dec 10 16:36 rac1
检查发现,rac1.olr文件确实不存在。
三、解决过程
由于OLR文件默认会在集群安装时产生备份,可以从默认的备份位置进行恢复。
3.1检查备份文件是否存在
[grid@rac1 rac1]$ ll
total 19932
-rwxrwxr-x 1 grid oinstall 6787072 Dec 7 17:08 backup_20201207_170836.olr
-rwxrwxr-x 1 grid oinstall 6811648 Dec 10 11:20 backup_20201210_112026.olr
-rw------- 1 root root 6811648 Dec 10 16:36 backup_20201210_163621.olr
3.2存在备份文件,进行恢复
[root@rac1 cdata]# touch rac1.olr
[root@rac1 bin]# ./ocrconfig -local -restore /u01/app/11.2.0/grid/cdata/rac1/backup_20201210_163621.olr
3.3重新启动集群
[root@rac1 bin]# ./crsctl start crs
--启动后,检查集群状态
[root@rac1 bin]# ./crsctl stat res -t
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DATA.dg
ONLINE ONLINE rac1
ONLINE ONLINE rac2
ora.FRA.dg
ONLINE ONLINE rac1
ONLINE ONLINE rac2
ora.OCR.dg
ONLINE ONLINE rac1
ONLINE ONLINE rac2
ora.asm
ONLINE ONLINE rac1 Started
ONLINE ONLINE rac2 Started
ora.gsd
OFFLINE OFFLINE rac1
OFFLINE OFFLINE rac2
ora.net1.network
ONLINE ONLINE rac1
ONLINE ONLINE rac2
ora.ons
ONLINE ONLINE rac1
ONLINE ONLINE rac2
ora.registry.acfs
ONLINE ONLINE rac1
ONLINE ONLINE rac2
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
1 ONLINE ONLINE rac2
ora.cvu
1 ONLINE ONLINE rac2
ora.oc4j
1 ONLINE ONLINE rac2
ora.orcl.db
1 ONLINE ONLINE rac1 Open
2 ONLINE ONLINE rac2 Open
ora.rac1.vip
1 ONLINE ONLINE rac1
ora.rac2.vip
1 ONLINE ONLINE rac2
ora.scan1.vip
1 ONLINE ONLINE rac2