trouble shooting-集群无法启动之-Ohasd.bin无法启动排查

jianbo ye

于 2023-05-24 17:13:41 发布

阅读量981

点赞数

文章标签： oracle 数据库

本文链接：https://blog.csdn.net/yejb7456/article/details/130851123

版权

情况一、OS is at appropriate run level

OS需要运行在正确的级别下

OS need to be at specified run level before CRS will try to start up.

To find out at which run level the clusterware needs to come up:

查看集群需要运行在OS的什么运行级别下（OS需要运行在正确的级别下）：

1、操作系统低于：Oracle Linux 6 (OL6) or Red Hat Linux 6 (RHEL6)

cat /etc/inittab|grep init.ohasd

h1:35:respawn:/etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null

2、操作系统版本：Oracle Linux 6 (OL6) or Red Hat Linux 6 (RHEL6)

[root@host1 proc]# cat /etc/init/oracle-ohasd.conf

# Oracle OHASD startup

start on runlevel [35]

stop on runlevel [!35]

respawn

exec /etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null

3、操作系统版本：Oracle Linux 7 (and Red Hat Linux 7)

systemctl status oracle-ohasd.service

当前OS运行的级别

To find out current run level:

who -r

情况二、"init.ohasd run" is up

在ohasd.bin启动前需要启动init.ohasd

查看 init.ohasd有没有启动

ps -ef|grep init.ohasd|grep -v grep

root 2279 1 0 18:14 ? 00:00:00 /bin/sh /etc/init.d/init.ohasd run

可以通过如下命令启动

cd <location-of-init.ohasd>

nohup ./init.ohasd run &

情况三、Cluserware auto start is enabled - it's enabled by default

是否打开集群自动启动

To verify whether it's currently enabled or not:

$GRID_HOME/bin/crsctl config crs

by default CRS is enabled for auto start upon node reboot, to enable:

$GRID_HOME/bin/crsctl enable crs

情况四、Oracle Local Registry (OLR, $GRID_HOME/cdata/${HOSTNAME}.olr) is accessible and valid

确保OLR 能正常访问和有效

. Oracle Local Registry (OLR, $GRID_HOME/cdata/${HOSTNAME}.olr) is accessible and valid

可以通过 cat /etc/oracle/olr.loc 查看OLR的位置

[root@host1 oracle]# cat olr.loc

olrconfig_loc=/test_ora_rac/grid/crs_1/cdata/host1.olr

crs_home=/test_ora_rac/grid/crs_1

查看具体OLR的

ls -l $GRID_HOME/cdata/*.olr

-rw------- 1 root oinstall 272756736 Feb 2 18:20 rac1.olr

If the OLR is inaccessible or corrupted, likely ohasd.log will have similar messages like following:

如果OLR不能访问或者损坏，会在 ohasd.log 中报如下错误

2010-01-24 22:59:10.470: [ default][1373676464] Initializing OLR

2010-01-24 22:59:10.472: [ OCROSD][1373676464]utopen:6m':failed in stat OCR file/disk /ocw/grid/cdata/rac1.olr, errno=2, os err string=No such file or directory

2010-01-24 22:59:10.472: [ OCROSD][1373676464]utopen:7:failed to open any OCR file/disk, errno=2, os err string=No such file or directory

2010-01-24 22:59:10.473: [ OCRRAW][1373676464]proprinit: Could not open raw device

2010-01-24 22:59:10.473: [ OCRAPI][1373676464]a_init:16!: Backend init unsuccessful : [26]

2010-01-24 22:59:10.473: [ CRSOCR][1373676464] OCR context init failure. Error: PROCL-26: Error while accessing the physical storage Operating System error [No such file or directory] [2]

2010-01-24 22:59:10.473: [ default][1373676464] OLR initalization failured, rc=26

2010-01-24 22:59:10.474: [ default][1373676464]Created alert : (:OHAS00106:) : Failed to initialize Oracle Local Registry

2010-01-24 22:59:10.474: [ default][1373676464][PANIC] OHASD exiting; Could not init OLR

2010-01-24 23:01:46.275: [ OCROSD][1228334000]utread:3: Problem reading buffer 1907f000 buflen 4096 retval 0 phy_offset 102400 retry 5

2010-01-24 23:01:46.275: [ OCRRAW][1228334000]propriogid:1_1: Failed to read the whole bootblock. Assumes invalid format.

2010-01-24 23:01:46.275: [ OCRRAW][1228334000]proprioini: all disks are not OCR/OLR formatted

2010-01-24 23:01:46.275: [ OCRRAW][1228334000]proprinit: Could not open raw device

2010-01-24 23:01:46.275: [ OCRAPI][1228334000]a_init:16!: Backend init unsuccessful : [26]

2010-01-24 23:01:46.276: [ CRSOCR][1228334000] OCR context init failure. Error: PROCL-26: Error while accessing the physical storage

2010-01-24 23:01:46.276: [ default][1228334000] OLR initalization failured, rc=26

2010-01-24 23:01:46.276: [ default][1228334000]Created alert : (:OHAS00106:) : Failed to initialize Oracle Local Registry

2010-01-24 23:01:46.277: [ default][1228334000][PANIC] OHASD exiting; Could not init OLR

2010-11-07 03:00:08.932: [ default][1] Created alert : (:OHAS00102:) : OHASD is not running as privileged user

2010-11-07 03:00:08.932: [ default][1][PANIC] OHASD exiting: must be run as privileged user

ohasd.bin comes up but output of "crsctl stat res -t -init"shows no resource, and "ocrconfig -local -manualbackup" fails

2010-08-04 13:13:11.102: [ CRSPE][35] Resources parsed

2010-08-04 13:13:11.103: [ CRSPE][35] Server [] has been registered with the PE data model

2010-08-04 13:13:11.103: [ CRSPE][35] STARTUPCMD_REQ = false:

2010-08-04 13:13:11.103: [ CRSPE][35] Server [] has changed state from [Invalid/unitialized] to [VISIBLE]

2010-08-04 13:13:11.103: [ CRSOCR][31] Multi Write Batch processing...

2010-08-04 13:13:11.103: [ default][35] Dump State Starting ...

2010-08-04 13:13:11.112: [ CRSPE][35] SERVERS:

:VISIBLE:address{{Absolute|Node:0|Process:-1|Type:1}}; recovered state:VISIBLE. Assigned to no pool

------------- SERVER POOLS:

Free [min:0][max:-1][importance:0] NO SERVERS ASSIGNED

2010-08-04 13:13:11.113: [ CRSPE][35] Dumping ICE contents...:ICE operation count: 0

2010-08-04 13:13:11.113: [ default][35] Dump State Done.

如果OLR有问题，可以通过如下命令进行恢复

The solution is to restore a good backup of OLR with "ocrconfig -local -restore <ocr_backup_name>".

By default, OLR will be backed up to $GRID_HOME/cdata/$HOST/backup_$TIME_STAMP.olr once installation is complete.

情况五、ohasd.bin is able to access network socket files:

确保能访问网络scoker 文件（可以删除，然后再尝试重启）

Network socket files can be located in /tmp/.oracle, /var/tmp/.oracle or /usr/tmp/.oracle

2010-06-29 10:31:01.570: [ COMMCRS][1206901056]clsclisten: Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=procr_local_conn_0_PROL))

2010-06-29 10:31:01.571: [ OCRSRV][1217390912]th_listen: CLSCLISTEN failed clsc_ret= 3, addr= [(ADDRESS=(PROTOCOL=ipc)(KEY=procr_local_conn_0_PROL))]

2010-06-29 10:31:01.571: [ OCRSRV][3267002960]th_init: Local listener did not reach valid state

In Grid Infrastructure cluster environment, ohasd related socket files should be owned by root, but in Oracle Restart environment, they should be owned by grid user, refer to "Network Socket File Location, Ownership and Permission" section for example output.

情况六：ohasd.bin is able to access log file location

现象：无法访问或者找不到 ohasd日志的目录（$ORACLE_HOME/log/hostname/ohasd）

OS messages/syslog shows:

Feb 20 10:47:08 racnode1 OHASD[9566]: OHASD exiting; Directory /ocw/grid/log/racnode1/ohasd not found.

Refer to "Log File Location, Ownership and Permission" section for example output, if the expected directory is missing, create it with proper ownership and permission.

解决：创建目录（如果不在），赋予正确的属组和权限。

情况七、After node reboot, ohasd may fail to start on SUSE Linux after node reboot, refer to note 1325718.1 - OHASD not Starting After Reboot on SLES

Bug 12406573 is fixed in 11.2.0.3.

情况八：OHASD fails to start, "ps -ef| grep ohasd.bin" shows ohasd.bin is started, but nothing in $GRID_HOME/log/<node>/ohasd/ohasd.log for many minutes, truss shows it is looping to close non-opened file handles:

假如 ohasd.bin已经启动，但是ohasd.log却未有任何输出，如果出现如下两个现象，说明是BUG：11834289

通过truss命令进行跟踪，显示如下：

15058/1: 0.1995 close(2147483646) Err#9 EBADF

15058/1: 0.1996 close(2147483645) Err#9 EBADF

通过pstack 命令出现如下显示：

[root@rac1 ~]# ps -ef|grep ohasd.bin

root 1919 1 1 2022 ? 2-02:24:48 /oracle/grid/crs_1/bin/ohasd.bin reboot

[root@rac1 ~]# pstack 1919 |grep closefiledescriptors

Call stack of ohasd.bin from pstack shows the following:

_close sclssutl_closefiledescriptors main ..

The cause is bug 11834289 which is fixed in 11.2.0.3 and above, other symptoms of the bug is clusterware processes may fail to start with same call stack and truss output (looping on OS call "close"). If the bug happens when trying to start other resources, "CRS-5802: Unable to start the agent process" could show up as well.

情况九、Other potential causes/solutions listed in note 1069182.1 - OHASD Failed to Start: Inappropriate ioctl for device

现象：在执行root.sh 出现如下错误

After successful installation of 11gR2 Grid Infrastructure (aka CRS), root.sh fails with following:

Creating trace directory

LOCAL ADD MODE

Creating OCR keys for user 'grid', privgrp 'oinstall'..

Operation successful.

CRS-4664: Node <node1> successfully pinned.

Adding daemon to inittab

ohasd failed to start: Inappropriate ioctl for device

ohasd failed to start: Inappropriate ioctl for device at /ocw/grid/crs/install/roothas.pl line 296.

Oracle root script execution aborted!

crsctl error

CRS-4639: Could not contact Oracle High Availability Services

CRS-4124: Oracle High Availability Services startup failed

CRS-0715: Oracle High Availability Service has timed out waiting for init.ohasd to be started

可能原因：

Case1: Grid Infrastructure Standalone - ownership/permission wrong for socket files

安装Standalone 集群时候，网络socket file 用户属组/权限错误

For Oracle Restart (Grid Infrastructure Standalone), it's possible that network socket files have unexpected or inappropriate ownership or permission. Note that the location of the socket files will be /tmp/.oracle or /var/tmp/.oracle or /usr/tmp/.oracle. The following ownership (for example) will cause this issue:

srwxrwxrwx 1 root root 0 Dec 4 19:22 sCRSD_UI_SOCKET

[oracle@rac1 ~]$ ls -ltr /var/tmp/.oracle/sCRSD_UI_SOCKET

srwxrwxrwx 1 root root 0 Oct 21 17:48 /var/tmp/.oracle/sCRSD_UI_SOCKET

Case2:

Many other causes/solutions of the same issue has been listed in note 1050908.1 Section "Case 1: OHASD.BIN does not start"
For findability: crsctl check crs; crsctl start crs

情况十、ohasd.bin started fine, however, "crsctl check crs" shows only the following and nothing else:

CRS-4638: Oracle High Availability Services is online

And "crsctl stat res -p -init" shows nothing

现象：

1、Ohasd.bin正常启动，
2、crsctl check crs只显示一条信息：CRS-4638: Oracle High Availability Services is online
3、crsctl stat res -p -init 没有任何输出

原因：

The cause is that OLR is corrupted, refer to note 1193643.1 to restore.

情况十一、On EL7/OL7: note 1959008.1 - Install of Clusterware fails while running root.sh on OL7 - ohasd fails to start

现象1、

在红帽7或者ORACLE OS 7上安装112.0.4的GI软件时候，当在执行root.sh的时候，会报错

同时 ohasd.log日志里报错

2015-01-03 04:57:14.616: [ default][3592750912] Created alert : (:OHAS00117:) : TIMED OUT WAITING FOR OHASD MONITOR

原因：由于Oracle Linux 7（和Redhat 7）使用systemd而不是initd来启动/重新启动进程，并将其作为服务运行

解决：

1、运行root.sh 前 apply the patch 18370031 for 11.2.0.4

2、临时解决方案，手工创建ohas服务，安装完数据库后，别忘记要删除

如果没有删除，在后续的运维中如果出现集群重启，可能会出现OHASD无法启动，手工重启也无法启动，这时候的解决方案。

重启主机
禁用ohas服务

情况十二、syslogd is up and OS is able to execute init script S96ohasd

OS may stuck with some other Snn script while node is coming up, thus never get chance to execute S96ohasd; if that's the case, following message will not be in OS messages:

Jan 20 20:46:51 rac1 logger: Oracle HA daemon is enabled for autostart.

If you don't see above message, the other possibility is syslogd(/usr/sbin/syslogd) is not fully up. Grid may fail to come up in that case as well. This may not apply to AIX.

To find out whether OS is able to execute S96ohasd while node is coming up, modify S96ohasd:

From:

case `$CAT $AUTOSTARTFILE` in

enable*)

$LOGERR "Oracle HA daemon is enabled for autostart."

To:

case `$CAT $AUTOSTARTFILE` in

enable*)

/bin/touch /tmp/ohasd.start."`date`"

$LOGERR "Oracle HA daemon is enabled for autostart."

After a node reboot, if you don't see /tmp/ohasd.start.timestamp get created, it means OS stuck with some other Snn script. If you do see /tmp/ohasd.start.timestamp but not "Oracle HA daemon is enabled for autostart" in messages, likely syslogd is not fully up. For both case, you will need engage System Administrator to find out the issue on OS level. For latter case, the workaround is to "sleep" for about 2 minutes, modify ohasd:

From:

case `$CAT $AUTOSTARTFILE` in

enable*)

$LOGERR "Oracle HA daemon is enabled for autostart."

To:

case `$CAT $AUTOSTARTFILE` in

enable*)

/bin/sleep 120

$LOGERR "Oracle HA daemon is enabled for autostart."

情况十三：File System that GRID_HOME resides is online when init script S96ohasd is executed; once S96ohasd is executed, following message should be in OS messages file:

Jan 20 20:46:51 rac1 logger: Oracle HA daemon is enabled for autostart.

Jan 20 20:46:57 rac1 logger: exec /ocw/grid/perl/bin/perl -I/ocw/grid/perl/lib /ocw/grid/bin/crswrapexece.pl /ocw/grid/crs/install/s_crsconfig_rac1_env.txt /ocw/grid/bin/ohasd.bin "reboot"

If you see the first line, but not the last line, likely the filesystem containing the GRID_HOME was not online while S96ohasd is executed.

trouble shooting-集群无法启动之-Ohasd.bin无法启动排查

​​​​​​情况一、OS is at appropriate run level

​​​​​​情况二、"init.ohasd run" is up

情况三、Cluserware auto start is enabled - it's enabled by default

​​​​​​​​​​​​​​情况四、Oracle Local Registry (OLR, $GRID_HOME/cdata/${HOSTNAME}.olr) is accessible and valid

​​​​​​​情况五、ohasd.bin is able to access network socket files:

​​​​​​​情况六：ohasd.bin is able to access log file location

​​​​​​​​​​​​​​情况七、After node reboot, ohasd may fail to start on SUSE Linux after node reboot, refer to note 1325718.1 - OHASD not Starting After Reboot on SLES

​​​​​​​情况八：OHASD fails to start, "ps -ef| grep ohasd.bin" shows ohasd.bin is started, but nothing in $GRID_HOME/log/<node>/ohasd/ohasd.log for many minutes, truss shows it is looping to close non-opened file handles:

​​​​​​​情况九、Other potential causes/solutions listed in note 1069182.1 - OHASD Failed to Start: Inappropriate ioctl for device

​​​​​​​​​​​​​​情况十一、On EL7/OL7: note 1959008.1 - Install of Clusterware fails while running root.sh on OL7 - ohasd fails to start

情况十二、syslogd is up and OS is able to execute init script S96ohasd

情况十三：​​​​​​​​​​​​​​File System that GRID_HOME resides is online when init script S96ohasd is executed; once S96ohasd is executed, following message should be in OS messages file:

情况一、OS is at appropriate run level

情况二、"init.ohasd run" is up

情况四、Oracle Local Registry (OLR, $GRID_HOME/cdata/${HOSTNAME}.olr) is accessible and valid

情况五、ohasd.bin is able to access network socket files:

情况六：ohasd.bin is able to access log file location

情况七、After node reboot, ohasd may fail to start on SUSE Linux after node reboot, refer to note 1325718.1 - OHASD not Starting After Reboot on SLES

情况八：OHASD fails to start, "ps -ef| grep ohasd.bin" shows ohasd.bin is started, but nothing in $GRID_HOME/log/<node>/ohasd/ohasd.log for many minutes, truss shows it is looping to close non-opened file handles:

情况九、Other potential causes/solutions listed in note 1069182.1 - OHASD Failed to Start: Inappropriate ioctl for device

情况十一、On EL7/OL7: note 1959008.1 - Install of Clusterware fails while running root.sh on OL7 - ohasd fails to start

情况十三：File System that GRID_HOME resides is online when init script S96ohasd is executed; once S96ohasd is executed, following message should be in OS messages file: