情况一、OS is at appropriate run level
OS需要运行在正确的级别下
OS need to be at specified run level before CRS will try to start up.
To find out at which run level the clusterware needs to come up:
查看集群需要运行在OS的什么运行级别下(OS需要运行在正确的级别下):
1、操作系统 低于:Oracle Linux 6 (OL6) or Red Hat Linux 6 (RHEL6)
cat /etc/inittab|grep init.ohasd
h1:35:respawn:/etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null
2、操作系统版本:Oracle Linux 6 (OL6) or Red Hat Linux 6 (RHEL6)
[root@host1 proc]# cat /etc/init/oracle-ohasd.conf
# Copyright (c) 2001, 2011, Oracle and/or its affiliates. All rights reserved.
# Oracle OHASD startup
start on runlevel [35]
stop on runlevel [!35]
respawn
exec /etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null
3、操作系统版本:Oracle Linux 7 (and Red Hat Linux 7)
systemctl status oracle-ohasd.service
当前OS运行的级别
To find out current run level:
who -r
情况二、"init.ohasd run" is up
在ohasd.bin启动前需要启动init.ohasd
查看 init.ohasd有没有启动
ps -ef|grep init.ohasd|grep -v grep
root 2279 1 0 18:14 ? 00:00:00 /bin/sh /etc/init.d/init.ohasd run
可以通过如下命令启动
cd <location-of-init.ohasd>
nohup ./init.ohasd run &
情况三、Cluserware auto start is enabled - it's enabled by default
是否打开集群自动启动
To verify whether it's currently enabled or not:
$GRID_HOME/bin/crsctl config crs
by default CRS is enabled for auto start upon node reboot, to enable:
$GRID_HOME/bin/crsctl enable crs
情况四、Oracle Local Registry (OLR, $GRID_HOME/cdata/${HOSTNAME}.olr) is accessible and valid
确保OLR 能正常访问和有效
. Oracle Local Registry (OLR, $GRID_HOME/cdata/${HOSTNAME}.olr) is accessible and valid
可以通过 cat /etc/oracle/olr.loc 查看OLR的位置
[root@host1 oracle]# cat olr.loc
olrconfig_loc=/test_ora_rac/grid/crs_1/cdata/host1.olr
crs_home=/test_ora_rac/grid/crs_1
查看具体OLR的
ls -l $GRID_HOME/cdata/*.olr
-rw------- 1 root oinstall 272756736 Feb 2 18:20 rac1.olr
If the OLR is inaccessible or corrupted, likely ohasd.log will have similar messages like following:
如果OLR不能访问或者损坏,会在 ohasd.log 中报如下错误
..
2010-01-24 22:59:10.470: [ default][1373676464] Initializing OLR
2010-01-24 22:59:10.472: [ OCROSD][1373676464]utopen:6m':failed in stat OCR file/disk /ocw/grid/cdata/rac1.olr, errno=2, os err string=No such file or directory
2010-01-24 22:59:10.472: [ OCROSD][1373676464]utopen:7:failed to open any OCR file/disk, errno=2, os err string=No such file or directory
2010-01-24 22:59:10.473: [ OCRRAW][1373676464]proprinit: Could not open raw device
2010-01-24 22:59:10.473: [ OCRAPI][1373676464]a_init:16!: Backend init unsuccessful : [26]
2010-01-24 22:59:10.473: [ CRSOCR][1373676464] OCR context init failure. Error: PROCL-26: Error while accessing the physical storage Operating System error [No such file or directory] [2]
2010-01-24 22:59:10.473: [ default][1373676464] OLR initalization failured, rc=26
2010-01-24 22:59:10.474: [ default][1373676464]Created alert : (:OHAS00106:) : Failed to initialize Oracle Local Registry
2010-01-24 22:59:10.474: [ default][1373676464][PANIC] OHASD exiting; Could not init OLR
OR
..
2010-01-24 23:01:46.275: [ OCROSD][1228334000]utread:3: Problem reading buffer 1907f000 buflen 4096 retval 0 phy_offset 102400 retry 5
2010-01-24 23:01:46.275: [ OCRRAW][1228334000]propriogid:1_1: Failed to read the whole bootblock. Assumes invalid format.
2010-01-24 23:01:46.275: [ OCRRAW][1228334000]proprioini: all disks are not OCR/OLR formatted
2010-01-24 23:01:46.275: [ OCRRAW][1228334000]proprinit: Could not open raw device
2010-01-24 23:01:46.275: [ OCRAPI][1228334000]a_init:16!: Backend init unsuccessful : [26]
2010-01-24 23:01:46.276: [ CRSOCR][1228334000] OCR context init failure. Error: PROCL-26: Error while accessing the physical storage
2010-01-24 23:01:46.276: [ default][1228334000] OLR initalization failured, rc=26
2010-01-24 23:01:46.276: [ default][1228334000]Created alert : (:OHAS00106:) : Failed to initialize Oracle Local Registry
2010-01-24 23:01:46.277: [ default][1228334000][PANIC] OHASD exiting; Could not init OLR
OR
..
2010-11-07 03:00:08.932: [ default][1] Created alert : (:OHAS00102:) : OHASD is not running as privileged user
2010-11-07 03:00:08.932: [ default][1][PANIC] OHASD exiting: must be run as privileged user
OR
ohasd.bin comes up but output of "crsctl stat res -t -init"shows no resource, and "ocrconfig -local -manualbackup" fails
OR
..
2010-08-04 13:13:11.102: [ CRSPE][35] Resources parsed
2010-08-04 13:13:11.103: [ CRSPE][35] Server [] has been registered with the PE data model
2010-08-04 13:13:11.103: [ CRSPE][35] STARTUPCMD_REQ = false:
2010-08-04 13:13:11.103: [ CRSPE][35] Server [] has changed state from [Invalid/unitialized] to [VISIBLE]
2010-08-04 13:13:11.103: [ CRSOCR][31] Multi Write Batch processing...
2010-08-04 13:13:11.103: [ default][35] Dump State Starting ...
..
2010-08-04 13:13:11.112: [ CRSPE][35] SERVERS:
:VISIBLE:address{{Absolute|Node:0|Process:-1|Type:1}}; recovered state:VISIBLE. Assigned to no pool
------------- SERVER POOLS:
Free [min:0][max:-1][importance:0] NO SERVERS ASSIGNED
2010-08-04 13:13:11.113: [ CRSPE][35] Dumping ICE contents...:ICE operation count: 0
2010-08-04 13:13:11.113: [ default][35] Dump State Done.
如果OLR有问题,可以通过如下命令进行恢复
The solution is to restore a good backup of OLR with "ocrconfig -local -restore <ocr_backup_name>".
By default, OLR will be backed up to $GRID_HOME/cdata/$HOST/backup_$TIME_STAMP.olr once installation is complete.
情况五、ohasd.bin is able to access network socket files:
确保能访问网络scoker 文件(可以删除,然后再尝试重启)
Network socket files can be located in /tmp/.oracle, /var/tmp/.oracle or /usr/tmp/.oracle
2010-06-29 10:31:01.570: [ COMMCRS][1206901056]clsclisten: Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=procr_local_conn_0_PROL))
2010-06-29 10:31:01.571: [ OCRSRV][1217390912]th_listen: CLSCLISTEN failed clsc_ret= 3, addr= [(ADDRESS=(PROTOCOL=ipc)(KEY=procr_local_conn_0_PROL))]
2010-06-29 10:31:01.571: [ OCRSRV][3267002960]th_init: Local listener did not reach valid state
In Grid Infrastructure cluster environment, ohasd related socket files should be owned by root, but in Oracle Restart environment, they should be owned by grid user, refer to "Network Socket File Location, Ownership and Permission" section for example output.
情况六:ohasd.bin is able to access log file location
现象:无法访问或者找不到 ohasd日志的目录($ORACLE_HOME/log/hostname/ohasd)
OS messages/syslog shows:
Feb 20 10:47:08 racnode1 OHASD[9566]: OHASD exiting; Directory /ocw/grid/log/racnode1/ohasd not found.
Refer to "Log File Location, Ownership and Permission" section for example output, if the expected directory is missing, create it with proper ownership and permission.
解决:创建目录(如果不在),赋予正确的属组和权限。
情况七、After node reboot, ohasd may fail to start on SUSE Linux after node reboot, refer to note 1325718.1 - OHASD not Starting After Reboot on SLES
Bug 12406573 is fixed in 11.2.0.3.
情况八:OHASD fails to start, "ps -ef| grep ohasd.bin" shows ohasd.bin is started, but nothing in $GRID_HOME/log/<node>/ohasd/ohasd.log for many minutes, truss shows it is looping to close non-opened file handles:
假如 ohasd.bin已经启动,但是ohasd.log却未有任何输出,如果出现如下两个现象,说明是BUG:11834289
- 通过truss命令进行跟踪,显示如下:
..
15058/1: 0.1995 close(2147483646) Err#9 EBADF
15058/1: 0.1996 close(2147483645) Err#9 EBADF
..
- 通过pstack 命令出现如下显示:
[root@rac1 ~]# ps -ef|grep ohasd.bin
root 1919 1 1 2022 ? 2-02:24:48 /oracle/grid/crs_1/bin/ohasd.bin reboot
[root@rac1 ~]# pstack 1919 |grep closefiledescriptors
Call stack of ohasd.bin from pstack shows the following:
_close sclssutl_closefiledescriptors main ..
The cause is bug 11834289 which is fixed in 11.2.0.3 and above, other symptoms of the bug is clusterware processes may fail to start with same call stack and truss output (looping on OS call "close"). If the bug happens when trying to start other resources, "CRS-5802: Unable to start the agent process" could show up as well.
情况九、Other potential causes/solutions listed in note 1069182.1 - OHASD Failed to Start: Inappropriate ioctl for device
现象:在执行root.sh 出现如下错误
After successful installation of 11gR2 Grid Infrastructure (aka CRS), root.sh fails with following:
..
Creating trace directory
LOCAL ADD MODE
Creating OCR keys for user 'grid', privgrp 'oinstall'..
Operation successful.
CRS-4664: Node <node1> successfully pinned.
Adding daemon to inittab
ohasd failed to start: Inappropriate ioctl for device
ohasd failed to start: Inappropriate ioctl for device at /ocw/grid/crs/install/roothas.pl line 296.
Oracle root script execution aborted!
crsctl error
CRS-4639: Could not contact Oracle High Availability Services
OR
CRS-4124: Oracle High Availability Services startup failed
OR
CRS-0715: Oracle High Availability Service has timed out waiting for init.ohasd to be started
可能原因:
Case1: Grid Infrastructure Standalone - ownership/permission wrong for socket files
安装Standalone 集群时候,网络socket file 用户属组/权限错误
For Oracle Restart (Grid Infrastructure Standalone), it's possible that network socket files have unexpected or inappropriate ownership or permission. Note that the location of the socket files will be /tmp/.oracle or /var/tmp/.oracle or /usr/tmp/.oracle. The following ownership (for example) will cause this issue:
srwxrwxrwx 1 root root 0 Dec 4 19:22 sCRSD_UI_SOCKET
[oracle@rac1 ~]$ ls -ltr /var/tmp/.oracle/sCRSD_UI_SOCKET
srwxrwxrwx 1 root root 0 Oct 21 17:48 /var/tmp/.oracle/sCRSD_UI_SOCKET
Case2:
Many other causes/solutions of the same issue has been listed in note 1050908.1 Section "Case 1: OHASD.BIN does not start"
For findability: crsctl check crs; crsctl start crs
情况十、ohasd.bin started fine, however, "crsctl check crs" shows only the following and nothing else:
CRS-4638: Oracle High Availability Services is online
And "crsctl stat res -p -init" shows nothing
现象:
1、Ohasd.bin正常启动,
2、crsctl check crs只显示一条信息:CRS-4638: Oracle High Availability Services is online
3、crsctl stat res -p -init 没有任何输出
原因:
The cause is that OLR is corrupted, refer to note 1193643.1 to restore.
情况十一、On EL7/OL7: note 1959008.1 - Install of Clusterware fails while running root.sh on OL7 - ohasd fails to start
现象1、
在红帽7或者ORACLE OS 7上安装112.0.4的GI软件时候,当在执行root.sh的时候,会报错
同时 ohasd.log日志里报错
Oracle Database 11g Clusterware Release 11.2.0.4.0 - Production Copyright 1996, 2011 Oracle. All rights reserved.
2015-01-03 04:57:14.616: [ default][3592750912] Created alert : (:OHAS00117:) : TIMED OUT WAITING FOR OHASD MONITOR
原因:由于Oracle Linux 7(和Redhat 7)使用systemd而不是initd来启动/重新启动进程,并将其作为服务运行
解决:
1、运行root.sh 前 apply the patch 18370031 for 11.2.0.4
2、临时解决方案,手工创建ohas服务,安装完数据库后,别忘记要删除
如果没有删除,在后续的运维中如果出现集群重启,可能会出现OHASD无法启动,手工重启也无法启动,这时候的解决方案。
- 重启主机
- 禁用ohas服务
情况十二、syslogd is up and OS is able to execute init script S96ohasd
OS may stuck with some other Snn script while node is coming up, thus never get chance to execute S96ohasd; if that's the case, following message will not be in OS messages:
Jan 20 20:46:51 rac1 logger: Oracle HA daemon is enabled for autostart.
If you don't see above message, the other possibility is syslogd(/usr/sbin/syslogd) is not fully up. Grid may fail to come up in that case as well. This may not apply to AIX.
To find out whether OS is able to execute S96ohasd while node is coming up, modify S96ohasd:
From:
case `$CAT $AUTOSTARTFILE` in
enable*)
$LOGERR "Oracle HA daemon is enabled for autostart."
To:
case `$CAT $AUTOSTARTFILE` in
enable*)
/bin/touch /tmp/ohasd.start."`date`"
$LOGERR "Oracle HA daemon is enabled for autostart."
After a node reboot, if you don't see /tmp/ohasd.start.timestamp get created, it means OS stuck with some other Snn script. If you do see /tmp/ohasd.start.timestamp but not "Oracle HA daemon is enabled for autostart" in messages, likely syslogd is not fully up. For both case, you will need engage System Administrator to find out the issue on OS level. For latter case, the workaround is to "sleep" for about 2 minutes, modify ohasd:
From:
case `$CAT $AUTOSTARTFILE` in
enable*)
$LOGERR "Oracle HA daemon is enabled for autostart."
To:
case `$CAT $AUTOSTARTFILE` in
enable*)
/bin/sleep 120
$LOGERR "Oracle HA daemon is enabled for autostart."
情况十三:File System that GRID_HOME resides is online when init script S96ohasd is executed; once S96ohasd is executed, following message should be in OS messages file:
Jan 20 20:46:51 rac1 logger: Oracle HA daemon is enabled for autostart.
..
Jan 20 20:46:57 rac1 logger: exec /ocw/grid/perl/bin/perl -I/ocw/grid/perl/lib /ocw/grid/bin/crswrapexece.pl /ocw/grid/crs/install/s_crsconfig_rac1_env.txt /ocw/grid/bin/ohasd.bin "reboot"
If you see the first line, but not the last line, likely the filesystem containing the GRID_HOME was not online while S96ohasd is executed.