Preface
I had recently done a 10gR2 CRS installation on SuSE linux 9.3 (2.6.5.7-244 kernel) and noticed that after a reboot of the RAC nodes, the CRS would not come up!The CSS daemon was stuck at the /etc/init.d/init.cssd startcheck command:
raclinux1:/tmp # ps -ef | grep css
root 6929 1 0 13:56 ? 00:00:00 /bin/sh /etc/init.d/init.cssd fatal
root 6960 6928 0 13:56 ? 00:00:00 /bin/sh /etc/init.d/init.cssd startcheck
root 6963 6929 0 13:56 ? 00:00:00 /bin/sh /etc/init.d/init.cssd startcheck
root 7064 6935 0 13:56 ? 00:00:00 /bin/sh /etc/init.d/init.cssd startcheck
Debugging..
To debug this more, I went to the $ORA_CRS_HOME/log/<nodename>/client and checked the latest files there:raclinux1:/opt/oracle/product/10.2.0/crs/log/raclinux1/client # ls -ltr
total 435
-rw-r----- 1 root root 2561 May 18 23:20 ocrconfig_8870.log
-rw-r--r-- 1 root root 195 May 18 23:22 clscfg_8924.log
-rw-r----- 1 root root 172 May 18 23:29 ocr_15307_3.log
-rw-r----- 1 root root 172 May 18 23:29 ocr_15319_3.log
-rw-r----- 1 root root 172 May 18 23:29 ocr_15447_3.log
...
...
...
drwxr-x--- 2 oracle dba 3472 May 19 08:10 .
drwxr-xr-t 8 root dba 232 May 19 13:50 ..
-rw-r--r-- 1 root root 2946 May 19 14:11 clsc.log
-rw-r--r-- 1 root root 7702 May 19 14:11 css.log
I did a more of the clsc.log & css.log and saw the following errors:
$ more clsc.log
...
...
...
2008-05-19 14:11:29.912: [ COMMCRS][1094672672]clsc_connect: (0x81c74b8) no listener at (ADDRESS=(PROTOCOL=IPC)(KEY=CRSD_UI_SOCKET))
2008-05-19 14:11:31.582: [ COMMCRS][1094672672]clsc_connect: (0x817e3f0) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=SYSTEM.evm.acceptor.auth))
2008-05-19 14:11:31.583: [ default][1094672672]Terminating clsd session
$ more css.log
...
...
...
2008-05-19 02:42:48.307: [ OCROSD][1094672672]utopen:7:failed to open OCR file/disk /var/opt/oracle/ocr1 /var/opt/oracle/oc
r2, errno=19, os err string=No such device
2008-05-19 02:42:48.308: [ OCRRAW][1094672672]proprinit: Could not open raw device
2008-05-19 02:42:48.308: [ default][1094672672]a_init:7!: Backend init unsuccessful : [26]
2008-05-19 02:42:48.308: [ CSSCLNT][1094672672]clsssinit: Unable to access OCR device in OCR init.
2008-05-19 02:43:41.982: [ OCROSD][1094672672]utopen:7:failed to open OCR file/disk /var/opt/oracle/ocr1 /var/opt/oracle/oc
r2, errno=19, os err string=No such device
2008-05-19 02:43:41.983: [ OCRRAW][1094672672]proprinit: Could not open raw device
2008-05-19 02:43:41.983: [ default][1094672672]a_init:7!: Backend init unsuccessful : [26]
2008-05-19 02:43:41.983: [ CSSCLNT][1094672672]clsssinit: Unable to access OCR device in OCR init.
2008-05-19 02:46:40.204: [ CSSCLNT][1094672672]clsssInitNative: connect failed, rc 9
2008-05-19 14:11:28.217: [ CSSCLNT][1094672672]clsssInitNative: connect failed, rc 9
2008-05-19 14:11:37.186: [ CSSCLNT][1094672672]clsssInitNative: connect failed, rc 9
So it was pointing towards the OCR being not available, as could be verified by the /tmp/crsctl.<PID> files too:
raclinux1:/tmp # ls -ltr crsctl*
-rw-r--r-- 1 oracle dba 148 May 19 02:44 crsctl.6826
-rw-r--r-- 1 oracle dba 148 May 19 02:44 crsctl.6679
-rw-r--r-- 1 oracle dba 148 May 19 02:44 crsctl.6673
-rw-r--r-- 1 oracle dba 148 May 19 02:49 crsctl.7784
-rw-r--r-- 1 oracle dba 148 May 19 02:49 crsctl.7890
-rw-r--r-- 1 oracle dba 148 May 19 02:49 crsctl.7794
-rw-r--r-- 1 oracle dba 148 May 19 13:55 crsctl.7034
-rw-r--r-- 1 oracle dba 148 May 19 13:55 crsctl.6886
-rw-r--r-- 1 oracle dba 148 May 19 13:55 crsctl.6883
-rw-r--r-- 1 oracle dba 148 May 19 14:18 crsctl.6960
-rw-r--r-- 1 oracle dba 148 May 19 14:18 crsctl.7064
-rw-r--r-- 1 oracle dba 148 May 19 14:18 crsctl.6963
raclinux1:/tmp # more crsctl.6963
OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage Operating System error [Permission denied] [13]
Permission issue!
Duh! So it was a permission issue on the OCR disk (at this moment), which could expand into a permissions issue for Voting and asm disks later:raclinux1:/tmp # ls -ltr /dev/raw/raw*
crw-rw-r-- 1 root disk 162, 9 Nov 18 2005 /dev/raw/raw9
crw-rw-r-- 1 root disk 162, 8 Nov 18 2005 /dev/raw/raw8
crw-rw-r-- 1 root disk 162, 7 Nov 18 2005 /dev/raw/raw7
crw-rw-r-- 1 root disk 162, 6 Nov 18 2005 /dev/raw/raw6
crw-rw-r-- 1 root disk 162, 5 Nov 18 2005 /dev/raw/raw5
crw-rw-r-- 1 root disk 162, 4 Nov 18 2005 /dev/raw/raw4
crw-rw-r-- 1 root disk 162, 3 Nov 18 2005 /dev/raw/raw3
crw-rw-r-- 1 root disk 162, 2 Nov 18 2005 /dev/raw/raw2
crw-rw-r-- 1 root disk 162, 15 Nov 18 2005 /dev/raw/raw15
crw-rw-r-- 1 root disk 162, 14 Nov 18 2005 /dev/raw/raw14
crw-rw-r-- 1 root disk 162, 13 Nov 18 2005 /dev/raw/raw13
crw-rw-r-- 1 root disk 162, 12 Nov 18 2005 /dev/raw/raw12
crw-rw-r-- 1 root disk 162, 11 Nov 18 2005 /dev/raw/raw11
crw-rw-r-- 1 root disk 162, 10 Nov 18 2005 /dev/raw/raw10
crw-rw-r-- 1 root disk 162, 1 Nov 18 2005 /dev/raw/raw1
I enabled read and write permission for the raw devices using the # chmod +rw /dev/raw/raw* devices. but even after that the latest /tmp/crsctl.<PID> files being generated were showing this message:
raclinux1:/tmp # more crsctl.6960
Failure -2 opening file handle for (vote1)
Failure 1 checking the CSS voting disk 'vote1'.
Failure -2 opening file handle for (vote2)
Failure 1 checking the CSS voting disk 'vote2'.
Failure -2 opening file handle for (vote3)
Failure 1 checking the CSS voting disk 'vote3'.
Not able to read adequate number of voting disks
At this point, I just chowned /dev/raw/raw* to oracle:dba like this:
raclinux1:/tmp # chown oracle:dba /dev/raw/raw*
After 1-2 mins, the CSS came up:
raclinux1:/tmp # ps -ef | grep css
root 6929 1 0 13:56 ? 00:00:00 /bin/sh /etc/init.d/init.cssd fatal
root 10900 6929 0 14:39 ? 00:00:00 /bin/sh /etc/init.d/init.cssd daemon
oracle 10980 10900 0 14:40 ? 00:00:00 /bin/su -l oracle -c /bin/sh -c 'ulimit -c unlimited; cd /opt/oracle/product/10.2.0/crs/log/raclinux1/cssd; /opt/oracle/product/10.2.0/crs/bin/ocssd || exit $?'
oracle 10981 10980 0 14:40 ? 00:00:00 /bin/sh -c ulimit -c unlimited; cd /opt/oracle/product/10.2.0/crs/log/raclinux1/cssd; /opt/oracle/product/10.2.0/crs/bin/ocssd || exit $?
oracle 11007 10981 2 14:40 ? 00:00:00 /opt/oracle/product/10.2.0/crs/bin/ocssd.bin
root 12013 7414 0 14:40 pts/2 00:00:00 grep css
raclinux1:/tmp #
The CRS components came up fine automatically:
raclinux1:/opt/oracle/product/10.2.0/crs/bin # ./crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
The ASM and RAC instances also came up fine:
raclinux1:/opt/oracle/product/10.2.0/crs/bin # ps -ef |grep smon
oracle 12257 1 0 14:41 ? 00:00:00 asm_smon_+ASM1
oracle 13100 1 0 14:41 ? 00:00:02 ora_smon_o10g1
root 32282 7414 0 14:55 pts/2 00:00:00 grep smon
For the long term..
To make this change permanent, I put it in /etc/init.d/boot.local file, along with the modprobe hangcheck-timer command:raclinux1:/opt/oracle/product/10.2.0/crs/bin # more /etc/init.d/boot.local
#! /bin/sh
#
# Copyright (c) 2002 SuSE Linux AG Nuernberg, Germany. All rights reserved.
#
# Author: Werner Fink <werner@suse.de>, 1996
# Burchard Steinbild, 1996
#
# /etc/init.d/boot.local
#
# script with local commands to be executed from init on system startup
#
# Here you should add things, that should happen directly after booting
# before we're going to the first run level.
#
chown oracle:dba /dev/raw/raw*
modprobe hangcheck-timer hangcheck_tick=30 hangcheck_margin=180