上周六午夜12点刚要睡觉,电话响起,这个时候来电话肯定没啥好事,一看手机号码不认识,通了电话才知道是我们外聘的HP工程师在客户现场处理故障,客户是两台HP小型机做了一个两个节点的RAC,由于客户的原因导致第二个节点系统无法进入多用户模式,估计是在系统里乱操作,删了什么操作系统文件,导致机器只能进入维护模式,因此第二个节点不得不重新安装,HP工程师是克隆了另外一个节点的系统到第二个节点的,然后修改IP,主机名等等的配置好Service Guard后,HA能起来,但是启动CRS的时候,第二个节点报如下错误:
折腾了半天毫无进展,想重启系统然系统自己带起来,但是跟HP的工程师交流了一下,主机起来后CRS是要手工启动的,那么重启就毫无意义了,在Unix、Linux下,CRS的启动停止脚本是放在init.d目录里的,对HP-Unix不太熟悉,问了才知道HP-Unix中,这个目录是在/sbin/init.d 中,而不是/etc/init.d 目录,从这个目录里用./init.crs 脚本来启动CRS,用法如下:
# ./init.crs xxx <--随便输入一个让它显示用法
Usage: ./init.crs {stop|start|enable|disable}
# ./init.crs start
这次的错误信息有参考意义了:
错误日志显示CRS不能创建
cssrun这个文件,
检查之:
# cd /var/opt/oracle/scls_scr/rqtmsdb2/root/
sh: /var/opt/oracle/scls_scr/rqtmsdb2/root/: not found.
咦,没有这个目录!
# cd /var/opt/oracle/scls_scr/
ls -l 一看就明白了:
因为这个系统是从第一个节点克隆过来的,所以这个本应该是
rqtmsdb2的目录现在是
rqtmsdb1,怪不得呢!
修改之:
再次启动CRS:
这次能够正常启动了!
回头检查第一个节点,这个节点HP工程师跟我说什么也没动过,我就信了,克隆一个系统嘛是对这个节点不用做任何改动,但是现实且很残酷!
命令敲下去:
# cd /sbin/init.d
#
# . / init . crs start
Startup will be queued to init within 30 seconds .
等不到d.bin的进程,无任何反应,回头检查操作系统日志:
看来有些错误信息啊,其中的一个文件:
无法绑定监听到
PricateIP上,再去检查/etc/hosts文件,发现没有Pricate IP!,只有第二个节点的Pricate IP,再去检查第二个节点的/etc/hosts文件,对比后添加第一个节点的Pricate IP :
192.168.0.1 rqtmsdb1-priv
没在开始去检查/etc/hosts文件真是失误啊!听到的一定要自己再确认一遍!又一次在RAC环境里载在/etc/hosts文件手里!!!之前在一个客户那里配置RAC,工程师给我将localhosts这个系统默认的东东去掉了,导致我在这个上面花了一天的时间才找到是没有localhosts导致的!
再次启动CRS,这次正常启动了!以为一切都好了,可以去睡觉了,没先到后面VIP还有问题,
crs_start -all 启动Cluste,报告不能启动,VIP起不来,后面的就都失败了,这个错误好办,之前解决过,先设置对VIP进行debug:
然后单独启动VIP资源:
没有配置默认网关,在检查IP地址配置情况,发现,IP地址是配置在lan2上的,一问才知道,由于lan0经常出问题,这次改到lan2,不早说啊,nnd!!
VIP在启动的时候回去ping默认网关,如果不通,那么VIP是起不来的。HP工程师配置好默认网关后,修改VIP到lan0上去:
先删除之:
su - oracle
oifcfg delif -global
然后再重新配置:
- Attempting to start CRS stack
- Failure at scls_scr_create with code 1
- Internal Error Information:
- Category: 1234
- Operation: scls_scr_create
- Location: mkdir
- Other: Unable to make user dir
- Dep: 2
# ./init.crs xxx <--随便输入一个让它显示用法
Usage: ./init.crs {stop|start|enable|disable}
# ./init.crs start
这次的错误信息有参考意义了:
- /sbin/init.d/init.cssd[537]: /var/opt/oracle/scls_scr/rqtmsdb2/root/cssrun: Cannot create the specified file.
- Startup will be queued to init within 30 seconds.
检查之:
# cd /var/opt/oracle/scls_scr/rqtmsdb2/root/
sh: /var/opt/oracle/scls_scr/rqtmsdb2/root/: not found.
咦,没有这个目录!
# cd /var/opt/oracle/scls_scr/
ls -l 一看就明白了:
- # ls -l
- total 0
- drwxr-xr-x 4 root sys 96 Dec 31 2010 rqtmsdb1
修改之:
- # mv rqtmsdb1 rqtmsdb2
- # ls -l
- total 0
- drwxr-xr-x 4 root sys 96 Dec 31 2010 rqtmsdb2
- # cd rq*
# ls -l
total 16
drwxr-xr-x 2 orarac sys 96 Dec 31 2010 orarac
drwxr-xr-x 2 root sys 8192 Nov 17 09:55 root
# cd root
# ls -l
total 48
-rw-rw-rw- 1 root root 8 Nov 17 15:33 crsdboot
-rw-r--r-- 1 root sys 7 Dec 31 2010 crsstart
-rw-rw-rw- 1 root sys 6 Nov 17 15:33 cssrun
-rw-r--r-- 1 root sys 0 Nov 17 15:33 noclsmon
-rw-rw-rw- 1 root root 0 Nov 17 15:33 nooprocd
- # cd /sbin/init.d
- #
- # ./init.crs start
- Startup will be queued to init within 30 seconds.
- # ps -ef|grep d.bin
- root 18734 22410 1 02:22:49 pts/ta 0:00 grep d.bin
- # ps -ef|grep d.bin
- root 2059 1 0 22:03:36 ? 0:00 /ora_soft/oracle/product/crs/bin/crsd.bin reboot
- orarac 18782 2057 0 02:23:09 ? 0:00 /ora_soft/oracle/product/crs/bin/evmd.bin
- orarac 19013 19012 0 02:23:14 ? 0:00 /ora_soft/oracle/product/crs/bin/ocssd.bin
- # /ora_soft/oracle/product/crs/bin/crsctl check crs
- CSS appears healthy
- CRS appears healthy
- EVM appears healthy
- # /ora_soft/oracle/product/crs/bin/crlctl stop crs
- sh: /ora_soft/oracle/product/crs/bin/crlctl: not found.
- # /ora_soft/oracle/product/crs/bin/crsctl stop crs
- Stopping resources.
- Successfully stopped CRS resources
- Stopping CSSD.
- Shutting down CSS daemon.
- Shutdown request successfully issued.
- # ps -ef|grep d.bin
- root 21987 22410 0 02:24:53 pts/ta 0:00 grep d.bin
- # /ora_soft/oracle/product/crs/bin/crsctl start crs
- Attempting to start CRS stack
- The CRS stack will be started shortly
- # ps -ef|grep d.bin
- root 23992 22410 0 02:32:59 pts/ta 0:00 grep d.bin
- # ps -ef|grep d.bin
- root 23995 22410 0 02:33:05 pts/ta 0:00 grep d.bin
- # ps -ef|grep d.bin
- root 21829 1 0 02:24:44 ? 0:00 /ora_soft/oracle/product/crs/bin/crsd.bin reboot
- orarac 24152 21817 0 02:33:18 ? 0:00 /ora_soft/oracle/product/crs/bin/evmd.bin
- orarac 24299 24298 0 02:33:21 ? 0:00 /ora_soft/oracle/product/crs/bin/ocssd.bin
- root 24577 22410 0 02:33:31 pts/ta 0:00 grep d.bin
- # /ora_soft/oracle/product/crs/bin/crsctl status
- Unknown parameter: status
- # /ora_soft/oracle/product/crs/bin/crsctl check crs
- CSS appears healthy
- CRS appears healthy
- EVM appears healthy
- #
回头检查第一个节点,这个节点HP工程师跟我说什么也没动过,我就信了,克隆一个系统嘛是对这个节点不用做任何改动,但是现实且很残酷!
命令敲下去:
# cd /sbin/init.d
#
# . / init . crs start
Startup will be queued to init within 30 seconds .
等不到d.bin的进程,无任何反应,回头检查操作系统日志:
- Nov 18 03:26:00 rqtmsdb1 syslog: Cluster Ready Services waiting on dependencies. Diagnostics in /tmp/crsctl.2104.
- Nov 18 03:26:00 rqtmsdb1 syslog: Cluster Ready Services waiting on dependencies. Diagnostics in /tmp/crsctl.2116.
- Nov 18 03:26:00 rqtmsdb1 syslog: Cluster Ready Services waiting on dependencies. Diagnostics in /tmp/crsctl.2154.
- Nov 18 03:34:16 rqtmsdb1 syslog: Cluster Ready Services waiting on dependencies. Diagnostics in /tmp/crsctl.2154.
- #cat /tmp/crsctl.2104
- Failed 3 to bind listening endpoint:(ADDRESS=(PROTOCOL=tcp)(HOST=rqtmsdb1-priv))
- #
192.168.0.1 rqtmsdb1-priv
没在开始去检查/etc/hosts文件真是失误啊!听到的一定要自己再确认一遍!又一次在RAC环境里载在/etc/hosts文件手里!!!之前在一个客户那里配置RAC,工程师给我将localhosts这个系统默认的东东去掉了,导致我在这个上面花了一天的时间才找到是没有localhosts导致的!
再次启动CRS,这次正常启动了!以为一切都好了,可以去睡觉了,没先到后面VIP还有问题,
crs_start -all 启动Cluste,报告不能启动,VIP起不来,后面的就都失败了,这个错误好办,之前解决过,先设置对VIP进行debug:
- #/ora_soft/oracle/product/crs/bin/crsctl debug log res "ora.rqtmsdb1.vip:5"
- # /ora_soft/oracle/product/crs/bin/srvctl start nodeapps -n rqtmsdb1
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:29 EAT 2012 [ 25193 ] Checking interface existance
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:29 EAT 2012 [ 25193 ] Calling getifbyip
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:29 EAT 2012 [ 25193 ] getifbyip: started for 172.16.7.22
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:29 EAT 2012 [ 25193 ] Completed getifbyip
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:29 EAT 2012 [ 25193 ] switched to standby : start/check operation
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:33 EAT 2012 [ 25193 ] Completed with initial interface test
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:33 EAT 2012 [ 25193 ] Broadcast = 172.16.7.255
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:33 EAT 2012 [ 25193 ] Interface tests
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:33 EAT 2012 [ 25193 ] checkIf: start for if=lan0
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:33 EAT 2012 [ 25193 ] checkIf: get default gw
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:33 EAT 2012 [ 25193 ] defaultgw: started
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:33 EAT 2012 [ 25193 ] defaultgw: completed with
- rqtmsdb1:ora.rqtmsdb1.vip:checkIf: Default gateway is not defined (host=rqtmsdb1)
- rqtmsdb1:ora.rqtmsdb1.vip:Interface lan0 checked failed (host=rqtmsdb1)
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:33 EAT 2012 [ 25193 ] checkIf: end for if=lan0
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:33 EAT 2012 [ 25193 ] DEBUG: FAIL_WHEN_ALL_LINK_DOWN = 1 and IF_USING =
- rqtmsdb1:ora.rqtmsdb1.vip:Invalid parameters, or failed to bring up VIP (host=rqtmsdb1)
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:33 EAT 2012 [ 25341 ] Checking interface existance
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:33 EAT 2012 [ 25341 ] Calling getifbyip
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:33 EAT 2012 [ 25341 ] getifbyip: started for 172.16.7.22
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:33 EAT 2012 [ 25341 ] Completed getifbyip
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:33 EAT 2012 [ 25341 ] switched to standby : start/check operation
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:37 EAT 2012 [ 25341 ] Completed with initial interface test
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:37 EAT 2012 [ 25341 ] Broadcast = 172.16.7.255
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:37 EAT 2012 [ 25341 ] Performing CRS_STAT testing
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:37 EAT 2012 [ 25341 ] Completed CRS_STAT testing
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:37 EAT 2012 [ 25341 ] Interface tests
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:37 EAT 2012 [ 25341 ] checkIf: start for if=lan0
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:37 EAT 2012 [ 25341 ] checkIf: get default gw
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:37 EAT 2012 [ 25341 ] defaultgw: started
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:37 EAT 2012 [ 25341 ] defaultgw: completed with
- rqtmsdb1:ora.rqtmsdb1.vip:checkIf: Default gateway is not defined (host=rqtmsdb1)
- rqtmsdb1:ora.rqtmsdb1.vip:Interface lan0 checked failed (host=rqtmsdb1)
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:37 EAT 2012 [ 25341 ] checkIf: end for if=lan0
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:37 EAT 2012 [ 25341 ] DEBUG: FAIL_WHEN_ALL_LINK_DOWN = 1 and IF_USING =
- rqtmsdb1:ora.rqtmsdb1.vip:Invalid parameters, or failed to bring up VIP (host=rqtmsdb1)
- CRS-1006: No more members to consider
- CRS-0215: Could not start resource 'ora.rqtmsdb1.vip'.
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:48 EAT 2012 [ 25801 ] Checking interface existance
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:48 EAT 2012 [ 25801 ] Calling getifbyip
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:48 EAT 2012 [ 25801 ] getifbyip: started for 172.16.7.22
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:48 EAT 2012 [ 25801 ] Completed getifbyip
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:48 EAT 2012 [ 25801 ] switched to standby : start/check operation
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:52 EAT 2012 [ 25801 ] Completed with initial interface test
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:52 EAT 2012 [ 25801 ] Broadcast = 172.16.7.255
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:52 EAT 2012 [ 25801 ] Interface tests
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:52 EAT 2012 [ 25801 ] checkIf: start for if=lan0
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:52 EAT 2012 [ 25801 ] checkIf: get default gw
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:52 EAT 2012 [ 25801 ] defaultgw: started
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:52 EAT 2012 [ 25801 ] defaultgw: completed with
- rqtmsdb1:ora.rqtmsdb1.vip:checkIf: Default gateway is not defined (host=rqtmsdb1)
- rqtmsdb1:ora.rqtmsdb1.vip:Interface lan0 checked failed (host=rqtmsdb1)
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:52 EAT 2012 [ 25801 ] checkIf: end for if=lan0
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:52 EAT 2012 [ 25801 ] DEBUG: FAIL_WHEN_ALL_LINK_DOWN = 1 and IF_USING =
- rqtmsdb1:ora.rqtmsdb1.vip:Invalid parameters, or failed to bring up VIP (host=rqtmsdb1)
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:52 EAT 2012 [ 25949 ] Checking interface existance
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:52 EAT 2012 [ 25949 ] Calling getifbyip
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:52 EAT 2012 [ 25949 ] getifbyip: started for 172.16.7.22
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:52 EAT 2012 [ 25949 ] Completed getifbyip
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:52 EAT 2012 [ 25949 ] switched to standby : start/check operation
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:56 EAT 2012 [ 25949 ] Completed with initial interface test
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:56 EAT 2012 [ 25949 ] Broadcast = 172.16.7.255
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:56 EAT 2012 [ 25949 ] Performing CRS_STAT testing
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:56 EAT 2012 [ 25949 ] Completed CRS_STAT testing
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:56 EAT 2012 [ 25949 ] Interface tests
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:56 EAT 2012 [ 25949 ] checkIf: start for if=lan0
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:56 EAT 2012 [ 25949 ] checkIf: get default gw
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:56 EAT 2012 [ 25949 ] defaultgw: started
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:56 EAT 2012 [ 25949 ] defaultgw: completed with
- rqtmsdb1:ora.rqtmsdb1.vip:checkIf: Default gateway is not defined (host=rqtmsdb1)
- rqtmsdb1:ora.rqtmsdb1.vip:Interface lan0 checked failed (host=rqtmsdb1)
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:56 EAT 2012 [ 25949 ] checkIf: end for if=lan0
- rqtmsdb1:ora.rqtmsdb1.vip:Sun Nov 18 04:19:56 EAT 2012 [ 25949 ] DEBUG: FAIL_WHEN_ALL_LINK_DOWN = 1 and IF_USING =
- rqtmsdb1:ora.rqtmsdb1.vip:Invalid parameters, or failed to bring up VIP (host=rqtmsdb1)
- CRS-0215: Could not start resource 'ora.rqtmsdb1.LISTENER_RQTMSDB1.lsnr'.
- #
没有配置默认网关,在检查IP地址配置情况,发现,IP地址是配置在lan2上的,一问才知道,由于lan0经常出问题,这次改到lan2,不早说啊,nnd!!
VIP在启动的时候回去ping默认网关,如果不通,那么VIP是起不来的。HP工程师配置好默认网关后,修改VIP到lan0上去:
先删除之:
su - oracle
oifcfg delif -global
然后再重新配置:
- $oifcfg setif -global lan2/172.16.7.0:public
- $oifcfg setif -global lan3/192.168.0.0:cluster_interconnect
#/ora_soft/oracle/product/crs/bin/srvctl modify nodeapps -n rqtmsdb2 -A 172.16.7.23/255.255.255.0/lan2
#/ora_soft/oracle/product/crs/bin/srvctl modify nodeapps -n rqtmsdb1 -A 172.16.7.22/255.255.255.0/lan2
修改完成后再次crs_start -all ,RAC启动成功,手工,睡觉!
http://blog.chinaunix.net/uid-26896647-id-3417998.html