RAC重建OCR/Voting disk遇到的一些故障

author:skate

time:2010-05-09


我的测试环境:

 

母系统:win2003
虚拟软件:vmware3.2.1
guest系统:centos4.7
oracle db:oracle10.2.1

 

以下是我在重建rac的ocr/voting disk过程中遇到的错误及解决方法,记录一下。


rac故障现象总结:

 

0. 检查crs的状态

 

直接查看进程”ps -ef |grep d.bin“


[root@rac1 oracle]# ps -ef |grep d.bin
root     15716  6979  0 11:04 pts/0    00:00:00 grep d.bin
root     28240     1  1 09:29 ?        00:01:00 /u01/crs/oracle/product/10.2.0/crs/bin/crsd.bin reboot
oracle   29059 28223  0 09:32 ?        00:00:11 /u01/crs/oracle/product/10.2.0/crs/bin/evmd.bin
oracle   29209 29181  0 09:32 ?        00:00:44 /u01/crs/oracle/product/10.2.0/crs/bin/ocssd.bin

 

 看见以上的进程,就代表crs已经正常启动了

 

 用命令查看"crsctl check crs"

 

[oracle@rac2 ~]$ crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy

 

看见以上输出,crs已经正常启动。

 

像如下的情况,就代表crs没有成功启动。

 

# ps -ef | grep css

root      6929     1  0 19:56 ?        00:00:00 /bin/sh /etc/init.d/init.cssd fatal
root      6960  6928  0 19:56 ?        00:00:00 /bin/sh /etc/init.d/init.cssd startcheck
root      6963  6929  0 19:56 ?        00:00:00 /bin/sh /etc/init.d/init.cssd startcheck
root      7064  6935  0 19:56 ?        00:00:00 /bin/sh /etc/init.d/init.cssd startcheck

 

可以查看crs的相关日志:crsd.log,ocssd.log,evmd.log

 

 

 

 


1. crs的故障:

 

1.1 报错:Insufficient user privileges.

 

现象:
[oracle@green ~]$ crsctl stop crs
Insufficient user privileges.

 

解决:
由于root环境变量没有设置$oracle_home和crs的环境变量,所以root下提示没有这个命令


[root@rac2 ~]# crsctl check crs
-bash: crsctl: command not found


[root@rac2 ~]# su - oracle


[oracle@rac2 ~]$ crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy


[oracle@rac2 ~]$ crsctl stop crs
Insufficient user privileges.

 

[oracle@rac2 ~]$ su
Password:
[root@rac2 oracle]# crsctl stop crs
Stopping resources.
Successfully stopped CRS resources.
Stopping CSSD.
Shutting down CSS daemon.
Shutdown request successfully issued.
[root@rac2 oracle]#

 

 


1.2 报错:

Failure 1 contacting CSS daemon
Cannot communicate with CRS
Cannot communicate with EVM


现象:
[root@rac2 oracle]# crsctl check crs
Failure 1 contacting CSS daemon
Cannot communicate with CRS
Cannot communicate with EVM

 

查看下crs的进程启动情况: ps -ef |grep d.bin

 

当crs启动后(crsctl start crs),要稍等一会才能起来,如果很快就核查,就会报上面这个错误。
还有一种情况也会产生这个错误,那就是节点间时间不同步,我这次遇到的这个问题就是因为节点
间时间不同步,我用了简单的rdate保证两个节点间的同步,当然还有其他的方法,如ntpdate或建立
时间服务器。

 

也可以直接用如下文件管理:
 /etc/rc.d/init.d/init.crs
 /etc/rc.d/init.d/init.crsd
 /etc/rc.d/init.d/init.cssd
 /etc/rc.d/init.d/init.evmd


 

参考:http://www.dbspecialists.com/files/presentations/rac_quick_reference.html


######################################################################################


rac2上的asm无法启动,报如下的错误:


[oracle@rac1 ~]$ srvctl start asm -n rac2
PRKS-1009 : Failed to start ASM instance "+ASM2" on node "rac2", [PRKS-1009 : Failed to start ASM instance "+ASM2" on node "rac2", [CRS-0215: Could not start resource 'ora.rac2.ASM2.asm'.]]
  [PRKS-1009 : Failed to start ASM instance "+ASM2" on node "rac2", [CRS-0215: Could not start resource 'ora.rac2.ASM2.asm'.]]


然后用crs_start单独启动,看报什么错,结果又报了一大堆错误:


[oracle@rac2 ~]$ srvctl start asm -n rac2
PRKS-1009 : Failed to start ASM instance "+ASM2" on node "rac2", [PRKS-1009 : Failed to start ASM instance "+ASM2" on node "rac2", [rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:SQL*Plus: Release 10.2.0.1.0 - Production on Thu May 6 15:25:59 2010
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:Copyright (c) 1982, 2005, Oracle.  All rights reserved.
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:Enter user-name: Connected to an idle instance.
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:SQL> ORA-27504: IPC error creating OSD context
rac2:ora.rac2.ASM2.asm:ORA-27300: OS system dependent operation:if_not_found failed with status: 0
rac2:ora.rac2.ASM2.asm:ORA-27301: OS failure message: Error 0
rac2:ora.rac2.ASM2.asm:ORA-27302: failure occurred at: skgxpvaddr9
rac2:ora.rac2.ASM2.asm:ORA-27303: additional information: requested interface 192.0.22.0 not found. Check output from ifconfig command
rac2:ora.rac2.ASM2.asm:SQL> Disconnected
rac2:ora.rac2.ASM2.asm:
CRS-0215: Could not start resource 'ora.rac2.ASM2.asm'.]]
  [PRKS-1009 : Failed to start ASM instance "+ASM2" on node "rac2", [rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:SQL*Plus: Release 10.2.0.1.0 - Production on Thu May 6 15:25:59 2010
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:Copyright (c) 1982, 2005, Oracle.  All rights reserved.
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:Enter user-name: Connected to an idle instance.
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:SQL> ORA-27504: IPC error creating OSD context
rac2:ora.rac2.ASM2.asm:ORA-27300: OS system dependent operation:if_not_found failed with status: 0
rac2:ora.rac2.ASM2.asm:ORA-27301: OS failure message: Error 0
rac2:ora.rac2.ASM2.asm:ORA-27302: failure occurred at: skgxpvaddr9
rac2:ora.rac2.ASM2.asm:ORA-27303: additional information: requested interface 192.0.22.0 not found. Check output from ifconfig command
rac2:ora.rac2.ASM2.asm:SQL> Disconnected
rac2:ora.rac2.ASM2.asm:
CRS-0215: Could not start resource 'ora.rac2.ASM2.asm'.]]

 

一般错误“ORA-27504: IPC error creating OSD context”是因为节点间的通信的有问题

 

首先查看/etc/hosts文件

 

正确的格式应该如下:

 

[oracle@rac1 ~]$ more /etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1      localhost.localdomain localhost

#skate add

# Public
192.168.2.31   rac1.localdomain        rac1
192.168.2.22   rac2.localdomain        rac2
#Private
192.168.0.31   rac1-priv.localdomain   rac1-priv
192.168.0.22   rac2-priv.localdomain   rac2-priv
#Virtual
192.168.2.131   rac1-vip.localdomain    rac1-vip
192.168.2.122   rac2-vip.localdomain    rac2-vip
[oracle@rac1 ~]$

 

 

我的这个文件没有问题,在群里讨论,我和大家都比较关注下面的错误:
ORA-27303: additional information: requested interface 192.0.22.0 not found. Check output from ifconfig command


但是什么引起这个错误的呢?

先怀疑网卡设置,可能是ip设置有问题,或者MUT有问题。不过经过检查我的网卡设置都是正常的

 

rac1节点网络:


[root@rac1 tmp]# ip a
1: lo: <LOOPBACK,UP> mtu 16436 qdisc noqueue
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 brd 127.255.255.255 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether 00:0c:29:2a:81:d3 brd ff:ff:ff:ff:ff:ff
    inet 192.168.2.31/24 brd 192.168.2.255 scope global eth0
    inet 192.168.2.131/24 brd 192.168.2.255 scope global secondary eth0:1
    inet6 fe80::20c:29ff:fe2a:81d3/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether 00:0c:29:2a:81:dd brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.31/24 brd 192.168.0.255 scope global eth1
    inet6 fe80::20c:29ff:fe2a:81dd/64 scope link
       valid_lft forever preferred_lft forever
4: sit0: <NOARP> mtu 1480 qdisc noop
    link/sit 0.0.0.0 brd 0.0.0.0


rac2节点网络:


[root@rac2 ~]# ip a
1: lo: <LOOPBACK,UP> mtu 16436 qdisc noqueue
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 brd 127.255.255.255 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether 00:0c:29:81:22:38 brd ff:ff:ff:ff:ff:ff
    inet 192.168.2.22/24 brd 192.168.2.255 scope global eth0
    inet 192.168.2.122/24 brd 192.168.2.255 scope global secondary eth0:1
    inet6 fe80::20c:29ff:fe81:2238/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether 00:0c:29:81:22:42 brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.22/24 brd 192.168.0.255 scope global eth1
    inet6 fe80::20c:29ff:fe81:2242/64 scope link
       valid_lft forever preferred_lft forever
4: sit0: <NOARP> mtu 1480 qdisc noop
    link/sit 0.0.0.0 brd 0.0.0.0

 

 

我又google了半天,找到一个帖子,说是尝试如下修改,可以解决

 

1、关闭 Oracle 实例-instance。
2、cd $ORACLE_HOME/rdbms/lib
3、make -f ins_rdbms.mk rac_off
4、make -f ins_rdbms.mk ioracle

 

我按其操作后,没起作用,反而出来如下的错误:

 

[oracle@rac2 lib]$ srvctl start asm -n rac2
PRKS-1009 : Failed to start ASM instance "+ASM2" on node "rac2", [PRKS-1009 : Failed to start ASM instance "+ASM2" on node "rac2", [rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:SQL*Plus: Release 10.2.0.1.0 - Production on Thu May 6 16:34:23 2010
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:Copyright (c) 1982, 2005, Oracle.  All rights reserved.
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:Enter user-name: Connected to an idle instance.
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:SQL> ORA-00439: feature not enabled: Real Application Clusters
rac2:ora.rac2.ASM2.asm:SQL> Disconnected
rac2:ora.rac2.ASM2.asm:
CRS-0215: Could not start resource 'ora.rac2.ASM2.asm'.]]
  [PRKS-1009 : Failed to start ASM instance "+ASM2" on node "rac2", [rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:SQL*Plus: Release 10.2.0.1.0 - Production on Thu May 6 16:34:23 2010
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:Copyright (c) 1982, 2005, Oracle.  All rights reserved.
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:Enter user-name: Connected to an idle instance.
rac2:ora.rac2.ASM2.asm:
rac2:ora.rac2.ASM2.asm:SQL> ORA-00439: feature not enabled: Real Application Clusters
rac2:ora.rac2.ASM2.asm:SQL> Disconnected
rac2:ora.rac2.ASM2.asm:
CRS-0215: Could not start resource 'ora.rac2.ASM2.asm'.]]

 

从错误码“ORA-00439: feature not enabled: Real Application Clusters”可以看出已经禁用了集群功能,于是我有反向执行

 

1、关闭 Oracle 实例-instance。(这步我没操作,因为我的实例就没起来,呵呵)
2、cd $ORACLE_HOME/rdbms/lib
3、make -f ins_rdbms.mk rac_on
4、make -f ins_rdbms.mk ioracle

 

执行后,又恢复到以前的额错误了,查看相应的alertlog都没有错误,不过在asm2的alertlog中最后两行有错误

 

[root@rac2 ~]# tail -50 /u01/app/oracle/admin/+ASM/bdump/alert_+ASM2.log |more
USER: terminating instance due to error 27504
Instance terminated by USER, pid = 27244
Fri May  7 03:26:47 2010
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-free SCN scheme 2
Using LOG_ARCHIVE_DEST_1 parameter default value as /u01/app/oracle/product/10.2
.0 _1 s/arch
Autotune of undo retention is turned off.
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.1.0.
System parameters with non-default values:
  large_pool_size          = 12582912
  instance_type            = asm
  cluster_interconnects    = 192,168.0.22
  cluster_database         = TRUE
  instance_number          = 2
  remote_login_passwordfile= EXCLUSIVE
  background_dump_dest     = /u01/app/oracle/admin/+ASM/bdump
  user_dump_dest           = /u01/app/oracle/admin/+ASM/udump
  core_dump_dest           = /u01/app/oracle/admin/+ASM ump
  asm_diskgroups           = DATA
USER: terminating instance due to error 27504
Instance terminated by USER, pid = 29732
Fri May  7 03:28:40 2010
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-free SCN scheme 2
Using LOG_ARCHIVE_DEST_1 parameter default value as /u01/app/oracle/product/10.2
.0 _1 s/arch
Autotune of undo retention is turned off.
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.1.0.
System parameters with non-default values:
  large_pool_size          = 12582912
  instance_type            = asm
  cluster_interconnects    = 192,168.0.22
  cluster_database         = TRUE
  instance_number          = 2
  remote_login_passwordfile= EXCLUSIVE
  background_dump_dest     = /u01/app/oracle/admin/+ASM/bdump
  user_dump_dest           = /u01/app/oracle/admin/+ASM/udump
  core_dump_dest           = /u01/app/oracle/admin/+ASM ump
  asm_diskgroups           = DATA
USER: terminating instance due to error 27504
Instance terminated by USER, pid = 31201


这个信息也不能定位错误在哪,在这过程中,我用sqlplus在rac1中可以成功启动数据库。最后一个群里的朋友说看看
asm2的参数文件内容


我经过检查发现我把+ASM2.cluster_interconnects='192.168.0.22' 写成+ASM2.cluster_interconnects='192,168.0.22'


把逗点写成了逗号,马上改正过了,然后在启动asm2实例,就可以启动了。

 

现在在回头想,报错:ORA-27303: additional information: requested interface 192.0.22.0 not found. Check output from ifconfig command
就可以理解了,因为节点间通信有问题,才会报这个错误.

 

 


#####################################################################################

 

启动数据库报错

 

[oracle@rac2 ~]$ srvctl start database -d rac
PRKP-1001 : Error starting instance rac1 on node rac1
CRS-0215: Could not start resource 'ora.rac.rac1.inst'.
PRKP-1001 : Error starting instance rac2 on node rac2
CRS-0215: Could not start resource 'ora.rac.rac2.inst'.

 

虽然用srvctl无法启动数据库,但是可以用sqlplus分别在两个节点正常启动数据库

抱这个错误,网上有说按如下方法可以解决:

 

as root:
crsctl stop crs
rm -f /var/tmp/.oracle/*
crsctl start crs

 

等一会,crs正常启动后,就可以正常启动数据库了

 

但对我的环境,问题依旧。这是我突然想到数据库名和实例名的大小写的问题,


我刚才注册到ocr里的都是小写的,怀疑可能是这个原因。于是删除原来小写的
,从新注册大写的

 

这是原来注册的小写的:


[oracle@rac2 ~]$ srvctl add database -d rac -o /u01/app/oracle/product/10.2.0/db_1
[oracle@rac2 ~]$ srvctl add instance -d rac -i rac1 -n rac1
[oracle@rac2 ~]$ srvctl add instance -d rac -i rac2 -n rac2
[oracle@rac2 ~]$ srvctl modify instance -d rac -i rac1 -s +ASM1
[oracle@rac2 ~]$ srvctl modify instance -d rac -i rac2 -s +ASM2

 

把小写的删除

 

[oracle@rac2 ~]$ srvctl remove instance -d rac -i rac1
Remove instance rac1 from the database rac? (y/[n]) y
[oracle@rac2 ~]$ srvctl remove instance -d rac -i rac2
Remove instance rac2 from the database rac? (y/[n]) y
[oracle@rac2 ~]$ srvctl remove database -d rac
Remove the database rac? (y/[n]) y
[oracle@rac2 ~]$

 

把database和instance注册成大写的

 

[oracle@rac2 ~]$ srvctl add database -d RAC -o $ORACLE_HOME
[oracle@rac2 ~]$ srvctl add instance -d RAC -i RAC1 -n rac1
[oracle@rac2 ~]$ srvctl add instance -d RAC -i RAC2 -n rac2
[oracle@rac2 ~]$ srvctl modify  instance -d RAC -i RAC1 -s +ASM1
[oracle@rac2 ~]$ srvctl modify  instance -d RAC -i RAC2 -s +ASM2

 

然后在启动数据库,居然启动了啊。

 

[oracle@rac2 ~]$ srvctl start database -d rac

 

 


######################################################################


在ocr中删除instance和database的报错:

 

[oracle@rac2 ~]$ srvctl remove instance -d rac -i rac1
Remove instance rac1 from the database rac? (y/[n]) y
PRKP-1023 : The instance {0} is still running.rac
[oracle@rac2 ~]$ srvctl remove instance -d rac -i rac1
Remove instance rac1 from the database rac? (y/[n]) y
PRKP-1023 : The instance {0} is still running.rac
[oracle@rac2 ~]$ srvctl remove instance -d rac
PRKO-2001 : Invalid command line syntax
[oracle@rac2 ~]$ srvctl remove database -d rac
Remove the database rac? (y/[n]) y
PRKP-1022 : The database rac is still running.

 

解决方式:用crs_stop -all停掉所有的服务,然后用crs_stat -t -v 检查各服务的状态
如果有服务的state是UNKNO的,那就只能一个一个的停掉了。

 

################################################################

 

onsctl启动的问题:

 

[oracle@rac1 ~]$ onsctl ping
Number of onsconfiguration retrieved, numcfg = 0
ons is not running ...

 

解决:
[oracle@rac1 ~]$ onsctl start
Number of onsconfiguration retrieved, numcfg = 0
Number of onsconfiguration retrieved, numcfg = 0
onsctl: ons started


[oracle@rac1 ~]$ onsctl ping
Number of onsconfiguration retrieved, numcfg = 0
ons is running ...


##########################################################

 

错误现象oifcfg getif 没有返回值

 

[root@rac2 public]# /u01/crs/oracle/product/10.2.0/crs/bin/oifcfg iflist
eth0  192.168.2.0
eth1  192.168.0.0
[root@rac2 public]# /u01/crs/oracle/product/10.2.0/crs/bin/oifcfg getif
[root@rac2 public]# /u01/crs/oracle/product/10.2.0/crs/bin/oifcfg getif -global
[root@rac2 public]# /u01/crs/oracle/product/10.2.0/crs/bin/oifcfg getif -global rac1
[root@rac2 public]# /u01/crs/oracle/product/10.2.0/crs/bin/oifcfg getif -global rac2

 

手工把网络信息注册到ocr中

 

[oracle@rac2 public]# oifcfg setif -global eth0/192.168.2.0:public
[oracle@rac2 public]# oifcfg setif -global eth1/192.168.0.0:cluster_interconnect


然后就可以查了啊

 

[oracle@rac1 ~]$ oifcfg getif
eth0  192.168.2.0  global  public
eth1  192.168.0.0  global  cluster_interconnect

 

#########################################################################

 

 

以上的错误是我在重建ocr/voting disk所遇到的一些错误。

 

 

----------end---------

 

 

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Applies to: Oracle Server - Enterprise Edition - Version: 10.2.0.4 to 11.1.0.6 Information in this document applies to any platform. Description The following errors may be reported: ORA-00603: ORACLE server session terminated by fatal error ORA-27504: IPC error creating OSD context ORA-27300: OS system dependent operation:sskgxp_select failed with status: 3 ORA-27301: OS failure message: No such process ORA-27302: failure occurred at: skgxpvfymmtu ORA-27303: additional information: MTU could not be verified. Did not receive valid message. These errors are caused by more aggressive checking introduced in 10.2.0.4. Likelihood of Occurrence This only affects Oracle Real Application Clusters and can be reported in ASM as well as database instances. The issue was introduced in Oracle 10.2.0.4 so earlier versions are not affected Possible Symptoms The errors above can be reported by Oracle shadow processes or by background processes. Additionally they may be reported in the alert.log. The symptoms can include: - process failure - startup failure - processes spinning in function sskgxp_select with high CPU usage Workaround or Resolution There is no workaround available. However, if the instance fails to start, a reboot of the server supporting the instance will usually allow startup to succeed. Patches At the time of writing, patches were under development on top of 10.2.0.4 on some platforms. Please check Patch 7331323 for availability on your platform. The problem is resolved in 11.1.0.7. References NOTE:419937.1 - Alert.log shows frequently "skgxpvfymmtu: process failed because of a resource problem in the OS" NOTE:419871.1 - Failures due to "skgxpvfymmtu: process failed because of a resource problem in the OS" on 32-bit Linux

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值