在OracleRAC的测试环境以及实际生产环境中,有时候由于误操作或者磁盘损坏都会引起OCR磁盘故障从而导致CRS无法启动,数据库不能提供服务。本文针对可能出现的几种情况分别给出具体的解决方案,以下操作都经过实践。
一、故障定位:
(1)检查日志
tail -f
/var/log/messages:
Oracle Cluster Registry
initialization failed with invalid format: PROC-22: The OCR backend
has an invalid
format //出现这个的话就怀疑是OCR磁盘故障了
(2)检查集群状态
#crs_stat -t
CRS-0184: 无法与 CRS
守护程序通信。
(3)检查进程
# ps -ef|grep d.bin
root 26170
22963 0 11:51
pts/0 00:00:00 grep
d.bin
发现crs等进程均不存在
(4)crsctl check
boot
# crsctl check
boot
OCR initialization failed
accessing OCR device: PROC-26: 访问物理存储时出错 操作系统错误 [No such file or
directory] [2]
(5)crsctl check
crs
# crsctl check
crs
Failure 1 contacting CSS
daemon
Cannot communicate with
CRS
Cannot communicate with
EVM
(6)检查OCR盘
# ocrcheck
PROT-602:
无法从集群注册表中检索数据
(7)检查VOTE盘
# crsctl query css
votedisk
OCR initialization failed
accessing OCR device: PROC-26: 访问物理存储时出错 操作系统错误 [No such file or
directory] [2]
到这里基本可以确定是OCR磁盘故障了,不过也可能是磁阵没有挂载上或者裸设备没绑定的原因,可以用如下方法确定:fdisk –l
检查磁盘分区确认磁阵挂载上;如果OCR以及VOTE使用的是裸设备,则使用raw
–qa检查裸设备是否绑定磁盘分区。
二、现象及解决方案:
现象一:磁阵重启导致CRS进程退出,无法启动解决方案:
(1)重新扫描HBA卡
#rescan-scsi-bus.sh
(2)绑定裸设备
#/etc/init.d/raw
start
(3)扫描ASM磁盘组
#oracleasmscandisks
一般到这里CRS就会自动启动了,如果还是启动不了再执行下面的操作:
(4)关闭CRS自启动(每个节点root用户执行)
#/etc/init.d/init.crs
disable
(5)停止CRS(每个节点root用户执行)
#/etc/init.d/init.crs
stop
(6)检查OCR备份目录
#ocrconfig
–showbackup
(7)进入OCR备份所在节点的相应目录,使用备份文件恢复OCR
#cd
/home/oracle/crs/cdata/crs
# ocrconfig -restore
backup00.ocr(一般这个为最新的备份文件)
(8)开启自启动(每个节点root用户执行)
#/etc/init.d/init.crs
enable
(9)启动CRS(每个节点root用户执行)
#/etc/init.d/init.crs
start
现象二:OCR磁盘损坏但有备份
经过一系列排查已确定CRS无法启动的原因为OCR磁盘损坏,但在rac节点的本地磁盘上有OCR的备份,则可采用如下方式恢复OCR,启动CRS:
(1)检查OCR备份目录
#ocrconfig
–showbackup
rac2 2011/12/05
11:07:21 /home/oracle/crs/cdata/crs
(2)进入OCR备份所在节点的相应目录,使用备份文件恢复OCR
#cd
/home/oracle/crs/cdata/crs
# ocrconfig -restore
backup00.ocr(一般这个为最新的备份文件)
(3)重新启动CRS(每个节点root用户执行)
# /etc/init.d/init.crs
start
(4)检查集群状态
# crs_stat –t
一切正常后表示OCR恢复成功。
现象三:OCR磁盘损坏且无备份
由于某些原因导致OCR磁盘损坏,且没有备份文件。这就需要重建OCR,并向OCR中重新注册各项资源。操作步骤如下:
(1)我们使用dd命令模拟OCR磁盘的破坏
# dd if=/dev/zero
of=/dev/raw/raw1 bs=8192 count=12800
(2)停止CRS进程(每个节点root用户执行)
# crsctl stop crs或者#
/etc/init.d/init.crs stop
(3)分别在每个节点上执行$ORA_CRS_HOME/install/rootdelete.sh
(必须root执行)
rac1:/home/oracle/crs/install #
./rootdelete.sh
Shutting down Oracle Cluster
Ready Services (CRS):
OCR initialization failed with
invalid format: PROC-22: OCR 后端有无效格式
Shutdown has begun. The daemons
should exit soon.
Checking to see if Oracle CRS
stack is down...
Oracle CRS stack is not
running.
Oracle CRS stack is down
now.
Removing script for Oracle
Cluster Ready services
Updating ocr file for
downgrade
Cleaning up SCR settings in
'/etc/oracle/scls_scr'
Cleaning up Network socket
directories
(4)在某个RAC节点执行$ORA_CRS_HOME/install/rootdeinstall.sh
rac1:/home/oracle/crs/install #
./rootdeinstall.sh
Removing contents from OCR
device
2560+0 records
in
2560+0 records
out
10485760 bytes (10 MB) copied,
0.52535 s, 20.0 MB/s
(5)重建OCR,在各个RAC节点执行$ORA_CRS_HOME
/root.sh
rac1:/home/oracle/crs #
./root.sh
WARNING: directory '/home/oracle'
is not owned by root
No value set for the CRS
parameter CRS_OCR_LOCATIONS. Using Values in
paramfile.crs
Checking to see if Oracle CRS
stack is already configured
Setting the permissions on OCR
backup directory
Setting up NS
directories
Oracle Cluster Registry
configuration upgraded successfully
WARNING: directory '/home/oracle'
is not owned by root
Successfully accumulated
necessary OCR keys.
Using ports: CSS=49895 CRS=49896
EVMC=49898 and EVMR=49897.
node :
node 1: rac1 rac1-priv
rac1
node 2: rac2 rac2-priv
rac2
node 3: rac3 rac3-priv
rac3
Creating OCR keys for user
'root', privgrp 'root'..
Operation
successful.
Now formatting voting device:
/u02/oracrs/vote.crs
Format of 1 voting devices
complete.
Startup will be queued to init
within 30 seconds.
Adding daemons to
inittab
Expecting the CRS daemons to be
up within 600 seconds.
CSS is active on these
nodes.
rac1
CSS is inactive on these
nodes.
rac2
rac3
Local node checking
complete.
Run root.sh on remaining nodes to
start CRS daemons.
注意最后一个节点执行后会有报错,原因为VIPCA以silent方式运行失败。所以需要手动运行VIPCA工具(图形界面)。
(6)运行VIPCA工具,在图形界面上进行VIP配置,与以前一致即可,具体过程这里不再详细的描述。
(7)检查CRS服务状态
rac1:/home/oracle/crs # crsctl
check crs
CSS appears
healthy
CRS appears
healthy
EVM appears
healthy
rac1:/home/oracle/crs # crs_stat
-t
名称 类型 目标 状态 主机
------------------------------------------------------------
ora.rac1.gsd application ONLINE ONLINE rac1
ora.rac1.ons application ONLINE ONLINE rac1
ora.rac1.vip application ONLINE ONLINE rac1
ora.rac2.gsd application ONLINE ONLINE rac2
ora.rac2.ons application ONLINE ONLINE rac2
ora.rac2.vip application ONLINE ONLINE rac2
ora.rac3.gsd application ONLINE ONLINE rac3
ora.rac3.ons application ONLINE ONLINE rac3
ora.rac3.vip application ONLINE ONLINE rac3
这里说明ONS、GSD、VIP都已注册到OCR中,还需要把监听、数据库、ASM注册到OCR,继续下面操作。
(8)使用netca工具重新配置监听器,注册listener
配置完后重新查看crs状态:
rac1:/home/oracle/crs # crs_stat
-t
名称 类型 目标 状态 主机
------------------------------------------------------------
ora....C1.lsnr
application ONLINE ONLINE rac1
ora.rac1.gsd application ONLINE ONLINE rac1
ora.rac1.ons application ONLINE ONLINE rac1
ora.rac1.vip application ONLINE ONLINE rac1
ora....C2.lsnr
application ONLINE ONLINE rac2
ora.rac2.gsd application ONLINE ONLINE rac2
ora.rac2.ons application ONLINE ONLINE rac2
ora.rac2.vip application ONLINE ONLINE rac2
ora....C3.lsnr
application ONLINE ONLINE rac3
ora.rac3.gsd application ONLINE ONLINE rac3
ora.rac3.ons application ONLINE ONLINE rac3
ora.rac3.vip application ONLINE ONLINE rac3
(9)向OCR中注册ASM
#srvctl add asm -n rac1 -i +ASM1
-o $ORACLE_HOME
# srvctl add asm -n rac2 -i +ASM2
-o $ORACLE_HOME
# srvctl add asm -n rac3 -i +ASM3
-o $ORACLE_HOME
(10)启动ASM
crs_start
ora.rac1.ASM1.asm
a)、启动报错:
PRKS-1009 : Failed to start ASM
instance "+ASM1" on node "rac1", [PRKS-1009 : Failed to start ASM
instance "+ASM1" on node "rac1", [CRS-1028: Dependency analysis
failed because of:
CRS-0223: Resource
'ora.rac1.ASM1.asm' has placement error.]]
[PRKS-1009 : Failed to start
ASM instance "+ASM1" on node "rac1", [CRS-1028: Dependency analysis
failed because of:
CRS-0223: Resource
'ora.rac1.ASM1.asm' has placement error.]]
b)、查看asm资源状态为unknow:
[url=mailto
racle@rac1]oracle@rac1[/url]:~>
crs_stat
NAME=ora.rac1.ASM1.asm
TYPE=application
TARGET=ONLINE
STATE=UNKNOWN on
rac
c)、停止也报错:
[url=mailto
racle@rac1]oracle@rac1[/url]:~> crs_stop
ora.rac1.ASM1.asm
Attempting to stop
`ora.rac1.ASM1.asm` on member `rac1`
`ora.rac1.ASM1.asm` on member
`rac1` has experienced an unrecoverable failure.
Human intervention required to
resume its availability.
CRS-0216: 无法停止资源
'ora.rac1.ASM1.asm'。
d)、检查ASM的日志tail -f
$ORACLE_HOME/log/rac1/racg/ora.rac1.ASM1.asm.log,发现如下错误:
Enter user-name:
ERROR:
ORA-01031: insufficient
privileges
Enter user-name: SP2-0306:
Invalid option.
2011-12-06 19:32:14.009:
[ RACG][4286338816]
[16318][4286338816][ora.rac1.ASM1.asm]: Usage: CONN[ECT] [logon]
[AS {SYSDBA|SYSOPER}]
where
::= [/][@] | /
Enter user-name: Enter
password:
ERROR:
ORA-01005: null password given;
logon denied
但是用如下方法是可以打开ASM库的:
export
ORACLE_SID=+ASM1
sqlplus / as
sysdba
startup
e)、最后发现是权限问题,执行如下命令后问题解决:
rac1:/home/oracle # crs_setperm
ora.rac1.ASM1.asm -o oracle
rac1:/home/oracle # crs_setperm
ora.rac1.ASM1.asm -g oinstall
rac1:/home/oracle # crs_getperm
ora.rac1.ASM1.asm
名称:
ora.rac1.ASM1.asm
owner
racle:rwx,pgrp
install:r-x,other::r--,
[url=mailto
racle@rac1]oracle@rac1[/url]:~> crs_stop
ora.rac1.ASM1.asm
Attempting to stop
`ora.rac1.ASM1.asm` on member `rac1`
Stop of `ora.rac1.ASM1.asm` on
member `rac1` succeeded.
[url=mailto
racle@rac1]oracle@rac1[/url]:~> crs_start
ora.rac1.ASM1.asm
Attempting to start
`ora.rac1.ASM1.asm` on member `rac1`
Start of `ora.rac1.ASM1.asm` on
member `rac1` succeeded.
查看状态:
rac1:/home/oracle # crs_stat
-t
名称 类型 目标 状态 主机
------------------------------------------------------------
ora....SM1.asm
application ONLINE ONLINE rac1
ora....C1.lsnr
application ONLINE ONLINE rac1
ora.rac1.gsd application ONLINE ONLINE rac1
ora.rac1.ons application ONLINE ONLINE rac1
ora.rac1.vip application ONLINE ONLINE rac1
ora....SM2.asm
application ONLINE ONLINE rac2
ora....C2.lsnr
application ONLINE ONLINE rac2
ora.rac2.gsd application ONLINE ONLINE rac2
ora.rac2.ons application ONLINE ONLINE rac2
ora.rac2.vip application ONLINE ONLINE rac2
ora....SM3.asm
application ONLINE ONLINE rac3
ora....C3.lsnr
application ONLINE ONLINE rac3
ora.rac3.gsd application ONLINE ONLINE rac3
ora.rac3.ons application ONLINE ONLINE rac3
ora.rac3.vip application ONLINE ONLINE rac3
(11)注册数据库
srvctl add database -d zxin -o
$ORACLE_HOME
(12)注册实例
# srvctl add instance -d zxin -i
zxin1 -n rac1
# srvctl add instance -d zxin -i
zxin2 -n rac2
# srvctl add instance -d zxin -i
zxin3 -n rac3
(13)修改实例和ASM实例的依赖关系
# srvctl modify instance -d zxin
-i zxin1 -s +ASM1
# srvctl modify instance -d zxin
-i zxin2 -s +ASM2
# srvctl modify instance -d zxin
-i zxin3 -s +ASM3
(14)启动数据库
# srvctl start database -d
zxin
报错:
PRKP-1001 : Error starting
instance zxin1 on node rac1
CRS-0215: ????????????
'ora.zxin.zxin1.inst'??
PRKP-1001 : Error starting
instance zxin2 on node rac2
CRS-0215: ????????????
'ora.zxin.zxin2.inst'??
PRKP-1001 : Error starting
instance zxin3 on node rac3
CRS-0215: ????????????
'ora.zxin.zxin3.inst'??
与上面一样的权限问题,执行如下命令后解决:
# crs_setperm ora.zxin.zxin2.inst
-o oracle
# crs_setperm ora.zxin.zxin2.inst
-g oinstall
# crs_setperm ora.zxin.zxin3.inst
-o oracle
# crs_setperm ora.zxin.zxin3.inst
-g oinstall
# crs_setperm ora.zxin.db -o
oracle
# crs_setperm ora.zxin.db -g
oinstall
# srvctl stop database -d
zxin
# srvctl start database -d
zxin
检查状态全部正常,如下:
rac1:/home/oracle # crs_stat
-t
名称 类型 目标 状态 主机
------------------------------------------------------------
ora....SM1.asm
application ONLINE ONLINE rac1
ora....C1.lsnr
application ONLINE ONLINE rac1
ora.rac1.gsd application ONLINE ONLINE rac1
ora.rac1.ons application ONLINE ONLINE rac1
ora.rac1.vip application ONLINE ONLINE rac1
ora....SM2.asm
application ONLINE ONLINE rac2
ora....C2.lsnr
application ONLINE ONLINE rac2
ora.rac2.gsd application ONLINE ONLINE rac2
ora.rac2.ons application ONLINE ONLINE rac2
ora.rac2.vip application ONLINE ONLINE rac2
ora....SM3.asm
application ONLINE ONLINE rac3
ora....C3.lsnr
application ONLINE ONLINE rac3
ora.rac3.gsd application ONLINE ONLINE rac3
ora.rac3.ons application ONLINE ONLINE rac3
ora.rac3.vip application ONLINE ONLINE rac3
ora.zxin.db application ONLINE ONLINE rac1
ora....n1.inst
application ONLINE ONLINE rac1
ora....n2.inst
application ONLINE ONLINE rac2
ora....n3.inst
application ONLINE ONLINE rac3
至此OCR成功恢复,所有服务正常运行。可使用crs_stop
–all以及crs_start –all命令测试CRS重新启动后各服务是否也能正常。
由于OCR磁盘损坏的故障还是非常常见的,如果不能恢复,只能重装CRS,将直接影响测试的进度以及现网环境的正常运行。通过以上的方法能够快速定位故障并进行恢复,操作也较简单。